A data-driven approach to friendship

As a machine learning researcher, I love working with data. Which of course got me thinking about other areas, including friendship. Treated as a data analysis problem, what insights can we derive, and which of those are meaningful? Let’s start with some obvious ones:

  • Who is putting in the most work?
    • Especially with long-distance relationships (of any sort), keeping in touch requires effort. So who’s really carrying the team, and who’s mooching off the efforts of the other?
  • How often do you communicate?
    • And who communicates at what rate?

Now we need some definitions. Even if we found the answers to the above, we need a priori definitions of what is “good”. Let’s make some reasonable ones:

  • We expect non-trivial communication at least once every 9 months.
  • Each party should start at least 1 in every 4 non-trivial conversations.

The first just ensures you’re keeping in touch at all. The latter ensures that each party puts in at least 25% of the work. Non-trivial just excludes stuff like birthday wishes–those are social obligations as opposed to actual conversations.

So we have our questions, and now, we need data. Fortunately, Telegram has this built-in. We export our data to JSON so we can analyze it. If you use iMessage and have a Mac, the database is stored locally in a SQLite format. It’s a little annoying to parse, but it can be done. Let’s get started.

I’ll skip the part where we load the JSON from Telegram–it’s trivial. Note that when requesting the JSON, it’s best to uncheck photos, videos, etc. We won’t parse those anyway. Our function (which takes the array of messages) looks like this:

def parse(data, resolution=2):
    RESOLUTION = timedelta(hours=resolution)
    MIN_DELTA = timedelta(weeks=36)

    initiated = {}
    infractions = 0

    data = [message for message in data if 'from' in message]
    parties = set(list(map(lambda x: x['from'], data)))

    for i, message in enumerate(tqdm(data)):
        date = datetime.fromisoformat(message['date'])
        text = message['text']
        sender = message['from']
        not_sender = list(parties - {sender})[0]

        if i != 0:
            # This is not the first message, so we can check if this is the first message in a new conversation
            if date - datetime.fromisoformat(data[i - 1]['date']) >= RESOLUTION:
                    if isinstance(text, list):
                        text = parse_complex_message(text)

                    if 'happy birthday' not in text.lower():
                        # `sender` initiated a new conversation
                    print('Message is', message)

            if date - datetime.fromisoformat(data[i - 1]['date']) >= MIN_DELTA:
                infractions += 1
            initiated[sender] = [1]
            initiated[not_sender] = [0]

    # Get cumulative scores for each party in infractions
    initiated_forward = {}
    initiated_backward = {}
    for party in parties:
        initiated_forward[party] = [sum(initiated[party][:i]) / (i if i != 0 else i + 1) * 100 for i in range(1, len(initiated[party]))]
        initiated_backward[party] = [sum(initiated[party][-i:]) / (i if i != 0 else i + 1) * 100 for i in range(1, len(initiated[party]))]

    initiated_forward = pd.DataFrame(initiated_forward)
    initiated_backward = pd.DataFrame(initiated_backward)

    return initiated_forward, initiated_backward, infractions

It’s all pretty straightforward: we pick a threshold for what constitutes a new conversation and set our 9 month (36 week) threshold. We iterate over messages, keeping tally of who started each conversation. At the end, we compute a cumulative percentage in both the forwards and backwards directions so that we can plot a nice graph instead of just printing a number. As to the parse_complex_message function, that simply parses cases where Telegram has broken the message into parts:

def parse_complex_message(message):
    total = ''

    for part in message:
        if isinstance(part, str):
            total += f'{part} '
        elif isinstance(part, dict):
            total += f'{part["text"]} '

    return total

In the main function, we load the data and then use Streamlit to develop a dashboard of sorts and plot the two data frames we returned. Here’s a rather healthy-looking graph:

The x-axis here is the number of conversations. Overall, this is very healthy. While the beginning (or rather: the end, since this graph is from the end) is a little rough, it does eventually even out a bit.

The code for this project is open sourced–try it for yourself!

Leave a Comment

Your email address will not be published. Required fields are marked *