A few weeks ago I read the book Bursts: The Hidden Pattern Behind Everything We Do which discusses predictability of human actions. While I thought the writing left something to be desired, the ultimate conclusion of the book intrigued me: that human behavior is predictable, if in a statistical sense. This conclusion was derived from the observation that many aspects of human behavior are “bursty”, or follow a power law. In this context, it means a event happens more often in shorter intervals than in longer intervals.
One example provided in the book is the sending of emails. If you look at the time delay between successive emails, the length of time between one email and the next is most often small. That is, emails are usually sent in close succession (or, when you send emails, you usually send them or reply to them in blocks). Then a longer period of time later, another batch of emails is written and sent. Summing up this behavior results in a power law.
I suspected I too followed this pattern of “bursty” behavior. However my inner scientist wasn’t content with accepting the assertions of the book without a little data mining. I decided to test this with a little analysis of my twitter posts. I downloaded the most recent 3000 or so posts in XML format using twitterbackup (twitter currently only allows you to download your most recent 3000 tweets). This file includes a wealth of information, but for my purposes I was only after the timestamp for each tweet. The oldest tweet was from 11 November 2010, so this analysis covers the past three months of activity.
I whipped up a short python script to go through the tweets and calculate the time between successive posts and convert that information into a histogram (number of tweets which were sent after a given time interval). My expectation was that this would follow a power law, meaning a lot of tweets would be sent in rapid succession (short time intervals). While few tweets would be sent with long delays from the previous tweet.
So, was my suspicion (and the assertion of Albert-Laszlo Barabasi) born out?
The top graph is for all 2959 tweets. It doesn’t immediately show the predicted power law. But this is mostly due to the few outliers where there were long delays (several days) between tweets. Notably these are infrequent, and so it’s not too surprising that this part of the histogram isn’t well sampled. So I trimmed the range, creating a histogram of only those tweets sent after intervals of less than 1500 minutes (25 hours). Now it’s starting to look like a power law. The instances where I have gone days without tweeting are few and far between. Part of this is due to the short, 3 month range of data. There are only so many 4 day blocks in 3 months during which I could have ignored twitter.
Let’s zoom in a bit more and look at only the tweets with delays less than 300 minutes (18,000 seconds or 5 hours). That looks pretty close to a power law, with some occasional deviations (the y-axis is log-scale but the x-axis is linear, so the shape is correct for a powerlaw). But on the whole, pretty good.
Why is there a dip around 10,000 seconds (~3 hours)? I have no clue.
Based on this I will admit that I, like most people, follow a predictable pattern (at least statistically). (For those of you who are curious, the best-fit power law index is about -1.7)
But what about non-humans? I administer two twitter accounts for the Astronomy Department at UVa. The first, @UVaAstro is aimed at disseminating information about departmental events and news. The second, @UVaAstroLabs, is a notification system for students in astronomy labs. The former has some manual interaction while the latter is entirely script driven. I ran the same analysis as above for these systems. I expected there to be a difference. UVaAstroLabs should have a very peaked histogram, as the script runs once per day on days with labs. So the intervals between tweets should only have specific values (e.g., 24 hours [status update once per hour] and 96 hours [no labs Fri-Sun]).
For UVaAstro, there is a script that runs daily to post departmental events for that day. But this isn’t as regular in terms of time between tweets because there are some days with no events (> 24 hours between tweets) and some days with multiple events (< 1 min between tweets). Also, this account has occasional manual posts which would be more likely to follow the bursty power law behavior. I seems reasonable that this would show some combination of the above two behaviors.
The plot below shows the distribution for @UVaAstro (blue) and @UVaAstroLabs (green).
As expected @UVaAstroLabs is peaky, with peaks corresponding to the time between lab days (24 hours [1440 min]), the time period over the weekend without labs (4 days [5760 min]), and even the time between semesters (~40 days for Christmas break and ~90 for summer; they are longer than the actual breaks because there are no labs during finals week or the first week of classes). The @UVaAstro distribution is a bit different, due to the combination of scripted and human input. I submit that it is a combination of human (power law / bursty) and bot (peaky) activity.
“But George,” you say, “what if I want to do this own my own tweets?” or “I don’t believe you”. Well, here is the basic script used to generate the above data*. You can download your tweets using the program linked above and feed it into the script. See if you too follow a power law!
* – Yes, I know some of the code is ugly, particularly the handling of the duplicate <created_at> tag which appears twice per tweet entry. Once for when the tweet was created and once for when the account was created. More clever XML parsing would certainly clean this up, but I’ll leave that as an exercise to the reader.