The Vocabulary of Reddit

Posted on December 2, 2014 in: Personal, Programming|Comments Off

Reddit. A place of unrelenting procrastination and unexpected inspiration. At first glance, a confusing concoction of trite memes, current affairs, cute animals and rampant shitposts. But amidst the apparent chaos one discovers an underlying order: the endless stream of links is neatly segmented into subreddits, each boasting its own sub-community and distinct personality.

I thought it might be fun to investigate the subcultures that exist within these subreddits. With the help of some basic natural language processing, we gain insight into the distinct lexicon of a subreddit. Most likely, the results will simply reinforce our preconceptions, but hopefully we might learn something new.

A quick note: all of the results you see below were generated using a Python app that I’ve made available as open source. You can read more about it here or type git clone https://github.com/jaijuneja/reddit-nlp.git into your terminal. To avoid boring you with details of the implementation, let’s jump straight to the results.

I grabbed a list of the 25 top subreddits from here. For each one, I processed approximately 10,000 recent comments, tokenising and performing a running count of all words that appeared. What you see above are the most common words in each subreddit by their term frequency-inverse document frequency (tf-idf) score. This score reflects the importance of a word within a specific document. For example, the word “and” might appear very frequently in a subreddit, but since it also appears frequently in all other subreddits it is down-weighted. This approach helps to filter out a lot of uninformative words and has been supplemented with the use of a stop-word list. I also toyed with stemming (using the Porter Stemmer algorithm), but found that it wasn’t particularly effective. This explains why you can see both singular and plural forms of certain words appearing above.

Un-friending: a social media odyssey

Posted on February 20, 2014 in: Personal|Comments Off

Exhibit A

Please refer to Exhibit A.

That’s me scrolling through Facebook.

How did I get there? To be honest, I don’t know.

That’s what social media does to you nowadays. It sits idly in your brain, like one of those pesky background process you forgot to disable, waiting to take admin privileges and seize control.

Social media is our captain now.

Before you know it, you’re swiping through pics of Tiffany’s yoga retreat in Thailand.

Thailand looks dope, you think to yourself. And Tiffany’s gone hard on the hashtags. #lifesabeach. Good one.

Her play on words is fittingly ironic, Tiffany has a good life.

Scroll down further. Now it’s that dude Ryan. He’s kicking it with the bros. #brewskis #watchingthegame.

Who are Ryan and Tiffany? You have no idea. You met them once.

But when you added them as Facebook friends back in ’08, you entered a mutually beneficial contract. A contract in which you got to increment your friend counter, and in exchange had to serve as a cheerleading bystander to their online projection. You pump my stats, I pump yours.

What can we learn from our emails?

Posted on January 10, 2014 in: Personal, Programming|Comments Off

I’ve become fascinated with the idea of amassing personal data to conduct some basic self-analysis (a.k.a. self-PRISMing). We can learn a lot about our habits, and it is inevitable that some time in the near future people will routinely gather all kinds of data on themselves (many tech companies already do this, not to mention the NSA). Since I started experimenting with Python yesterday it seemed like a good idea to abandon my never-ending university workload and write up a small Python program that could analyse my email behaviour.

Fast-forward a few hours and I’ve got a working program that can access any email account via IMAP, retrieve all messages and provide a new perspective on one’s life that may not have otherwise been obvious. It’s currently very basic, but delivers some initial food for thought.

As a side note, all of the code is available on GitHub (just type git clone https://github.com/jaijuneja/email-analytics.git into Terminal/Command Prompt or download it from the link below).

Download Code

So, has this program provided me with enlightenment and newfound self-awareness? Not quite, but it has affirmed some of my suspicions. The first thing that is clear from my university email account is my non-existent/nocturnal sleep cycle. It appears that I regularly send mail between midnight and 6am, as do some of my peers (especially during term time). “I should probably go to bed” is not a thought I’m particularly receptive to…

However, this past term (October-December 2013) was perhaps the first where I successfully maintained a regular human sleep cycle for the most part – and it actually shows from the graph!

Upon deeper inspection we find that the probability distribution for my sent mail is essentially bimodal, with one peak in the late afternoon/evening and another around 1-2am.

What is also evident is the sharp drop in incoming mail during the summer. This is further emphasised in the plot below: you can see the termly cycles of email traffic, with distinct troughs during holiday seasons. Note that I can only see received mail from up to 200 days ago because Oxford’s MS Exchange server was giving me problems.