Predicting the Primaries: The US Election Story As Told by Data

Posted on March 1, 2016 in: Programming|Comments Off

Donald Trump

Everyone in media wants a piece of the US election pie. On both ends of the political spectrum, traditional party politics are being rocked by new personalities. Voters frustrated with the establishment see this as an opportunity for protest. On the left they’re Feeling the Bern and on the right they’re hoping to Make America Great Again, but in both cases voters have expressed a cocktail of emotions from anger to optimism.

Front and centre of this circus act sits an orange caricature. To cynics, Donald Trump is a parody of himself, a charlatan who wins support by pandering to public fear. To supporters, he represents the anti-establishment and has the ability to “drain the swamp” from within. And while the media fumbles over his unexpected rise, the rest of the world is watching in shock, horror and awe as the prospect of President Trump becomes achingly real. At the surface, it’s easy to condemn the media for falling prey to his shallow tactics; but deep down I’m relishing every bite of this amuse-bouche under the naive assumption that it won’t last.

However, I’m not here to push my opinion. Instead, I want to uncover what the numbers tell us: who are the likely candidates for each party? Are the pundits’ claims backed by hard statistics? What insights can we achieve with voting data and a simple predictive model? Let’s find out… Continue reading »

The Vocabulary of Reddit

Posted on December 2, 2014 in: Personal, Programming|Comments Off

Reddit. A place of unrelenting procrastination and unexpected inspiration. At first glance, a confusing concoction of trite memes, current affairs, cute animals and rampant shitposts. But amidst the apparent chaos one discovers an underlying order: the endless stream of links is neatly segmented into subreddits, each boasting its own sub-community and distinct personality.

I thought it might be fun to investigate the subcultures that exist within these subreddits. With the help of some basic natural language processing, we gain insight into the distinct lexicon of a subreddit. Most likely, the results will simply reinforce our preconceptions, but hopefully we might learn something new.

A quick note: all of the results you see below were generated using a Python app that I’ve made available as open source. You can read more about it here or type git clone https://github.com/jaijuneja/reddit-nlp.git into your terminal. To avoid boring you with details of the implementation, let’s jump straight to the results.

I grabbed a list of the 25 top subreddits from here. For each one, I processed approximately 10,000 recent comments, tokenising and performing a running count of all words that appeared. What you see above are the most common words in each subreddit by their term frequency-inverse document frequency (tf-idf) score. This score reflects the importance of a word within a specific document. For example, the word “and” might appear very frequently in a subreddit, but since it also appears frequently in all other subreddits it is down-weighted. This approach helps to filter out a lot of uninformative words and has been supplemented with the use of a stop-word list. I also toyed with stemming (using the Porter Stemmer algorithm), but found that it wasn’t particularly effective. This explains why you can see both singular and plural forms of certain words appearing above.

Bag-of-Words-based Localisation and Mapping of Textured Scenes

Posted on October 7, 2014 in: Programming, University|Comments Off

During my final year at Oxford I spent many nights ruminating on the topic of computer vision. Haggard, bearded, and jaded by the prospect of partying and gallivanting that once dominated my life, I was left with only my thesis. Over the course of a year, I transformed from a youthful Frodo into a schizophrenic Gollum, hunched over a laptop and typing furiously as freshers frolicked beyond the library window.

All jokes and prose aside, my time at Oxford was awesome and irreplaceable. But given that my master’s thesis did, in fact, consume a sizeable chunk of my lifespan, it would be unfortunate if it got lost in the colossal annals of academia. So I shall post it here for the world (i.e. 3 people tops) to read, where it will instead become lost in the deep web.

Paper

Poster

Abstract

This report develops a large-scale, offline localisation and mapping scheme for textured surfaces using the bag-of-visual-words model. The proposed system builds heavily on the work of Philbin et al. and Vedaldi and Fulkerson, taking as an input a corpus of images that are rapidly indexed and correlated to construct a coherent map of the imaged scene. Thereafter, new images can be reliably queried against the map to obtain their corresponding 3D camera pose. A simple bundle adjustment algorithm is formulated to enforce global map consistency, exhibiting good performance on real datasets and large loop closures. Furthermore, a proposed submapping scheme provides dramatic computational improvements over the baseline without any deterioration in localisation performance. A comprehensive implementation written in MATLAB as proof of concept is pressure tested against a variety of textures to examine the conditions for system failure. The application unambiguously outperforms the naked eye, demonstrating an ability to discriminate between very fine details in diverse settings. Tests validate its robustness against broad changes in lighting and perspective, as well as its notable resilience to high levels of noise. Areas for further development to augment the original work and achieve real-time performance are also suggested.

I conducted my research in the Visual Geometry Group under the supervision of Prof Andrea Vedaldi. The project focused on building a system to coherently reconstruct large-scale “textured” scenes using a single camera. The idea was to see whether a computer could reconstruct an environment which, to the human eye, contains very little physical or visual information. A complete architecture for offline localisation and mapping using the Bag-of-Words (BoW) model was proposed and subsequently implemented in MATLAB.

The poster above provides a good overview of the system, but to give you a better sense of what the algorithm does at a high level, consider the image below. First, a corpus of training images is used to extract “interesting” features (using the Scale Invariant Feature Transform) – typically unique areas of high contrast, corners, edges and so on. These are then quantised into visual words using k-means clustering. Once a “vocabulary” has been established a large set of images can have their features extracted and geometrically matched, such that the scene can be stitched and reconstructed. Thereafter, a new input image can be rapidly localised within the scene.

The MATLAB implementation can be downloaded from the link below.

Download Code

Alternatively if you use Git then you can clone the repository from GitHub using the command:

git clone https://github.com/jaijuneja/texture-localisation-matlab.git

Un-friending: a social media odyssey

Posted on February 20, 2014 in: Personal|Comments Off

Exhibit A

Please refer to Exhibit A.

That’s me scrolling through Facebook.

How did I get there? To be honest, I don’t know.

That’s what social media does to you nowadays. It sits idly in your brain, like one of those pesky background process you forgot to disable, waiting to take admin privileges and seize control.

Social media is our captain now.

Before you know it, you’re swiping through pics of Tiffany’s yoga retreat in Thailand.

Thailand looks dope, you think to yourself. And Tiffany’s gone hard on the hashtags. #lifesabeach. Good one.

Her play on words is fittingly ironic, Tiffany has a good life.

Scroll down further. Now it’s that dude Ryan. He’s kicking it with the bros. #brewskis #watchingthegame.

Who are Ryan and Tiffany? You have no idea. You met them once.

But when you added them as Facebook friends back in ’08, you entered a mutually beneficial contract. A contract in which you got to increment your friend counter, and in exchange had to serve as a cheerleading bystander to their online projection. You pump my stats, I pump yours.

What can we learn from our emails?

Posted on January 10, 2014 in: Personal, Programming|Comments Off

I’ve become fascinated with the idea of amassing personal data to conduct some basic self-analysis (a.k.a. self-PRISMing). We can learn a lot about our habits, and it is inevitable that some time in the near future people will routinely gather all kinds of data on themselves (many tech companies already do this, not to mention the NSA). Since I started experimenting with Python yesterday it seemed like a good idea to abandon my never-ending university workload and write up a small Python program that could analyse my email behaviour.

Fast-forward a few hours and I’ve got a working program that can access any email account via IMAP, retrieve all messages and provide a new perspective on one’s life that may not have otherwise been obvious. It’s currently very basic, but delivers some initial food for thought.

As a side note, all of the code is available on GitHub (just type git clone https://github.com/jaijuneja/email-analytics.git into Terminal/Command Prompt or download it from the link below).

Download Code

So, has this program provided me with enlightenment and newfound self-awareness? Not quite, but it has affirmed some of my suspicions. The first thing that is clear from my university email account is my non-existent/nocturnal sleep cycle. It appears that I regularly send mail between midnight and 6am, as do some of my peers (especially during term time). “I should probably go to bed” is not a thought I’m particularly receptive to…

However, this past term (October-December 2013) was perhaps the first where I successfully maintained a regular human sleep cycle for the most part – and it actually shows from the graph!

Upon deeper inspection we find that the probability distribution for my sent mail is essentially bimodal, with one peak in the late afternoon/evening and another around 1-2am.

What is also evident is the sharp drop in incoming mail during the summer. This is further emphasised in the plot below: you can see the termly cycles of email traffic, with distinct troughs during holiday seasons. Note that I can only see received mail from up to 200 days ago because Oxford’s MS Exchange server was giving me problems.