Reddit. A place of unrelenting procrastination and unexpected inspiration. At first glance, a confusing concoction of trite memes, current affairs, cute animals and rampant shitposts. But amidst the apparent chaos one discovers an underlying order: the endless stream of links is neatly segmented into subreddits, each boasting its own sub-community and distinct personality.

I thought it might be fun to investigate the subcultures that exist within these subreddits. With the help of some basic natural language processing, we gain insight into the distinct lexicon of a subreddit. Most likely, the results will simply reinforce our preconceptions, but hopefully we might learn something new.

A quick note: all of the results you see below were generated using a Python app that I’ve made available as open source. You can read more about it here or type git clone https://github.com/jaijuneja/reddit-nlp.git into your terminal. To avoid boring you with details of the implementation, let’s jump straight to the results.

Set a descriptionundefined (IPUC418032107)repostwatermeloncorvallissvedkanoodlesbucafriendzoneprosecutorcripklondikesureshadeemgoldblumsandlersaandlaar1yrragdollsikhindictmentjefftransvestitewatermelonssikhsthailandkarmadecay.comduhduhbeppobuccabayeuxeastwoodclamsclamkaylapearlsjackmanclintwurdbulbasaurminesweeperententebuffssapphiresmerrittduiraftswelcomoystershughfetabarreledrammsteinwatermelon1yrnamastemaldivesmarcoswidowskarmadecay.comrwcdaycareorgypowerlisting.wikia.compineapplesustream.tvcursiveworksheetscstdaycaresorgiesthreesomeflexoboratpreschooltongflacagoofusmoistcuntdementiamuffinsbassemmasrinotepadrhombusgallantsaurontunaduolingoustreamherpeshsvsoresnutellahsv1hazelnutsudaybielefeldcircumcisionritzzakathsv2baboonskrampuscanoeovermorrowgenitalferrerojuriesbutterfliesbaboonplutoerniehazelnutkirkmanreindeerrockallcankercheechisraelpalestinianbitofnewsbotbitofnews.compalestinianscrimeaukraineqatargraphenehamasnatocircumcisionkievpolioisraeliputinisraelspakistanigazapalestinenewspakistanarabsnetanyahukurdsmuslimv2quotaislamicseparatistebvparticlequarksgenesgalaxieslhctcstriclosanacceleratorsgenomequarkmonograndinandrogenicdketurmericdunningbacterialinkagecorporalnicotineslacacceleratorconcussionco2eegkrugerplasmahomosexualitybaryonsredditgiftsneutralityreddit.tvmeetupgifterhyggeautowikibotadminsfccrematcherredditgifts.comrematchedexchangesamp009yishanrematchgifteenp.reddit.comsnoocloudflarepinwaleshaftedkn0thingsecretsantassl.redditgifts.comgiftscdnobanlifeorgiamarequestsphilaeduolingozanerosettaevangelinezumbacflhaidakeirvivicalisten.tidalhifi.comtoryroryesakariweezerarranginglosslessfccmst3kwirecuttercometlanderecoboostkubrickmangamilfmilfmilfmxcaztecs00100000vaderscsivadermidichlorians01100101shopliftinghiguainbertasubsidiessufilovejoyhummussolargonzaloantiwhiteredditfossilpripyatchechen01110100belugatakeshiswipeout01110011incitingofficerubisoftdayzborderlandswii3dsdlccontrollerminesweepersamuscutscenemadcatzarmamariopreorderjakwawn64kartgamecubedeveloperscontrollersuplayglitchesskyrimdevscodunplayableps2rosalinanintendobirdmankaijuboyhoodfassbendernolantransformerskubrickrimjaegersblackbeardnightcrawlerteasergodzillaridleybaleaddamsceraprattbabadooktrailerstorohellboyexpendableskurtcameronjurassictrilogydeppwesprometheusgemskacursiveshinsreggaespotifyrammsteinvocalsbtsbeatlesobituarydiscographymascismegadethnirvanacobainstappkatysublimedomesticacudihastinkedsongskurtmustainedichsongwritingillmaticcolliecorgivealkittymacyzebraboyohuskycutestcolliespupzebraspuppycalfcalvesadorableshepherdpomskycorgiscowsraccoonscutenessbulldogcalfsbatdogpurebreddogeshibaloafpuppiescomcastispsneutralityfccverizondslmodematamptbandwidthteslaflacbbbmuskcubanispmbpsfiberpeeringthrottledtwcmotohydrogenbroadbandelonrouterfiosbadusbmonopoliesmonopolybestofdotakonyhodorathleticsannouncerprismatagargpaladindowsingchildfreeacademicsicelandqvcubiquanatoscholarshipsgaddafiathleticmirialberd1avenuestacosprequelsdpslibidougandaophilcialaccreditedcooks12vschoolieshyenaescalatorglitterclitstartribune.combadgersstencilmowervoltagevoltsdolphinim14andthisiswtfoverammsteinspacedickstendonjerkybroniestattoonoccalulatourniqueteyelashesppsclitorisfemoralquadriplegicmemegen.comcaptionbotcircumcisedunavailablehulucircumcisionfilwebstorechrome.google.commilfmemekermitsharptongggantennaps3courtingadviceanimalsbreastfeedingbaggedincitingthreesomeunsubscribekeanprotesterstamirofficergerrymanderingcopsucsdprotestorsdistrictsuefaprosecutordarrenbrownsfaaprotestswaistbandjurywilsonsgerrymanderednatotestimonyfifascientologyincitingsharptonstrikershfcsriotsindictmentuversewebmbrushiepuckmarmosettavaresengravingbboyguinnesscrowsbeckhamnitrojuithcontrollerpygmydancersstoutgfycatjackdawkebabfamicomewoknanahtml5vorticesjackiesnescannonballmotorcyclistsgifvkebabsphotonphotonsparticlegravitationalrelativityelectronneutronorbitselectronslorentzorbitnucleusspacetimeasksciencewavelengthparticlesinertialmasslessequationsquarksneutronsgalaxiesatomsplaceboangularphilaemutationsnucleiaccelerationaxiomsnovelspaperbacksgoodreads.comkindlegaimandiscworldbookstoregoodreadsaudiobookflagglolitavonnegutpratchettlibrariespaperbackmurakamikvothebookstoresasoiafnovelprosekareninahardcoverstrilogynabokovheinleinrandchapterspublisherslibrarianfinalekorraconstantinenielsengothamhbonbclongmireodenkirktechtvavatarbuffywilfredg4veepcwmulaneylauriekinectnewsroomfallontbssheldonseinfeldsorkinabedcosbyfinalessitcommindygopprosecutorindictmentjuryrepublicansindictbenghazihageljuriesdemocratstestimonyboehnerobamaprosecutorsrepublicanacaimmigrationgiulianiromneysenatedemocratamnestyobamasofficercongressdarrenhillarywitnessesconstitutionwilsonsparalysisuberlucidhangoverscabszitcyrillicrebarsnppusplanckpollockhallucinationsfluoridetemperaturefolliclelatencymodemmedallionsimmovablesatellitesbytetorieslyftgassestaxisradiativecabbiesacetaldehydecpu500px.comshastahikecanyonbrycezionearthpornyosemitesvalbardugandaidahobaikalhikeddenalifjordunitedstatesofamericahdrfjordsridgeglaciertetonsflickrrimexplorepakistantruenorthpicturestouristycorypoolephotographycapitan/r/funny/r/pics/r/AskReddit/r/todayilearned/r/worldnews/r/science/r/blog/r/IAmA/r/videos/r/gaming/r/movies/r/Music/r/aww/r/technology/r/bestof/r/WTF/r/AdviceAnimals/r/news/r/gifs/r/askscience/r/books/r/television/r/politics/r/explainlikeimfive/r/EarthPorn

I grabbed a list of the 25 top subreddits from here. For each one, I processed approximately 10,000 recent comments, tokenising and performing a running count of all words that appeared. What you see above are the most common words in each subreddit by their term frequency-inverse document frequency (tf-idf) score. This score reflects the importance of a word within a specific document. For example, the word “and” might appear very frequently in a subreddit, but since it also appears frequently in all other subreddits it is down-weighted. This approach helps to filter out a lot of uninformative words and has been supplemented with the use of a stop-word list. I also toyed with stemming (using the Porter Stemmer algorithm), but found that it wasn’t particularly effective. This explains why you can see both singular and plural forms of certain words appearing above.

The bubble chart reveals some immediate observations. People on /r/funny and /r/pics are hell-bent on pointing out reposts, even going so far as to use websites like karmadecay.com in their quest to identify them. By contrast, more serious subreddits like /r/science and /r/news tend to stay firmly on topic. Of course, one must appreciate the limitation that these results are highly time-sensitive, and a more robust analysis requires orders of magnitude more of data.

I next ran a script to rank subreddits by their use of swear words. Using this list I obtained the following results:

For those familiar with /r/WTF and /r/AdviceAnimals, this is not surprising. What’s more interesting is that there is a strong correlation between the prevalence of swear words in a subreddit and the average word length (Pearson’s r=-0.68). Essentially, subreddits that avoid swearing also tend to use more sophisticated vocabulary. We end up with two distinct classes of subreddits: those driven by reasoned, civilised discussion (labelled “intellectual”) and those driven by clickbait and karma (labelled “trash”).

Given the scope for machine learning, a final feature I developed was a simple multiclass classifier to determine which subreddit a piece of external content is most suited to. To verify that the classifier was working as expected, I ran several webpages against it. Unsurprisingly, pages such as Telegraph World News were matched up with /r/worldnews. More interestingly, my Facebook news feed was classified under /r/pics, which falls squarely in the “trash” department – perhaps a result of the number of references to photos/images on Facebook. This blog fell under /r/askscience, which I’ll take. I’ve tried to make the Python app simple to use so that anyone can build their own classifier and have included some sample code below for those trying to achieve this.

import urllib2
import re
from html2text import html2text
from redditnlp import WordCounter, TfidfCorpus

html = urllib2.urlopen('http://www.telegraph.co.uk/news/worldnews/').read()
text = html2text(html.decode('utf-8'))
# Remove links
text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)

counter = WordCounter()
word_count = counter.get_word_count(text)

corpus = TfidfCorpus(corpus_path='corpus.json')
corpus.train_classifier(classifier_type='LinearSVC', tfidf=True)
print corpus.classify_document(word_count)

If you’re interested in analytics-related stuff, check out my other post on analysing emails.