Predicting the Primaries: The US Election Story As Told by Data
Everyone in media wants a piece of the US election pie. On both ends of the political spectrum, traditional party politics are being rocked by new personalities. Voters frustrated with the establishment see this as an opportunity for protest. On the left they’re Feeling the Bern and on the right they’re hoping to Make America Great Again, but in both cases voters have expressed a cocktail of emotions from anger to optimism.
Front and centre of this circus act sits an orange caricature. To cynics, Donald Trump is a parody of himself, a charlatan who wins support by pandering to public fear. To supporters, he represents the anti-establishment and has the ability to “drain the swamp” from within. And while the media fumbles over his unexpected rise, the rest of the world is watching in shock, horror and awe as the prospect of President Trump becomes achingly real. At the surface, it’s easy to condemn the media for falling prey to his shallow tactics; but deep down I’m relishing every bite of this amuse-bouche under the naive assumption that it won’t last.
However, I’m not here to push my opinion. Instead, I want to uncover what the numbers tell us: who are the likely candidates for each party? Are the pundits’ claims backed by hard statistics? What insights can we achieve with voting data and a simple predictive model? Let’s find out…
The Data
With the primaries/caucuses of the four carve-out states just passed (Iowa, New Hampshire, South Carolina and Nevada), we have a rich set of county-level voting data for each of the candidates. Using US census records, we can associate each of these counties with demographic features (such as race, education, income and homeownership) which we assume are adequate predictors of voting outcome. Ultimately, the results of our election predictor are only as good as this assumption.
With that in mind, here’s a sample of what our data looks like:
[easytable style=”white-space:nowrap;font-size:13px;”]
state,county,party,candidate,votes,fraction_votes,income,black,hispanic,bilingual,senior,college,homeownership,firms
Iowa,Adair,Republican,Donald Trump,104,0.256,47892,0.4,1.7,1.1,22.1,16.3,77.1,0.098
Iowa,Adams,Republican,Donald Trump,68,0.249,45871,0.4,1.1,1.2,22,13.7,78.5,0.108
Iowa,Allamakee,Republican,Donald Trump,193,0.281,48831,1.4,5.7,8,21.3,14.9,79.5,0.114
Iowa,Appanoose,Republican,Donald Trump,292,0.348,39208,0.7,1.5,2.3,21.4,18.3,72.5,0.11
Iowa,Audubon,Republican,Donald Trump,99,0.265,48313,0.4,1.1,1.2,24.3,16.6,80.4,0.074
Iowa,Benton,Republican,Donald Trump,410,0.251,56669,0.6,1.3,2,16.9,18.8,80.4,0.091
[/easytable]
For each candidate we have the number and fraction of votes they received as well as eight representative features per county:
- Median household income
- % black population
- % hispanic population
- % population speaking more than one language at home
- % population over 65 years
- % population with a bachelor’s degree or higher
- Homeownership rate
- # of firms per capita
Initial Findings
A quick run through the data reveals some of the correlations that our model will rely on. If we examine the counties where different candidates have won, Marco Rubio and Bernie Sanders attract higher income voters. Trump wins in lower income counties overall, but seems to appeal to a broader range of voters. [For those wondering how the chart works, the boxed area represents the interquartile range while the lines represent the min/max range #GCSEmaths].
Diving a bit deeper, Rubio dominates in the few states with the highest rates of college education (bachelor’s degree or higher), while Trump and Cruz jockey for position among the remaining demographic.
A look at race demographics further reveals Trump’s widespread dominance, even winning in counties with the largest black and Hispanic populations. It’s worth caveating that this does not necessarily imply that black and Hispanic people are voting for Trump, but rather that he has won counties where such populations exist. Given his political agenda, there may be other hidden dynamics at play – for example, the Republican voting population may be relatively small and predominantly white. This theory has some basis given that caucus and primary voting is typically restricted to registered party members, which in the case of Republicans is white-dominated.
Among the Democrats, Clinton won overwhelmingly in black counties. It’s a striking chart, but one that has been reaffirmed by other news sources, which cite Clinton as securing 86% of African-American votes vs. 16% for Sanders in the latest South Carolina primary.
Predicting Voting Outcomes
Our predictive model will use the implied insights above – such as Rubio’s relative popularity among high income, college educated people – to project winners in each of the remaining states. We’ll use a random forest classifier to achieve this. In simple terms, a random forest classifier constructs a set of “decision trees” that relate each of our demographic features to a voting outcome (i.e. a candidate). You can think of it as a flow chart that asks a series of yes/no questions about the features (e.g. is median income higher than X?), descending the appropriate branch of the tree after each question until it reaches the leaf (the winning candidate).
Traditional decision tree-based learning algorithms tend to “overfit” to their training data. What this means is that decision trees are often grown very deep to fit to irregular patterns in the data. This makes them good at matching to the original dataset, but bad at predicting outcomes with new data. Consequently, random forest classifiers randomly select a subset of data points and features to construct an ensemble of decision trees. When the prediction algorithm is run, it chooses the outcome that is most commonly output by the various decision trees (i.e. the mode). This has been found to reduce the problem of overfitting.
And The Winner Is…
Hillary Clinton wins for the Democrats, with a notable East-West divide. Clinton wins 31 states including her home of New York versus 19 for Sanders.
And a landslide victory for Donald Trump among the Republicans. A tad exaggerated, perhaps? Only time will tell!
Some caveats: As mentioned earlier, these predictions are only as good as the assumptions upon which they are based. In this case, we assume that demographic features are universally strong predictors of voting outcome. There’s also the simplification of the presidential nomination being based on votes, when in reality it’s based on the number of delegates assigned to each candidate. Finally, we face the typical problem of limited data: in some instances (e.g. Rubio), the model had little information to go on given the small number of wins, while in others (e.g. Trump) it may be exaggerating future success.
In any case Super Tuesday is today, so we’ll see how the model fares!
If you enjoyed this article then please like and share! 🙂