Wednesday, March 23, 2016

2016 Political Primary Models

For the last month I have been working on statistical approaches to "fit" the presidential primary results in each state to some type of linear model. Initially, I was interested only in economic variables, such as unemployment rates, cost of living differentials, median income, rates of poverty, etc. These models had good success--as of March 2, three economic variables correctly fit 13/15 of the states who had voted (or caucused) up to that point on the Democratic side. I gradually started including other variables to make the models more complicated--education, race, rates of various industries in a state, changes in those rates over time, age, health, and violence, for example. In all, I incorporated over 3,000 variables into potential models. I present the final models below--three for the Democratic side, and three for the Republican side.

The Democratic models correctly fit 27-28 out of the 29 states who have voted as of March 23, 2016, using 3-4 variables. Two of the Republican models correctly fit all 31 states, and the third correctly fits 30/31 states, using 4-5 variables. The outcome variable in both cases is a simple subtraction of two candidates. On the Democratic side, Clinton-Sanders and on the Republican side, Trump-Cruz. I used these two separate variables as the dependent variables in two multiple regression equations, with economic, health, cultural (etc) variables as the independent predictors. Various statistical information is available to help determine which models are better than others, such as AIC, BIC, R-squared, the residuals, and in this specific case, the accuracy of the model in correctly finding the "winner" of the state-level contests, ie, whether the model correctly predicted a higher score for the winner versus their losing competitor.

Table A shows the three models for the Democratic contest for all 50 states. The second column shows the "Democratic Difference" score, while the 3rd-5th columns show the predicted values based on the three models. A positive Democratic Difference score means a win for Clinton by that margin, while a negative score means a win for Sanders by that margin. For example, Clinton won Arizona by 17.7 pts, so the score in column 2 is +17.7. However, Sanders won Kansas by 35.4 pts, so the score in column 2 is -35.4. Table B shows the same, but for the Republican side. A positive score means a win for Trump by that margin, and a negative score means a win for Cruz by that margin. I did not include any other candidates from either party in these models, and I made no effort to attempt to predict the results of the general election, just the primaries. The pink-shaded areas are the states that the model has incorrectly fit.

Tables C & D show the statistical results of each model, along with the variables used by each model that can be matched with the state-level model predictions from Tables A & B. For example, in Table A, model 1 (column M1D) predicts that Sanders will win Alaska by a wide margin (51.5 pts). Actually, all three models predict big wins for Sanders in Alaska--most of models have similar state-level predictions, but do so using different variables. Table C shows the three specific Democratic models. The left third of the table describes model 1 (M1-D)--it uses four variables: unemployment, "nones" (unaffiliated with any religion), the ratio of people who have migrated out of the state in 2015, and the difference between the Republican and Democratic votes for president, averaged over 2008-12. This latter measure is a simple subtraction of Republican-Democrat, so a positive value indicates a Republican win by that margin.

Interpreting Tables C & D can be challenging. The B-coefficients are the standardized coefficients. In model 1 (M1-D) you can compare the four variables to each other by the strength and direction of their contribution. For example, the strongest predictor in this model is the "nones," and it is negative, implying that it works in the opposite direction as the outcome variable, the Democratic Difference (DemDiff) measure, which is Clinton-Sanders. A higher positive value indicates a bigger win for Clinton, and a negative value indicates a win for Sanders. The negative value of the "no religious affiliation" implies that the higher the DemDiff score, the lower the rate of people who claim no religious affiliation, so in other words, Clinton tends to win in states with more people who claim to be affiliated with a religion, while Sanders tends to win in states where more people claim to be unaffiliated with a religion.

The unemployment variable is the next strongest variable in M1-D, and it is positive. This means that the higher the DemDiff value, the higher the unemployment variable. This implies that Clinton tends to win in states where there is higher unemployment, and Sanders tends to win in states with lower unemployment. The next strongest variable is the Rep-Dem presidential election results from 2008-2012, and it is negative. Since a higher value means that Republicans won by a larger margin, this means that Clinton tends to win in states where Democrats won with higher margins, while Sanders wins in states with lower Democratic margins, or even Republican wins. Finally, the out-migration variable measures how many people were leaving the state, compared to people moving into the state. This variable is positive, so that Clinton tends in win in states where more people have been moving out to another state, while Sanders tends to win in states where more people have been moving into that state from other states. In model 2, M2-D, the male-female ratio is a measure of the number of men versus women in the population. A higher value means a higher ratio of men than women, while a lower number means a higher ratio of women than men. In this case, thee is a negative value, meaning that Clinton tends to do better in states with a higher ratio of women compared to men.

The information on the bottom half of Tables C & D provide statistical information about the models. All of the six models have p-values that are very low, implying the possibility of strong confidence in the results of the models. Below the "Model P" row, I provide the number of states that each model got wrong, of the races that had been decided as of March 23. Two of the Republican models correctly fit all 31 of the state contests up to that point. The BIC is a Bayesian measure to help compare models to each other--the lower the number the better. Residuals provide a summary of how far the "predicted" values are from the "measured" values that actually took place, so the lower these values the better. Finally, the adjusted R^2 describes how much of the variability of the data is explained by the model, adjusted by the sample size (in this case, 51). So, for example, for M1-D, the adjusted R^2 is 0.729, meaning that this model explains around 72.9% of the variation in the data.

Table D describes the Republican models, where the outcome measure is the difference between Trump-Cruz, so positive values mean that Trump won by that margin, while negative values mean Cruz won by that margin. While race did not factor into the Clinton-Sanders models, several Trump-Cruz models are strengthened using race measures. For example, M1-R has a positive B-coefficient for "% population Black," which implies that Trump does better in states with a higher percent of Blacks compared to Whites. Changes in the number of mining jobs for men from 2000-2013 has a negative value--states where this number is higher gave Cruz stronger wins compared to Trump, implying that mining job losses in a state boost Trump's winnings there. In model 2, M2-R, the Gallup Well Being Index is an annual measure of how well the people in the states are doing--a higher number means they are doing better. This value is negative, meaning that as states to poorly in well-being, Trump gets higher wins, while as states do better, Cruz does better.

Model M1-D contains a variable describing the difference between Clinton vs Obama's primary victory's in 2012. This is a subtraction of Clinton-Obama, so positive numbers mean a win for Clinton, and negative numbers mean a win for Obama. In this model, the variable is positive, meaning that in states where Clinton won with larger margins against Obama in 2012, Trump does better in those states. Models 2-3 (M2-R, M3-R) have another political variable, the Republican-Democrat differences from the 2000 & 2004 presidential elections. Positive values mean a Republican win. In both models, this model is negative meaning that higher Democrat wins in those states (negative values), tend to provide stronger wins for Trump, while stronger wins for Republicans in those states give higher margins for Cruz.

Finally, model 3 (M3-R) includes a variables on slave ownership in 1860. This measures the percent of the population of that state that was slaves. I pulled together a number of variables that I used to create a "Southern Culture" factor that I believed would help predict primary results, and historical slave ownership in Southern states was one of those variables. The Southern Culture Index proved far less valuable in the models than I predicted, so was not used in any of the best models. However, the slaves variable was predictive of the Trump-Cruz contest. This variable is positive, meaning that as the number of slaves in that state in 1860 was higher, Trump does better in those states. Clearly there is a strong geographical feature to this variable, which is why I included it in the Southern Culture Index, where, as expected, it had strong associations with other measures that differentiate North vs South. However, the fact that the Southern Culture Index was poorly predictive of the Trump-Cruz model would seem to imply that the relationship of this variable to the Trump vs Cruz vote differences is more than just geography, but has a cultural residue from that history. It is beyond the scope of this paper to postulate a mechanism that links that slave history with higher win margins for Trump.

As a technical note, I used the open-source software R for this analysis. The 3000+ variables were processed by creating a script to do the following steps:
  1. Create all possible variables on the variables in ranges of 2-6 variables per model. I did not use any interaction terms, and I used only included variables that had correlations above abs(0.15).
  2. Process all of these combinations of variables as linear regression models.
  3. Eliminate all models that had and adjusted R^2 < 0.7, and any variable with a VIF (variance inflation factor) > 4.
  4. Eliminate all models that had 3 or more incorrect "fits" to the states
  5. Produce a summary report of the remaining models, listing the # incorrect, BIC, residuals, and Adj R^2
After R produced this abbreviated list (the original set of models was several million possibilities per outcome variable), I used Excel to ranked them according to the lowest number of incorrect fits, and lowest residuals, while also visually inspecting the BIC and Adj R2, although those patterns largely followed the trends of the lowest residuals. I eliminated models that had similar "types" of variables, such as multiple types of "jobs" variables, religion variables, voting variables, etc, even though the model and the variables had low VIFs and no indications of multicollinearity. I also eliminated models where individual variables had p-values over 0.2. Only one model above (M2-R) had a variable p-value above 0.1, and I retained the model because of its accuracy in having zero incorrect fits, and a reasonable residuals, BIC & adjusted R^2 compared to other models. I gave preference to models with lower numbers of predictor variables.

No comments:

Post a Comment