I tried 36 different combinations of these 6 variables. While race was an important predictor, and in fact, when used by itself, correctly predicted 12/15 of the races. However, in several models it dropped out (it failed to reach statistical significance), and other models failed to see improved predictability. Education, however, proved to be a useful predictor. On its own, it was one of the worst predictors, missing almost half of the races. However, when combined with economic variables, specifically, the cost of living and unemployment, it produced the only model that missed just one state, Oklahoma. The rest of the models missed 2 or more states.
In addition to accuracy of prediction, I also produced results for AIC, BIC and the residuals. The 36 models, the p-value significance of each variable, and the AIC/BIC/Residuals data is in the image below. The column for B*U is the interaction term for %Black population * Unemployment. The last column is the number of states incorrectly predicted, and the table is sorted first by states predicted, and then by lowest AIC. In statistics, you can use AIC and BIC to compare different regression models--the lower the value, one has a better case to argue that it is a better model (lowest values highlighted in green). Similarly, lower residuals also tend to indicate a better model. As you can see from the chart, the top model does not have the best AIC/BIC/residuals, despite the fact that it has the best prediction history. In the models where I did not use a specific variable, that cell is highlighted in red and an "x." In models where the variable failed to reach statistical significance (p less than 0.05), I have crossed out the value and made the font red.
This image shows the predicted values of Sanders' wins in each state based on this model (cost of living, unemployment, and college education). The only state it missed is Oklahoma. A positive value is a win for Clinton (highlighted in red), and a negative value is a win for Sanders (highlighted in blue). In the "model prediction" column, the correct predictions are highlighted in green.
This image is from the actual R-output for this model, showing the p-values, model significance, adjusted R-square, the coefficients for each variable, etc.
(Addendum--Predictions)
This final image is a list of all 50 states + DC with the original data used to calculate the models, and predictions for the outcome of the rest of the primaries/caucuses. Model 1 is just the 2 economic variables + education. Model 2 is those same 3 variables, plus median earnings and % of the state population that self-identified as Black for the American Community Survey, 5 year estimate (2010-2014). It is, arguably, the 2nd best model--one of the problems with the model is that both race and education drop out of statistical significance. However, removing them from the analysis creates an inferior model, so for the purpose of comparison, I left this model intact alongside Model 1.
No comments:
Post a Comment