Saturday, March 26, 2016

Brain Drain--Indiana is at the Bottom of the Barrel

According to recently released 2014 Census data, Indiana ranks at the very bottom of states, 50th, for "Brain Drain," a not-so-fancy social science term to describe the migration of people with advanced degrees out of one place and into another. Brain drain is typically related to that country or state having lower wages or professional employment opportunities and poorer quality of life. There are various ways of measuring brain drain, but the most typical measures involve simply counting the number of people with college degrees who are leaving your area compared to those who are moving into your area. For example, in the table to the right, the column "BA+Grad" is a subtraction of the number of people with bachelor's and graduate degrees who have moved into your state minus the number who have moved out of your state. A positive value means more college-educated people have moved into your state than have left, while a negative member means that state had a net loss of college educated people.

In the table I have included three additional columns. The 4th column, "BA+Grad," is the simple calculation described above. The 5th column, "No HS/Only HS," is a similar calculation, but measures the migration of people with only a high school education (no college at all), or with less than a high school education. A positive value means that poorly educated people are moving into that state at higher rates than they are moving out, while a negative measure means that state has a net loss of residents, but they are very poorly educated residents. The 3rd column, "Coll/ NoColl Migration, simply describes the sign of these two types of migration: +/+ indicates that both the highly educated and poorly educated are moving into your state, while -/- indicates that both groups are moving out of your state.

The first column, "Brain Drain Rank," ranks this migration process based on two measures. The primary ranking is the positive vs negative flows. If more educated people are moving into a state than leaving it (+/+), it has a higher rank. If more poorly educated people are moving into a state than leaving (-/-), it has a lower rank. The lowest rank in this list is for states who not only have negative migration of highly educated people, but have positive migration of poorly educated people--meaning that people with college degrees are leaving the state, while people with only a high school degree or less are coming into that state. Thus, a state could have a net gain of people moving into a state, but that gain comes entirely from poorly educated people.

Data indicates that there is a strong relationship between education and employment. People with college degrees have a far higher likelihood of finding jobs compared to people who only have a high school diploma or less. Further, those jobs tend to pay far more. Thus, an net in-flow into a state of people with only a high-school degree (or less), means that state could face greater demands on its social services budgets due to higher rates of unemployment of its residents, while a net out-flow of people with college degrees can mean there are fewer resources to increase the tax base and social service providers. On the one hand, it is of course problematic for a state to have net losses of a population--"dying states," as such--for example, Alabama, Kansas and Kentucky, who lost both the highly educated and the poorly educated in 2014. On the other hand, it becomes an even greater problem for a state's economy when the highly educated are leaving, but the poorly educated are moving in.

Indiana ranks the worst in this combined measure--states who lost people with college degrees, but gained people with only a high school education or less. Indiana had the highest rate of loss of college graduates. Six states had higher rates of loss of college graduates--New Jersey, Illinois, South Dakota, New York, Wyoming, and Alaska. However, these states lost both the highly educated and the poorly educated, with Alaska hemorrhaging both types of people. Fourteen states (including Indiana) had a pattern similar to Indiana, where they lost the college-educated, but gained the poorly educated--Indiana had the highest rates of loss of college graduates, and the 3rd highest rate of increase in poorly educated migrants, after North Dakota and Wisconsin.

Wednesday, March 23, 2016

2016 Political Primary Models

For the last month I have been working on statistical approaches to "fit" the presidential primary results in each state to some type of linear model. Initially, I was interested only in economic variables, such as unemployment rates, cost of living differentials, median income, rates of poverty, etc. These models had good success--as of March 2, three economic variables correctly fit 13/15 of the states who had voted (or caucused) up to that point on the Democratic side. I gradually started including other variables to make the models more complicated--education, race, rates of various industries in a state, changes in those rates over time, age, health, and violence, for example. In all, I incorporated over 3,000 variables into potential models. I present the final models below--three for the Democratic side, and three for the Republican side.

The Democratic models correctly fit 27-28 out of the 29 states who have voted as of March 23, 2016, using 3-4 variables. Two of the Republican models correctly fit all 31 states, and the third correctly fits 30/31 states, using 4-5 variables. The outcome variable in both cases is a simple subtraction of two candidates. On the Democratic side, Clinton-Sanders and on the Republican side, Trump-Cruz. I used these two separate variables as the dependent variables in two multiple regression equations, with economic, health, cultural (etc) variables as the independent predictors. Various statistical information is available to help determine which models are better than others, such as AIC, BIC, R-squared, the residuals, and in this specific case, the accuracy of the model in correctly finding the "winner" of the state-level contests, ie, whether the model correctly predicted a higher score for the winner versus their losing competitor.

Table A shows the three models for the Democratic contest for all 50 states. The second column shows the "Democratic Difference" score, while the 3rd-5th columns show the predicted values based on the three models. A positive Democratic Difference score means a win for Clinton by that margin, while a negative score means a win for Sanders by that margin. For example, Clinton won Arizona by 17.7 pts, so the score in column 2 is +17.7. However, Sanders won Kansas by 35.4 pts, so the score in column 2 is -35.4. Table B shows the same, but for the Republican side. A positive score means a win for Trump by that margin, and a negative score means a win for Cruz by that margin. I did not include any other candidates from either party in these models, and I made no effort to attempt to predict the results of the general election, just the primaries. The pink-shaded areas are the states that the model has incorrectly fit.

Tables C & D show the statistical results of each model, along with the variables used by each model that can be matched with the state-level model predictions from Tables A & B. For example, in Table A, model 1 (column M1D) predicts that Sanders will win Alaska by a wide margin (51.5 pts). Actually, all three models predict big wins for Sanders in Alaska--most of models have similar state-level predictions, but do so using different variables. Table C shows the three specific Democratic models. The left third of the table describes model 1 (M1-D)--it uses four variables: unemployment, "nones" (unaffiliated with any religion), the ratio of people who have migrated out of the state in 2015, and the difference between the Republican and Democratic votes for president, averaged over 2008-12. This latter measure is a simple subtraction of Republican-Democrat, so a positive value indicates a Republican win by that margin.

Interpreting Tables C & D can be challenging. The B-coefficients are the standardized coefficients. In model 1 (M1-D) you can compare the four variables to each other by the strength and direction of their contribution. For example, the strongest predictor in this model is the "nones," and it is negative, implying that it works in the opposite direction as the outcome variable, the Democratic Difference (DemDiff) measure, which is Clinton-Sanders. A higher positive value indicates a bigger win for Clinton, and a negative value indicates a win for Sanders. The negative value of the "no religious affiliation" implies that the higher the DemDiff score, the lower the rate of people who claim no religious affiliation, so in other words, Clinton tends to win in states with more people who claim to be affiliated with a religion, while Sanders tends to win in states where more people claim to be unaffiliated with a religion.

The unemployment variable is the next strongest variable in M1-D, and it is positive. This means that the higher the DemDiff value, the higher the unemployment variable. This implies that Clinton tends to win in states where there is higher unemployment, and Sanders tends to win in states with lower unemployment. The next strongest variable is the Rep-Dem presidential election results from 2008-2012, and it is negative. Since a higher value means that Republicans won by a larger margin, this means that Clinton tends to win in states where Democrats won with higher margins, while Sanders wins in states with lower Democratic margins, or even Republican wins. Finally, the out-migration variable measures how many people were leaving the state, compared to people moving into the state. This variable is positive, so that Clinton tends in win in states where more people have been moving out to another state, while Sanders tends to win in states where more people have been moving into that state from other states. In model 2, M2-D, the male-female ratio is a measure of the number of men versus women in the population. A higher value means a higher ratio of men than women, while a lower number means a higher ratio of women than men. In this case, thee is a negative value, meaning that Clinton tends to do better in states with a higher ratio of women compared to men.

The information on the bottom half of Tables C & D provide statistical information about the models. All of the six models have p-values that are very low, implying the possibility of strong confidence in the results of the models. Below the "Model P" row, I provide the number of states that each model got wrong, of the races that had been decided as of March 23. Two of the Republican models correctly fit all 31 of the state contests up to that point. The BIC is a Bayesian measure to help compare models to each other--the lower the number the better. Residuals provide a summary of how far the "predicted" values are from the "measured" values that actually took place, so the lower these values the better. Finally, the adjusted R^2 describes how much of the variability of the data is explained by the model, adjusted by the sample size (in this case, 51). So, for example, for M1-D, the adjusted R^2 is 0.729, meaning that this model explains around 72.9% of the variation in the data.

Table D describes the Republican models, where the outcome measure is the difference between Trump-Cruz, so positive values mean that Trump won by that margin, while negative values mean Cruz won by that margin. While race did not factor into the Clinton-Sanders models, several Trump-Cruz models are strengthened using race measures. For example, M1-R has a positive B-coefficient for "% population Black," which implies that Trump does better in states with a higher percent of Blacks compared to Whites. Changes in the number of mining jobs for men from 2000-2013 has a negative value--states where this number is higher gave Cruz stronger wins compared to Trump, implying that mining job losses in a state boost Trump's winnings there. In model 2, M2-R, the Gallup Well Being Index is an annual measure of how well the people in the states are doing--a higher number means they are doing better. This value is negative, meaning that as states to poorly in well-being, Trump gets higher wins, while as states do better, Cruz does better.

Model M1-D contains a variable describing the difference between Clinton vs Obama's primary victory's in 2012. This is a subtraction of Clinton-Obama, so positive numbers mean a win for Clinton, and negative numbers mean a win for Obama. In this model, the variable is positive, meaning that in states where Clinton won with larger margins against Obama in 2012, Trump does better in those states. Models 2-3 (M2-R, M3-R) have another political variable, the Republican-Democrat differences from the 2000 & 2004 presidential elections. Positive values mean a Republican win. In both models, this model is negative meaning that higher Democrat wins in those states (negative values), tend to provide stronger wins for Trump, while stronger wins for Republicans in those states give higher margins for Cruz.

Finally, model 3 (M3-R) includes a variables on slave ownership in 1860. This measures the percent of the population of that state that was slaves. I pulled together a number of variables that I used to create a "Southern Culture" factor that I believed would help predict primary results, and historical slave ownership in Southern states was one of those variables. The Southern Culture Index proved far less valuable in the models than I predicted, so was not used in any of the best models. However, the slaves variable was predictive of the Trump-Cruz contest. This variable is positive, meaning that as the number of slaves in that state in 1860 was higher, Trump does better in those states. Clearly there is a strong geographical feature to this variable, which is why I included it in the Southern Culture Index, where, as expected, it had strong associations with other measures that differentiate North vs South. However, the fact that the Southern Culture Index was poorly predictive of the Trump-Cruz model would seem to imply that the relationship of this variable to the Trump vs Cruz vote differences is more than just geography, but has a cultural residue from that history. It is beyond the scope of this paper to postulate a mechanism that links that slave history with higher win margins for Trump.

As a technical note, I used the open-source software R for this analysis. The 3000+ variables were processed by creating a script to do the following steps:
  1. Create all possible variables on the variables in ranges of 2-6 variables per model. I did not use any interaction terms, and I used only included variables that had correlations above abs(0.15).
  2. Process all of these combinations of variables as linear regression models.
  3. Eliminate all models that had and adjusted R^2 < 0.7, and any variable with a VIF (variance inflation factor) > 4.
  4. Eliminate all models that had 3 or more incorrect "fits" to the states
  5. Produce a summary report of the remaining models, listing the # incorrect, BIC, residuals, and Adj R^2
After R produced this abbreviated list (the original set of models was several million possibilities per outcome variable), I used Excel to ranked them according to the lowest number of incorrect fits, and lowest residuals, while also visually inspecting the BIC and Adj R2, although those patterns largely followed the trends of the lowest residuals. I eliminated models that had similar "types" of variables, such as multiple types of "jobs" variables, religion variables, voting variables, etc, even though the model and the variables had low VIFs and no indications of multicollinearity. I also eliminated models where individual variables had p-values over 0.2. Only one model above (M2-R) had a variable p-value above 0.1, and I retained the model because of its accuracy in having zero incorrect fits, and a reasonable residuals, BIC & adjusted R^2 compared to other models. I gave preference to models with lower numbers of predictor variables.

Friday, March 18, 2016

"Southern Culture" Index--Part 2

Earlier this week I described my first attempt to create a "Southern Culture" Index that relied on non-economic variables to facilitate generating a model to help predict the primaries. The purpose of this approach was to have one factor that I could use as a "cultural" factor to distinguish regional differences in US voting patterns, as opposed to economic factors, a separate approach. Two weeks ago I posted my early attempt at creating a regression model that fit the Democratic primaries that had taken place up to that point between Clinton and Sanders (and O'Malley). That model used two economic factors--cost of living, and rate of unemployment--along with rates of college attendance, and correctly fit 14/15 of the primary elections. Neither cultural nor race/ethnicity variables improved the model.

This post about creating a factor index for "Southern Culture" will be more technical than the first, but will also propose a revised and improved model. In both cases I used exploratory factor analysis (principal axis factor extraction) in R, specifically, the psych package, to reduce 15 variables down to a 4-variable model that has good statistical properties, moreso than the proposed factors I mapped in my last post. Here is the current map that represents the factor I am calling the "Southern Culture" Index, with states divided into four groups, with darkest red being "most Southern" and lighter shades being "less Southern."

Using R, and the original 15 variables chosen from the literature that seemed to correlate to Southern states, I created a script that would put all 15 variables into every possible combination, and tested each of those models against nine common measures of goodness of fit for exploratory factor analysis (EFA) approaches (one heavily cited reference in the literature on these issues is Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55). This produced just over 5,000 combinations of 2-6 variables per model. Typically with EFA a researcher wants to reduce a number of variables into a smaller number of factors. Say, for instance, you have a survey of 100 personality-related questions, and you want to create a small set of personality "factors"--like introversion, agreeableness, etc. In that case, you could run the answers to all of the survey questions through a statistical analysis and let the software find a few different factors based on patterns in how the 100 questions were answered by your respondents. In this case, I wanted just to generate just 1 factor, which is why I took the approach that I did.

I filtered the 5000 models based on specific cutoff criteria in the literature:

  • Moderate correlation of the variables: ideally between 0.3-0.8
  • Bartlett's test for sphericity: should be less than 0.05
  • Kaiser-Mayer-Olkin Index (KMO): ideally above 0.8
  • Tucker-Lewis Index (TLI, also called NNFI): ideally over 0.95
  • Root mean square error of approximation (RMSEA): ideally less than 0.05
  • Root mean square of the residuals (corrected for degrees of freedom, cRMSR): ideally less than 0.05
  • Bayesian Information Criterion (BIC): there is no standard value, but the lower the raw value the better, ie, if you have negative values, the more negative the better.
  • The communalities (h^2) of the variables in the factor: at least 0.32 (Tabachnick, B. G., & Fidell, L. S. (2001). Using Multivariate Statistics. Boston: Allyn and Bacon.), but the closer to 1 the better.
  • R2, the proportion of the variance explained by the factor: the closer to 1 the better, I chose a cutoff of 0.5 (a factor that explains 50% of the variation).

Using psych in R, the results are produced using the "fa" command, specifying principal axis factor analysis, fm="pa". Since I only wanted 1 factor, the rotation didn't matter, so I specified rotate="none". Neither KMO nor Bartlett's are automatically produced from the "fa" command in psych, but are available as separate commands: KMO(your variables), and cortest.bartlett(your variables). In this case, "your variables" would only be the specific variables you are testing in a specific model, not your entire dataset.

I pulled all of the resulting models into Excel to filter and peruse. After filtering using the above criteria, I was left with 400 models remaining. One of the variables that I felt had to be included in the model was the percent of the population in 1860 who were slaves. There is very little else that has so fundamentally shaped the history of the South in the last 200 years than an entire way of life built on slavery. This social model (slavery) did not completely end after the Civil War, since similar cultural practices continued for generations to subjugate race minorities, such as Jim Crow laws, lynchings, etc, and arguably, still has a profound impact on Southern culture today.

Even though I wanted a "cultural", non-economic model, I still included 5 economic and employment variables in this analysis, such as median family income (2014), income growth, unemployment, and levels of employment change in certain industries, such as manufacturing, over the last 15 years. Looking through the filtered models, many contained these economic variables, and while there were several models that ranked very high that included these economic variables, models that did not include economics were also highly ranked. After excluding all models that had economic variables, and that did not include the slave populations measure, I had less than 40 models remaining. These I ranked by BIC (lowest value), TLI (highest value), and KMO (highest value).

After controlling for the cutoff values listed above, the model produced is different from the one I posted earlier this week. This model has far better statistical properties, and is composed of the following 4 variables: White Evangelicals, death rates (2005-2014), slave population (1860), and teen birth rates (2014, 15-19 year olds). The factor loadings for each variable was strong and positive, meaning, in this case, that increases in incidence of what was being measured were more strongly associated with "Southern Culture," while decreases were more weakly associated with "Southern Culture." More specifically, research indicates that each of these variables is strongly associated with Southern states: higher rates of White Evangelicals, higher rates of teen births, shorter life spans (increased death rates), and of course the history slave ownership. The principal factor extraction statistical method pulled out these four variables, of the 15 tested, as best explaining the variability of these four measures across all 50 states (I did not include Washington DC or Puerto Rico). The analysis in R produced the following results:

VariablesFactor LoadingsCommunalities
White Evangelicals 0.84 0.71
death rates 0.95 0.90
slave population 0.64 0.41
teen birth rates 0.80 0.65
  • Bartlett's sphericity: p<0.001
  • KMO: 0.81
  • RMSEA: 0
  • cRMSR: 0.03
  • TLI: 1.04
  • BIC: -7.07
  • R2: 0.67
I did not include these tests in my filter, but they were produced by R, so I reproduce them here:

Correlation of scores with factors: ............. 0.97

Multiple R square of scores with factors: ....... 0.93

Minimum correlation of possible factor scores: ... 0.87

Wednesday, March 16, 2016

"Southern Culture" Index

In trying to generate models to predict/fit this year's election cycle, I wanted to eliminate cultural factors to focus on economic factors. Doing so meant that I had find a legitimate way to control for regional patterns of cultural difference--for example, differences between the South, Midwest, Northeast, and West, presuming such cultural differences exist. Various demographic maps lend visual credibility to the existence of regional differences, although rigorously disentangling economic from cultural factors is challenging. Prior literature indicates various factors associated with "Southern Culture," including several attempts to create a "Southern Culture Index."

Below are seven maps that show regional differences based on some of the more common factors that are mentioned in the academic literature that are associated with differences between the South and other parts of the country. Total, I found 20 variables that I included in an exploratory factor analysis, ranging from voting patterns (Republican to the South, Democratic to the North), occupational differences (manufacturing to the South, science and finance to the North), and income differences (higher to the North, and along the coasts, lower to the South). However, these variables tended to produce low statistical results as factors, so I did not create maps for them. The maps below represent the variables that produced the strongest results in terms of creating a "Southern Culture" Index. In addition, I produce four more maps of the best index models that were generated using various combinations of these 20 variables.

To summarize the data, the South has a number of challenges compared to the North, Midwest, and West. For example, in addition to lower income mentioned above (although the cost of living is lower, helping to compensate for income disparities), there are lower rates of college graduation and union membership. The South has significantly higher rates of firearm deaths, teen births, and death rates from various causes. Some researchers have identified a relationship between rates of violence in a region and large rates of Scotch-Irish ancestry in that region, both of which are found in the South. Similarly, the South has a long history of human rights abuses in terms of slavery, a history which continues to shape the South. These factors all contributed to the strongest indices. However, the final models did not use Scotch-Irish ancestry, income, or cost of living. The best models used combinations of the following six variables: death rates all causes (CDC, 2010-2015), firearm death rates (CDC, 2010-2014), union membership (BLS, 2015), teen birth rates 15-19 years (CDC, 2015), white Evangelicals (PRRI, 2015), and slave ownership as a percent of the population (1860).

I generated the following maps using the opensource software QGIS, and I used the opensource software R for the factor analysis to generate the indices. The package psych has several nice factor analysis features. The first seven maps show the individual factors, while the final four maps show the best index models (combinations of factors), along with technical information about the strength of the models. As can be seen, all four models produced results that are very similar. Red states are the most "South-like," blue states are the least "South-like," and purple states have mid-range "South-like" characteristics, according to each of the four models.

Death rates, all causes

Firearm death rates

Scotch-Irish ancestry

Slave ownership as a percent of the population (1860)

Teen birth rates

Union membership

White Evangelicals

4 factor index: White Evangelical + Union Membership + Death Rate + Slave Ownership

4 factor index: White Evangelical + Union Membership + Firearm Death Rates + Slave Ownership

5 factor index: White Evangelical + Union Membership + Firearm Death Rates + Teen Birth Rates + Slave Ownership

4 factor index: White Evangelical + Union Membership + Teen Birth Rates + Slave Ownership

Thursday, March 3, 2016

Regressing the Democratic Primary (Part 2)

Yesterday I posted a regression analysis that correctly predicts 13 of the 15 Democratic primaries/caucuses that have occurred so far this year. One of the most interesting aspects of the model is that it used no polling data or historical voting patterns--just 3 economic variables: median earnings from 2014, the cost of living for 2015, and unemployment for December 2015 (all of these are the latest available data for these measures). Based on a Facebook conversation that ensued, I added education and race through the model, specifically, the percent of residents in the state with a bachelor's degree or higher, and the percent Black population. Based on a specific recommendation, I also tried one interaction term, race with unemployment. The outcome variable is the difference between Sanders' votes and Clinton's votes. So, for example, in Vermont, Sanders beat Clinton by 72.5%, but in Alabama, Clinton beat Sanders by 58.6%.

I tried 36 different combinations of these 6 variables. While race was an important predictor, and in fact, when used by itself, correctly predicted 12/15 of the races. However, in several models it dropped out (it failed to reach statistical significance), and other models failed to see improved predictability. Education, however, proved to be a useful predictor. On its own, it was one of the worst predictors, missing almost half of the races. However, when combined with economic variables, specifically, the cost of living and unemployment, it produced the only model that missed just one state, Oklahoma. The rest of the models missed 2 or more states.

In addition to accuracy of prediction, I also produced results for AIC, BIC and the residuals. The 36 models, the p-value significance of each variable, and the AIC/BIC/Residuals data is in the image below. The column for B*U is the interaction term for %Black population * Unemployment. The last column is the number of states incorrectly predicted, and the table is sorted first by states predicted, and then by lowest AIC. In statistics, you can use AIC and BIC to compare different regression models--the lower the value, one has a better case to argue that it is a better model (lowest values highlighted in green). Similarly, lower residuals also tend to indicate a better model. As you can see from the chart, the top model does not have the best AIC/BIC/residuals, despite the fact that it has the best prediction history. In the models where I did not use a specific variable, that cell is highlighted in red and an "x." In models where the variable failed to reach statistical significance (p less than 0.05), I have crossed out the value and made the font red.

This image shows the predicted values of Sanders' wins in each state based on this model (cost of living, unemployment, and college education). The only state it missed is Oklahoma. A positive value is a win for Clinton (highlighted in red), and a negative value is a win for Sanders (highlighted in blue). In the "model prediction" column, the correct predictions are highlighted in green.

This image is from the actual R-output for this model, showing the p-values, model significance, adjusted R-square, the coefficients for each variable, etc.


This final image is a list of all 50 states + DC with the original data used to calculate the models, and predictions for the outcome of the rest of the primaries/caucuses. Model 1 is just the 2 economic variables + education. Model 2 is those same 3 variables, plus median earnings and % of the state population that self-identified as Black for the American Community Survey, 5 year estimate (2010-2014). It is, arguably, the 2nd best model--one of the problems with the model is that both race and education drop out of statistical significance. However, removing them from the analysis creates an inferior model, so for the purpose of comparison, I left this model intact alongside Model 1.

Wednesday, March 2, 2016

Economic Factors and the Democratic Primary

Political prognosticators use many factors to attempt predictions of elections, and there are many theories of what factors should predict elections. Over the last several weeks, the US has been starting to pick presidential candidates at the state level, through caucuses and primaries. On the Democratic side, 15 states have gone through this process and apportioned delegates of their choosing, at this point (March 2, 2016) narrowing the field to two candidates, Bernie Sanders and Hilary Clinton.

Of the factors proposed to predict how elections will go, economic factors and historical voting patterns are at the top of the list. Using these as a basis for a model to predict Democratic primary outcomes, I sorted through approximately 20 economic factors, presidential voting data since 1992, state-level voting data from 2014, and federal congress voting data from 2014. I also incorporated polling data, primarily from, which compiles and lists public polling data, as well as polling data from other sources when 538 did not list a particular state.

Using only economic and past voting variables as a basis, I constructed a regression model that explains 75% of the variation (the "adjusted r-square") of the primary & caucus results from the 15 states that have voted so far--this model correctly predicted 13/15 of the elections (missing Oklahoma and Massachusetts). In fact, the final model that I chose uses only 3 economic variables, and no past political voting or current polling data. A second model, using the same 3 economic variables, plus the state-wide results of the last presidential election (2012), explains 82% of the variation, however, it only predicted 12/15 of the primaries/caucuses. This second model gave results that were closer to the actual state-level results, however, it missed the vary tight race of Iowa (in addition to Oklahoma and Massachusetts, also missed by the first model). The dependent/outcome variable for this model, instead of the difference between the Clinton/Sanders votes, was the percent of the vote given to Sanders.

Model 1: Only economic factors (R-square=0.752, p<0.0001)

Difference between Clinton vs Sanders = Unemployment (Dec 2015) + Median Earnings (2014) + Cost of Living (2015)

Difference = -13.6 + 2805.6 x Unemployment + 0.0021 x Earnings - 0.0023 x COL

The three economic variables that I used were 1) Median Earnings for 2014 (this data is not yet available for 2015), 2) Unemployment for December 2015 (the latest data available), and 3) cost of living variation for 2015. Using the statistics package R, I used these three economic variables as my predictor/independent factors, and the raw difference between the percent of the vote given to Clinton vs Sanders in each respective state-wide primary/caucus. Using only these three economic variables, the regression model correctly predicted that Vermont, New Hampshire and Colorado would go for Bernie Sanders, while incorrectly predicting that Massachusetts would also go for Sanders. The model predicted that all other states would go for Clinton, which was correct, except for Oklahoma.

What is particularly interesting in this model, is that it does not use any cultural or political variables, not even past historical voting data or the expensive polling that newspapers and parties invest in. I expected that votes in the past several presidential elections would help make the model more accurate, and while the presidential election of 2008 was mathematically more accurate (smaller residuals, and larger r-square), it actually did slightly worse at predicting the outcome of the elections. Similarly, I presumed that the results of the midterm elections might be a good predictor, since there are common social patterns between midterm elections and primaries--specifically, you typically only get high-information, highly-motivated voters for both of these events. However, neither the federal congressional elections of 2014, nor the state-level house/senate votes created a better model. In fact, each of those midterm election variables produced a far worse model.

As for polling, I did not actually factor it into any of the final models. Part of the difficulty was determining which polls to use. Considering there is no one pollster that produces data available for all of the state, and no pollster uses the exact same methods, I did not believe it was reasonable, in the end, to include the polling data. In the table below where I show my data, I include an estimated average of the most recent polls listed at 538 for each state. Another interesting feature of the economic-based regression model I created, is that it has more predictive value than the polls--while this model predicted 13/15 state outcomes, the poll averages only predicted 12/15, missing New Hampshire, Oklahoma, and Massachusetts. If you include the margin of error in the Iowa polls as an incorrect prediction, the polling averages actually only predicted 11/15. These averaged did not take into account that certain individual polls may have had more predictive success than the average.

For each of the three economic variables, the correlations show that the better a state's economy was doing, the more likely they were to vote for Sanders. For example, the higher the median earnings for 2014, the more likely those states were to vote for Sanders. Similarly, the lower the unemployment rates for Dec 2015, the more likely they were to vote for Sanders. On the other hand, the higher the cost of living in a state, the more likely they were to vote for Clinton.

Finally, I have not tested the model for the results of the Republican primaries/caucuses, or any prior elections. Below is the data I used for this analysis (the "Model 1" values are the predicted difference between Sanders and Clinton, with a negative value favoring Sanders and a positive value favoring Clinton; the "Model 2" values are the predicted final percent in that state going for Sanders):