Of the factors proposed to predict how elections will go, economic factors and historical voting patterns are at the top of the list. Using these as a basis for a model to predict Democratic primary outcomes, I sorted through approximately 20 economic factors, presidential voting data since 1992, state-level voting data from 2014, and federal congress voting data from 2014. I also incorporated polling data, primarily from 538.com, which compiles and lists public polling data, as well as polling data from other sources when 538 did not list a particular state.
Using only economic and past voting variables as a basis, I constructed a regression model that explains 75% of the variation (the "adjusted r-square") of the primary & caucus results from the 15 states that have voted so far--this model correctly predicted 13/15 of the elections (missing Oklahoma and Massachusetts). In fact, the final model that I chose uses only 3 economic variables, and no past political voting or current polling data. A second model, using the same 3 economic variables, plus the state-wide results of the last presidential election (2012), explains 82% of the variation, however, it only predicted 12/15 of the primaries/caucuses. This second model gave results that were closer to the actual state-level results, however, it missed the vary tight race of Iowa (in addition to Oklahoma and Massachusetts, also missed by the first model). The dependent/outcome variable for this model, instead of the difference between the Clinton/Sanders votes, was the percent of the vote given to Sanders.
Model 1: Only economic factors (R-square=0.752, p<0.0001)
Difference between Clinton vs Sanders = Unemployment (Dec 2015) + Median Earnings (2014) + Cost of Living (2015)
Difference = -13.6 + 2805.6 x Unemployment + 0.0021 x Earnings - 0.0023 x COL
The three economic variables that I used were 1) Median Earnings for 2014 (this data is not yet available for 2015), 2) Unemployment for December 2015 (the latest data available), and 3) cost of living variation for 2015. Using the statistics package R, I used these three economic variables as my predictor/independent factors, and the raw difference between the percent of the vote given to Clinton vs Sanders in each respective state-wide primary/caucus. Using only these three economic variables, the regression model correctly predicted that Vermont, New Hampshire and Colorado would go for Bernie Sanders, while incorrectly predicting that Massachusetts would also go for Sanders. The model predicted that all other states would go for Clinton, which was correct, except for Oklahoma.
What is particularly interesting in this model, is that it does not use any cultural or political variables, not even past historical voting data or the expensive polling that newspapers and parties invest in. I expected that votes in the past several presidential elections would help make the model more accurate, and while the presidential election of 2008 was mathematically more accurate (smaller residuals, and larger r-square), it actually did slightly worse at predicting the outcome of the elections. Similarly, I presumed that the results of the midterm elections might be a good predictor, since there are common social patterns between midterm elections and primaries--specifically, you typically only get high-information, highly-motivated voters for both of these events. However, neither the federal congressional elections of 2014, nor the state-level house/senate votes created a better model. In fact, each of those midterm election variables produced a far worse model.
As for polling, I did not actually factor it into any of the final models. Part of the difficulty was determining which polls to use. Considering there is no one pollster that produces data available for all of the state, and no pollster uses the exact same methods, I did not believe it was reasonable, in the end, to include the polling data. In the table below where I show my data, I include an estimated average of the most recent polls listed at 538 for each state. Another interesting feature of the economic-based regression model I created, is that it has more predictive value than the polls--while this model predicted 13/15 state outcomes, the poll averages only predicted 12/15, missing New Hampshire, Oklahoma, and Massachusetts. If you include the margin of error in the Iowa polls as an incorrect prediction, the polling averages actually only predicted 11/15. These averaged did not take into account that certain individual polls may have had more predictive success than the average.
For each of the three economic variables, the correlations show that the better a state's economy was doing, the more likely they were to vote for Sanders. For example, the higher the median earnings for 2014, the more likely those states were to vote for Sanders. Similarly, the lower the unemployment rates for Dec 2015, the more likely they were to vote for Sanders. On the other hand, the higher the cost of living in a state, the more likely they were to vote for Clinton.
Finally, I have not tested the model for the results of the Republican primaries/caucuses, or any prior elections. Below is the data I used for this analysis (the "Model 1" values are the predicted difference between Sanders and Clinton, with a negative value favoring Sanders and a positive value favoring Clinton; the "Model 2" values are the predicted final percent in that state going for Sanders):