Friday, March 18, 2016

"Southern Culture" Index--Part 2

Earlier this week I described my first attempt to create a "Southern Culture" Index that relied on non-economic variables to facilitate generating a model to help predict the primaries. The purpose of this approach was to have one factor that I could use as a "cultural" factor to distinguish regional differences in US voting patterns, as opposed to economic factors, a separate approach. Two weeks ago I posted my early attempt at creating a regression model that fit the Democratic primaries that had taken place up to that point between Clinton and Sanders (and O'Malley). That model used two economic factors--cost of living, and rate of unemployment--along with rates of college attendance, and correctly fit 14/15 of the primary elections. Neither cultural nor race/ethnicity variables improved the model.

This post about creating a factor index for "Southern Culture" will be more technical than the first, but will also propose a revised and improved model. In both cases I used exploratory factor analysis (principal axis factor extraction) in R, specifically, the psych package, to reduce 15 variables down to a 4-variable model that has good statistical properties, moreso than the proposed factors I mapped in my last post. Here is the current map that represents the factor I am calling the "Southern Culture" Index, with states divided into four groups, with darkest red being "most Southern" and lighter shades being "less Southern."

Using R, and the original 15 variables chosen from the literature that seemed to correlate to Southern states, I created a script that would put all 15 variables into every possible combination, and tested each of those models against nine common measures of goodness of fit for exploratory factor analysis (EFA) approaches (one heavily cited reference in the literature on these issues is Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55). This produced just over 5,000 combinations of 2-6 variables per model. Typically with EFA a researcher wants to reduce a number of variables into a smaller number of factors. Say, for instance, you have a survey of 100 personality-related questions, and you want to create a small set of personality "factors"--like introversion, agreeableness, etc. In that case, you could run the answers to all of the survey questions through a statistical analysis and let the software find a few different factors based on patterns in how the 100 questions were answered by your respondents. In this case, I wanted just to generate just 1 factor, which is why I took the approach that I did.

I filtered the 5000 models based on specific cutoff criteria in the literature:

  • Moderate correlation of the variables: ideally between 0.3-0.8
  • Bartlett's test for sphericity: should be less than 0.05
  • Kaiser-Mayer-Olkin Index (KMO): ideally above 0.8
  • Tucker-Lewis Index (TLI, also called NNFI): ideally over 0.95
  • Root mean square error of approximation (RMSEA): ideally less than 0.05
  • Root mean square of the residuals (corrected for degrees of freedom, cRMSR): ideally less than 0.05
  • Bayesian Information Criterion (BIC): there is no standard value, but the lower the raw value the better, ie, if you have negative values, the more negative the better.
  • The communalities (h^2) of the variables in the factor: at least 0.32 (Tabachnick, B. G., & Fidell, L. S. (2001). Using Multivariate Statistics. Boston: Allyn and Bacon.), but the closer to 1 the better.
  • R2, the proportion of the variance explained by the factor: the closer to 1 the better, I chose a cutoff of 0.5 (a factor that explains 50% of the variation).

Using psych in R, the results are produced using the "fa" command, specifying principal axis factor analysis, fm="pa". Since I only wanted 1 factor, the rotation didn't matter, so I specified rotate="none". Neither KMO nor Bartlett's are automatically produced from the "fa" command in psych, but are available as separate commands: KMO(your variables), and cortest.bartlett(your variables). In this case, "your variables" would only be the specific variables you are testing in a specific model, not your entire dataset.

I pulled all of the resulting models into Excel to filter and peruse. After filtering using the above criteria, I was left with 400 models remaining. One of the variables that I felt had to be included in the model was the percent of the population in 1860 who were slaves. There is very little else that has so fundamentally shaped the history of the South in the last 200 years than an entire way of life built on slavery. This social model (slavery) did not completely end after the Civil War, since similar cultural practices continued for generations to subjugate race minorities, such as Jim Crow laws, lynchings, etc, and arguably, still has a profound impact on Southern culture today.

Even though I wanted a "cultural", non-economic model, I still included 5 economic and employment variables in this analysis, such as median family income (2014), income growth, unemployment, and levels of employment change in certain industries, such as manufacturing, over the last 15 years. Looking through the filtered models, many contained these economic variables, and while there were several models that ranked very high that included these economic variables, models that did not include economics were also highly ranked. After excluding all models that had economic variables, and that did not include the slave populations measure, I had less than 40 models remaining. These I ranked by BIC (lowest value), TLI (highest value), and KMO (highest value).

After controlling for the cutoff values listed above, the model produced is different from the one I posted earlier this week. This model has far better statistical properties, and is composed of the following 4 variables: White Evangelicals, death rates (2005-2014), slave population (1860), and teen birth rates (2014, 15-19 year olds). The factor loadings for each variable was strong and positive, meaning, in this case, that increases in incidence of what was being measured were more strongly associated with "Southern Culture," while decreases were more weakly associated with "Southern Culture." More specifically, research indicates that each of these variables is strongly associated with Southern states: higher rates of White Evangelicals, higher rates of teen births, shorter life spans (increased death rates), and of course the history slave ownership. The principal factor extraction statistical method pulled out these four variables, of the 15 tested, as best explaining the variability of these four measures across all 50 states (I did not include Washington DC or Puerto Rico). The analysis in R produced the following results:

VariablesFactor LoadingsCommunalities
White Evangelicals 0.84 0.71
death rates 0.95 0.90
slave population 0.64 0.41
teen birth rates 0.80 0.65
  • Bartlett's sphericity: p<0.001
  • KMO: 0.81
  • RMSEA: 0
  • cRMSR: 0.03
  • TLI: 1.04
  • BIC: -7.07
  • R2: 0.67
I did not include these tests in my filter, but they were produced by R, so I reproduce them here:

Correlation of scores with factors: ............. 0.97

Multiple R square of scores with factors: ....... 0.93

Minimum correlation of possible factor scores: ... 0.87

No comments:

Post a Comment