Wednesday, October 23, 2019

Interpreting Election Polling: Pay Attention to Margins of Error

I have to confess--I had absolutely no doubt Clinton would win the 2016 presidential election. Despite months of working with datasets about state-level polling, despite teaching statistics and sociological research methods, and closely following several sites that monitor state-level polling, I had no doubts. Zero doubts. However, in hindsight, it is clear I was not using my statisticians hat when I made that prediction--I was relying on intuition and emotion.

Probability gave Clinton the odds to win (according to 538, their last estimate was 71-29 for Trump, one of the most conservative estimates by news & polling sites). However, as any gambler knows, the odds are just that--odds, chances, possibilities. If you flip a coin that's weighted to come up heads 71% of the time, it is likely to come up tails 29% of the time. A 71% chance of winning is reasonably good odds, but clearly not a sure thing. An in the case of election polling, those odds can fluctuate from week to week based on current events and pubic mood. In fact, as of the end of September, 538 only gave Clinton 56-44 odds of winning.

Ironically, in my classes, as the campaign season heightened in late 2015 & early 2016, I warned students about the perils of ignoring the tedious details of polling results. I specifically showed the students a bar-graph of a respected poll comparing Trump vs Clinton--it looked like Clinton was far ahead... until you added in the confidence intervals. In the first graph below, the blue bar is the Clinton estimate, and the red is Trump, showing results from a Michigan poll. The pollster is Public Policy Polling (538 gives them a B+ rating). In this poll, showing Clinton clearly has a 5% lead over Trump. However, the margin of error for this poll (once you read the fine print) is 3.2%. That means the Trump estimate, shown at 41%, could actually be 3.2% above or below that, or a range of 37.5-44.2% (with 95% confidence). Similarly, Clinton, shown at 46%, could be anywhere from 42.8-49.2%. Notice on the graph the lower bar for Clinton overlaps with Trump's upper bar. This is what we call overlapping confidence intervals. In practice, this means they are at a statistical tie. So while I was teaching this to my students, I myself ignored my own advice. Clearly a case of "do as I say, not as I do."

Bring this idea to the national level. Many (especially Democrats, and clearly Clinton) believed Michigan, Wisconsin and Pennsylvania were a so-called 'blue-wall' that would definitely vote for a Democrat for president. Clearly this was not the case. In fact, their 46 electoral votes would have swung the election in Clinton's favor. Instead, they all went for Trump (although barely--none voted for Trump by more than 1%, and for all three states combined, it was less than 78,000 voters total that swung the vote for Trump). The graph below represents polling from just three states (MI, WI, PA), that 538 gives a B or higher quality rating, conducted in November, where the margin of error was reported. I include the raw data, and graphed the results with error bars. Contrary to conspiracy theorists, proposing there was election result interference, claiming the election results were all so far from polling estimates, the polling was actually all within the margin of error between Clinton and Trump. The only exception is one Wisconsin poll by PPP--while those error bars do not overlap, they are very close, which should make any skeptical poll watcher nervous. What does this tell us about 2020? When interpreting election polls, never ignore the confidence intervals (and don't let your gut feelings pull you away from the data).

Tuesday, July 30, 2019

Mapping in R & Jupyter Notebook (Python)

For a decade I have been doing all my mapping in Quantum GIS. However, I recently tried to do some spatial regression, and could not figure out how to make QGIS do it. This forced me to try other options, and I discovered that there are great packages available for mapping in both Jupyter Notebook (using Python), and R, with the latter having several packages for doing spatial regression (I'm sure such packages exist in Python as well, but since I've been using R for 15 years, that's where I'm more comfortable).

A long-running project of mine is mapping police violence, specifically, the number of people killed by police in the United States every year. Importing data from Mapping Police Violence, and the U.S. Census I generated maps both in Jupyter and R. This first set of four maps is a county-level depiction of the rates of killings in each state--specifically, the number of people killed by police in that county in relation to the total estimated population in that county in 2015. The Mapping Police Violence data covers all of 2013-2018. These maps were generated in Jupyter, which automatically created the nice legend on the side. The first map shows killings of all race/ethnic groups combined, while the next three separate the killings by Black, Hispanic, and White.

This next map is a screenshot of a map created in R-Studio, using the packages rgdal & leaflet. What I like about this, is that not only is it mapping the county-level data as before, but you can easily overlay any available set of maps underneath it, in this case, I imported OpenStreetMaps, the leaflet default. This allows the user to zoom in & out, and scroll all around the country.

The next part of this project was to do some spatial regression. Several packages were available, and I chose spgwr, because it specifically included geographic-weighted regression, and a way to map it with the sp package. The first issue was the regression itself. Since just over half of US counties have no recorded killings by police from 2013-2018 (ignoring the shocking aspect that just under half of all US counties DO have recorded killings by police in this 6-year period), there are a lot of zeroes in this data. This means the usual Gaussian, OLS approach will not work, since transforming the data cannot make it look normally distributed. Several distributions allow modelling 'count' data, for example, Poisson, quasipoisson, and negative binomial, with the latter being the best fit for my data. Another reason I used the spgwr package is that it allows geographically modelling negative binomial approaches.

One final issue is that there is some statistical question whether p-values have any meaning with geographically-weighted variables, since spatial auto-correlation is a serious issue--ie, effects are unlikely to be discretely localized inside county borders, but rather, are likely to be spread out over many counties, if not across most of a given state and across state lines as well. More discussion can be found by the author of the spgwr program (Roger Bivand, Norwegian School of Economics) here, and the creators of ArcGIS here--both removed p-values from their software, after initially including them for negative binomial approaches to geographically-weighted regression.

On the other hand, they still include p-values for OLS models. Because of this, as well as just for comparison, I generated spatial regression models for both OLS and negative binomial, as well as maps, plus p-values for OLS (recognizing they couldn't be trusted, I just wanted to see what they looked like). Out of about 20 demographic variables I originally included for analysis (many of which have been shown to be predictive in previous research), the best model, where all predictor variables remained statistically significant at p<0.05, included seven county-level variables (generally from the 2015 American Community Survey estimates): divorce rates, education (those lacking a high school diploma at 25+ yrs), house crowding (house occupancy rates with greater than 1 family), poverty rates, inequality rates (GINI), segregation (measured by the White-NonWhite Exposure Index, the likelihood of these two groups regularly interacting with each other in their communities), and percent of Trump voters in 2016. Here are summaries of the OLS (left) and negative binomial models (right).

While all variables are significant in both, what is interesting is the sign-flips for Trump voters and poverty between the models. The OLS model indicates a negative relationship between poverty & trump voters, as predictors of the number of killings by police. In other words, the fewer Trump voters, the higher police killings, and vice versa--similarly for poverty. However, given that this model is unreliable since it doesn't meet regression assumptions (normally distributed variables), the negative binomial results are much more important, which indicate that higher rates of poverty, and a higher percent of Trump voters in any given county, is predictive of the number of killings by police.

Finally, I mapped these results. Given that there is no good way to visualize a regression model with seven predictor variables, I mapped each individual predictor variable listed above. Here, I show only Trump voters coefficients. First, the OLS map, which, as mentioned above, indicates that fewer Trump voters predicts higher killings by police in most of the country--this is indicated with the pink, purple and blue areas on the map. Only the orange and yellow areas are where a higher rate of Trump voters predicts police killings. After that map is the map of p-values for the OLS model. Since the software doesn't generate p-values for the negative binomial model, those can't be shown--I show the OLS model here, just to show what the package can do if your data met OLS regression assumptions. If one could trust the OLS model, the coefficients for Trump voters would only be statistically significant in the green and blue areas (or yellow areas, if you did not mind using p < 0.1). Finally, the last map shows the negative binomial map, indicating that everywhere in the country, the relationship between Trump voters in 2016, and killings by police in any given county, are positive, though the areas in blue have weaker predictive value, and the areas in yellow have the strongest. Since this model had seven predictor variables, they each are stronger or weaker in different areas of the country.