Wednesday, October 29, 2014

Soda, Telomeres, Aging, and Statistics

Anderson Cooper recently highlighted a pre-print analysis of "telomere length" and drinking sweetened carbonated beverages (soda, pop, or coke, in the vernacular) on his Ridiculist. He even includes an interview with neurosurgeon/health reporter Sanjay Gupta. I'm currently teaching a statistics course, so I'm always on the lookout for cutting-edge, peer-reviewed research that may have statistics at the appropriate level for my class, and data that would interest university undergraduates. I downloaded the paper and pulled out the Results table.

The researchers' hypothesis is framed with the theoretical belief that telomere length is related to aging. I won't address that issue in this blog post, except to say that it's a controversial (and, in my personal opinion, poorly defended) proposition. I will limit my comments just to the statistics presented in this research, and specifically, just to Table 3--spoiler alert, I wonder if the editors of AJPH were impaired when they let it into the journal.

In the table, we see Models 1 and 2, which they describe in the notes. Model 1 is just age, gender and energy (which I couldn't find that they define--perhaps it's simple daily caloric intake?), while Model 2 includes a mish-mash of "healthy habits," such as healthy eating, BMI, smoking and alcohol, as well as some extra socioeconomic demographics, such as race, education, poverty level, etc. They compare four drinks--carbonated sugar-sweetened, noncarbonated sugar-sweetened, diet, and 100% fruit juice. They provide the quartiles, the b (regression coefficient; similar to the "m" or "slope" in the dreaded high-school algebra linear equation "y=mx+b"), and the 95% confidence intervals.

My first clue that something is amiss is that there is not a consistent linearity in the quartiles (not to mention that they don't provide Q0, the minimum--we don't necessarily need the Q0, but then why provide Q4, the maximum if you're not going to provide the minimum? it's just a consistency issue that doesn't affect the analysis but makes the table feel unbalanced and sketchy to me). Their regression is self-described as linear. However, the quartiles themselves are decidedly NOT linear--at least not for anything except the noncarbonated sugar-sweetened beverages, and the combined sugar-sweetened beverages index. That is problematic for me.

Let's ignore the non-linearity question for a moment, and just look at their base analysis, which is comparing the median values of the four beverages with their published b coefficients and confidence intervals. Let's ignore Model 1, since it's just demographics. Once you control for the everyday behaviors of the people in their study (they use data from the NHANES), they only have one variable that they claim reaches the level of statistical significance: people who consume sugar-sweetened carbonated beverages apparently have shorter telomeres. While it would be nice to overstate the other Model 2 coefficients, such as claiming that people who drink fruit juice have longer telomeres, and people who drink diet drinks have neither positive nor negative impacts on telomeres, it's statistically inappropriate to make those claims, since neither of those measurements achieved statistical significance (p<0.05), therefore we can completely ignore them. So let's compare the two extremes, based solely on the published medians of telomere lengths: sugar-sweetened soda with 1.13 & 100% fruit juice (diet soda median is equal to fruit juice) with 1.08. On the surface it looks like those two numbers are "different"--clearly they are "different numbers," but that doesn't mean that in "reality" in the general population they are different, since this output is based on a sample, and therefore an estimate. That's what "statistical analysis" does by definition--creates reasonable estimates of the general population based on samples.

Let's assume the sample meets standard scientific guidelines of randomness, etc, so we just have to determine if 1.08 vs 1.13 translates into an "actual" difference when applied to a general population estimate. According to the p-value of the coefficients, it does indeed appear to be different, but we'll get to that later. For now let's stick to the medians. Notice the spread from Q1, Q2 (the median), Q3 to Q4. Since they aren't linear, I'm not quite sure what the Q2 value actually represents. As any of my undergraduate statistics students can tell you, one of the data assumptions that must be met before you can do a regression analysis is data linearity--this non-linearity of quartiles makes me suspicious. Putting that question aside, I wonder what if the differences between the quartiles are "actually" differences, or if the non-linearity indicates that these are merely natural variation and there actually are not regular increases in telomere lengths from Q0 to Q4? We can't know that based solely on this study, but to me personally, I don't see that assumption is met given this table. Let's then take a leap and say--what if the actual Q2 measurement is anomalous, and perhaps a better estimate of Q2 would be to take an average of Q1 vs. Q3? In that case, the "estimated" Q2 for sugar-sweetened carbonated beverages is not 1.13 at all, but [(1.04+1.09)/2] 1.065, which is shorter than the telomere length of the fruit juice telomeres. This is actually what the researchers, in fact, predict with their regression equation--shorter telomeres with sugar-sweetened carbonated beverages. But this isn't necessarily self-evident, since the actual median shows that soda/pop has longer telomeres than fruit juice. Based on the presented median lengths, one could interpret that drinking sugar-sweetened sodas are actually better for you than drinking fruit juice! Granted, I'm not going to make that claim--although there is significant evidence that drinking fruit juice is not much healthier, if at all, than drinking soda.

Finally, let's look at the b coefficients. I always make a point for my statistics students to ignore any published data that doesn't include confidence intervals, since hiding confidence intervals (by not publishing them) is a GREAT way to completely misrepresent a data analysis to your benefit. One can't interpret any parametric analysis (like a regression coefficient) without the confidence intervals. In this case, it seems that in Model 2, sugar-sweetened carbonated drinks shorten telomeres (b=-0.010) compared to fruit juice, which seem to actually lengthen telomeres (b=+0.016). BUT! Remember that these are statistical estimates, and not "real" numbers--the real numbers for soda-related telomere shortening are actually somewhere between -0.020 and -0.001, and we can't know "actually" where without a 95% confidence of making a Type 1 error (i.e., claiming this result is real, when it isn't). So in reality, the authors aren't claiming that the "actual" telomere shortening is exactly -0.010, but almost certainly somewhere between -0.020 and -0.001.

Similarly, the alleged telomere lengthening properties of fruit juice isn't "exactly" +0.016, but likely somewhere between 0.000 and +0.033. So for my Introduction to Statistics students, I have them look at the maximum confidence value of the lowest measurement, and the lowest confidence value of the highest measurement before making an assessment. This means that the lowest likely value of telomere lengthening of fruit juice is actually 0.000 (i.e., no change at all), vs. telomere shortening of soda is -0.001. While "statistically" if looking at the p-values, that appears to be a measurable difference. But in reality, personally, I would interpret that as not at all an important clinical difference (I won't get into "effect sizes" in this particular blog post, but since they don't post the standard deviations, we can't calculate those, which I would guess come out to be completely insignificant).

What makes this difference perhaps even less relevant, is if one looks at the decimal points and takes rounding into consideration. If the soda telomere maximum confidence level is b=-.001, potentially that could be -0.0005. Similarly, the minimum telomore shortening of b=0.000 could be -0.0005; in other words, they could be basically the same value, depending on how they rounded! On the one hand, I would have liked to have seen that one extra decimal place to rule out that possibility. On the other hand, one can simply eliminate a decimal place, and then the maximum soda level becomes b=0.00 and minimum fruit juice level would also be b=0.00. So, in actuality, I don't really need to see that extra decimal place at all--I don't think these results support their conclusions, going by this table alone.

Sunday, October 26, 2014

Just over a week left before the 2014 election, with major shake-ups likely for U.S. Senate and Governor races. Why are 6 seats important? Because that's all the GOP needs to take control of the senate by 1 vote.

Comparing 6 different prediction sites, all agree that Republicans will pick up at least 6 seats, with the most likely being Alaska, Arkansas, Iowa, Kentucky, Louisiana and South Dakota. The only exception is Real Clear Politics, which is making the most conservative estimates (scientifically conservative, not politically conservative) and declaring many of the polling so close that they are still within the margin of error, so keeping them as "Toss-ups" (although they have published a "no toss-up map" that agrees with the other 5 prediction sites, that the GOP will pick up 6 seats). Of the other differences between the sites, one is Kansas, where Politico and Washington Post are calling likely Republican, whereas 538 and Princeton are calling "leans Independent". Both of the latter along with the Washington Post say Georgia is slightly leaning Democrat, and both Colorado and Louisiana are leaning Republican, each of which the other 3 sites still call toss-ups. Sabato at the UVa Center for Politics is still counting Kansas as a toss-up.

However, that number "6" is contingent on a couple of things. First, it presumes that Republicans would be certain of keeping all of the seats they currently control. But three of these seats are actually far closer than expected--Georgia, Kansas and Kentucky. In Kansas, Governor Brownback has made the Republican brand so toxic that the incumbent Republican senator may get kicked out of office, replaced by an independent who has not stated for whom he would vote for Senate leader, but he would not support either Reid or McConnell. So while McConnell's seat may end up being safe from challenger Lundergan-Grimes, unless Republicans vote in a new Senate leader other than McConnell, Orman might refuse to caucus with them, potentially giving Democrats a hail-Mary if the race is closer than expected. The second problem for Republicans is Georgia, where they lost their safe incumbent Saxby Chambliss, and now that race may turn into a runoff which 3 of the prediction sites are calling a toss-up, and the other 3 are saying leans slightly Democrat.

The second "6 seats to victory" contingency is that the close races will cut evenly between Democrats and Republicans. If the 6 states listed above all go for the Republicans, which all of the sites agree is likely, and they keep Georgia and Kansas, then they are safe. However, if either one of these states goes for the Democrats, then Republicans will need Colorado and/or Louisiana for the definitive win. Three prediction sites (Politico, Sabato and RCP) are not calling either of these races, leaving them as toss-ups as of Oct 26, while the other three sites (538, Princeton and WaPo) are calling them likely Republican (in the "no toss-ups" model, RCP calls both of these for Republicans). If Republicans bring in BOTH Colorado AND Louisiana, then they don't need Georgia or Kansas for the tie-breaker. But if three of these 4 states (GA, KS, CO, LA) go for Democrats, then the situation gets complicated.

First, if the Senate decision comes down to one seat, then Georgia or Louisiana may have to break the tie, and neither of those races may be determined by this election due to the nature of their system. For example, in Georgia, if none of the candidates break 50%, it requires a runoff which would be held in mid-January. Louisiana's potential runoff would be in early December. Second, if 3 of these borderline states go to Democrats, then the Senate would be tied, meaning Vice President Biden would be needed to break any tie votes.

So what all of this means, if that the Senate goes the way that all of these 6 sites predict, based on polling as of October 26, then control of the Senate in 2015-2016 will likely be Republican. The spoilers are Georgia, Kansas, Colorado and Louisiana. Republicans only need 2 of these for the win. But if Democrats get 3, Biden will be needed for a tie-breaker, and if they get all 4, then Democrats remain in control of the senate.

Friday, October 17, 2014

Health and Human Services Spending since 2003

The recent Texas Ebola outbreak has caused a political frenzy with mutual-blame casting by Republicans and Democrats. I looked up the official outlays as published by the U.S. Treasury. While we have had an increasing budget for several decades in terms of raw dollars, once adjusted for inflation and population growth, those budgets look increasingly anemic. Below are four separate budgets from the Health and Human Services, which is the agency most tied into U.S. health, as well as specifically preparedness for infectious disease control. The first two graphs are the total relevant budgets for the the National Institutes of Health, and Centers for Disease Control from 2003-2014, adjusted for inflation (in 2014 $), and population growth. Both budgets have dropped during the time period shown, from 2003-2014. The CDC budget has dropped from $20.07 per person in 2003 to $19.95 today. The NIH budget has dropped from $101.33 per person in 2003 to $97.57 per person today.

The second set of charts organizes the data slightly differently. Within the Health and Human Services Budgets, there are three separate agencies that engage in "health care research and training": NIH, CDC and HRSA. Two of these, the HRSA and CDC also have budgets for "health care services." I have aggregated these two categories of outlays, and depicted both the inflation-adjusted spending per-capita, and spending as a percent of real GDP. The health care research and training budget (inflation adjusted per capita) has doubled since 1979, but when compared to real GDP, spending on health care research and training as actually gone down, from 0.76% to 0.65%. Per capita, this budget has declined since 2003. Similarly, the HRSA and CDC budgets for direct health care have gone up (inflation adjusted per capita), not quite doubling from $24.35 to $40.21, but has decreased when compared to our real GDP, from 0.37% to 0.25%. Per capita, this budget has remained approximately the same since 2003.

YearCDC Outlays (Inflation-Adjusted/Capita, 2014$)NIH Outlays (Inflation-Adjusted/Capita, 2014$)CDC Total Outlays (in millions)NIH Total Outlays (in millions)Health Care Research and Training (CDC+HRSA+NIH; Inflation Adjusted Per Capita)Health Care Research and Training (as % of Real GDP)Health Care Services (CDC+HRSA; Inflation adjusted per capita)Health Care Services (as % of Real GDP)Health Resources and Services Administration-Health care servicesHealth Resources and Services Administration-Health research and trainingCenters for Disease Control and Prevention-Health care servicesCenters for Disease Control and Prevention-Health research and trainingNational Institutes of Health-Health research and trainingCPIPopulation (in millions)

Saturday, October 11, 2014

District 29, 2014: Delph vs. Ford

Last February, the current state senator from District 29, Mike Delph, had a twitter meltdown about the same-sex marriage constitutional amendment shenanigans at the Indiana state house, that arguably (along with some other issues) caused the Senate leader to impose several sanctions against Delph, which included moving him to the back of the chamber thus forced to sit with the Democrats, removing him from leadership roles, and taking away his press secretary. Shortly thereafter, a Democrat competitor arose from his district, JD Ford.

The district has traditionally been solidly Republican, but with the 2010 redistricting, and steady urbanization of that region, I wondered if the situation would have changed. Using Tiger geographic redistricting shape files , Indiana election results for 2012, and ACS population data, I created some estimates of who voted for whom at the precinct-level in 2 specific races--Attorney General, and School Superintendent. As background for these races, AG Zoeller was running for his 2nd term, being active in far-right political cases (like submitting anti-gay-marriage amicus briefs to various cases around the country), and School Superintendent Tony Bennett had been a charter-school activist while in office. Zoeller won his 2012 run, but Bennett lost to Glenda Ritz, despite the GOP sweeping the rest of the offices in the state, including electing a supermajority in both Senate and Representatives chambers.

In what is now the newly redistricted D29 for senate (some of these counts are estimates based on precinct line changes), Zoeller (R) won approximately 57% of the vote, while Ritz (D) won approximately 51% of the vote. There are several factors working in Delph's favor. First, he has an incumbent advantage. Second, he has the party advantage for the "6 year presidential itch;" i.e., in the 6th year of an incumbent president's term, the "other" party (in this case, Republicans) tend to have an advantage. Third, he has a turnout advantage--midterms tend to favor GOP. Fourth, he has the numbers in this district--Zoeller beat his competitor by a wide margin.

However, several factors may also come into play. The hit that he took by the senate majority leader, and Delph's very public meltdown about the same-sex marriage issue, which was accompanied by several very insulting remarks about what makes a "true" Christian (i.e., "any professed Christian minister that teaches any sin is acceptable is NOT acting in true love but in eternal condemnation #truth"), and arguably "crazy" comments on several other issues, may cause voters in his district to question is ability to represent them. While midterms already have miserably low turnout (usually about 1/3 of registered voters come out to midterms in the Marion County area), some typically stalwart GOP in his district may be incentivized to stay home rather than be forced to choose between Delph and the gay Democrat. And as demonstrated in the 2012 elections, D29 voters are more than willing to vote for the Democrat if they like the candidate, rather than simply doing a party-line vote.

Any of these factors could shift the vote in Ford's direction. Additionally, the demographic numbers that we have for that district are largely from the 2008-2012 ACS--but that entire area is dramatically and rapidly changing, largely in the direction of younger voters, and race minorities, so the influx of new residents, if they register and vote, will mostly be potential Ford voters, especially when given the choice between Delph and Ford.

In terms of raw numbers, there are about 100k residents in D29 over 18 (2012 ACS 5-year estimate), with an average age of 36, 52% female, and 69% White. But these estimates are 2-6 years old--a lot has changed in 2 years in that area. According to my redistricting estimates above, there were about 60k votes cast in the 2012 election, so perhaps half of that will show up for the midterms. Much of the new housing in the area, because of the recession and housing price bust, has been apartments, which will bring in voters more inclined towards Democrats (younger, poorer, single females).

Maps of the redistricting, by % of vote for AG and School Superintendent, are below.