Sunday, April 21, 2013

Comparing Between Regression Models: Aikaike Information Criterion (AIC)

In preparing for my final week of sociological statistics class, the textbook takes us to "nested regression models," which is simply a way of comparing various multiple regression models with one or more independent variables removed. In the example I'll be using in my class, we'll be looking at a dataset of Florida crime by county as a dependent variable, with the independent variables of urbanization, education, and average income. To evaluate the reliability of the independent variables to be able to predict crime rates, we can generate any of several regression equations, using all of the three variables, two of them, or just one.

Distinguishing the "best" equation is somewhat subjective, but statisticians have developed some criteria to evaluate whether one model is likely better than another. For my class we are using SPSS as our statistical software, since that's the licensed software on our campus (IUPUI). It's expensive, and even with our campus license, you have to "rent" it every semester you want to use it. I personally don't use it for my research, since, while it's a reasonable GUI option, there are many advanced functions that it just can't do, and its flexibility to alter parameters is limited. I use the free, open-source software R, which has a steep learning curve, since it is command-line, but far more powerful and flexible. I don't have my students to learn it because of the learning curve, and since most of them in their future careers will just want simple software that they can point-and-click to get reasonable results. They likely won't want to do any programming to get their results.

In my search for a way to allow my students compare "nested regression models" using SPSS, I spent a great deal of time Googling ways to get SPSS to generate AIC, and it just won't, except for logistic regression, or using the advanced Generalized Regression Models feature--both of which are great options, but the former is a specialized technique for probability outcomes, while the latter is not necessarily a good option for an introductory sociological stats class.

I found 5 ways to get SPSS to give me AIC, and I will teach the students 2 of those ways--one formula, and manually forcing SPSS to produce the regression AIC using syntax. I reproduce the 5 methods below, since there is no simple "checkbox" for regular linear regression in SPSS. Recognize that the linear regression method and the GZM (generalized linear regression) AIC produce different numbers. The absolute AIC number is not relevant, but only the difference in the AICs of different models--then choose the model that produces the smallest AIC.

In the equations below, n = sample size, k = number of parameters, SSE = sum of squares error (or residual sum of squares as listed in SPSS output)

  1. AIC formula #1 (same result as SPSS linear regression syntax)
    n*Ln(SSE/n) + 2*(k+1)

  2. AIC formula #2 (same result as SPSS GZM)
    2k + n [Ln(2(pi) SSE/n ) + 1]

  3. AIC formula #3 (same result as SPSS GZM)
    Requires you to obtain the log-likelihood, which in SPSS, you can only get using GZM (generalized linear model, see option #5 below)
    2*k – 2* loglikelihood

  4. SPSS method #1: Use linear regression syntax
    REGRESSION
    /DESCRIPTIVES MEAN STDDEV CORR SIG N
    /MISSING LISTWISE
    /STATISTICS COEFF OUTS R ANOVA Selection
    /CRITERIA=PIN(.05) POUT(.10)
    /NOORIGIN
    /DEPENDENT DependentVariable
    /METHOD=ENTER IndependentVariables separated by a space.

  5. SPSS method #2: Use GZM
    Click the following:
    Analyze --> Generalized Linear Models --> Generalized Linear Models
    Under the "Response" tab, put the outcome variable into the "Dependent Variable" box.
    Under the "Predictors" tab, put all continuous independent variables into the “Covariates” box--if you have any categorical predictors, those go into the "Factors" box.
    Under the "Models" tab, put all listed variables (independents) into the “model” box and make sure that as "Type" they are listed as "Main effects"

8 comments:

  1. Thank you so much! You saved my soul!

    ReplyDelete
  2. How can I use the syntax for a cox regression (spss 17). What should I adjust in your code?

    Thanks!

    ReplyDelete
  3. I assume there's some mistakes in #5 above--i assume that the DEpendent variable should be entered as the Response variable?

    And I'm a little confused about what goes in the covariates box--is this just continuous independent variables?

    Do ALL the variables go in the model box, or all the INdependent variables?

    finally, to calculate the AIC for different models, do i alter which dependent variables I put in both the covariates box and in the models box, or just one of those? and then run it repeatedly with the different sets of independent variables?

    ReplyDelete
    Replies
    1. I believe you are correct regarding the Response box and dependent variables--I have corrected the error and clarified some instructions, thanks! As for your final question about the covariates box, only your predictors (independents) go there. Let's say you have 3 possible predictors. In your first model, you propose that A+B-->Y, and you run the GZM for that model, and it produces AIC-M1. You want to compare that to a second model, A+C-->Y, so you re-run the analysis, removing B from the original analysis and replace it with C (keeping A) in the "Covariates" box, AND the "Model" box. For every new model you want to test, you remove the old model from all options and replace it with the new model. That produces AIC-M2. At that point you can compare AIC-M1 & AIC-M2 and make a more informed decision about your models.

      Delete
  4. Thank you for this very useful post! However, if I am testing several interaction terms (moderators), do I need to put the interaction terms in the 'predictors'? Thank you.

    ReplyDelete
  5. Hello,
    I was wondering if you have any information on what the AIC criterion should be. My output has resulted in -557.696 for the AIC. I was wondering if you could let me know what this means?

    Thank you,
    Caitlin

    ReplyDelete
    Replies
    1. Caitlin--AIC ("Akaike information criterion", so there's no need to specify "AIC criterion") gives you a way to compare models to each other. It doesn't mean anything by itself, so "-557" can't tell you anything. But if you add or remove variables and rerun the model, you can use the new AIC to see which model is more efficient. Models with lower values (ie, -560 < -557, or 99 < 105) are considered more efficient, giving you an objective reason to choose the model with the lower value. For example, anytime you add a variable you should get a better R^2, but you may have done so at the cost of efficiency. Science likes simplicity when possible. AIC gives you a way to balance model strength with simplicity.

      Delete