Sunday, April 21, 2013

Comparing Between Regression Models: Aikaike Information Criterion (AIC)

In preparing for my final week of sociological statistics class, the textbook takes us to "nested regression models," which is simply a way of comparing various multiple regression models with one or more independent variables removed. In the example I'll be using in my class, we'll be looking at a dataset of Florida crime by county as a dependent variable, with the independent variables of urbanization, education, and average income. To evaluate the reliability of the independent variables to be able to predict crime rates, we can generate any of several regression equations, using all of the three variables, two of them, or just one.

Distinguishing the "best" equation is somewhat subjective, but statisticians have developed some criteria to evaluate whether one model is likely better than another. For my class we are using SPSS as our statistical software, since that's the licensed software on our campus (IUPUI). It's expensive, and even with our campus license, you have to "rent" it every semester you want to use it. I personally don't use it for my research, since, while it's a reasonable GUI option, there are many advanced functions that it just can't do, and its flexibility to alter parameters is limited. I use the free, open-source software R, which has a steep learning curve, since it is command-line, but far more powerful and flexible. I don't have my students to learn it because of the learning curve, and since most of them in their future careers will just want simple software that they can point-and-click to get reasonable results. They likely won't want to do any programming to get their results.

In my search for a way to allow my students compare "nested regression models" using SPSS, I spent a great deal of time Googling ways to get SPSS to generate AIC, and it just won't, except for logistic regression, or using the advanced Generalized Regression Models feature--both of which are great options, but the former is a specialized technique for probability outcomes, while the latter is not necessarily a good option for an introductory sociological stats class.

I found 5 ways to get SPSS to give me AIC, and I will teach the students 2 of those ways--one formula, and manually forcing SPSS to produce the regression AIC using syntax. I reproduce the 5 methods below, since there is no simple "checkbox" for regular linear regression in SPSS. Recognize that the linear regression method and the GZM (generalized linear regression) AIC produce different numbers. The absolute AIC number is not relevant, but only the difference in the AICs of different models--then choose the model that produces the smallest AIC.

In the equations below, n = sample size, k = number of parameters, SSE = sum of squares error (or residual sum of squares as listed in SPSS output)

  1. AIC formula #1 (same result as SPSS linear regression syntax)
    n*Ln(SSE/n) + 2*(k+1)

  2. AIC formula #2 (same result as SPSS GZM)
    2k + n [Ln(2(pi) SSE/n ) + 1]

  3. AIC formula #3 (same result as SPSS GZM)
    Requires you to obtain the log-likelihood, which in SPSS, you can only get using GZM (generalized linear model, see option #5 below)
    2*k – 2* loglikelihood

  4. SPSS method #1: Use linear regression syntax
    /CRITERIA=PIN(.05) POUT(.10)
    /DEPENDENT DependentVariable
    /METHOD=ENTER IndependentVariables separated by a space.

  5. SPSS method #2: Use GZM
    Click the following:
    Analyze --> Generalized Linear Models --> Generalized Linear Models
    Under the "Response" tab, put the outcome variable into the "Dependent Variable" box.
    Under the "Predictors" tab, put all continuous independent variables into the “Covariates” box--if you have any categorical predictors, those go into the "Factors" box.
    Under the "Models" tab, put all listed variables (independents) into the “model” box and make sure that as "Type" they are listed as "Main effects"


  1. How can I use the syntax for a cox regression (spss 17). What should I adjust in your code?


  2. I assume there's some mistakes in #5 above--i assume that the DEpendent variable should be entered as the Response variable?

    And I'm a little confused about what goes in the covariates box--is this just continuous independent variables?

    Do ALL the variables go in the model box, or all the INdependent variables?

    finally, to calculate the AIC for different models, do i alter which dependent variables I put in both the covariates box and in the models box, or just one of those? and then run it repeatedly with the different sets of independent variables?

    1. I believe you are correct regarding the Response box and dependent variables--I have corrected the error and clarified some instructions, thanks! As for your final question about the covariates box, only your predictors (independents) go there. Let's say you have 3 possible predictors. In your first model, you propose that A+B-->Y, and you run the GZM for that model, and it produces AIC-M1. You want to compare that to a second model, A+C-->Y, so you re-run the analysis, removing B from the original analysis and replace it with C (keeping A) in the "Covariates" box, AND the "Model" box. For every new model you want to test, you remove the old model from all options and replace it with the new model. That produces AIC-M2. At that point you can compare AIC-M1 & AIC-M2 and make a more informed decision about your models.

  3. Thank you for this very useful post! However, if I am testing several interaction terms (moderators), do I need to put the interaction terms in the 'predictors'? Thank you.