Friday, April 26, 2013

Testing for Data Normality in R

This tutorial uses the free, open-source statistics software R. It uses primarily command-line entry, not as simple as SPSS, but far more powerful.

For regression analysis, your univariate data (all of the continuous variables used, independent and dependent) has to be normal, equivalent to a bell-shaped, gaussian distribution. You can visualize the curve of this distribution with a histogram, verification of which can help you establish the normality of your data. Visualization with a Q-Q plot can also help verify normality. However, these "eye-ball" tests have never satisfied my empirical instincts, since what seems to be "normal enough" for one person, may not be normal to another.

While many in the social sciences still are not implementing, or at least reporting, these tests, without using normal data, the regression analysis may be invalid. Several tests exist for demonstrating data normality, with the follow-up procedures for non-normality being data transformation and/or removal of outliers. The Law of Large Numbers and the Central Limit Theorem provide some protection for your analysis if you have a large sample (n>30) even if your data is not normal, as do certain kinds of tests that are considered robust. However, it is good practice to try to get your data as normal as possible, since the math behind these types of analyses assume univariate data normality. The primary normality tests are as follows:

  1. Visual tests: Histogram and Q-Q plots
  2. Z-score tests for skew and kurtosis (Statistic/Standard Error for that Statistic)
  3. Specialized tests: Shapiro-Wilk, Kolmogorov-Smirnov, Jarque-Bera, etc.
  4. Filliben's test for Q-Q plot correlation
For the numerical tests, significance of the model typically requires your p-value > 0.05. This causes my students immeasurable confusion, since they have been taught in other classes, and in most instances of my class, that they should look for p < 0.05 to determine if a test is significant. The latter is for tests of means, such as ANOVA, or significance for their regression model and coefficient. In those cases, the null hypothesis is: the means are equal. If you are testing that one intervention is effective, and another one isn't, or that one group is different from another, you want to reject the null hypothesis, therefore, you are looking for p < 0.05. On the other hand, for data normality tests, the null hypothesis is: the data is normal, or the data is equivalent to a normal distribution. In this case, you want your data to be normal, since otherwise you cannot even proceed to the means test. You want to be able to "fail to reject the null hypothesis," i.e., you want p > 0.05.

USING DATASET “WORLD” VARIABLE “HIV” (UPDATED 5/26/2013)

Find the optimal transformation using Box-Cox procedure: generates the optimal λ (power)
(Note: The first version of this post recommended the command box.cox.powers, which has since been deprecated, so is no longer available.)
A handy way to obtain a good univariate lambda power value is from the AID package, that you will need to install.
library(AID)
>boxcoxnc(world$hiv)
The output is a best-guess lambda, plus the normality values for each of 7 common normality tests. Be certain to look at the plots that are generated. If the software can generate a reasonable lambda, then the plots will show a "peak" (either minimum or maximum). Sometimes a reasonable lambda cannot be determined within the -2 to 2 boundaries, and the plots will not peak for any of the normality tests, but will simply appear linear. In this case, it may be that the data cannot be transformed.

Generating a Box-Cox transformation: Generates [(xλ-1)/λ], where x is each value, and λ is a power
>bcPower(world$hiv, λ)

Produce histogram
>hist(world$hiv)
Importing text files seems to create non-numeric data problems. Import from SPSS works OK, as does importing from the clipboard using x.num <- as.numeric(readClipboard())

Produce qq-plot
>qqPlot(world$hiv)

QQ-Plot Correlation Coefficient (Filliben, 1975)
>qqp <- qqnorm(hivt) [also produces a qq-plot, but less informative than qqPlot]
>cor(qqp$x,qqp$y)
Interpretation for sample size at http://www.dm.unibo.it/~simoncin/QQCritVal.pdf

Remove missing data—some tests won’t work with missing data
>hivt <- na.omit(world$hiv)

Skewness and Kurtosis
Some argue that any skew and kurtosis < 1 represent normal data (assuming k-3, the value most software provides, since for normal distribution, unmodified k=3). Most use the Z-score of skew and kurtosis. If n < 200, then most argue that normal data is given by Z < 1.96, and if n>200, then Z < 2.58 likely represents normal data. For n > 2,000, these numbers often produce unreliable results. Z-Skew = [skew/standard error of skew], and Z-Kurtosis = [kurtosis/standard error of kurtosis].

In R, package e1071 produces skew and kurtosis, so will need to be installed and then activated. The default for these commands is "Type 3", which is the standard skewness formula adjusted for the sample size, called the "adjusted Fisher-Pearson standardized moment coefficient," modified by multiplying skew by: n/(n-1)(n-2). If you change this to "Type=1" then you generate the results from the standard formula, and if you change this to "Type=2" then you get the values produced by SPSS. You can also load the package "moments" to get a skew and kurtosis command that matches the hand-calculated values. Package e1071, Type=1, uses sample size = n-1, presuming you are using a "sample," while package Moments uses sample size = n, presuming you are using a population. Each of these methods used to calculate skew and kurtosis can produce different values, but which are relatively close to each other, and should be statistically insignificant.

Skewness and Z-skew
>skewness(hivt)
>skewness(hivt)/sqrt(6/n) [n=sample size]
or >skewness(hivt)/sqrt(6/length(hivt))

Kurtosis and Z-kurtosis
>kurtosis(hivt)
>kurtosis(hivt)/sqrt(24/n) [n=sample size]
or >kurtosis(hivt)/sqrt(24/length(hivt))

Specialized Numerical Tests: Empirical Distribution Functions
Kolmogorov-Smirnov test-Liliefors Correction (need to install library nortest)
lillie.test(hivt)

Kolmogorov-Smirnov (uncorrected)
ks.test(hivt,pnorm)

Shapiro-Wilk test (Royston correction)
>shapiro.test(hivt)

Jarque-Bera test (need to install library tseries)
>jarque.bera.test(hivt)

Robust Jarque-Bera test (need to install library lawstat)
>rjb.test(hivt)

Anderson-Darling test (from nortest)
>ad.test(hivt)

Cramer-von Mises test (from nortest)
>cvm.test(hivt)

Shapiro-Francia test (from nortest)
>sf.test(hivt)

No comments:

Post a Comment