How to Deal with Violation of Normality of Errors in R

Linear regression assumes that error terms are normally distributed. This is especially important when we are using linear regression for prediction purposes and our sample size is small (see: Understand Linear Regression Assumptions).

When the normality of errors assumption is violated, try:

  1. Transforming the outcome variable
  2. Removing outliers
  3. Transforming the outcome into a binary variable then using logistic regression

Let’s create some data to demonstrate these methods:

set.seed(1)
x = runif(100)
y = x + runif(100)
model = lm(y ~ x)

par(mfrow = c(1, 2)) # plot side by side
hist(model$residuals) # histogram of residuals
plot(model, 2, pch = 16) # residuals normal Q-Q plot

Output:

histogram and normal q-q plot showing non-normal residuals

So we see that the residuals (which are our estimates of the errors) are not normally distributed.

Solution #1: Transforming the outcome variable

The first solution we can try is to transform the outcome Y by using a log or a square root transformation.

The difference is that with a log transformation the model will still be interpretable, so this is what we are going to use here (for information on how to interpret such model, see: Interpret Log Transformations in Linear Regression):

model1 = lm(log(y) ~ x)

par(mfrow = c(1, 2)) # plot side by side
hist(model1$residuals) # histogram of residuals
plot(model1, 2, pch = 16) # residuals normal Q-Q plot

Output:

histogram and normal q-q plot showing non-normal residuals even after a log transformation of the outcome

In this case, the log transformation of Y did not solve our problem since the residuals are still far from being normally distributed.

Solution #2: Removing outliers

In many cases, non-normality of errors is due to the presence of outliers. An outlier is an observation with a Y value that is far from the regression line.

To identify outliers, we can plot studentized residuals (where each residual is divided by an estimate of its standard deviation) versus fitted values. Observations whose studentized residuals are larger than 3 in absolute value are possible outliers [James et al., 2021].

Here’s how to plot studentized residuals vs fitted values in R:

model = lm(y ~ x)

fitted_values = predict(model)
studentized_residuals = rstudent(model)

plot(fitted_values , studentized_residuals, pch = 16,
     xlab = 'Fitted values', ylab = 'Studentized residuals',
     ylim = c(min(c(-3.3, min(studentized_residuals))),
              max(c(3.3, max(studentized_residuals)))))
abline(h = -3, lty = 3, col = 'red')
abline(h = 3, lty = 3, col = 'red')

Output:

a plot of studentized residuals versus fitted values showing no outliers

The plot shows no outliers since no observations have studentized residuals above 3 or below -3.

But suppose we had an outlier:

Here’s how to determine its position in the dataset:

which.max(studentized_residuals)
# 12

In this case, the 12th observation is an outlier.

Next, we can delete this observation and rerun linear regression to check if the residual standard error decreased or the R2 increased (which are signs of improvement of the model fit):

# R-squared before and after removing the outlier
summary(lm(y ~ x, data = dat))$r.squared
summary(lm(y ~ x, data = dat[-12,]))$r.squared

Solution #3: Transforming the outcome into a binary variable then using logistic regression

This is the least favorable solution, since converting a continuous variable into binary involves losing information.

The cutpoint to split the y variable should be chosen according to background knowledge. In the example below, we will split y in half using its median as a cutpoint:

y_binary = as.factor(ifelse(y > median(y), 'high', 'low'))

# run logistic regression
summary(glm(y_binary ~ x, family = 'binomial'))

Output:

Call:
glm(formula = y_binary ~ x, family = "binomial")

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.09229  -0.63413  -0.01602   0.74769   1.94530  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   3.4261     0.7003   4.892 9.96e-07 ***
x            -6.5840     1.2478  -5.277 1.32e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138.629  on 99  degrees of freedom
Residual deviance:  92.896  on 98  degrees of freedom
AIC: 96.896

Number of Fisher Scoring iterations: 4

If you are interested, see: How to Run and Interpret a Logistic Regression Model in R.

References

  • James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning: With Applications in R. 2nd ed. 2021 edition. Springer; 2021.

Further reading