Linear regression assumes that error terms are normally distributed. This is especially important when we are using linear regression for prediction purposes and our sample size is small (see: Understand Linear Regression Assumptions).
When the normality of errors assumption is violated, try:
- Transforming the outcome variable
- Removing outliers
- Transforming the outcome into a binary variable then using logistic regression
Let’s create some data to demonstrate these methods:
set.seed(1) x = runif(100) y = x + runif(100) model = lm(y ~ x) par(mfrow = c(1, 2)) # plot side by side hist(model$residuals) # histogram of residuals plot(model, 2, pch = 16) # residuals normal Q-Q plot
Output:
So we see that the residuals (which are our estimates of the errors) are not normally distributed.
Solution #1: Transforming the outcome variable
The first solution we can try is to transform the outcome Y by using a log or a square root transformation.
The difference is that with a log transformation the model will still be interpretable, so this is what we are going to use here (for information on how to interpret such model, see: Interpret Log Transformations in Linear Regression):
model1 = lm(log(y) ~ x) par(mfrow = c(1, 2)) # plot side by side hist(model1$residuals) # histogram of residuals plot(model1, 2, pch = 16) # residuals normal Q-Q plot
Output:
In this case, the log transformation of Y did not solve our problem since the residuals are still far from being normally distributed.
Solution #2: Removing outliers
In many cases, non-normality of errors is due to the presence of outliers. An outlier is an observation with a Y value that is far from the regression line.
To identify outliers, we can plot studentized residuals (where each residual is divided by an estimate of its standard deviation) versus fitted values. Observations whose studentized residuals are larger than 3 in absolute value are possible outliers [James et al., 2021].
Here’s how to plot studentized residuals vs fitted values in R:
model = lm(y ~ x) fitted_values = predict(model) studentized_residuals = rstudent(model) plot(fitted_values , studentized_residuals, pch = 16, xlab = 'Fitted values', ylab = 'Studentized residuals', ylim = c(min(c(-3.3, min(studentized_residuals))), max(c(3.3, max(studentized_residuals))))) abline(h = -3, lty = 3, col = 'red') abline(h = 3, lty = 3, col = 'red')
Output:
The plot shows no outliers since no observations have studentized residuals above 3 or below -3.
But suppose we had an outlier:
Here’s how to determine its position in the dataset:
which.max(studentized_residuals) # 12
In this case, the 12th observation is an outlier.
Next, we can delete this observation and rerun linear regression to check if the residual standard error decreased or the R2 increased (which are signs of improvement of the model fit):
# R-squared before and after removing the outlier summary(lm(y ~ x, data = dat))$r.squared summary(lm(y ~ x, data = dat[-12,]))$r.squared
Solution #3: Transforming the outcome into a binary variable then using logistic regression
This is the least favorable solution, since converting a continuous variable into binary involves losing information.
The cutpoint to split the y variable should be chosen according to background knowledge. In the example below, we will split y in half using its median as a cutpoint:
y_binary = as.factor(ifelse(y > median(y), 'high', 'low')) # run logistic regression summary(glm(y_binary ~ x, family = 'binomial'))
Output:
Call: glm(formula = y_binary ~ x, family = "binomial") Deviance Residuals: Min 1Q Median 3Q Max -2.09229 -0.63413 -0.01602 0.74769 1.94530 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.4261 0.7003 4.892 9.96e-07 *** x -6.5840 1.2478 -5.277 1.32e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 138.629 on 99 degrees of freedom Residual deviance: 92.896 on 98 degrees of freedom AIC: 96.896 Number of Fisher Scoring iterations: 4
If you are interested, see: How to Run and Interpret a Logistic Regression Model in R.
References
- James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning: With Applications in R. 2nd ed. 2021 edition. Springer; 2021.