Understand Linear Regression Assumptions

The 4 assumptions of linear regression in order of importance are:

  1. Linearity
  2. Independence of errors
  3. Constant variance of errors
  4. Normality of errors

1. Linearity

1.1. Explanation

The relationship between each predictor Xi and the outcome Y should be linear.

1.2. How to check the linearity assumption

Instead of checking the relationship between each predictor Xi and the outcome Y in a multivariable model, we can plot the residuals versus the fitted values. The plot should show no pattern:

Part A shows a residuals plot where the linearity assumption is satisfied
Part B shows a residuals plot where the linearity assumption is violated

In the figure above, we can focus on the LOESS red line that smoothly fits the data:

  • In part A: The LOESS line looks like “a child’s freehand drawing of a straight line” [Cohen et al. 2002], so the linearity assumption is satisfied.
  • In part B: the LOESS line is curved, so the linearity assumption is violated.

1.3. What happens when the linearity assumption is violated

When this assumption is violated, we may get:

  • Biased regression coefficients
  • Biased standard errors
  • Biased p-values
  • Biased R2

So linearity is the most important linear regression assumption since its violation biases all the model’s output.

1.4. How to deal with non-linearity

When the linearity assumption is violated, try:

  • Adding a quadratic term to the model: Y = X1 + X12 + X2 + X22
  • Transforming the predictor X (log, square root): Y = log(X1) + log(X2)
  • Adding an interaction term (since non-linearity can be due to an interaction between predictors): Y = X1 + X2 + X1×X2
  • Categorizing the predictor X (when X is a numeric variable)

How to check if your solution works:

  • If you added a quadratic or an interaction term: you can use ANOVA to compare the model with the quadratic or interaction term to one without them.
  • If you used a log or square root transformed predictor: an increase in the adjusted-R2 for the new model would be a sign of a better fit.

Here’s a tutorial on how to implement these fixes in R.

2. Independence of errors

2.1. Explanation

Independent/uncorrelated error terms are obtained if each observation is drawn randomly from the population.

2.2. How to check the independence of errors

This assumption is violated in time series, spatial, and multilevel settings. So check the study design and the data collection process.

For example, errors can be correlated when we are dealing with multiple measurements made on the same participants, or when cluster sampling is used.

2.3. What happens when errors are not independent

When the independence of errors assumption is violated, we may get:

  • Biased standard errors (smaller standard errors and confidence intervals)
  • Biased p-values (smaller p-values)

So independence of errors is useful to assess whether the confidence intervals and p-values can be trusted.

2.4. How to deal with correlated errors

The fix depends on how the errors are correlated. You should use a data analysis method that is appropriate for your study design (for example, an ARIMA model can be used for time series data, and a multilevel model for nested data).

3. Constant variance of errors

3.1. Explanation

The dispersion of the data around the regression line should be constant.

3.2. How to check the constant variance of errors

⚠ Errors vs Residuals
In linear regression, the residual (which is the distance between an observed data point and the linear regression line) estimates the error (which is the distance between an observed data point and the true population value). Therefore, we can use residuals (which we can measure) instead of errors (which we cannot measure) to check the assumptions of a linear regression model.

The plot of the residuals versus the fitted values should show no discernible shape. A fan, butterfly, or blob shape are signs of violation of this assumption:

part A shows a fan-shaped residuals plot
part B shows a fan-shaped residuals plot
part C shows a butterfly-shaped residuals plot
part D shows a blob-shaped residuals plot

What if the predictor X is a categorical variable?

In this case, calculate the variance of the residuals for each subgroup of X. The constant variance assumption will be violated if the variance of the residuals:

  • Changes by a factor of 2 or more between subgroups that differ in size by a factor of 2 or more.
  • Changes by a factor of 3 or more between subgroups that differ in size by a factor of less than 2.

[source: Huntington-Klein, 2022]

⚠ Do not use statistical tests to check for constant variance of errors
Statistical tests are sensitive to sample sizes: in small samples they do not have enough power to reject the null, and therefore will give a false sense of reassurance. Also, instead of getting a pass/fail answer from a statistical test, we need to examine the severity of the violation of the constant variance assumption [source: Vittinghoff et al., 2011]

3.3. What happens when errors do not have constant variance

When the assumption of constant variance of errors is violated, we may get:

  • Biased standard errors
  • Biased p-values

Note that this assumption is less important for estimating the relationship between a predictor X and an outcome Y. It is more important for predictive modeling, when the objective of using linear regression is to predict Y given X.

3.4. How to deal with non-constant error variance

When the assumption of constant variance of errors is violated, try:

  • Transforming the outcome Y (log, square root)
  • Using weighted regression
  • Calculating heteroscedasticity-robust standard errors
  • Converting the outcome into a binary variable then using logistic regression (least favorable solution)

Here’s a tutorial on how to implement these fixes in R.

4. Normality of errors

4.1. Explanation

Error terms should be normally distributed.

4.2. How to check the normality of errors

The normality of the errors can be checked in 2 ways:

  1. By looking at a histogram of the residuals: the distribution should be bell-shaped.
  2. By looking at a normal Q-Q plot of the residuals: The points should follow the diagonal straight line.

Here are some examples:

Part A: histogram and q-q plot of a normal distribution
Part B: histogram and q-q plot of a left-skewed distribution
Part C: histogram and q-q plot of a right-skewed distribution
⚠ Do not use statistical tests to check the normality of errors
Statistical tests are sensitive to sample sizes:
  • In small samples, where the assumption of normality is most important for linear regression, they do not have enough power to reject the null.
  • In large samples, where linear regression is robust regarding departure from normality, they reject the null for a small departure from normality.
[source: Vittinghoff et al., 2011]

4.3. What happens when errors are not normally distributed

When the normality of errors assumption is violated, we may get:

  • Biased standard errors
  • Biased p-values

This is the least important of the linear regression assumptions since it should be taken into consideration only in 2 cases:

  1. When working with a small sample size.
  2. When using linear regression for prediction purposes, i.e. when calculating prediction intervals. Instead, when the objective is to estimate the relationship between a predictor X and an outcome Y, Gelman and colleagues do not even recommend checking the normality of the residuals.

4.4. How to deal with non-normal errors

When the normality of errors assumption is violated, try:

  • Transforming the outcome Y (log, square root)
  • Removing outliers (observations with Y values that are far from the regression line)
  • Transforming the outcome into a binary variable then using logistic regression (least favorable solution)

Here’s a tutorial on how to implement these fixes in R.

⚠ Note:
Linear regression does not require the outcome Y (nor the predictor X) to be normally distributed. Only errors should be normally distributed, and they can be even when Y is not.

References

  • Keith TZ. Multiple Regression and Beyond: An Introduction to Multiple Regression and Structural Equation Modeling. 3rd edition. Routledge; 2019.
  • Vittinghoff E, Glidden DV, Shiboski SC, McCulloch CE. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. 2nd ed. 2012 edition. Springer; 2011.
  • Gelman A, Hill J, Vehtari A. Regression and Other Stories. 1st edition. Cambridge University Press; 2020.
  • Cohen J, Cohen P, West SG, Aiken LS. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, 3rd Edition. Third edition. Routledge; 2002.
  • Huntington-Klein N. The Effect. 1st edition. Routledge; 2022.

Further reading