How to Deal with Multicollinearity in R

Multicollinearity occurs when there is a strong linear relationship between 2 or more predictors in a regression model. It is a problem because it increases the standard errors of the regression coefficients, leading to noisy estimates.

Let’s simulate some data in R:

# create 2 highly correlated variables: var1 and var2
set.seed(1)
var1 = rnorm(100, 0, 1)
var2 = var1 + 0.3*rnorm(100, 0, 1)
y = var1 + var2 + 2*rnorm(100, 0, 1)

model = lm(y ~ var1 + var2)

# check for multicollinearity
library(car)
vif(model)
#    var1    var2 
# 10.7635 10.7635

We have a collinearity problem in our model since our variables’ VIFs (Variance Inflation Factor) are higher than 10.

We can deal with multicollinearity by:

  1. Combining the variables.
  2. Using Principal Components Regression (PCR).
  3. Removing one of the variables from the model.

1. Combining the variables

HOW IT WORKS: We can combine var1 and var2 by using any linear combination of the two, for example, taking their average.

WHEN TO USE IT: Replacing variables with their average works, for instance, if var1 and var2 represent grades in different courses.

Let’s replace var1 and var2 in the model with their average:

var_avg = (var1 + var2)/2

summary(lm(y ~ var_avg))
# Call:
# lm(formula = y ~ var_avg)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -5.8303 -0.9738 -0.0806  1.3214  5.2636 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  0.05692    0.20924   0.272    0.786    
# var_avg      2.02349    0.22974   8.808 4.66e-14 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 2.079 on 98 degrees of freedom
# Multiple R-squared:  0.4418,	Adjusted R-squared:  0.4361 
# F-statistic: 77.58 on 1 and 98 DF,  p-value: 4.66e-14

2. Using principal components regression

HOW IT WORKS: Principal Components Analysis (PCA) provides us with a way to replace the variables that are causing a collinearity problem with a set of fewer uncorrelated variables that represent their shared part, called principal components, as predictors in the regression model.

WHEN TO USE IT: Replacing variables with few principal components that represent their shared part works best when these variables are different measures of the same underlying concept. (for example, if you are using different test scores to measure intelligence)

In our case, we will replace var1 and var2 with only 1 principal component that best represents both of them:

# first put var1 and var2 in a data frame
dat = data.frame(var1 = var1, var2 = var2)

# run PCA
pca = prcomp(dat, scale = TRUE) # scale=TRUE to standardize the variables

# print the first few rows of the principle components
head(pca$x)
#             PC1         PC2
# [1,] -1.2615022  0.10370865
# [2,]  0.1329040 -0.01520087
# [3,] -1.6484387  0.16129968
# [4,]  2.3290610  0.01126323
# [5,]  0.2003748  0.14699174
# [6,] -1.0225152 -0.44075409

Note that scale = True standardizes the variables var1 and var2 so that the scale on which each variable was measured will not affect the results of PCA.

Now we reduced var1 and var2 to 2 other components PC1 and PC2. Let’s see the proportion of variance explained by each of them:

# we divide the variance explained by each component
# by the variance explained by all components
pca$sdev^2 / sum(pca$sdev^2)
# 0.97620727 0.02379273

So the first component PC1 explains 97.6% of the variance in the data, and PC2 explains only 2.4%. Therefore, PC1 alone is a good representative of both var1 and var2.

Finally, we replace both var1 and var2 in the regression model by PC1:

pc1 = pca$x[,'PC1']

summary(lm(y ~ pc1))
# Call:
# lm(formula = y ~ pc1)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -5.8338 -0.9596 -0.0768  1.3088  5.2650 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   0.2658     0.2079   1.279    0.204    
# pc1           1.3173     0.1495   8.811 4.58e-14 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 2.079 on 98 degrees of freedom
# Multiple R-squared:  0.442,	Adjusted R-squared:  0.4363 
# F-statistic: 77.64 on 1 and 98 DF,  p-value: 4.583e-14

3. Removing one of the variables from the model

HOW IT WORKS: Remove one of the predictor variables that are causing the problem, then look at the VIFs again to make sure that the collinearity problem is fixed.

WHEN TO USE IT: Dropping one of the variables may be beneficial when the variables that are causing the collinearity problem are not measuring the same thing, instead they just happen to be highly correlated (maybe because one is a direct cause of the other).

In our example, we will drop var2:

model = lm(y ~ var1)
summary(model)
# Call:
# lm(formula = y ~ var1)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -5.9936 -0.8893  0.0257  1.4725  5.5166 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  0.04343    0.20999   0.207    0.837    
# var1         2.04202    0.23324   8.755 6.06e-14 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 2.084 on 98 degrees of freedom
# Multiple R-squared:  0.4389,	Adjusted R-squared:  0.4332 
# F-statistic: 76.65 on 1 and 98 DF,  p-value: 6.06e-14

References

  • Huntington-Klein N. The Effect. 1st edition. Routledge; 2022.
  • James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning: With Applications in R. 2nd ed. 2021 edition. Springer; 2021.

Further reading