Multicollinearity occurs when there is a strong linear relationship between 2 or more predictors in a regression model. It is a problem because it increases the standard errors of the regression coefficients, leading to noisy estimates.
Let’s simulate some data in R:
# create 2 highly correlated variables: var1 and var2 set.seed(1) var1 = rnorm(100, 0, 1) var2 = var1 + 0.3*rnorm(100, 0, 1) y = var1 + var2 + 2*rnorm(100, 0, 1) model = lm(y ~ var1 + var2) # check for multicollinearity library(car) vif(model) # var1 var2 # 10.7635 10.7635
We have a collinearity problem in our model since our variables’ VIFs (Variance Inflation Factor) are higher than 10.
We can deal with multicollinearity by:
- Combining the variables.
- Using Principal Components Regression (PCR).
- Removing one of the variables from the model.
1. Combining the variables
HOW IT WORKS: We can combine var1 and var2 by using any linear combination of the two, for example, taking their average.
WHEN TO USE IT: Replacing variables with their average works, for instance, if var1 and var2 represent grades in different courses.
Let’s replace var1 and var2 in the model with their average:
var_avg = (var1 + var2)/2 summary(lm(y ~ var_avg)) # Call: # lm(formula = y ~ var_avg) # # Residuals: # Min 1Q Median 3Q Max # -5.8303 -0.9738 -0.0806 1.3214 5.2636 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.05692 0.20924 0.272 0.786 # var_avg 2.02349 0.22974 8.808 4.66e-14 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 2.079 on 98 degrees of freedom # Multiple R-squared: 0.4418, Adjusted R-squared: 0.4361 # F-statistic: 77.58 on 1 and 98 DF, p-value: 4.66e-14
2. Using principal components regression
HOW IT WORKS: Principal Components Analysis (PCA) provides us with a way to replace the variables that are causing a collinearity problem with a set of fewer uncorrelated variables that represent their shared part, called principal components, as predictors in the regression model.
WHEN TO USE IT: Replacing variables with few principal components that represent their shared part works best when these variables are different measures of the same underlying concept. (for example, if you are using different test scores to measure intelligence)
In our case, we will replace var1 and var2 with only 1 principal component that best represents both of them:
# first put var1 and var2 in a data frame dat = data.frame(var1 = var1, var2 = var2) # run PCA pca = prcomp(dat, scale = TRUE) # scale=TRUE to standardize the variables # print the first few rows of the principle components head(pca$x) # PC1 PC2 # [1,] -1.2615022 0.10370865 # [2,] 0.1329040 -0.01520087 # [3,] -1.6484387 0.16129968 # [4,] 2.3290610 0.01126323 # [5,] 0.2003748 0.14699174 # [6,] -1.0225152 -0.44075409
Note that scale = True standardizes the variables var1 and var2 so that the scale on which each variable was measured will not affect the results of PCA.
Now we reduced var1 and var2 to 2 other components PC1 and PC2. Let’s see the proportion of variance explained by each of them:
# we divide the variance explained by each component # by the variance explained by all components pca$sdev^2 / sum(pca$sdev^2) # 0.97620727 0.02379273
So the first component PC1 explains 97.6% of the variance in the data, and PC2 explains only 2.4%. Therefore, PC1 alone is a good representative of both var1 and var2.
Finally, we replace both var1 and var2 in the regression model by PC1:
pc1 = pca$x[,'PC1'] summary(lm(y ~ pc1)) # Call: # lm(formula = y ~ pc1) # # Residuals: # Min 1Q Median 3Q Max # -5.8338 -0.9596 -0.0768 1.3088 5.2650 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.2658 0.2079 1.279 0.204 # pc1 1.3173 0.1495 8.811 4.58e-14 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 2.079 on 98 degrees of freedom # Multiple R-squared: 0.442, Adjusted R-squared: 0.4363 # F-statistic: 77.64 on 1 and 98 DF, p-value: 4.583e-14
3. Removing one of the variables from the model
HOW IT WORKS: Remove one of the predictor variables that are causing the problem, then look at the VIFs again to make sure that the collinearity problem is fixed.
WHEN TO USE IT: Dropping one of the variables may be beneficial when the variables that are causing the collinearity problem are not measuring the same thing, instead they just happen to be highly correlated (maybe because one is a direct cause of the other).
In our example, we will drop var2:
model = lm(y ~ var1) summary(model) # Call: # lm(formula = y ~ var1) # # Residuals: # Min 1Q Median 3Q Max # -5.9936 -0.8893 0.0257 1.4725 5.5166 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.04343 0.20999 0.207 0.837 # var1 2.04202 0.23324 8.755 6.06e-14 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 2.084 on 98 degrees of freedom # Multiple R-squared: 0.4389, Adjusted R-squared: 0.4332 # F-statistic: 76.65 on 1 and 98 DF, p-value: 6.06e-14
- Huntington-Klein N. The Effect. 1st edition. Routledge; 2022.
- James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning: With Applications in R. 2nd ed. 2021 edition. Springer; 2021.