How to Deal with Heteroscedasticity in Regression in R

Linear regression assumes that the dispersion of data points around the regression line is constant.

We can deal with violation of this assumption (i.e. with heteroscedasticity) by:

  1. Transforming the outcome variable
  2. Calculating heteroscedasticity-robust standard errors
  3. Using weighted regression
  4. Using a t-test with unequal variances (when dealing with only 1 categorical predictor)
  5. Converting the outcome into a binary variable then using logistic regression

Let’s create some heteroscedastic data to demonstrate these methods:

set.seed(0)
x = runif(100)
y = x + rnorm(n = 100, mean = 10, sd = x)

model = lm(y ~ x)

# plot residuals vs fitted values
plot(model, 1, pch = 16)

Output:

residuals versus fitted values plot showing heteroscedasticity

The residuals vs fitted values plot shows a fan shape, which is evidence of heteroscedasticity. (For more information, see: How to Check Linear Regression Assumptions in R)

Solution #1: Transforming the outcome variable

The first solution we can try is to transform the outcome Y by using a log or a square root transformation.

The difference is that with a log transformation the model will still be interpretable, so this is what we are going to use here (for information on how to interpret such model, see: Interpret Log Transformations in Linear Regression):

model1 = lm(log(y) ~ x)

# re-check residuals vs fitted values plot
plot(model1, 1)

Output:

residuals versus fitted values plot still showing a fan shape

The log transformation did not solve our problem in this case since the residuals vs fitted values plot is still showing a fan shape instead of a random pattern.

Solution #2: Calculating heteroscedasticity-robust standard errors

Since heteroscedasticity only biases standard errors (and not regression coefficients), we can replace them with ones that are robust to heteroscedasticity.

Here’s how to do it in R:

model = lm(y ~ x)

library("lmtest") # for coeftest
library("sandwich") # for vcovHC

# estimating heteroscedasticity-robust standard errors
coeftest(model, vcov = vcovHC(model))

Output:

t test of coefficients:

            Estimate Std. Error  t value  Pr(>|t|)    
(Intercept) 9.986870   0.073587 135.7143 < 2.2e-16 ***
x           0.920409   0.193715   4.7514 6.922e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In this case, the robust standard errors differed a little bit from the ordinary standard errors calculated using summary(model).

Solution #3: Using weighted regression

Another solution would be to use weighted regression where we handle heteroscedasticity by imposing less weight on the part of the data where the variance is high, and more weight on the part where the variance is low.

How to calculate the weights for weighted regression?

First of all, these weights should be inversely proportional to the variance of Y, so:

weights = 1/variance of Y

Next, we will have to guess the variance of Y for different values of X. Our best guess of this variance is to regress the absolute values of the residuals against the fitted values. The resulting fitted values squared of this regression will be our estimates of this variance. (For more information, see Weighted Regression)

In R we run the following code:

# ordinary linear regression model
model = lm(y ~ x)

# estimating the variance of y for different values of x
variance = lm(abs(model$residuals) ~ model$fitted.values)$fitted.values^2

# calculating the weights
weights = 1 / variance

# weighted regression model
weighted_model = lm(y ~ x, weights = weights)
summary(weighted_model)

Output:

Call:
lm(formula = y ~ x, weights = weights)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-3.0384 -0.9777  0.0516  0.6491  3.4628 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.00850    0.02994 334.282  < 2e-16 ***
x            0.86790    0.11637   7.458 3.57e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.207 on 98 degrees of freedom
Multiple R-squared:  0.3621,	Adjusted R-squared:  0.3556 
F-statistic: 55.62 on 1 and 98 DF,  p-value: 3.57e-11

Solution #4: Using a t-test with unequal variances

When we are dealing with only 1 categorical predictor, we can use a t-test with unequal variances (a.k.a. Welch’s t-test) instead of linear regression to estimate the relationship between x and y:

t.test(x, y) # the default is to run a Welch's t-test

Output:

	Welch Two Sample t-test

data:  x and y
t = -159.43, df = 142.38, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -10.068735  -9.822109
sample estimates:
 mean of x  mean of y 
 0.5207647 10.4661865 

Solution #5: Converting the outcome into a binary variable then using logistic regression

This is the least favorable solution, since converting a continuous variable into binary involves losing information.

The cutpoint to split the y variable should be chosen according to background knowledge. In the example below, we will split y in half using its median as a cutpoint:

y_binary = as.factor(ifelse(y > median(y), 'high', 'low'))

# run logistic regression
summary(glm(y_binary ~ x, family = 'binomial'))

Output:

Call:
glm(formula = y_binary ~ x, family = "binomial")

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.61361  -0.95142  -0.01876   0.89945   1.84797  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.0157     0.5327   3.784 0.000155 ***
x            -3.8594     0.9217  -4.187 2.82e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138.63  on 99  degrees of freedom
Residual deviance: 116.96  on 98  degrees of freedom
AIC: 120.96

Number of Fisher Scoring iterations: 4

If you are interested, see: How to Run and Interpret a Logistic Regression Model in R.

Further reading