Stepwise (Linear & Logistic) Regression in R

In this article, we will cover:

  1. How to run forward stepwise linear regression
  2. How to run backward stepwise linear regression
  3. How to run forward stepwise logistic regression
  4. How to run backward stepwise logistic regression
  5. How to force stepwise selection to keep certain variables in the final model
  6. How to interpret stepwise regression output

Let’s start by creating some data:

# create a dataset containing 10 numeric variables X1 to X10
set.seed(1)
dat = data.frame(matrix(runif(1000), ncol = 10))

To run a stepwise regression, use the stepAIC function from the MASS library.

1. How to run forward stepwise linear regression

library(MASS)

fullModel = lm(X1 ~ ., data = dat) # model with all 9 variables
nullModel = lm(X1 ~ 1, data = dat) # model with the intercept only

summary(stepAIC(nullModel, # start with a model containing no variables
                direction = 'forward', # run forward selection
                scope = list(upper = fullModel, # the maximum to consider is a model with all variables
                             lower = nullModel), # the minimum to consider is a model with no variables
                trace = 0)) # do not show the step-by-step process of model selection

Output:

Call:
lm(formula = X1 ~ X4 + X3 + X7, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.52407 -0.23122 -0.00609  0.19931  0.42532 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.47077    0.07644   6.158 1.71e-08 ***
X4          -0.19252    0.09173  -2.099   0.0385 *  
X3           0.15369    0.09456   1.625   0.1074    
X7           0.14837    0.09495   1.563   0.1214    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2608 on 96 degrees of freedom
Multiple R-squared:  0.07919,	Adjusted R-squared:  0.05042 
F-statistic: 2.752 on 3 and 96 DF,  p-value: 0.04682

Forward stepwise regression only kept 3 variables in the final model: X3, X4, and X7.

2. How to run backward stepwise linear regression

Here we can use the same code as for forward selection, but we should change 2 things:

  1. Start with the full model (instead of the null model)
  2. Change the direction from forward to backward
library(MASS)

fullModel = lm(X1 ~ ., data = dat) # model with all 9 variables
nullModel = lm(X1 ~ 1, data = dat) # model with the intercept only

summary(stepAIC(fullModel, # start with a model containing all the variables
                direction = 'backward', # run backward selection
                scope = list(upper = fullModel, # the maximum to consider is a model with all variables
                             lower = nullModel), # the minimum to consider is a model with no variables
                trace = 0)) # do not show the step-by-step process of model selection

Output:

Call:
lm(formula = X1 ~ X3 + X4 + X7, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.52407 -0.23122 -0.00609  0.19931  0.42532 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.47077    0.07644   6.158 1.71e-08 ***
X3           0.15369    0.09456   1.625   0.1074    
X4          -0.19252    0.09173  -2.099   0.0385 *  
X7           0.14837    0.09495   1.563   0.1214    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2608 on 96 degrees of freedom
Multiple R-squared:  0.07919,	Adjusted R-squared:  0.05042 
F-statistic: 2.752 on 3 and 96 DF,  p-value: 0.04682

Backward stepwise regression only kept 3 variables in the final model: X3, X4, and X7.

3. How to run forward stepwise logistic regression

Since our data contains only numeric variables, we will transform X1 to binary:

# transform X1 to a binary variable
dat$X1 = dat$X1 > 0.5
library(MASS)

fullModel = glm(X1 ~ ., family = 'binomial', data = dat) # model with all 9 variables
nullModel = glm(X1 ~ 1, family = 'binomial', data = dat) # model with the intercept only

summary(stepAIC(nullModel, # start with a model containing no variables
                direction = 'forward', # run forward selection
                scope = list(upper = fullModel, # the maximum to consider is a model with all variables
                             lower = nullModel), # the minimum to consider is a model with no variables
                trace = 0)) # do not show the step-by-step process of model selection

Output:

Call:
glm(formula = X1 ~ X4 + X3, family = "binomial", data = dat)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.740  -1.127  -0.769   1.152   1.644  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  0.08439    0.49725   0.170   0.8652  
X4          -1.60056    0.74724  -2.142   0.0322 *
X3           1.46221    0.76637   1.908   0.0564 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138.47  on 99  degrees of freedom
Residual deviance: 130.78  on 97  degrees of freedom
AIC: 136.78

Number of Fisher Scoring iterations: 4

Forward stepwise logistic regression only kept 2 variables in the final model: X3 and X4.

4. How to run backward stepwise logistic regression

Here we can use the same code as for forward selection, but we should change 2 things:

  1. Start with the full model (instead of the null model)
  2. Change the direction from forward to backward
library(MASS)

fullModel = glm(X1 ~ ., family = 'binomial', data = dat) # model with all 9 variables
nullModel = glm(X1 ~ 1, family = 'binomial', data = dat) # model with the intercept only

summary(stepAIC(fullModel, # start with a model containing all variables
                direction = 'backward', # run backward selection
                scope = list(upper = fullModel, # the maximum to consider is a model with all variables
                             lower = nullModel), # the minimum to consider is a model with no variables
                trace = 0)) # do not show the step-by-step process of model selection

Output:

Call:
glm(formula = X1 ~ X3 + X4, family = "binomial", data = dat)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.740  -1.127  -0.769   1.152   1.644  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  0.08439    0.49725   0.170   0.8652  
X3           1.46221    0.76637   1.908   0.0564 .
X4          -1.60056    0.74724  -2.142   0.0322 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138.47  on 99  degrees of freedom
Residual deviance: 130.78  on 97  degrees of freedom
AIC: 136.78

Number of Fisher Scoring iterations: 4

Backward stepwise logistic regression only kept 2 variables in the final model: X3 and X4.

5. How to force stepwise regression to keep certain variables in the final model

Sometimes you want to keep an important variable in the final model regardless of its influence (i.e. even if it does not improve the model). This may be because the variable is a confounder that you want to control for, or maybe it is the main predictor in your study. In either cases, you want to force stepwise selection to keep it.

Suppose we want to force the stepwise algorithm to keep the variable X10. Here’s how to do it:

We will take the example of forward stepwise linear regression and replace the null model with a minimum model that contains X10:

fullModel = lm(X1 ~ ., data = dat) # model with all 9 variables
minModel = lm(X1 ~ X10, data = dat) # the minimum model contains X10

summary(stepAIC(minModel, # start with the minModel instead of nullModel
                direction = 'forward', # run forward selection
                scope = list(upper = fullModel, # the maximum to consider is a model with all variables
                             lower = minModel), # the minimum to consider is a model with X10 only
                trace = 0)) # do not show the step-by-step process of model selection

Output:

Call:
lm(formula = X1 ~ X10 + X4 + X3, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.56696 -0.21124 -0.01518  0.22209  0.45376 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.58628    0.07482   7.836  6.3e-12 ***
X10         -0.11068    0.08603  -1.287   0.2013    
X4          -0.16710    0.09140  -1.828   0.0706 .  
X3           0.15897    0.09492   1.675   0.0972 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2618 on 96 degrees of freedom
Multiple R-squared:  0.07178,	Adjusted R-squared:  0.04277 
F-statistic: 2.474 on 3 and 96 DF,  p-value: 0.06617

Forward stepwise regression only kept 3 variables in the final model: X3, X4, and X10.

6. How to interpret stepwise regression output

I already covered this section in details in the following articles: Understand Forward and Backward Stepwise Regression and How to Report Stepwise Regression). So here I will only mention 3 important things to keep in mind when interpreting the output of stepwise regression:

1. The function stepAIC uses Akaike Information Criterion (AIC) to limit the size of the final model

If you want to use Bayesian Information Criterion (BIC) instead of AIC, you will have to set the argument k inside the stepAIC function to log(n), where n is the sample size.

Here’s how to run forward stepwise linear regression with BIC to set the limit on the size of the final model:

n = nrow(dat) # sample size

# limit the size of the final model according to BIC
summary(stepAIC(nullModel,
                direction = 'forward',
                scope = list(upper = fullModel,
                             lower = nullModel),
                trace = 0,
                k = log(n))) # use BIC instead of AIC

Note that BIC is a more restrictive criterion than AIC, and therefore tend to produce a smaller final model.

2. The regression coefficients, confidence intervals, p-values, and R-squared outputted by stepwise regression are biased

The output of a stepwise regression cannot be interpreted in the same way as an ordinary linear or logistic regression. Stepwise regression is a good exploratory tool that should not be used for inferential purposes.

3. Stepwise regression is not guaranteed to select the best possible combination of variables

Stepwise selection is a greedy approach that does not consider all possible combinations of available predictors. So do not assume that the final model is the best possible model, nor that it contains the best predictors of the outcome.

Further reading