In this article, we will cover:
- How to run forward stepwise linear regression
- How to run backward stepwise linear regression
- How to run forward stepwise logistic regression
- How to run backward stepwise logistic regression
- How to force stepwise selection to keep certain variables in the final model
- How to interpret stepwise regression output
Let’s start by creating some data:
# create a dataset containing 10 numeric variables X1 to X10 set.seed(1) dat = data.frame(matrix(runif(1000), ncol = 10))
To run a stepwise regression, use the stepAIC function from the MASS library.
1. How to run forward stepwise linear regression
library(MASS)
fullModel = lm(X1 ~ ., data = dat) # model with all 9 variables
nullModel = lm(X1 ~ 1, data = dat) # model with the intercept only
summary(stepAIC(nullModel, # start with a model containing no variables
direction = 'forward', # run forward selection
scope = list(upper = fullModel, # the maximum to consider is a model with all variables
lower = nullModel), # the minimum to consider is a model with no variables
trace = 0)) # do not show the step-by-step process of model selection
Output:
Call:
lm(formula = X1 ~ X4 + X3 + X7, data = dat)
Residuals:
Min 1Q Median 3Q Max
-0.52407 -0.23122 -0.00609 0.19931 0.42532
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.47077 0.07644 6.158 1.71e-08 ***
X4 -0.19252 0.09173 -2.099 0.0385 *
X3 0.15369 0.09456 1.625 0.1074
X7 0.14837 0.09495 1.563 0.1214
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2608 on 96 degrees of freedom
Multiple R-squared: 0.07919, Adjusted R-squared: 0.05042
F-statistic: 2.752 on 3 and 96 DF, p-value: 0.04682
Forward stepwise regression only kept 3 variables in the final model: X3, X4, and X7.
2. How to run backward stepwise linear regression
Here we can use the same code as for forward selection, but we should change 2 things:
- Start with the full model (instead of the null model)
- Change the direction from forward to backward
library(MASS)
fullModel = lm(X1 ~ ., data = dat) # model with all 9 variables
nullModel = lm(X1 ~ 1, data = dat) # model with the intercept only
summary(stepAIC(fullModel, # start with a model containing all the variables
direction = 'backward', # run backward selection
scope = list(upper = fullModel, # the maximum to consider is a model with all variables
lower = nullModel), # the minimum to consider is a model with no variables
trace = 0)) # do not show the step-by-step process of model selection
Output:
Call:
lm(formula = X1 ~ X3 + X4 + X7, data = dat)
Residuals:
Min 1Q Median 3Q Max
-0.52407 -0.23122 -0.00609 0.19931 0.42532
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.47077 0.07644 6.158 1.71e-08 ***
X3 0.15369 0.09456 1.625 0.1074
X4 -0.19252 0.09173 -2.099 0.0385 *
X7 0.14837 0.09495 1.563 0.1214
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2608 on 96 degrees of freedom
Multiple R-squared: 0.07919, Adjusted R-squared: 0.05042
F-statistic: 2.752 on 3 and 96 DF, p-value: 0.04682
Backward stepwise regression only kept 3 variables in the final model: X3, X4, and X7.
3. How to run forward stepwise logistic regression
Since our data contains only numeric variables, we will transform X1 to binary:
# transform X1 to a binary variable dat$X1 = dat$X1 > 0.5
library(MASS)
fullModel = glm(X1 ~ ., family = 'binomial', data = dat) # model with all 9 variables
nullModel = glm(X1 ~ 1, family = 'binomial', data = dat) # model with the intercept only
summary(stepAIC(nullModel, # start with a model containing no variables
direction = 'forward', # run forward selection
scope = list(upper = fullModel, # the maximum to consider is a model with all variables
lower = nullModel), # the minimum to consider is a model with no variables
trace = 0)) # do not show the step-by-step process of model selection
Output:
Call:
glm(formula = X1 ~ X4 + X3, family = "binomial", data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.740 -1.127 -0.769 1.152 1.644
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.08439 0.49725 0.170 0.8652
X4 -1.60056 0.74724 -2.142 0.0322 *
X3 1.46221 0.76637 1.908 0.0564 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.47 on 99 degrees of freedom
Residual deviance: 130.78 on 97 degrees of freedom
AIC: 136.78
Number of Fisher Scoring iterations: 4
Forward stepwise logistic regression only kept 2 variables in the final model: X3 and X4.
4. How to run backward stepwise logistic regression
Here we can use the same code as for forward selection, but we should change 2 things:
- Start with the full model (instead of the null model)
- Change the direction from forward to backward
library(MASS)
fullModel = glm(X1 ~ ., family = 'binomial', data = dat) # model with all 9 variables
nullModel = glm(X1 ~ 1, family = 'binomial', data = dat) # model with the intercept only
summary(stepAIC(fullModel, # start with a model containing all variables
direction = 'backward', # run backward selection
scope = list(upper = fullModel, # the maximum to consider is a model with all variables
lower = nullModel), # the minimum to consider is a model with no variables
trace = 0)) # do not show the step-by-step process of model selection
Output:
Call:
glm(formula = X1 ~ X3 + X4, family = "binomial", data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.740 -1.127 -0.769 1.152 1.644
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.08439 0.49725 0.170 0.8652
X3 1.46221 0.76637 1.908 0.0564 .
X4 -1.60056 0.74724 -2.142 0.0322 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.47 on 99 degrees of freedom
Residual deviance: 130.78 on 97 degrees of freedom
AIC: 136.78
Number of Fisher Scoring iterations: 4
Backward stepwise logistic regression only kept 2 variables in the final model: X3 and X4.
5. How to force stepwise regression to keep certain variables in the final model
Sometimes you want to keep an important variable in the final model regardless of its influence (i.e. even if it does not improve the model). This may be because the variable is a confounder that you want to control for, or maybe it is the main predictor in your study. In either cases, you want to force stepwise selection to keep it.
Suppose we want to force the stepwise algorithm to keep the variable X10. Here’s how to do it:
We will take the example of forward stepwise linear regression and replace the null model with a minimum model that contains X10:
fullModel = lm(X1 ~ ., data = dat) # model with all 9 variables
minModel = lm(X1 ~ X10, data = dat) # the minimum model contains X10
summary(stepAIC(minModel, # start with the minModel instead of nullModel
direction = 'forward', # run forward selection
scope = list(upper = fullModel, # the maximum to consider is a model with all variables
lower = minModel), # the minimum to consider is a model with X10 only
trace = 0)) # do not show the step-by-step process of model selection
Output:
Call:
lm(formula = X1 ~ X10 + X4 + X3, data = dat)
Residuals:
Min 1Q Median 3Q Max
-0.56696 -0.21124 -0.01518 0.22209 0.45376
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.58628 0.07482 7.836 6.3e-12 ***
X10 -0.11068 0.08603 -1.287 0.2013
X4 -0.16710 0.09140 -1.828 0.0706 .
X3 0.15897 0.09492 1.675 0.0972 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2618 on 96 degrees of freedom
Multiple R-squared: 0.07178, Adjusted R-squared: 0.04277
F-statistic: 2.474 on 3 and 96 DF, p-value: 0.06617
Forward stepwise regression only kept 3 variables in the final model: X3, X4, and X10.
6. How to interpret stepwise regression output
I already covered this section in details in the following articles: Understand Forward and Backward Stepwise Regression and How to Report Stepwise Regression). So here I will only mention 3 important things to keep in mind when interpreting the output of stepwise regression:
1. The function stepAIC uses Akaike Information Criterion (AIC) to limit the size of the final model
If you want to use Bayesian Information Criterion (BIC) instead of AIC, you will have to set the argument k inside the stepAIC function to log(n), where n is the sample size.
Here’s how to run forward stepwise linear regression with BIC to set the limit on the size of the final model:
n = nrow(dat) # sample size
# limit the size of the final model according to BIC
summary(stepAIC(nullModel,
direction = 'forward',
scope = list(upper = fullModel,
lower = nullModel),
trace = 0,
k = log(n))) # use BIC instead of AIC
Note that BIC is a more restrictive criterion than AIC, and therefore tend to produce a smaller final model.
2. The regression coefficients, confidence intervals, p-values, and R-squared outputted by stepwise regression are biased
The output of a stepwise regression cannot be interpreted in the same way as an ordinary linear or logistic regression. Stepwise regression is a good exploratory tool that should not be used for inferential purposes.
3. Stepwise regression is not guaranteed to select the best possible combination of variables
Stepwise selection is a greedy approach that does not consider all possible combinations of available predictors. So do not assume that the final model is the best possible model, nor that it contains the best predictors of the outcome.