In this article, we will cover:
- How to run forward stepwise linear regression
- How to run backward stepwise linear regression
- How to run forward stepwise logistic regression
- How to run backward stepwise logistic regression
- How to force stepwise selection to keep certain variables in the final model
- How to interpret stepwise regression output
Let’s start by creating some data:
# create a dataset containing 10 numeric variables X1 to X10 set.seed(1) dat = data.frame(matrix(runif(1000), ncol = 10))
To run a stepwise regression, use the stepAIC function from the MASS library.
1. How to run forward stepwise linear regression
library(MASS) fullModel = lm(X1 ~ ., data = dat) # model with all 9 variables nullModel = lm(X1 ~ 1, data = dat) # model with the intercept only summary(stepAIC(nullModel, # start with a model containing no variables direction = 'forward', # run forward selection scope = list(upper = fullModel, # the maximum to consider is a model with all variables lower = nullModel), # the minimum to consider is a model with no variables trace = 0)) # do not show the step-by-step process of model selection
Output:
Call: lm(formula = X1 ~ X4 + X3 + X7, data = dat) Residuals: Min 1Q Median 3Q Max -0.52407 -0.23122 -0.00609 0.19931 0.42532 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.47077 0.07644 6.158 1.71e-08 *** X4 -0.19252 0.09173 -2.099 0.0385 * X3 0.15369 0.09456 1.625 0.1074 X7 0.14837 0.09495 1.563 0.1214 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2608 on 96 degrees of freedom Multiple R-squared: 0.07919, Adjusted R-squared: 0.05042 F-statistic: 2.752 on 3 and 96 DF, p-value: 0.04682
Forward stepwise regression only kept 3 variables in the final model: X3, X4, and X7.
2. How to run backward stepwise linear regression
Here we can use the same code as for forward selection, but we should change 2 things:
- Start with the full model (instead of the null model)
- Change the direction from forward to backward
library(MASS) fullModel = lm(X1 ~ ., data = dat) # model with all 9 variables nullModel = lm(X1 ~ 1, data = dat) # model with the intercept only summary(stepAIC(fullModel, # start with a model containing all the variables direction = 'backward', # run backward selection scope = list(upper = fullModel, # the maximum to consider is a model with all variables lower = nullModel), # the minimum to consider is a model with no variables trace = 0)) # do not show the step-by-step process of model selection
Output:
Call: lm(formula = X1 ~ X3 + X4 + X7, data = dat) Residuals: Min 1Q Median 3Q Max -0.52407 -0.23122 -0.00609 0.19931 0.42532 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.47077 0.07644 6.158 1.71e-08 *** X3 0.15369 0.09456 1.625 0.1074 X4 -0.19252 0.09173 -2.099 0.0385 * X7 0.14837 0.09495 1.563 0.1214 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2608 on 96 degrees of freedom Multiple R-squared: 0.07919, Adjusted R-squared: 0.05042 F-statistic: 2.752 on 3 and 96 DF, p-value: 0.04682
Backward stepwise regression only kept 3 variables in the final model: X3, X4, and X7.
3. How to run forward stepwise logistic regression
Since our data contains only numeric variables, we will transform X1 to binary:
# transform X1 to a binary variable dat$X1 = dat$X1 > 0.5
library(MASS) fullModel = glm(X1 ~ ., family = 'binomial', data = dat) # model with all 9 variables nullModel = glm(X1 ~ 1, family = 'binomial', data = dat) # model with the intercept only summary(stepAIC(nullModel, # start with a model containing no variables direction = 'forward', # run forward selection scope = list(upper = fullModel, # the maximum to consider is a model with all variables lower = nullModel), # the minimum to consider is a model with no variables trace = 0)) # do not show the step-by-step process of model selection
Output:
Call: glm(formula = X1 ~ X4 + X3, family = "binomial", data = dat) Deviance Residuals: Min 1Q Median 3Q Max -1.740 -1.127 -0.769 1.152 1.644 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.08439 0.49725 0.170 0.8652 X4 -1.60056 0.74724 -2.142 0.0322 * X3 1.46221 0.76637 1.908 0.0564 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 138.47 on 99 degrees of freedom Residual deviance: 130.78 on 97 degrees of freedom AIC: 136.78 Number of Fisher Scoring iterations: 4
Forward stepwise logistic regression only kept 2 variables in the final model: X3 and X4.
4. How to run backward stepwise logistic regression
Here we can use the same code as for forward selection, but we should change 2 things:
- Start with the full model (instead of the null model)
- Change the direction from forward to backward
library(MASS) fullModel = glm(X1 ~ ., family = 'binomial', data = dat) # model with all 9 variables nullModel = glm(X1 ~ 1, family = 'binomial', data = dat) # model with the intercept only summary(stepAIC(fullModel, # start with a model containing all variables direction = 'backward', # run backward selection scope = list(upper = fullModel, # the maximum to consider is a model with all variables lower = nullModel), # the minimum to consider is a model with no variables trace = 0)) # do not show the step-by-step process of model selection
Output:
Call: glm(formula = X1 ~ X3 + X4, family = "binomial", data = dat) Deviance Residuals: Min 1Q Median 3Q Max -1.740 -1.127 -0.769 1.152 1.644 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.08439 0.49725 0.170 0.8652 X3 1.46221 0.76637 1.908 0.0564 . X4 -1.60056 0.74724 -2.142 0.0322 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 138.47 on 99 degrees of freedom Residual deviance: 130.78 on 97 degrees of freedom AIC: 136.78 Number of Fisher Scoring iterations: 4
Backward stepwise logistic regression only kept 2 variables in the final model: X3 and X4.
5. How to force stepwise regression to keep certain variables in the final model
Sometimes you want to keep an important variable in the final model regardless of its influence (i.e. even if it does not improve the model). This may be because the variable is a confounder that you want to control for, or maybe it is the main predictor in your study. In either cases, you want to force stepwise selection to keep it.
Suppose we want to force the stepwise algorithm to keep the variable X10. Here’s how to do it:
We will take the example of forward stepwise linear regression and replace the null model with a minimum model that contains X10:
fullModel = lm(X1 ~ ., data = dat) # model with all 9 variables minModel = lm(X1 ~ X10, data = dat) # the minimum model contains X10 summary(stepAIC(minModel, # start with the minModel instead of nullModel direction = 'forward', # run forward selection scope = list(upper = fullModel, # the maximum to consider is a model with all variables lower = minModel), # the minimum to consider is a model with X10 only trace = 0)) # do not show the step-by-step process of model selection
Output:
Call: lm(formula = X1 ~ X10 + X4 + X3, data = dat) Residuals: Min 1Q Median 3Q Max -0.56696 -0.21124 -0.01518 0.22209 0.45376 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.58628 0.07482 7.836 6.3e-12 *** X10 -0.11068 0.08603 -1.287 0.2013 X4 -0.16710 0.09140 -1.828 0.0706 . X3 0.15897 0.09492 1.675 0.0972 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2618 on 96 degrees of freedom Multiple R-squared: 0.07178, Adjusted R-squared: 0.04277 F-statistic: 2.474 on 3 and 96 DF, p-value: 0.06617
Forward stepwise regression only kept 3 variables in the final model: X3, X4, and X10.
6. How to interpret stepwise regression output
I already covered this section in details in the following articles: Understand Forward and Backward Stepwise Regression and How to Report Stepwise Regression). So here I will only mention 3 important things to keep in mind when interpreting the output of stepwise regression:
1. The function stepAIC uses Akaike Information Criterion (AIC) to limit the size of the final model
If you want to use Bayesian Information Criterion (BIC) instead of AIC, you will have to set the argument k inside the stepAIC function to log(n), where n is the sample size.
Here’s how to run forward stepwise linear regression with BIC to set the limit on the size of the final model:
n = nrow(dat) # sample size # limit the size of the final model according to BIC summary(stepAIC(nullModel, direction = 'forward', scope = list(upper = fullModel, lower = nullModel), trace = 0, k = log(n))) # use BIC instead of AIC
Note that BIC is a more restrictive criterion than AIC, and therefore tend to produce a smaller final model.
2. The regression coefficients, confidence intervals, p-values, and R-squared outputted by stepwise regression are biased
The output of a stepwise regression cannot be interpreted in the same way as an ordinary linear or logistic regression. Stepwise regression is a good exploratory tool that should not be used for inferential purposes.
3. Stepwise regression is not guaranteed to select the best possible combination of variables
Stepwise selection is a greedy approach that does not consider all possible combinations of available predictors. So do not assume that the final model is the best possible model, nor that it contains the best predictors of the outcome.