In this article, we will cover:

- How to run forward stepwise linear regression
- How to run backward stepwise linear regression
- How to run forward stepwise logistic regression
- How to run backward stepwise logistic regression
- How to force stepwise selection to keep certain variables in the final model
- How to interpret stepwise regression output

Let’s start by creating some data:

# create a dataset containing 10 numeric variables X1 to X10 set.seed(1) dat = data.frame(matrix(runif(1000), ncol = 10))

To run a stepwise regression, use the *stepAIC* function from the *MASS* library.

## 1. How to run forward stepwise linear regression

library(MASS) fullModel = lm(X1 ~ ., data = dat) # model with all 9 variables nullModel = lm(X1 ~ 1, data = dat) # model with the intercept only summary(stepAIC(nullModel, # start with a model containing no variables direction = 'forward', # run forward selection scope = list(upper = fullModel, # the maximum to consider is a model with all variables lower = nullModel), # the minimum to consider is a model with no variables trace = 0)) # do not show the step-by-step process of model selection

**Output:**

Call: lm(formula = X1 ~ X4 + X3 + X7, data = dat) Residuals: Min 1Q Median 3Q Max -0.52407 -0.23122 -0.00609 0.19931 0.42532 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.47077 0.07644 6.158 1.71e-08 *** X4 -0.19252 0.09173 -2.099 0.0385 * X3 0.15369 0.09456 1.625 0.1074 X7 0.14837 0.09495 1.563 0.1214 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2608 on 96 degrees of freedom Multiple R-squared: 0.07919, Adjusted R-squared: 0.05042 F-statistic: 2.752 on 3 and 96 DF, p-value: 0.04682

Forward stepwise regression only kept 3 variables in the final model: X3, X4, and X7.

## 2. How to run backward stepwise linear regression

Here we can use the same code as for forward selection, but we should change 2 things:

- Start with the full model (instead of the null model)
- Change the
*direction*from*forward*to*backward*

library(MASS) fullModel = lm(X1 ~ ., data = dat) # model with all 9 variables nullModel = lm(X1 ~ 1, data = dat) # model with the intercept only summary(stepAIC(fullModel, # start with a model containing all the variables direction = 'backward', # run backward selection scope = list(upper = fullModel, # the maximum to consider is a model with all variables lower = nullModel), # the minimum to consider is a model with no variables trace = 0)) # do not show the step-by-step process of model selection

**Output:**

Call: lm(formula = X1 ~ X3 + X4 + X7, data = dat) Residuals: Min 1Q Median 3Q Max -0.52407 -0.23122 -0.00609 0.19931 0.42532 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.47077 0.07644 6.158 1.71e-08 *** X3 0.15369 0.09456 1.625 0.1074 X4 -0.19252 0.09173 -2.099 0.0385 * X7 0.14837 0.09495 1.563 0.1214 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2608 on 96 degrees of freedom Multiple R-squared: 0.07919, Adjusted R-squared: 0.05042 F-statistic: 2.752 on 3 and 96 DF, p-value: 0.04682

Backward stepwise regression only kept 3 variables in the final model: X3, X4, and X7.

## 3. How to run forward stepwise logistic regression

Since our data contains only numeric variables, we will transform X1 to binary:

# transform X1 to a binary variable dat$X1 = dat$X1 > 0.5

library(MASS) fullModel = glm(X1 ~ ., family = 'binomial', data = dat) # model with all 9 variables nullModel = glm(X1 ~ 1, family = 'binomial', data = dat) # model with the intercept only summary(stepAIC(nullModel, # start with a model containing no variables direction = 'forward', # run forward selection scope = list(upper = fullModel, # the maximum to consider is a model with all variables lower = nullModel), # the minimum to consider is a model with no variables trace = 0)) # do not show the step-by-step process of model selection

**Output:**

Call: glm(formula = X1 ~ X4 + X3, family = "binomial", data = dat) Deviance Residuals: Min 1Q Median 3Q Max -1.740 -1.127 -0.769 1.152 1.644 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.08439 0.49725 0.170 0.8652 X4 -1.60056 0.74724 -2.142 0.0322 * X3 1.46221 0.76637 1.908 0.0564 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 138.47 on 99 degrees of freedom Residual deviance: 130.78 on 97 degrees of freedom AIC: 136.78 Number of Fisher Scoring iterations: 4

Forward stepwise logistic regression only kept 2 variables in the final model: X3 and X4.

## 4. How to run backward stepwise logistic regression

Here we can use the same code as for forward selection, but we should change 2 things:

- Start with the full model (instead of the null model)
- Change the
*direction*from*forward*to*backward*

library(MASS) fullModel = glm(X1 ~ ., family = 'binomial', data = dat) # model with all 9 variables nullModel = glm(X1 ~ 1, family = 'binomial', data = dat) # model with the intercept only summary(stepAIC(fullModel, # start with a model containing all variables direction = 'backward', # run backward selection scope = list(upper = fullModel, # the maximum to consider is a model with all variables lower = nullModel), # the minimum to consider is a model with no variables trace = 0)) # do not show the step-by-step process of model selection

**Output:**

Call: glm(formula = X1 ~ X3 + X4, family = "binomial", data = dat) Deviance Residuals: Min 1Q Median 3Q Max -1.740 -1.127 -0.769 1.152 1.644 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.08439 0.49725 0.170 0.8652 X3 1.46221 0.76637 1.908 0.0564 . X4 -1.60056 0.74724 -2.142 0.0322 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 138.47 on 99 degrees of freedom Residual deviance: 130.78 on 97 degrees of freedom AIC: 136.78 Number of Fisher Scoring iterations: 4

Backward stepwise logistic regression only kept 2 variables in the final model: X3 and X4.

## 5. How to force stepwise regression to keep certain variables in the final model

Sometimes you want to keep an important variable in the final model regardless of its influence (i.e. even if it does not improve the model). This may be because the variable is a confounder that you want to control for, or maybe it is the main predictor in your study. In either cases, you want to force stepwise selection to keep it.

Suppose we want to force the stepwise algorithm to keep the variable X10. Here’s how to do it:

We will take the example of forward stepwise linear regression and replace the null model with a minimum model that contains X10:

fullModel = lm(X1 ~ ., data = dat) # model with all 9 variables minModel = lm(X1 ~ X10, data = dat) # the minimum model contains X10 summary(stepAIC(minModel, # start with the minModel instead of nullModel direction = 'forward', # run forward selection scope = list(upper = fullModel, # the maximum to consider is a model with all variables lower = minModel), # the minimum to consider is a model with X10 only trace = 0)) # do not show the step-by-step process of model selection

**Output:**

Call: lm(formula = X1 ~ X10 + X4 + X3, data = dat) Residuals: Min 1Q Median 3Q Max -0.56696 -0.21124 -0.01518 0.22209 0.45376 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.58628 0.07482 7.836 6.3e-12 *** X10 -0.11068 0.08603 -1.287 0.2013 X4 -0.16710 0.09140 -1.828 0.0706 . X3 0.15897 0.09492 1.675 0.0972 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2618 on 96 degrees of freedom Multiple R-squared: 0.07178, Adjusted R-squared: 0.04277 F-statistic: 2.474 on 3 and 96 DF, p-value: 0.06617

Forward stepwise regression only kept 3 variables in the final model: X3, X4, and X10.

## 6. How to interpret stepwise regression output

I already covered this section in details in the following articles: Understand Forward and Backward Stepwise Regression and How to Report Stepwise Regression). So here I will only mention 3 important things to keep in mind when interpreting the output of stepwise regression:

### 1. The function *stepAIC* uses Akaike Information Criterion (AIC) to limit the size of the final model

If you want to use Bayesian Information Criterion (BIC) instead of AIC, you will have to set the argument *k* inside the *stepAIC* function to *log(n)*, where *n* is the sample size.

Here’s how to run forward stepwise linear regression with BIC to set the limit on the size of the final model:

n = nrow(dat) # sample size # limit the size of the final model according to BIC summary(stepAIC(nullModel, direction = 'forward', scope = list(upper = fullModel, lower = nullModel), trace = 0, k = log(n))) # use BIC instead of AIC

Note that BIC is a more restrictive criterion than AIC, and therefore tend to produce a smaller final model.

### 2. The regression coefficients, confidence intervals, p-values, and R-squared outputted by stepwise regression are biased

The output of a stepwise regression cannot be interpreted in the same way as an ordinary linear or logistic regression. Stepwise regression is a good exploratory tool that should not be used for inferential purposes.

### 3. Stepwise regression is not guaranteed to select the best possible combination of variables

Stepwise selection is a greedy approach that does not consider all possible combinations of available predictors. So do not assume that the final model is the best possible model, nor that it contains the best predictors of the outcome.