# Stepwise (Linear & Logistic) Regression in R

Let’s start by creating some data:

```# create a dataset containing 10 numeric variables X1 to X10
set.seed(1)
dat = data.frame(matrix(runif(1000), ncol = 10))```

To run a stepwise regression, use the stepAIC function from the MASS library.

## 1. How to run forward stepwise linear regression

```library(MASS)

fullModel = lm(X1 ~ ., data = dat) # model with all 9 variables
nullModel = lm(X1 ~ 1, data = dat) # model with the intercept only

direction = 'forward', # run forward selection
scope = list(upper = fullModel, # the maximum to consider is a model with all variables
lower = nullModel), # the minimum to consider is a model with no variables
trace = 0)) # do not show the step-by-step process of model selection```

Output:

```Call:
lm(formula = X1 ~ X4 + X3 + X7, data = dat)

Residuals:
Min       1Q   Median       3Q      Max
-0.52407 -0.23122 -0.00609  0.19931  0.42532

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.47077    0.07644   6.158 1.71e-08 ***
X4          -0.19252    0.09173  -2.099   0.0385 *
X3           0.15369    0.09456   1.625   0.1074
X7           0.14837    0.09495   1.563   0.1214
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2608 on 96 degrees of freedom
Multiple R-squared:  0.07919,	Adjusted R-squared:  0.05042
F-statistic: 2.752 on 3 and 96 DF,  p-value: 0.04682```

Forward stepwise regression only kept 3 variables in the final model: X3, X4, and X7.

## 2. How to run backward stepwise linear regression

Here we can use the same code as for forward selection, but we should change 2 things:

2. Change the direction from forward to backward
```library(MASS)

fullModel = lm(X1 ~ ., data = dat) # model with all 9 variables
nullModel = lm(X1 ~ 1, data = dat) # model with the intercept only

direction = 'backward', # run backward selection
scope = list(upper = fullModel, # the maximum to consider is a model with all variables
lower = nullModel), # the minimum to consider is a model with no variables
trace = 0)) # do not show the step-by-step process of model selection```

Output:

```Call:
lm(formula = X1 ~ X3 + X4 + X7, data = dat)

Residuals:
Min       1Q   Median       3Q      Max
-0.52407 -0.23122 -0.00609  0.19931  0.42532

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.47077    0.07644   6.158 1.71e-08 ***
X3           0.15369    0.09456   1.625   0.1074
X4          -0.19252    0.09173  -2.099   0.0385 *
X7           0.14837    0.09495   1.563   0.1214
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2608 on 96 degrees of freedom
Multiple R-squared:  0.07919,	Adjusted R-squared:  0.05042
F-statistic: 2.752 on 3 and 96 DF,  p-value: 0.04682
```

Backward stepwise regression only kept 3 variables in the final model: X3, X4, and X7.

## 3. How to run forward stepwise logistic regression

Since our data contains only numeric variables, we will transform X1 to binary:

```# transform X1 to a binary variable
dat\$X1 = dat\$X1 > 0.5```
```library(MASS)

fullModel = glm(X1 ~ ., family = 'binomial', data = dat) # model with all 9 variables
nullModel = glm(X1 ~ 1, family = 'binomial', data = dat) # model with the intercept only

direction = 'forward', # run forward selection
scope = list(upper = fullModel, # the maximum to consider is a model with all variables
lower = nullModel), # the minimum to consider is a model with no variables
trace = 0)) # do not show the step-by-step process of model selection```

Output:

```Call:
glm(formula = X1 ~ X4 + X3, family = "binomial", data = dat)

Deviance Residuals:
Min      1Q  Median      3Q     Max
-1.740  -1.127  -0.769   1.152   1.644

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.08439    0.49725   0.170   0.8652
X4          -1.60056    0.74724  -2.142   0.0322 *
X3           1.46221    0.76637   1.908   0.0564 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 138.47  on 99  degrees of freedom
Residual deviance: 130.78  on 97  degrees of freedom
AIC: 136.78

Number of Fisher Scoring iterations: 4```

Forward stepwise logistic regression only kept 2 variables in the final model: X3 and X4.

## 4. How to run backward stepwise logistic regression

Here we can use the same code as for forward selection, but we should change 2 things:

2. Change the direction from forward to backward
```library(MASS)

fullModel = glm(X1 ~ ., family = 'binomial', data = dat) # model with all 9 variables
nullModel = glm(X1 ~ 1, family = 'binomial', data = dat) # model with the intercept only

direction = 'backward', # run backward selection
scope = list(upper = fullModel, # the maximum to consider is a model with all variables
lower = nullModel), # the minimum to consider is a model with no variables
trace = 0)) # do not show the step-by-step process of model selection```

Output:

```Call:
glm(formula = X1 ~ X3 + X4, family = "binomial", data = dat)

Deviance Residuals:
Min      1Q  Median      3Q     Max
-1.740  -1.127  -0.769   1.152   1.644

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.08439    0.49725   0.170   0.8652
X3           1.46221    0.76637   1.908   0.0564 .
X4          -1.60056    0.74724  -2.142   0.0322 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 138.47  on 99  degrees of freedom
Residual deviance: 130.78  on 97  degrees of freedom
AIC: 136.78

Number of Fisher Scoring iterations: 4
```

Backward stepwise logistic regression only kept 2 variables in the final model: X3 and X4.

## 5. How to force stepwise regression to keep certain variables in the final model

Sometimes you want to keep an important variable in the final model regardless of its influence (i.e. even if it does not improve the model). This may be because the variable is a confounder that you want to control for, or maybe it is the main predictor in your study. In either cases, you want to force stepwise selection to keep it.

Suppose we want to force the stepwise algorithm to keep the variable X10. Here’s how to do it:

We will take the example of forward stepwise linear regression and replace the null model with a minimum model that contains X10:

```fullModel = lm(X1 ~ ., data = dat) # model with all 9 variables
minModel = lm(X1 ~ X10, data = dat) # the minimum model contains X10

direction = 'forward', # run forward selection
scope = list(upper = fullModel, # the maximum to consider is a model with all variables
lower = minModel), # the minimum to consider is a model with X10 only
trace = 0)) # do not show the step-by-step process of model selection```

Output:

```Call:
lm(formula = X1 ~ X10 + X4 + X3, data = dat)

Residuals:
Min       1Q   Median       3Q      Max
-0.56696 -0.21124 -0.01518  0.22209  0.45376

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.58628    0.07482   7.836  6.3e-12 ***
X10         -0.11068    0.08603  -1.287   0.2013
X4          -0.16710    0.09140  -1.828   0.0706 .
X3           0.15897    0.09492   1.675   0.0972 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2618 on 96 degrees of freedom
Multiple R-squared:  0.07178,	Adjusted R-squared:  0.04277
F-statistic: 2.474 on 3 and 96 DF,  p-value: 0.06617```

Forward stepwise regression only kept 3 variables in the final model: X3, X4, and X10.

## 6. How to interpret stepwise regression output

I already covered this section in details in the following articles: Understand Forward and Backward Stepwise Regression and How to Report Stepwise Regression). So here I will only mention 3 important things to keep in mind when interpreting the output of stepwise regression:

### 1. The function stepAIC uses Akaike Information Criterion (AIC) to limit the size of the final model

If you want to use Bayesian Information Criterion (BIC) instead of AIC, you will have to set the argument k inside the stepAIC function to log(n), where n is the sample size.

Here’s how to run forward stepwise linear regression with BIC to set the limit on the size of the final model:

```n = nrow(dat) # sample size

# limit the size of the final model according to BIC
summary(stepAIC(nullModel,
direction = 'forward',
scope = list(upper = fullModel,
lower = nullModel),
trace = 0,
k = log(n))) # use BIC instead of AIC```

Note that BIC is a more restrictive criterion than AIC, and therefore tend to produce a smaller final model.

### 2. The regression coefficients, confidence intervals, p-values, and R-squared outputted by stepwise regression are biased

The output of a stepwise regression cannot be interpreted in the same way as an ordinary linear or logistic regression. Stepwise regression is a good exploratory tool that should not be used for inferential purposes.

### 3. Stepwise regression is not guaranteed to select the best possible combination of variables

Stepwise selection is a greedy approach that does not consider all possible combinations of available predictors. So do not assume that the final model is the best possible model, nor that it contains the best predictors of the outcome.