How to Run a Linear Regression in R tidymodels

In this article, we are going to use the iris dataset available in R to build a linear regression model using the tidymodels package.

# data set
head(iris)
#    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#1            5.1         3.5          1.4         0.2     setosa
#2            4.9         3.0          1.4         0.2     setosa
#3            4.7         3.2          1.3         0.2     setosa
#4            4.6         3.1          1.5         0.2     setosa
#5            5.0         3.6          1.4         0.2     setosa

Building the model

In order to fit a linear regression model in tidymodels, we need to do 4 things:

  1. Specify which model we are going to use: in this case, a linear regression using lm
  2. Describe how we want to prepare the data before feeding it to the model: here we will tell R what the recipe is (in this specific example, we won’t do variable transformations, so we only need to specify the role of each variable using a formula: y ~ x1 + x2 + …)
  3. Create a workflow object that combines the model with the recipe.
  4. Fit the workflow object to the data: steps 1 to 3 are only to specify how we will work, they won’t do anything to the data. So we need this final step to explicitly tell tidymodels to fit the model to the data.
library(tidymodels)

#1. specify model type
model <- linear_reg() |> 
  set_engine('lm')

#2. specify the role of each variable
rec <- recipe(
  Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
  data = iris
)

#3. combine the 2 in a workflow object
wkflow <- workflow() |> 
  add_model(model) |> 
  add_recipe(rec)

#4. fit the model to the data
model_fit <- wkflow |> 
  fit(data = iris)

Checking linear regression assumptions

After fitting the model, we should check whether the assumptions of linear regression are met. For this we will use the performance package.

# check assumptions of linear regression
library(performance)

model_fit |> 
  extract_fit_engine() |> 
  check_model()

Output:

Checking the assumptions of the first linear regression model

We see that the high correlation between Petal.Length and Petal.Width is causing a collinearity problem. So, we will remove Petal.Width from the recipe and re-run the model.

Refining the model

# removing Petal.Width from the recipe
rec <- recipe(
  Sepal.Length ~ Sepal.Width + Petal.Length,
  data = iris
)

# update the workflow
wkflow <- workflow() |> 
  add_model(model) |> 
  add_recipe(rec)

# re-fit the model to the data
model_fit <- wkflow |> 
  fit(data = iris)

# re-check assumptions
model_fit |> 
  extract_fit_engine() |> 
  check_model()

Output:

Everything is looking good now!

Next, we will look at the model output.

Examining the model output

Getting the regression coefficients and p-values

# get coefficients and p-values
model_fit |> tidy()
## A tibble: 3 x 5
#  term         estimate std.error statistic  p.value
#  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
#1 (Intercept)     2.25     0.248       9.07 7.04e-16
#2 Sepal.Width     0.596    0.0693      8.59 1.16e-14
#3 Petal.Length    0.472    0.0171     27.6  5.85e-60

Checking the model fit

# check model performance
model_fit |> glance()
## A tibble: 1 x 12
#  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC deviance df.residual  nobs
#      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
#1     0.840         0.838 0.333      386. 2.93e-59     2  -46.5  101.  113.     16.3         147   150

Using the model to predict the outcome

Finally, we can calculate the predicted values for Sepal.Length.

# use the model to predict the outcome
model_fit |> 
  predict(new_data = iris)
## A tibble: 150 x 1
#   .pred
#   <dbl>
# 1  4.99
# 2  4.70
# 3  4.77
# 4  4.80
# 5  5.05
# 6  5.37
# 7  4.93
# 8  4.98
# 9  4.64
#10  4.80
## ... with 140 more rows
## i Use `print(n = ...)` to see more rows

The predict function is mostly useful to predict new/test data to check the out-of-sample model accuracy.

Further reading