In this article, we are going to use the iris dataset available in R to build a linear regression model using the tidymodels package.
# data set head(iris) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species #1 5.1 3.5 1.4 0.2 setosa #2 4.9 3.0 1.4 0.2 setosa #3 4.7 3.2 1.3 0.2 setosa #4 4.6 3.1 1.5 0.2 setosa #5 5.0 3.6 1.4 0.2 setosa
Building the model
In order to fit a linear regression model in tidymodels, we need to do 4 things:
- Specify which model we are going to use: in this case, a linear regression using lm
- Describe how we want to prepare the data before feeding it to the model: here we will tell R what the recipe is (in this specific example, we won’t do variable transformations, so we only need to specify the role of each variable using a formula: y ~ x1 + x2 + …)
- Create a workflow object that combines the model with the recipe.
- Fit the workflow object to the data: steps 1 to 3 are only to specify how we will work, they won’t do anything to the data. So we need this final step to explicitly tell tidymodels to fit the model to the data.
library(tidymodels) #1. specify model type model <- linear_reg() |> set_engine('lm') #2. specify the role of each variable rec <- recipe( Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = iris ) #3. combine the 2 in a workflow object wkflow <- workflow() |> add_model(model) |> add_recipe(rec) #4. fit the model to the data model_fit <- wkflow |> fit(data = iris)
Checking linear regression assumptions
After fitting the model, we should check whether the assumptions of linear regression are met. For this we will use the performance package.
# check assumptions of linear regression library(performance) model_fit |> extract_fit_engine() |> check_model()
Output:
We see that the high correlation between Petal.Length
and Petal.Width
is causing a collinearity problem. So, we will remove Petal.Width
from the recipe and re-run the model.
Refining the model
# removing Petal.Width from the recipe rec <- recipe( Sepal.Length ~ Sepal.Width + Petal.Length, data = iris ) # update the workflow wkflow <- workflow() |> add_model(model) |> add_recipe(rec) # re-fit the model to the data model_fit <- wkflow |> fit(data = iris) # re-check assumptions model_fit |> extract_fit_engine() |> check_model()
Output:
Everything is looking good now!
Next, we will look at the model output.
Examining the model output
Getting the regression coefficients and p-values
# get coefficients and p-values model_fit |> tidy() ## A tibble: 3 x 5 # term estimate std.error statistic p.value # <chr> <dbl> <dbl> <dbl> <dbl> #1 (Intercept) 2.25 0.248 9.07 7.04e-16 #2 Sepal.Width 0.596 0.0693 8.59 1.16e-14 #3 Petal.Length 0.472 0.0171 27.6 5.85e-60
Checking the model fit
# check model performance model_fit |> glance() ## A tibble: 1 x 12 # r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs # <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> #1 0.840 0.838 0.333 386. 2.93e-59 2 -46.5 101. 113. 16.3 147 150
Using the model to predict the outcome
Finally, we can calculate the predicted values for Sepal.Length
.
# use the model to predict the outcome model_fit |> predict(new_data = iris) ## A tibble: 150 x 1 # .pred # <dbl> # 1 4.99 # 2 4.70 # 3 4.77 # 4 4.80 # 5 5.05 # 6 5.37 # 7 4.93 # 8 4.98 # 9 4.64 #10 4.80 ## ... with 140 more rows ## i Use `print(n = ...)` to see more rows
The predict function is mostly useful to predict new/test data to check the out-of-sample model accuracy.