How to Run a Logistic Regression in R tidymodels

In this tutorial, we are going to use the tidymodels package to run a logistic regression on the Titanic dataset available in R.

1. Preparing the data

# transforming Titanic into a tibble
df <- Titanic |>
  as_tibble() |> 
  uncount(n) |> 
  mutate_if(is.character, as.factor)

df
## A tibble: 2,201 x 4
#   Class Sex   Age   Survived
#   <fct> <fct> <fct> <fct>   
# 1 3rd   Male  Child No      
# 2 3rd   Male  Child No      
# 3 3rd   Male  Child No      
# 4 3rd   Male  Child No      
# 5 3rd   Male  Child No      
# 6 3rd   Male  Child No      
# 7 3rd   Male  Child No      
# 8 3rd   Male  Child No      
# 9 3rd   Male  Child No      
#10 3rd   Male  Child No      
## ... with 2,191 more rows
## i Use `print(n = ...)` to see more rows

2. Running a logistic regression model

In order to fit a logistic regression model in tidymodels, we need to do 4 things:

Specify which model we are going to use: in this case, a logistic regression using glm
Describe how we want to prepare the data before feeding it to the model: here we will tell R what the recipe is (in this specific example, we won’t do variable transformations, so we only need to specify the role of each variable using a formula: y ~ x1 + x2 + …)
Create a workflow object that combines the model with the recipe.
Fit the workflow object to the data: in steps 1 to 3 we only specified what we are going to do, but we did nothing to the data. So we need this final step to explicitly tell tidymodels to fit the model to the data.

library(tidymodels)

#1. specify model type
model <- logistic_reg() |> 
  set_engine("glm")

#2. specify the role of each variable
rec <- recipe(Survived ~ Age + Class + Sex, data = df)

#3. combine the 2 in a workflow
wkflow <- workflow() |> 
  add_model(model) |> 
  add_recipe(rec)

#4. fit the model to the data
model_fit <- wkflow |> 
  fit(data = df)

3. Examining the relationship between the predictors and the outcome

3.1. Getting the coefficients, p-values, and confidence intervals

tidy(model_fit, exponentiate = TRUE, conf.int = TRUE)
## A tibble: 6 x 7
#  term        estimate std.error statistic  p.value conf.low conf.high
#  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
#1 (Intercept)   7.72       0.168     12.2  4.44e-34   5.59      10.8  
#2 AgeChild      2.89       0.244      4.35 1.36e- 5   1.79       4.67 
#3 Class2nd      0.361      0.196     -5.19 2.05e- 7   0.245      0.529
#4 Class3rd      0.169      0.172    -10.4  3.69e-25   0.120      0.236
#5 ClassCrew     0.424      0.157     -5.45 5.00e- 8   0.312      0.577
#6 SexMale       0.0889     0.140    -17.2  1.43e-66   0.0672     0.117

The coefficients are exponentiated and so can be interpreted as odds ratios. For example, the second row shows that the AgeChild‘s exponentiated coefficient is 2.89, which means that a child has 2.89 times the survival odds of an adult, with a 95% CI of [1.79, 4.67].

Note: You should ignore the standard errors in this output since these are still on the logistic scale while the coefficients are exponentiated. (see this issue on GitHub).

3.2. Testing the global effect of a categorical variable with multiple levels

Let’s say we are interested in answering the question: Does the passenger’s class affect survival?

In the previous section, we studied how the survival rate differs between passenger classes, but we did not answer the question: is passenger class, as a whole, a good predictor of survival?

The drop1 function in R tests whether dropping the variable Class significantly affects the model. The output will be a single p-value no matter how many levels the variable has:

# global effect of a categorical variable
drop1(model_fit |> extract_fit_engine(), .~., test = "Chisq")
#Single term deletions
#
#Model:
#..y ~ Age + Class + Sex
#       Df Deviance    AIC    LRT  Pr(>Chi)    
#<none>      2210.1 2222.1                     
#Age     1   2228.9 2238.9  18.85 1.413e-05 ***
#Class   3   2329.1 2335.1 119.03 < 2.2e-16 ***
#Sex     1   2563.0 2573.0 352.91 < 2.2e-16 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The p-value of the variable Class is < 2.2e-16 (smaller than 0.05), therefore, we can say that the passenger’s class has a statistically significant effect on survival.

4. Checking the model’s performance

Before checking the performance of our logistic regression model, we first need to predict the outcome using the model and add these predictions to our original dataset, as we will use them later in our calculations.

4.1. Predicting the outcome

# predict the outcome using the model
df_preds <- model_fit |> 
  augment(new_data = df)

df_preds
## A tibble: 2,201 x 7
#   Class Sex   Age   Survived .pred_class .pred_No .pred_Yes
#   <fct> <fct> <fct> <fct>    <fct>          <dbl>     <dbl>
# 1 3rd   Male  Child No       No             0.749     0.251
# 2 3rd   Male  Child No       No             0.749     0.251
# 3 3rd   Male  Child No       No             0.749     0.251
# 4 3rd   Male  Child No       No             0.749     0.251
# 5 3rd   Male  Child No       No             0.749     0.251
# 6 3rd   Male  Child No       No             0.749     0.251
# 7 3rd   Male  Child No       No             0.749     0.251
# 8 3rd   Male  Child No       No             0.749     0.251
# 9 3rd   Male  Child No       No             0.749     0.251
#10 3rd   Male  Child No       No             0.749     0.251
## ... with 2,191 more rows
## i Use `print(n = ...)` to see more rows

So we added 3 columns to our data: the predicted class .pred_class, the probability that the predicted class is No .pred_No, and the probability that the predicted class is Yes .pred_Yes

4.2. Deviance, AIC, and BIC

glance(model_fit)
## A tibble: 1 x 8
#  null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
#          <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
#1         2769.    2200 -1105. 2222. 2256.    2210.        2195  2201

Deviance measures the goodness of fit of a logistic regression model. A deviance of 0 means that the model describes the data perfectly, and a higher value corresponds to a less accurate model. In our case, the null deviance = 2769 (which measures the fit of a model that only includes the intercept) is larger than the residual deviance = 2210 (which measures the fit of the model with all the predictors included). So we can conclude that the predictors we added improved the model fit compared to the baseline model.

The Akaike Information Criterion (AIC) estimates the prediction error of the logistic regression model: a lower AIC corresponds to more accurate model predictions. AIC can be used to compare the current model to one that contains more/less predictors. For example, if adding another predictor X to the model does not cause a drop in the AIC, then we can conclude that X does not improve the model’s prediction of the outcome probability.

4.3. Confusion matrix

conf_mat(df_preds, truth = Survived, estimate = .pred_class)
#          Truth
#Prediction   No  Yes
#       No  1364  362
#       Yes  126  349

This matrix contains 4 numbers:

True Negatives (TN): where Prediction = No and Truth = No (1364 cases)
False Negatives (FN): where Prediction = No and Truth = Yes (362 cases)
False Positives (FP): where Prediction = Yes and Truth = No (126 cases)
True Positives (TP): where Prediction = Yes and Truth = Yes (349 cases)

Using this matrix, we can calculate many performance metrics. But instead of calculating them by hand, we can use the following R code:

perf <- metric_set(accuracy,
                   sensitivity,
                   specificity,
                   mcc,
                   precision,
                   recall)

perf(df_preds, truth = Survived, estimate = .pred_class)
## A tibble: 6 x 3
#  .metric     .estimator .estimate
#  <chr>       <chr>          <dbl>
#1 accuracy    binary         0.778
#2 sensitivity binary         0.915
#3 specificity binary         0.491
#4 mcc         binary         0.462
#5 precision   binary         0.790
#6 recall      binary         0.915

mcc is Mathew’s Correlation Coefficient which is a number between -1 and 1, where:

0 represents no agreement at all between the predictions and the true values of the outcome
-1 represents perfect disagreement, and
1 represents perfect agreement.

4.4. Area under the ROC

The ROC AUC tells us how good the model is at separating the 2 classes of the outcome.

# AUC ROC
roc_auc(df_preds, truth = Survived, .pred_Yes, event_level = "second")
## A tibble: 1 x 3
#  .metric .estimator .estimate
#  <chr>   <chr>          <dbl>
#1 roc_auc binary         0.760

Note that we set event_level = "second" since the default behavior of the function roc_auc() is to consider the first class of the variable Survived as the event. But since the classes (Yes/No) will be considered alphabetically (No before Yes), in this case, we need to set the second level (Yes) as the event.

roc_curve(df_preds, truth = Survived, .pred_Yes, event_level = "second") |> 
  autoplot()

Output: