Logistic Regression in R (with Categorical Variables)

In this article, we will run and interpret a logistic regression model where the predictor is a categorical variable with multiple levels.

Loading the data

We will use the Titanic dataset available in R:

# transform the Titanic data from table to data.frame
dat = as.data.frame(Titanic)
dat = dat[rep(1:nrow(dat), dat$Freq), -5]

summary(dat)
#   Class         Sex          Age       Survived  
#  1st :325   Male  :1731   Child: 109   No :1490  
#  2nd :285   Female: 470   Adult:2092   Yes: 711  
#  3rd :706                                        
#  Crew:885 

Running a logistic regression model

Next, we will use logistic regression to examine the effect of class (the predictor) on survival (the outcome):

model = glm(Survived ~ Class, family = 'binomial', data = dat)

summary(model)
# Call:
# glm(formula = Survived ~ Class, family = "binomial", data = dat)
# 
# Deviance Residuals: 
#     Min       1Q   Median       3Q      Max  
# -1.3999  -0.7623  -0.7401   0.9702   1.6906  
# 
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)    
# (Intercept)   0.5092     0.1146   4.445 8.79e-06 ***
# Class2nd     -0.8565     0.1661  -5.157 2.51e-07 ***
# Class3rd     -1.5965     0.1436 -11.114  < 2e-16 ***
# ClassCrew    -1.6643     0.1390 -11.972  < 2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 2769.5  on 2200  degrees of freedom
# Residual deviance: 2588.6  on 2197  degrees of freedom
# AIC: 2596.6
# 
# Number of Fisher Scoring iterations: 4

Explaining the output

Since Class has 4 categories (1st, 2nd, 3rd, and Crew), R will divide it into 3 binary variables and leave one category as a reference. In this case, “1st” will be the reference category, so the 3 binary variables will be:

  • Class2nd = 1 (if the person is in the second class) and 0 (if the person is not in the second class)
  • Class3rd = 1 (if the person is in the third class) and 0 (if the person is not in the third class)
  • ClassCrew = 1 (if the person is a crew member) and 0 (if the person is not a crew member)

Note that the variable Class1st is coded implicitly: if the score on all 3 variables is 0, then the person is in the first class.

Interpreting the model

For example, the coefficient of the variable Class2nd: β = -0.8565 (p < 0.05), can be interpreted as follows:

People in the second class have 0.42 (eβ = 0.42) times the survival odds of passengers in the first class (the reference group).

Alternatively, we can say that:

People in the second class have 58% (1 - 0.42 = 0.58) less odds of surviving than passengers in the first class (the reference group).

For more information on how to interpret logistic regression output, see my other articles: Interpret Logistic Regression Coefficients and Interpret Logistic Regression Intercept.

Changing the reference group

Suppose we are interested in explaining the survival of passengers compared to crew members. In this case, we should use the category “Crew” as a reference group:

# make Crew the reference group in Class
dat$Class = relevel(dat$Class, ref = "Crew")

# rerun logistic regression
model = glm(Survived ~ Class, family = 'binomial', data = dat)

summary(model)
# Call:
# glm(formula = Survived ~ Class, family = "binomial", data = dat)
# 
# Deviance Residuals: 
#     Min       1Q   Median       3Q      Max  
# -1.3999  -0.7623  -0.7401   0.9702   1.6906  
# 
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -1.15516    0.07876 -14.667  < 2e-16 ***
# Class1st     1.66434    0.13902  11.972  < 2e-16 ***
# Class2nd     0.80785    0.14375   5.620 1.91e-08 ***
# Class3rd     0.06785    0.11711   0.579    0.562    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 2769.5  on 2200  degrees of freedom
# Residual deviance: 2588.6  on 2197  degrees of freedom
# AIC: 2596.6
# 
# Number of Fisher Scoring iterations: 4

Interpreting the model

Now the coefficient of Class2nd: β = 0.80785 (p < 0.05) can be interpreted as follows:

People in the second class have 2.24 (eβ = 2.24) times the survival odds of crew members (the reference group).

Alternatively, we can say that:

People in the second class have 124% (2.24 - 1 = 1.24) more odds of surviving than crew members (the reference group).

Global effect of the categorical variable

In the previous sections, we studied how the survival rate differs between passenger classes, but we still did not answer our main question: Does passenger class affect survival? i.e. is passenger class, as a whole, a good predictor of survival?

The drop1 function in R tests whether dropping the variable Class significantly affects the model. The output will be a single p-value no matter how many levels the variable has:

# test the global effect of the variable Class
drop1(model, .~., test = "Chisq")
# Single term deletions
# 
# Model:
# Survived ~ Class
#        Df Deviance    AIC   LRT  Pr(>Chi)    
# <none>      2588.6 2596.6                    
# Class   3   2769.5 2771.5 180.9 < 2.2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Interpreting the output

The p-value of the variable Class is < 2.2e-16, smaller than 0.05, therefore:

Passenger class has a statistically significant effect on survival.

Further reading