In this article, we will run and interpret a logistic regression model where the predictor is a categorical variable with multiple levels.
Loading the data
We will use the Titanic dataset available in R:
# transform the Titanic data from table to data.frame dat = as.data.frame(Titanic) dat = dat[rep(1:nrow(dat), dat$Freq), -5] summary(dat) # Class Sex Age Survived # 1st :325 Male :1731 Child: 109 No :1490 # 2nd :285 Female: 470 Adult:2092 Yes: 711 # 3rd :706 # Crew:885
Running a logistic regression model
Next, we will use logistic regression to examine the effect of class (the predictor) on survival (the outcome):
model = glm(Survived ~ Class, family = 'binomial', data = dat) summary(model) # Call: # glm(formula = Survived ~ Class, family = "binomial", data = dat) # # Deviance Residuals: # Min 1Q Median 3Q Max # -1.3999 -0.7623 -0.7401 0.9702 1.6906 # # Coefficients: # Estimate Std. Error z value Pr(>|z|) # (Intercept) 0.5092 0.1146 4.445 8.79e-06 *** # Class2nd -0.8565 0.1661 -5.157 2.51e-07 *** # Class3rd -1.5965 0.1436 -11.114 < 2e-16 *** # ClassCrew -1.6643 0.1390 -11.972 < 2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # (Dispersion parameter for binomial family taken to be 1) # # Null deviance: 2769.5 on 2200 degrees of freedom # Residual deviance: 2588.6 on 2197 degrees of freedom # AIC: 2596.6 # # Number of Fisher Scoring iterations: 4
Explaining the output
Since Class has 4 categories (1st, 2nd, 3rd, and Crew), R will divide it into 3 binary variables and leave one category as a reference. In this case, “1st” will be the reference category, so the 3 binary variables will be:
- Class2nd = 1 (if the person is in the second class) and 0 (if the person is not in the second class)
- Class3rd = 1 (if the person is in the third class) and 0 (if the person is not in the third class)
- ClassCrew = 1 (if the person is a crew member) and 0 (if the person is not a crew member)
Note that the variable Class1st is coded implicitly: if the score on all 3 variables is 0, then the person is in the first class.
Interpreting the model
For example, the coefficient of the variable Class2nd: β = -0.8565 (p < 0.05), can be interpreted as follows:
People in the second class have 0.42 (eβ = 0.42) times the survival odds of passengers in the first class (the reference group).
Alternatively, we can say that:
People in the second class have 58% (1 - 0.42 = 0.58) less odds of surviving than passengers in the first class (the reference group).
For more information on how to interpret logistic regression output, see my other articles: Interpret Logistic Regression Coefficients and Interpret Logistic Regression Intercept.
Changing the reference group
Suppose we are interested in explaining the survival of passengers compared to crew members. In this case, we should use the category “Crew” as a reference group:
# make Crew the reference group in Class dat$Class = relevel(dat$Class, ref = "Crew") # rerun logistic regression model = glm(Survived ~ Class, family = 'binomial', data = dat) summary(model) # Call: # glm(formula = Survived ~ Class, family = "binomial", data = dat) # # Deviance Residuals: # Min 1Q Median 3Q Max # -1.3999 -0.7623 -0.7401 0.9702 1.6906 # # Coefficients: # Estimate Std. Error z value Pr(>|z|) # (Intercept) -1.15516 0.07876 -14.667 < 2e-16 *** # Class1st 1.66434 0.13902 11.972 < 2e-16 *** # Class2nd 0.80785 0.14375 5.620 1.91e-08 *** # Class3rd 0.06785 0.11711 0.579 0.562 # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # (Dispersion parameter for binomial family taken to be 1) # # Null deviance: 2769.5 on 2200 degrees of freedom # Residual deviance: 2588.6 on 2197 degrees of freedom # AIC: 2596.6 # # Number of Fisher Scoring iterations: 4
Interpreting the model
Now the coefficient of Class2nd: β = 0.80785 (p < 0.05) can be interpreted as follows:
People in the second class have 2.24 (eβ = 2.24) times the survival odds of crew members (the reference group).
Alternatively, we can say that:
People in the second class have 124% (2.24 - 1 = 1.24) more odds of surviving than crew members (the reference group).
Global effect of the categorical variable
In the previous sections, we studied how the survival rate differs between passenger classes, but we still did not answer our main question: Does passenger class affect survival? i.e. is passenger class, as a whole, a good predictor of survival?
The drop1 function in R tests whether dropping the variable Class significantly affects the model. The output will be a single p-value no matter how many levels the variable has:
# test the global effect of the variable Class drop1(model, .~., test = "Chisq") # Single term deletions # # Model: # Survived ~ Class # Df Deviance AIC LRT Pr(>Chi) # <none> 2588.6 2596.6 # Class 3 2769.5 2771.5 180.9 < 2.2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Interpreting the output
The p-value of the variable Class is < 2.2e-16, smaller than 0.05, therefore:
Passenger class has a statistically significant effect on survival.