In this article, we will run and interpret a logistic regression model where the predictor is a categorical variable with multiple levels.

## Loading the data

We will use the Titanic dataset available in R:

# transform the Titanic data from table to data.frame dat = as.data.frame(Titanic) dat = dat[rep(1:nrow(dat), dat$Freq), -5] summary(dat) # Class Sex Age Survived # 1st :325 Male :1731 Child: 109 No :1490 # 2nd :285 Female: 470 Adult:2092 Yes: 711 # 3rd :706 # Crew:885

## Running a logistic regression model

Next, we will use logistic regression to examine the effect of class (the predictor) on survival (the outcome):

model = glm(Survived ~ Class, family = 'binomial', data = dat) summary(model) # Call: # glm(formula = Survived ~ Class, family = "binomial", data = dat) # # Deviance Residuals: # Min 1Q Median 3Q Max # -1.3999 -0.7623 -0.7401 0.9702 1.6906 # # Coefficients: # Estimate Std. Error z value Pr(>|z|) # (Intercept) 0.5092 0.1146 4.445 8.79e-06 *** # Class2nd -0.8565 0.1661 -5.157 2.51e-07 *** # Class3rd -1.5965 0.1436 -11.114 < 2e-16 *** # ClassCrew -1.6643 0.1390 -11.972 < 2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # (Dispersion parameter for binomial family taken to be 1) # # Null deviance: 2769.5 on 2200 degrees of freedom # Residual deviance: 2588.6 on 2197 degrees of freedom # AIC: 2596.6 # # Number of Fisher Scoring iterations: 4

### Explaining the output

Since *Class* has 4 categories (1st, 2nd, 3rd, and Crew), R will divide it into 3 binary variables and leave one category as a reference. In this case, “1st” will be the reference category, so the 3 binary variables will be:

**Class2nd**= 1 (if the person is in the second class) and 0 (if the person is not in the second class)**Class3rd**= 1 (if the person is in the third class) and 0 (if the person is not in the third class)**ClassCrew**= 1 (if the person is a crew member) and 0 (if the person is not a crew member)

Note that the variable **Class1st** is coded implicitly: if the score on all 3 variables is 0, then the person is in the first class.

### Interpreting the model

For example, the coefficient of the variable **Class2nd: β = -0.8565 (p < 0.05)**, can be interpreted as follows:

People in the second class have 0.42 (e^{β}= 0.42) times the survival odds of passengers in the first class (the reference group).

Alternatively, we can say that:

People in the second class have 58% (1 - 0.42 = 0.58) less odds of surviving than passengers in the first class (the reference group).

For more information on how to interpret logistic regression output, see my other articles: Interpret Logistic Regression Coefficients and Interpret Logistic Regression Intercept.

## Changing the reference group

Suppose we are interested in explaining the survival of passengers compared to crew members. In this case, we should use the category “Crew” as a reference group:

# make Crew the reference group in Class dat$Class = relevel(dat$Class, ref = "Crew") # rerun logistic regression model = glm(Survived ~ Class, family = 'binomial', data = dat) summary(model) # Call: # glm(formula = Survived ~ Class, family = "binomial", data = dat) # # Deviance Residuals: # Min 1Q Median 3Q Max # -1.3999 -0.7623 -0.7401 0.9702 1.6906 # # Coefficients: # Estimate Std. Error z value Pr(>|z|) # (Intercept) -1.15516 0.07876 -14.667 < 2e-16 *** # Class1st 1.66434 0.13902 11.972 < 2e-16 *** # Class2nd 0.80785 0.14375 5.620 1.91e-08 *** # Class3rd 0.06785 0.11711 0.579 0.562 # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # (Dispersion parameter for binomial family taken to be 1) # # Null deviance: 2769.5 on 2200 degrees of freedom # Residual deviance: 2588.6 on 2197 degrees of freedom # AIC: 2596.6 # # Number of Fisher Scoring iterations: 4

### Interpreting the model

Now the coefficient of **Class2nd: β = 0.80785 (p < 0.05)** can be interpreted as follows:

People in the second class have 2.24 (e^{β}= 2.24) times the survival odds of crew members (the reference group).

Alternatively, we can say that:

People in the second class have 124% (2.24 - 1 = 1.24) more odds of surviving than crew members (the reference group).

## Global effect of the categorical variable

In the previous sections, we studied how the survival rate differs between passenger classes, but we still did not answer our main question: **Does passenger class affect survival?** i.e. is passenger class, as a whole, a good predictor of survival?

The *drop1* function in R tests whether dropping the variable *Class* significantly affects the model. The output will be a single p-value no matter how many levels the variable has:

# test the global effect of the variable Class drop1(model, .~., test = "Chisq") # Single term deletions # # Model: # Survived ~ Class # Df Deviance AIC LRT Pr(>Chi) # <none> 2588.6 2596.6 # Class 3 2769.5 2771.5 180.9 < 2.2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

### Interpreting the output

The p-value of the variable Class is < 2.2e-16, smaller than 0.05, therefore:

Passenger class has a statistically significant effect on survival.