# Logistic Regression in R (with Categorical Variables)

In this article, we will run and interpret a logistic regression model where the predictor is a categorical variable with multiple levels.

We will use the Titanic dataset available in R:

```# transform the Titanic data from table to data.frame
dat = as.data.frame(Titanic)
dat = dat[rep(1:nrow(dat), dat\$Freq), -5]

summary(dat)
#   Class         Sex          Age       Survived
#  1st :325   Male  :1731   Child: 109   No :1490
#  2nd :285   Female: 470   Adult:2092   Yes: 711
#  3rd :706
#  Crew:885 ```

## Running a logistic regression model

Next, we will use logistic regression to examine the effect of class (the predictor) on survival (the outcome):

```model = glm(Survived ~ Class, family = 'binomial', data = dat)

summary(model)
# Call:
# glm(formula = Survived ~ Class, family = "binomial", data = dat)
#
# Deviance Residuals:
#     Min       1Q   Median       3Q      Max
# -1.3999  -0.7623  -0.7401   0.9702   1.6906
#
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   0.5092     0.1146   4.445 8.79e-06 ***
# Class2nd     -0.8565     0.1661  -5.157 2.51e-07 ***
# Class3rd     -1.5965     0.1436 -11.114  < 2e-16 ***
# ClassCrew    -1.6643     0.1390 -11.972  < 2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
#     Null deviance: 2769.5  on 2200  degrees of freedom
# Residual deviance: 2588.6  on 2197  degrees of freedom
# AIC: 2596.6
#
# Number of Fisher Scoring iterations: 4```

### Explaining the output

Since Class has 4 categories (1st, 2nd, 3rd, and Crew), R will divide it into 3 binary variables and leave one category as a reference. In this case, “1st” will be the reference category, so the 3 binary variables will be:

• Class2nd = 1 (if the person is in the second class) and 0 (if the person is not in the second class)
• Class3rd = 1 (if the person is in the third class) and 0 (if the person is not in the third class)
• ClassCrew = 1 (if the person is a crew member) and 0 (if the person is not a crew member)

Note that the variable Class1st is coded implicitly: if the score on all 3 variables is 0, then the person is in the first class.

### Interpreting the model

For example, the coefficient of the variable Class2nd: β = -0.8565 (p < 0.05), can be interpreted as follows:

`People in the second class have 0.42 (eβ = 0.42) times the survival odds of passengers in the first class (the reference group).`

Alternatively, we can say that:

`People in the second class have 58% (1 - 0.42 = 0.58) less odds of surviving than passengers in the first class (the reference group).`

For more information on how to interpret logistic regression output, see my other articles: Interpret Logistic Regression Coefficients and Interpret Logistic Regression Intercept.

## Changing the reference group

Suppose we are interested in explaining the survival of passengers compared to crew members. In this case, we should use the category “Crew” as a reference group:

```# make Crew the reference group in Class
dat\$Class = relevel(dat\$Class, ref = "Crew")

# rerun logistic regression
model = glm(Survived ~ Class, family = 'binomial', data = dat)

summary(model)
# Call:
# glm(formula = Survived ~ Class, family = "binomial", data = dat)
#
# Deviance Residuals:
#     Min       1Q   Median       3Q      Max
# -1.3999  -0.7623  -0.7401   0.9702   1.6906
#
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.15516    0.07876 -14.667  < 2e-16 ***
# Class1st     1.66434    0.13902  11.972  < 2e-16 ***
# Class2nd     0.80785    0.14375   5.620 1.91e-08 ***
# Class3rd     0.06785    0.11711   0.579    0.562
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
#     Null deviance: 2769.5  on 2200  degrees of freedom
# Residual deviance: 2588.6  on 2197  degrees of freedom
# AIC: 2596.6
#
# Number of Fisher Scoring iterations: 4```

### Interpreting the model

Now the coefficient of Class2nd: β = 0.80785 (p < 0.05) can be interpreted as follows:

`People in the second class have 2.24 (eβ = 2.24) times the survival odds of crew members (the reference group).`

Alternatively, we can say that:

`People in the second class have 124% (2.24 - 1 = 1.24) more odds of surviving than crew members (the reference group).`

## Global effect of the categorical variable

In the previous sections, we studied how the survival rate differs between passenger classes, but we still did not answer our main question: Does passenger class affect survival? i.e. is passenger class, as a whole, a good predictor of survival?

The drop1 function in R tests whether dropping the variable Class significantly affects the model. The output will be a single p-value no matter how many levels the variable has:

```# test the global effect of the variable Class
drop1(model, .~., test = "Chisq")
# Single term deletions
#
# Model:
# Survived ~ Class
#        Df Deviance    AIC   LRT  Pr(>Chi)
# <none>      2588.6 2596.6
# Class   3   2769.5 2771.5 180.9 < 2.2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1```

### Interpreting the output

The p-value of the variable Class is < 2.2e-16, smaller than 0.05, therefore:

`Passenger class has a statistically significant effect on survival.`