Ordinal logistic regression is a type of regression analysis that models the relationship between one or more predictors (numerical or categorical) and an ordinal outcome. An ordinal outcome is a variable that has more than 2 categories that have a logical order, such as:
- Cancer stage: 0, 1, 2, 3, or 4
- Income level: low, medium, or high
In this tutorial, we will use ordinal logistic regression on the titanic data set, which can be downloaded from this GitHub link, to predict the variable Pclass (passenger class), which can take on 1 of 3 values (1: first class, 2: second class, and 3: third class), using the predictors: Age, Sex, and Survived.
1. Loading the data
First, we will transform the numeric variable Pclass into an ordinal variable, and then remove all observations with missing values:
library(tidyverse) titanic <- read_csv("titanic.csv") |> mutate(Pclass = factor(Pclass, levels = c(1, 2, 3), ordered = TRUE)) |> select(Pclass, Age, Sex, Survived) |> drop_na() titanic ## A tibble: 714 × 4 # Pclass Age Sex Survived # <ord> <dbl> <chr> <dbl> # 1 3 22 male 0 # 2 1 38 female 1 # 3 3 26 female 1 # 4 1 35 female 1 # 5 3 35 male 0 # 6 1 54 male 0 # 7 3 2 male 0 # 8 3 27 female 1 # 9 2 14 female 1 #10 3 4 female 1 ## ℹ 704 more rows ## ℹ Use `print(n = ...)` to see more rows
2. Fitting an ordinal logistic regression model
We will use the most common type of ordinal logistic regression, which is proportional odds logistic regression, since it is the easiest to interpret.
The model can be run using the
clm() function from the ordinal package as follows:
library(ordinal) model <- clm(Pclass ~ Age + Sex + Survived, data = titanic) summary(model) #formula: Pclass ~ Age + Sex + Survived #data: titanic # # link threshold nobs logLik AIC niter max.grad cond.H # logit flexible 714 -627.62 1265.23 4(0) 1.74e-07 4.1e+04 # #Coefficients: # Estimate Std. Error z value Pr(>|z|) #Age -0.063208 0.005857 -10.792 <2e-16 *** #Sexmale -0.315159 0.199921 -1.576 0.115 #Survived -1.969870 0.204830 -9.617 <2e-16 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # #Threshold coefficients: # Estimate Std. Error z value #1|2 -4.2024 0.3091 -13.60 #2|3 -2.8120 0.2838 -9.91
Interpreting the relationship between each predictor and the outcome
The general interpretation of an ordinal logistic regression coefficient β1 associated with a predictor X1 is:
A 1 unit increase in the predictor X1 is associated with a β1 change in the log odds of the outcome Y of being in a higher category.
Since the coefficients of the predictors are the log odds ratios, we will have to exponentiate them to get the odds ratios.
In R, we can call the function
tidy() on the model fit to extract the coefficients and p-values, with the arguments
exponentiate = TRUE and
conf.int = TRUE (to print the 95% confidence intervals). Then, we remove the standard errors and z-scores from the output to make the table smaller:
library(tidymodels) tidy(model, exponentiate = TRUE, conf.int = TRUE) |> select(-c(std.error, statistic)) ## A tibble: 5 × 6 # term estimate p.value conf.low conf.high coef.type # <chr> <dbl> <dbl> <dbl> <dbl> <chr> #1 1|2 0.0150 4.17e-42 NA NA intercept #2 2|3 0.0601 3.77e-23 NA NA intercept #3 Age 0.939 3.77e-27 0.928 0.949 location #4 Sexmale 0.730 1.15e- 1 0.491 1.08 location #5 Survived 0.139 6.77e-22 0.0927 0.207 location
We will only look at the coefficients of the predictors (the last 3 lines), not the intercepts. Here’s how to interpret them:
Interpreting the coefficient of Age:
The exponentiated coefficient (the odds ratio) related to Age is 0.939 which is less than 1: this means that age is negatively related to higher passenger class values. But since a 1st class ticket is better than a 2nd class which in turn is better than a 3rd class ticket, then higher age is positively related to having a higher-class ticket, and the interpretation becomes:
Older passengers tend to have higher-class tickets (p < 0.001). Specifically, a passenger that is 1 year older has 6.1% more odds (0.939 – 1 = -0.061) of having a ticket that is one class higher (i.e. a lower Pclass value).
Interpreting the coefficient of Sex:
The exponentiated coefficient (the odds ratio) related to male Sex is not statistically significant (p = 0.115) which means that gender is not a good predictor of passenger class.
Interpreting the coefficient of Survived:
The exponentiated coefficient (the odds ratio) related to Survived is 0.139 which is less than 1, this means that:
Passengers who survived tend to have lower Pclass values, i.e. tend to be have higher-class tickets (p < 0.001). Specifically, a passenger who survived had 86.1% (0.139 – 1 = 0.861) more odds of having a ticket that is one class higher than a passenger who did not survive.
Checking the model fit
nagelkerke() from the rcompanion package will calculate the pseudo-R-squared and run a likelihood ratio test:
library(rcompanion) nagelkerke(model) #$Models # #Model: "clm, Pclass ~ Age + Sex + Survived, titanic" #Null: "clm, Pclass ~ 1, titanic" # #$Pseudo.R.squared.for.model.vs.null # Pseudo.R.squared #McFadden 0.155862 #Cox and Snell (ML) 0.277186 #Nagelkerke (Cragg and Uhler) 0.316640 # #$Likelihood.ratio.test # Df.diff LogLik.diff Chisq p.value # -3 -115.88 231.77 5.738e-50 # #$Number.of.observations # #Model: 714 #Null: 714 # #$Messages # "Note: For models fit with REML, these statistics are based on refitting with ML" # #$Warnings # "None"
Of particular importance are:
- Nagelkerke’s R-squared: which is a number between 0 and 1 that measures the goodness of fit of a logistic regression model.
- The likelihood ratio test: which tests if the full model (the model with all the predictors included) fits the data better than the null model (the model with no variables). In our case, the LogLik.diff is -115.88 with p < 0.001, which means that adding the predictors is better than the null model with no predictors.
We can also run the Lipsitz test to check the goodness of fit:
library(generalhoslem) lipsitz.test(model) # Lipsitz goodness of fit test for ordinal response models # #data: formula: Pclass ~ Age + Sex + Survived #LR statistic = 7.0386, df = 9, p-value = 0.6331
Since the null hypothesis is a good model fit, then the p = 0.6331 obtained means that we cannot reject that hypothesis — which is a good thing.
Accuracy of the ordinal logistic regression model
Next, we will calculate the model’s accuracy in 3 steps:
Step #1: Get the fitted values and save them in preds:
preds <- augment(model, type = "class") preds ## A tibble: 714 × 5 # Pclass Age Sex Survived .fitted # <ord> <dbl> <chr> <dbl> <fct> # 1 3 22 male 0 3 # 2 1 38 female 1 1 # 3 3 26 female 1 1 # 4 1 35 female 1 1 # 5 3 35 male 0 3 # 6 1 54 male 0 1 # 7 3 2 male 0 3 # 8 3 27 female 1 1 # 9 2 14 female 1 3 #10 3 4 female 1 3 ## ℹ 704 more rows ## ℹ Use `print(n = ...)` to see more rows
Step #2: Look at the confusion matrix:
conf_mat(preds, truth = Pclass, estimate = .fitted) # Truth #Prediction 1 2 3 # 1 113 59 49 # 2 0 0 0 # 3 73 114 306
Notice that the model did not classify anyone to have a 2nd class ticket.
Now we can stop here and manually calculate the accuracy using the confusion matrix, but the function
accuracy() from tidymodels can do it faster.
Step #3: Calculate the model accuracy:
accuracy(preds, truth = Pclass, estimate = .fitted) ## A tibble: 1 × 3 # .metric .estimator .estimate # <chr> <chr> <dbl> #1 accuracy multiclass 0.587
So we can say that, based on Age, Sex, and Survived, the model correctly guesses the passenger’s class 58.7% of the times.
Checking the proportional odds assumption
The specific type of ordinal logistic regression model that we used (the proportional odds logistic regression) assumes that the effect of each predictor on the outcome is the same for all levels of the outcome. For example, “the effect of increasing age by 1 year on the odds of having a second-class ticket vs a first-class ticket” is the same as “the effect of increasing age by 1 year on the odds of having a third-class ticket vs a second-class ticket”.
We can check this assumption by using the Brant test, where the null hypothesis is that the proportional odds assumption holds. The assumption is considered violated if p < 0.05 on the Omnibus test plus at least one of the variables [source: McNulty K. Handbook of Regression Modeling in People Analytics: With Examples in R and Python. 1st edition. Chapman and Hall/CRC; 2021.]
library(gofcat) brant.test(model) #Brant Test: # chi-sq df pr(>chi) #Omnibus 3.699 3 0.30 #Age 2.027 1 0.15 #Sexmale 1.554 1 0.21 #Survived 0.679 1 0.41 # #H0: Proportional odds assumption holds
Here, all p-values are above 0.05, so the proportional odds assumption holds.