A regression model that includes a categorical predictor with many levels might not contain enough observations in each category to be able to detect a reasonable effect size with reasonable power, even then, the large number of dummy variables created could be difficult to interpret.
In this article, we present 4 ways to deal with such variable. We start with simple solutions and move gradually to more complex ones:
- Combine similar categories using domain knowledge
- Group categories with low frequency
- Replace categories with the mean values of a numeric proxy variable
- Combine categories based on the similarity of their relationship with the outcome
1. Combine similar categories using domain knowledge
Use your expertise in the field to define larger categories for your variable. The coding from previous studies can be used as a guide.
For instance, you can replace the 9 AJCC stages of prostate cancer (I, IIA, IIB, IIC, IIIA, IIIB, IIIC, IVA, IVB) with a binary variable, metastasized (Yes/No), which reflects whether the cancer has spread to other parts of the body.
Since combining similar categories involves information loss, you should compare the performance metrics (eg. RMSE for linear regression, or AUC under the ROC for logistic regression) of a model that uses the original coding of this variable with the one that uses the variable with newly created categories to make sure that the accuracy did not drop a lot.
2. Group categories with low frequency
Arrange the categories by frequency, then group low-frequency categories into a single new category.
For instance, you can combine categories that have less than 5% of observations in your sample.
Since grouping low-frequency categories also involves information loss, you should make sure that this transformation did not cause a large drop in the model performance.
3. Replace categories with the mean values of a numeric proxy variable
Find a numeric variable Z that is a good proxy for your categorical variable X (eg. a highly correlated one). Next, calculate the mean of Z for each category of X, then replace the categories of X with these means. This way, you will be creating a single numeric variable based on your categorical variable that has many levels. [Source: Frank Harrell on StackExchange]
For instance, you can replace categories of the variable college major with their average corresponding years of education. Another example would be to replace city names in the variable location with an approximate combination of longitude and latitude (thus creating 2 numeric variables from location).
This method works in cases where you don’t care about explaining the relationship between each category of your variable and the outcome, i.e. when you are only interested in the global effect of your categorical variable.
A variation of this method, called target encoding (a.k.a. effect encoding, likelihood encoding, or criterion scaling), uses the outcome to encode the categorical variable. In this case, each category will be replaced by:
- The corresponding outcome mean (if the outcome is numeric, i.e. in a linear regression setting)
- The corresponding outcome probability (if the outcome is binary, i.e. in a logistic regression setting)
For target encoding, since you will be using the outcome to create the new variable, be sure to use cross-validation to avoid overfitting.
4. Combine categories based on the similarity of their relationship with the outcome
Combine categories that have similar effects on the outcome variable. The similarity of these categories can be judged using the residuals of an initial model. [Source: Bruce P, Bruce A, Gedeck P. Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python. 2nd edition. O’Reilly Media; 2020.]
Here’s a step-by-step description of how this works:
Suppose you want to create 5 categories from a variable called Occupation that has 100 categories:
- Step 1: Run a linear regression model using the rest of your predictors (without the variable Occupation). Add these residuals as a new column to your original data.
- Step 2: Calculate the mean (or median) residual value for each category of Occupation.
- Step 3: Sort the categories according to the mean (or median) residual values assigned to them, and group every 20 consecutive categories into 1, thus obtaining 5 categories in total.
- Step 4: Re-run the linear regression model including the newly created Occupation variable with 5 categories.
Since this method uses the outcome to break the variable into larger categories, you should use cross-validation to avoid overfitting.