In this article, we will run and interpret a linear regression model where the predictor is a categorical variable with multiple levels.
Loading the Data
We will use the chickwts dataset available in R.
(These data come from an experiment where newly hatched chickens were randomly divided into 6 groups, each group receiving a different feed supplement. Their weights in grams are given after 6 weeks.)
summary(chickwts) # weight feed # Min. :108.0 casein :12 # 1st Qu.:204.5 horsebean:10 # Median :258.0 linseed :12 # Mean :261.3 meatmeal :11 # 3rd Qu.:323.5 soybean :14 # Max. :423.0 sunflower:12
Running a linear regression model
Next, we will use linear regression to examine the effect of feed supplements (the predictor) on the weight of chickens (the outcome):
model = lm(weight ~ feed, data = chickwts) summary(model) # Call: # lm(formula = weight ~ feed, data = chickwts) # # Residuals: # Min 1Q Median 3Q Max # -123.909 -34.413 1.571 38.170 103.091 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 323.583 15.834 20.436 < 2e-16 *** # feedhorsebean -163.383 23.485 -6.957 2.07e-09 *** # feedlinseed -104.833 22.393 -4.682 1.49e-05 *** # feedmeatmeal -46.674 22.896 -2.039 0.045567 * # feedsoybean -77.155 21.578 -3.576 0.000665 *** # feedsunflower 5.333 22.393 0.238 0.812495 # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 54.85 on 65 degrees of freedom # Multiple R-squared: 0.5417, Adjusted R-squared: 0.5064 # F-statistic: 15.36 on 5 and 65 DF, p-value: 5.936e-10
Explaining the output
Since feed has 6 categories (casein, horsebean, linseed, meatmeal, soybean, and sunflower), R will divide it into 5 binary variables and leave one category as a reference. In this case, “casein” will be the reference category, so the 5 binary variables will be:
- feedhorsebean = 1 (if the supplement is horsebean) and 0 (if the supplement is not horsebean)
- feedlinseed= 1 (if the supplement is linseed) and 0 (if the supplement is not linseed)
- feedmeatmeal = 1 (if the supplement is meatmeal) and 0 (if the supplement is not meatmeal)
- feedsoybean = 1 (if the supplement is soybean) and 0 (if the supplement is not soybean)
- feedsunflower = 1 (if the supplement is sunflower) and 0 (if the supplement is not sunflower)
Note that the variable feedcasein is coded implicitly: if the score on all 5 variables is 0, then the supplement is casein.
Interpreting the model
For example, the coefficient of the variable feedhorsebean: β = -163.383 (p < 0.05), can be interpreted as follows:
The average weight of chickens receiving horsebean supplements is 163.383 grams less than those receiving casein (the reference group).
The intercept β0 = 323.583 (p < 0.05) should be interpreted assuming a value of 0 for all the predictors in the model, i.e. assuming that chickens were given casein supplements (the reference group):
The average weight of chickens receiving casein supplements is 323.583 grams.
For more information on how to interpret linear regression output, see my other articles: Interpret Linear Regression Coefficients and Interpret the Linear Regression Intercept.
Changing the reference group
Now suppose we are interested in explaining the effects of various feed types compared to sunflower. In this case, we should use the category “sunflower” as a reference group:
# make sunflower the reference group in feed chickwts$feed = relevel(chickwts$feed, ref = "sunflower") # rerun linear regression model = lm(weight ~ feed, data = chickwts) summary(model) # Call: # lm(formula = weight ~ feed, data = chickwts) # # Residuals: # Min 1Q Median 3Q Max # -123.909 -34.413 1.571 38.170 103.091 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 328.917 15.834 20.773 < 2e-16 *** # feedcasein -5.333 22.393 -0.238 0.812495 # feedhorsebean -168.717 23.485 -7.184 8.20e-10 *** # feedlinseed -110.167 22.393 -4.920 6.21e-06 *** # feedmeatmeal -52.008 22.896 -2.271 0.026435 * # feedsoybean -82.488 21.578 -3.823 0.000298 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 54.85 on 65 degrees of freedom # Multiple R-squared: 0.5417, Adjusted R-squared: 0.5064 # F-statistic: 15.36 on 5 and 65 DF, p-value: 5.936e-10
Interpreting the model
Now the coefficient of the variable feedhorsebean: β = -168.717 (p < 0.05), can be interpreted as follows:
The average weight of chickens receiving horsebean supplements is 168.717 grams less than those receiving sunflower (the reference group).
Global effect of the categorical variable
In the previous sections, we compared the effectiveness of various feed types on the weight of chickens, but we still did not answer our main question: Does the feed type affect weight? i.e. is feed type, as a whole, a good predictor of weight?
The drop1 function in R tests whether dropping the variable feed significantly affects the model. The output will be a single p-value no matter how many levels the variable has:
# test the global effect of the variable feed drop1(model, .~., test = "F") # Single term deletions # # Model: # weight ~ feed # Df Sum of Sq RSS AIC F value Pr(>F) # <none> 195556 574.39 # feed 5 231129 426685 619.78 15.365 5.936e-10 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Interpreting the output
The p-value of the variable feed is < 5.963e-10, smaller than 0.05, therefore:
The feed type has a statistically significant effect on chicken weight.