Linear Regression in R (with a Categorical Variable)

In this article, we will run and interpret a linear regression model where the predictor is a categorical variable with multiple levels.

Loading the Data

We will use the chickwts dataset available in R.

(These data come from an experiment where newly hatched chickens were randomly divided into 6 groups, each group receiving a different feed supplement. Their weights in grams are given after 6 weeks.)

summary(chickwts)
#      weight             feed   
#  Min.   :108.0   casein   :12  
#  1st Qu.:204.5   horsebean:10  
#  Median :258.0   linseed  :12  
#  Mean   :261.3   meatmeal :11  
#  3rd Qu.:323.5   soybean  :14  
#  Max.   :423.0   sunflower:12 

Running a linear regression model

Next, we will use linear regression to examine the effect of feed supplements (the predictor) on the weight of chickens (the outcome):

model = lm(weight ~ feed, data = chickwts)

summary(model)
# Call:
# lm(formula = weight ~ feed, data = chickwts)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -123.909  -34.413    1.571   38.170  103.091 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    323.583     15.834  20.436  < 2e-16 ***
# feedhorsebean -163.383     23.485  -6.957 2.07e-09 ***
# feedlinseed   -104.833     22.393  -4.682 1.49e-05 ***
# feedmeatmeal   -46.674     22.896  -2.039 0.045567 *  
# feedsoybean    -77.155     21.578  -3.576 0.000665 ***
# feedsunflower    5.333     22.393   0.238 0.812495    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 54.85 on 65 degrees of freedom
# Multiple R-squared:  0.5417,	Adjusted R-squared:  0.5064 
# F-statistic: 15.36 on 5 and 65 DF,  p-value: 5.936e-10

Explaining the output

Since feed has 6 categories (casein, horsebean, linseed, meatmeal, soybean, and sunflower), R will divide it into 5 binary variables and leave one category as a reference. In this case, “casein” will be the reference category, so the 5 binary variables will be:

  • feedhorsebean = 1 (if the supplement is horsebean) and 0 (if the supplement is not horsebean)
  • feedlinseed= 1 (if the supplement is linseed) and 0 (if the supplement is not linseed)
  • feedmeatmeal = 1 (if the supplement is meatmeal) and 0 (if the supplement is not meatmeal)
  • feedsoybean = 1 (if the supplement is soybean) and 0 (if the supplement is not soybean)
  • feedsunflower = 1 (if the supplement is sunflower) and 0 (if the supplement is not sunflower)

Note that the variable feedcasein is coded implicitly: if the score on all 5 variables is 0, then the supplement is casein.

Interpreting the model

For example, the coefficient of the variable feedhorsebean: β = -163.383 (p < 0.05), can be interpreted as follows:

The average weight of chickens receiving horsebean supplements is 163.383 grams less than those receiving casein (the reference group).

The intercept β0 = 323.583 (p < 0.05) should be interpreted assuming a value of 0 for all the predictors in the model, i.e. assuming that chickens were given casein supplements (the reference group):

The average weight of chickens receiving casein supplements is 323.583 grams.

For more information on how to interpret linear regression output, see my other articles: Interpret Linear Regression Coefficients and Interpret the Linear Regression Intercept.

Changing the reference group

Now suppose we are interested in explaining the effects of various feed types compared to sunflower. In this case, we should use the category “sunflower” as a reference group:

# make sunflower the reference group in feed
chickwts$feed = relevel(chickwts$feed, ref = "sunflower")

# rerun linear regression
model = lm(weight ~ feed, data = chickwts)

summary(model)
# Call:
# lm(formula = weight ~ feed, data = chickwts)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -123.909  -34.413    1.571   38.170  103.091 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    328.917     15.834  20.773  < 2e-16 ***
# feedcasein      -5.333     22.393  -0.238 0.812495    
# feedhorsebean -168.717     23.485  -7.184 8.20e-10 ***
# feedlinseed   -110.167     22.393  -4.920 6.21e-06 ***
# feedmeatmeal   -52.008     22.896  -2.271 0.026435 *  
# feedsoybean    -82.488     21.578  -3.823 0.000298 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 54.85 on 65 degrees of freedom
# Multiple R-squared:  0.5417,	Adjusted R-squared:  0.5064
# F-statistic: 15.36 on 5 and 65 DF,  p-value: 5.936e-10

Interpreting the model

Now the coefficient of the variable feedhorsebean: β = -168.717 (p < 0.05), can be interpreted as follows:

The average weight of chickens receiving horsebean supplements is 168.717 grams less than those receiving sunflower (the reference group).

Global effect of the categorical variable

In the previous sections, we compared the effectiveness of various feed types on the weight of chickens, but we still did not answer our main question: Does the feed type affect weight? i.e. is feed type, as a whole, a good predictor of weight?

The drop1 function in R tests whether dropping the variable feed significantly affects the model. The output will be a single p-value no matter how many levels the variable has:

# test the global effect of the variable feed
drop1(model, .~., test = "F")
# Single term deletions
# 
# Model:
# weight ~ feed
#        Df Sum of Sq    RSS    AIC F value    Pr(>F)    
# <none>              195556 574.39                      
# feed    5    231129 426685 619.78  15.365 5.936e-10 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Interpreting the output

The p-value of the variable feed is < 5.963e-10, smaller than 0.05, therefore:

The feed type has a statistically significant effect on chicken weight.

Further reading