# How to Deal with Violation of the Linearity Assumption in R

The most important assumption of linear regression is that the relationship between each predictor and the outcome is linear.

When the linearity assumption is violated, try:

3. Log transforming a predictor
4. Categorizing a numeric predictor

Let’s create some non-linear data to demonstrate these methods:

```set.seed(2)
# creating the variables:
# age
age = rnorm(n = 100, mean = 50, sd = 10)

# ldl cholesterol
ldl = rnorm(n = 100, mean = 120, sd = 35)

# systolic blood pressure
sbp = 0.01*age^2 - 3*log(age) + 2*sqrt(ldl) + rnorm(100, 90, 10)

model1 = lm(sbp ~ age + ldl)

# plotting residuals vs fitted values
plot(model1, 1)```

The residuals vs fitted values plot shows a curved relationship, therefore, the linearity assumption is violated.

Non-linearity may be fixed by adding age2 and/or ldl2 to our model. We can check if this solution works by:

1. Using ANOVA to compare the model with a quadratic term with the model without a quadratic term (the null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the model with a quadratic term is superior).
2. Comparing the adjusted-R2 of the model with a quadratic term with that of the model without a quadratic term.
3. Plotting the residuals versus the fitted values for the model with a quadratic term.

```# adding age-squared
model2 = lm(sbp ~ age + I(age^2) + ldl)
# note the use of the I() function, since age^2 alone has another meaning in R

# checking our solution
#1. comparing models with ANOVA and getting the p-value
anova(model1, model2)['Pr(>F)'][2,] # 4.983351e-05

#3. plotting residuals vs fitted values for the new model
plot(model2, 1)```

From these results, we can conclude that adding age2 fixed the violation of the linearity assumption since:

1. The p-value of the ANOVA is 4.983e-05 (< 0.05), which means that the model containing age2 fits the data better.
2. The adjusted-R2 of the model with age2 is higher.
3. The non-linear pattern in the residuals vs fitted values plot disappeared.

```# adding ldl-squared
model3 = lm(sbp ~ age + I(age^2) + ldl + I(ldl^2))

# checking our solution
#1. comparing models with ANOVA and getting the p-value
anova(model2, model3)['Pr(>F)'][2,] # 0.9640057

1. The p-value of the ANOVA is > 0.05, which means that the model containing ldl2 does not fit the data better than model2.
2. The adjusted-R2 of the model with ldl2 is approximately the same as model2.

## Solution #2: Adding an interaction term

Since non-linearity may be due to an interaction between predictors, we can try adding an interaction term between the predictors: age and ldl:

```model4 = lm(sbp ~ age + ldl + age:ldl)

# checking our solution
#1. comparing models with ANOVA and getting the p-value
anova(model3, model1)['Pr(>F)'][2,] # 0.3097294

#3. plotting residuals vs fitted values for the new model
plot(model4, 1)```

These results suggest that adding an interaction term does not fix the violation of the linearity assumption since the model with interaction:

• Did not fit the data better (ANOVA p-value > 0.05).
• Does not have a higher adjusted-R2.
• Also has a non-linearity pattern in the residuals vs fitted values plot.

(For more information, see: Why and When to Include Interactions? and Interpret Interactions in Linear Regression)

## Solution #3: Transforming a predictor variable

Next, we can try using a log or a square root transformation of the predictors age and/or ldl.

The difference is that with a log transformation the model will still be interpretable, so this is what we are going to use here (for more information, see: Interpret Log Transformations in Linear Regression):

### Log transforming age:

```model5 = lm(sbp ~ log(age) + ldl)

# checking our solution

#2. plotting residuals vs fitted values for the new model
plot(model5, 1)```

The results show that replacing age with log(age) does not fix the violation of the linearity assumption since:

• It did not improve the model’s adjusted-R2.
• It did not eliminate the non-linearity pattern in the residuals vs fitted values plot.

### Log transforming ldl:

```model6 = lm(sbp ~ age + log(ldl))

# checking our solution

#2. plotting residuals vs fitted values for the new model
plot(model6, 1)```

Also replacing ldl with log(ldl) does not fix the violation of the linearity assumption since:

• It only provided a tiny improvement in the model’s adjusted-R2.
• It did not eliminate the non-linearity pattern in the residuals vs fitted values plot.

## Solution #4: Categorizing a numeric predictor

When other solutions fail, we can handle non-linearity by categorizing a numeric predictor according to cutpoints that make sense theoretically.

Here, for the sake of demonstration, we will not waste time with choosing the perfect cutpoints, so we will categorize both age and ldl by dividing them into 3 categories:

### Categorizing age

```# divide age into 3 categories
age_cat = cut(age, breaks = 3)

model7 = lm(sbp ~ age_cat + ldl)

# checking our solution

Since the adjusted-R2 improved, we will keep the categorical version of age, and try categorizing ldl as well:

### Categorizing ldl

```# divide ldl into 3 categories
ldl_cat = cut(ldl, breaks = 3)

model8 = lm(sbp ~ age_cat + ldl_cat)

# checking our solution