In a regression model, consider including the interaction between 2 variables when:
- They have large main effects.
- The effect of one changes for various subgroups of the other.
- The interaction has been proven in previous studies.
- You want to explore new hypotheses.
Below we will explore each of these points in detail, but first let’s start with why we need to study interactions in the first place.
Why include an interaction term?
A model without interactions assumes that the effect of each predictor on the outcome is independent of other predictors in the model.
We say that 2 variables interact when one influences the effect of the other. In this case, their main effects (the separate effect of each of them on the outcome) should no longer be considered in isolation as it doesn’t make sense anymore to interpret the effect of one while holding the other constant.
So a linear regression equation should be changed from:
Y = β0 + β1X1 + β2X2 + ε
Y = β0 + β1X1 + β2X2 + β3X1X2 + ε
And if the interaction term is statistically significant (associated with a p-value < 0.05), then:
β3 can be interpreted as the increase in effectiveness of X1 for each 1 unit increase in X2 (and vice-versa).
(For more information, see: Interpret Interactions in Linear Regression, and how to code a linear regression model with interaction in R)
When you include an interaction between 2 independent variables X1 and X2, DO NOT remove the main effects of the variables X1 and X2 from the model even if their p-values were larger than 0.05 (i.e. their effects were not statistically significant).
An interaction can also occur between 3 or more variables making the situation much more complex, but for practical purposes and in most real-world situations, you won’t have to deal with such complexities.
So we’ve established that when an interaction exists between 2 variables, you wouldn’t want to miss out on it. On the other hand, including all possible interactions for all predictors in your model will make it both uninterpretable and statistically flawed (see below).
Next we discuss how to choose which interaction terms to include in your regression model.
When to include an interaction term?
Consider including an interaction term between 2 variables:
1. When they have large main effects
Variables that have a large influence on the outcome are more likely to have a statistically significant interaction with other factors that influence this outcome.
Because alcohol is known to be an important factor that increases the risk of liver cirrhosis, when studying the effect of other factors on cirrhosis, many studies consider the interaction between them and alcohol. [Chevillotte et al., Corrao et al., Stroffolini et al.]
2. When the effect of one changes for various subgroups of the other
This change can be:
- an opposite effect for each subgroup of the other variable
- or the same effect with different intensity between subgroups of the other variable
In general, you should study the interaction between 2 variables whenever you suspect that a change in one variable will increase (or decrease) the effectiveness of another one in the model.
Here are a few signs that a variable has an influence on the effect of another one:
- Whenever you need to know the value of the one variable in order to estimate the effect of the other.
- Whenever you cannot separate in your mind the effects of both variables (eg. the effects of genetic and environmental factors on the risk of developing cancer).
- Whenever a certain treatment is beneficial for some category of patients and not for others.
Gastric bypass surgery is beneficial for extremely obese individuals (BMI > 40) and not for those who are just overweight (25 < BMI < 30). So when studying the effect of this type of surgery on the risk of mortality, including its interaction with BMI is a reasonable decision:
Mortality = Surgery + BMI + Surgery × BMI
3. When the importance of the interaction has already been proven in previous studies
A literature review will help you spot important, and sometimes less intuitive interactions.
A meta-analysis showed that cigarette smoking interacts with hepatitis B and C infections on the risk of liver cancer. So whenever we want to study the effect of chronic hepatitis on liver cancer, we should include smoking as a main effect and as an interaction with hepatitis infections:
Liver Cancer = Smoking + Hepatitis + Smoking × Hepatitis
4. When you want to explore new hypotheses
So far we discussed how to choose which interactions to include in your model BEFORE even looking at the data. And this is a good thing because, in general, it is a always better to develop your hypotheses before looking at your data in order to avoid multiple testing — which will inflate the risk of having false positive results.
For a statistical significance threshold of 5% (i.e. in cases where we consider results with p-values < 0.05 statistically significant), if you test 20 random interactions, then on average, 1 of them will have a statistically significant coefficient JUST BY CHANCE. Therefore it would be wrong, from a statistical point of view, to test all possible interactions for all predictors in the model.
However, choosing which interactions to include based only on theory is limited by our intuition and our understanding of the problem which can be very narrow in some cases.
So here are 3 options to select interactions based on data while avoiding multiple testing:
i. Test all possible interactions using a single global test
With this option you test all the interactions at once with a single global test (based on the Wald statistic). [see Regression Modeling Strategies by Frank Harrell]
Here, no matter how many predictors you have, the number of tests to run will be = 1.
ii. For each predictor, test all of its possible interactions with 1 test
For each predictor worth of consideration, test all its interactions with a single test.
For p predictors, the number of tests to run will be = p.
iii. Test each possible interaction alone
The last option is to run a statistical test for each possible interaction alone for all variables in the model. But, in order to avoid the multiple testing problem, you can:
- Lower the threshold of statistical significance: For instance, decide which interactions to keep based on a p-value < 0.01 for example, instead of 0.05.
- Or split your dataset into 2 subsets: Using the first subset, test each possible interaction, find which ones are statistically significant, and then retest these using the other subset of your data. Because you will be throwing half of your data, this approach would only be beneficial in certain cases where data collection is cheap both in terms of time and money.
In this case, for p predictors, the number of tests to run will be = p(p-1)/ 2.
Variables that are correlated with each other don’t have a higher chance of interacting with each other in a model. Interaction means that the effect of one on the outcome will depend on the other. While correlation only means that the 2 variables tend to vary together in a linear fashion. And the latter says nothing about the effect of each on the outcome Y. [Source: Clinical Prediction Models – by Ewout Steyerberg]
- Jr FEH. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. 2nd ed. 2015 Edition. Springer; 2015.
- Falissard B. Analysis of Questionnaire Data with R. 1st Edition. Chapman and Hall/CRC; 2011.
- James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning: With Applications in R. 1st ed. 2013, Corr. 7th printing 2017 edition. Springer; 2013.
- Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2009th Edition. Springer; 2008.