Regression analysis can be used to:
- estimate the effect of an exposure on a given outcome
- predict an outcome using known factors
- balance dissimilar groups
- model and replace missing data
- detect unusual records
In the text below, we will go through these points in greater detail and provide a real-world example of each.
1. Estimate the effect of an exposure on a given outcome
Regression can model linear and non-linear associations between an exposure (or treatment) and an outcome of interest. It can also simultaneously model the relationship between more than 1 exposure and an outcome, even when these exposures interact with each other.
Example: Exploring the relationship between Body Mass Index (BMI) and all-cause mortality
De Gonzales et al. used a Cox regression model to estimate the association between BMI and mortality among 1.46 million white adults.
As expected, they found that the risk of mortality increases with progressively higher than normal levels of BMI.
The takeaway message is that regression analysis enabled them to quantify that association while adjusting for smoking, alcohol consumption, physical activity, educational level and marital status — all potential confounders of the relationship between BMI and mortality.
2. Predict an outcome using known factors
A regression model can also be used to predict things like stock prices, weather conditions, the risk of getting a disease, mortality, etc. based on a set of known predictors (also called independent variables).
Example: Predicting malaria in South Africa using seasonal climate data
Kim et al. used Poisson regression to develop a malaria prediction model using climate data such as temperature and precipitation in South Africa.
The model performed best with short-term predictions.
Anyway, the important thing to notice here is the amount of complexities that a regression model can handle. For instance in this example, the model had to be flexible enough to account for non-linear and delayed associations between malaria transmission and climate factors.
This is a recurrent theme with predictive models: We start with a simple model, then we keep adding complexities until we get a satisfying result — this is why we call it model building.
3. Balance dissimilar groups
Proving that a relationship exists between some independent variable X and an outcome Y does not mean much if this result cannot be generalized beyond your sample.
In order for your results to generalize well, the sample you’re working with has to resemble the population from which it was drawn. If it doesn’t, you can use regression to balance some important characteristics in the sample to make it representative of the population of interest.
Another case where you would want to balance dissimilar groups is in a randomized controlled trial, where the objective is to compare the outcome between the group who received the intervention and another one that serves as control/reference. But in order for the comparison to make sense, the 2 groups must have similar characteristics.
Example: Evaluating how sleep quality is affected by sleep hygiene education and behavioral therapy
Nishinoue et al. conducted a randomized controlled trial to compare sleep quality between 2 groups of participants:
- The treatment group: Participants received sleep hygiene education and behavioral therapy
- The control group: Participants received sleep hygiene education only
A generalized linear model (a generalized form of linear regression) was used to:
- Evaluate how sleep quality changed between groups
- Adjust for age, gender, job title, smoking and drinking habits, body-mass index, and mental health to make the groups more comparable
4. Model and replace missing data
Modeling missing data is an important part of data analysis, especially in cases where you have high non-response rates (so a high number of missing values) like in telephone surveys.
Before jumping into imputing missing data, first you must determine:
- How important the variables that have missing values are in your analysis
- The percentage of missing values
- If these values were missing at random or not
Based on this analysis, you can then choose to:
- Delete observations with missing values
- Replace missing data with the column’s mean or median
- Use a a regression model to replace missing data
Example: Using multiple imputation to replace missing data in a medical study
Beynon et al. studied the prognostic role of alcohol and smoking at diagnosis of head and neck cancer.
But before they built their statistical model, they noticed that 11 variables (including smoking status and alcohol intake and other covariates) had missing values, so they used a technique called MICE (Multiple Imputation by Chained Equations) which runs regression models under the hood to replace missing values.
5. Detect unusual records
Regression models alongside other statistical techniques can be used to model how “normal data” should look like, the purpose being to detect values that deviate from this norm. These are referred to as “anomalies” or “outliers” in the data.
Most applications of anomaly detection is outside the healthcare domain. It is typically used for detection of financial frauds, atypical online behavior of website visitors, detection of anomalies in machine performance in a factory, etc.
Example: Detecting critical cases of patients undergoing heart surgery
Presbitero et al. used a time-varying autoregressive model (along with other statistical measures) to flag abnormal cases of patients undergoing heart surgery using data on their blood measurements.
Their goal is to ultimately prevent patient death by allowing early intervention to take place through the use of this early warning detection algorithm.