# Data Analysis

## 4 Ways to Handle a Categorical Predictor With Many Levels

A regression model that includes a categorical predictor with many levels might not contain enough observations in each category to be able to detect a reasonable effect size with reasonable power, even then, the large number of dummy variables created could be difficult to interpret. In this article, we present 4 ways to deal with …

## Understand Linear Regression Assumptions

The 4 assumptions of linear regression in order of importance are: 1. Linearity 1.1. Explanation The relationship between each predictor Xi and the outcome Y should be linear. 1.2. How to check the linearity assumption Instead of checking the relationship between each predictor Xi and the outcome Y in a multivariable model, we can plot …

## Weighted Regression: An Intuitive Introduction

Weighted regression (a.k.a. weighted least squares) is a regression model where each observation is given a certain weight that tells the software how important it should be in the model fit. Weighted regression can be used to: 1. Weighted regression to handle non-constant variance of error terms Linear regression assumes that the error terms have …

## How to Report Interaction Effects in Regression

For a linear regression model: Y = β0 + β1X + β2Z + β3XZ + ε If the coefficient of the interaction term β3 is statistically significant, then there is evidence of an interaction between X and Z. This means that the effect of X on the outcome Y is different for different sub-categories of Z, …

## Interpret Log Transformations in Linear Regression

The following table summarizes how to interpret a linear regression model with logarithmic transformations: Transformation Model Interpretation No transformations Y = β0 + β1 X A 1 unit increase in X is associated with an average change of β1 units in Y. Log-transformed predictor Y = β0 + β1 log(X) A 1% increase in X …

## Why Add & How to Interpret a Quadratic Term in Regression

Linear regression assumes that the relationship between the predictor X and the outcome Y is linear. If this assumption is not met, linear regression will be a poor fit to the data (as shown in the figure below). In this case, adding a quadratic term to the regression equation may help model the relationship between …

## Interpret Linear Regression Output in R

Here’s an example of linear regression in R: 1. Linear regression equation The formula \(y \sim x + z\) corresponds to the regression equation: \(y = β_0 + β_1x + β_2z\) where: 2. Residuals The residuals are the difference between the regression line that we fitted (using the predictors x and z) and the real …

## When Does Correlation Imply Causation?

Short answer: Correlation implies causation when alternative explanations of the relationship between the correlated variables (such as confounding and bias) are removed (by appropriately modifying the study design) or controlled for (by adjusting for them in the statistical analysis). Explanation: Causation means that changing the treatment X for a person will affect the probability of …

## Correlation Coefficient vs Regression Coefficient

Both the correlation and regression coefficients rely on the hypothesis that the data can be represented by a straight line. They are similar in many ways, but they serve different purposes. Here’s a table that summarizes the similarities and differences between the correlation coefficient, r, and the regression coefficient, β: Correlation coefficient: r Regression coefficient: …

## An Example of Using Marginal and Conditional Distributions

The conditional distribution of a variable, for example heights, is the distribution of heights given the value of another variable, for example gender. Plotting the conditional distribution of heights given gender is a way of visualizing the relationship between the 2 variables. The marginal distribution of heights is the distribution of heights for everybody, independent …

## Why Divide Sample Standard Deviation by n-1?

The problem The standard deviation is a measurement of the spread of the data — it is the average distance of the data from the mean. We are rarely interested in the amount of variation in our sample: the sample standard deviation is only useful as an approximation of the population standard deviation. When our …

## How to Handle Missing Data in Practice: Guide for Beginners

Handling missing data involves 2 steps: Determining the type of missing data, which can be: Missing completely at random (MCAR) Missing at random (MAR) Missing not at random (MNAR) Choosing a method to deal with these missing values, such as: Deleting variables (i.e. columns) that contain missing values Deleting observations (i.e. rows) whose values are …

## 5 Variable Transformations to Improve Your Regression Model

In this article, we will discuss how you can use the following transformations to build better regression models: Log transformation Square root transformation Polynomial transformation Standardization Centering by substracting the mean Compared to fitting a model using variables in their raw form, transforming them can help: Make the model’s coefficients more interpretable. Meet the model’s …

## Interpret Interactions in Linear Regression

For a linear regression model with interaction: Y = β0 + β1 X1 + β2 X2 + β3 X1X2 The coefficient of the interaction term (β3) is the increase in effectiveness of X1 for a 1 unit change in X2, and vice-versa. For example: Suppose we used linear regression to study the effect of physical …

## Interpret the Linear Regression Intercept

For a linear regression model: Y = β0 + β1 X The linear regression intercept β0 is the predicted value of the outcome Y when the predictor X equals zero. As an example, we will try to interpret the intercept β0 = 78.66 in the following linear regression model: Heart Rate = 78.66 + 2.94 …

## Using the 4 D-Separation Rules to Study a Causal Association

Suppose we want study whether coffee causes cancer, which we will represent as follows: Randomizing people to either consume coffee or not for many years in order to study its effect on cancer is neither ethical nor practical. So we have to use an observational design, where we would have to deal with bias and …

## What is a Good R-Squared Value? [Based on Real-World Data]

I analyzed the content of 43,110 randomly chosen research papers from PubMed to learn more about R-squared. Specifically, I wanted to answer the following questions: What is a good value for R-squared? What is a low value for R-squared? Is a higher R-squared always better? Is a low R-squared necessarily bad? Let’s start with a …

## Statistical Power: What It Is and How It Is Used in Practice

Statistical power is a measure of study efficiency, calculated before conducting the study to estimate the chance of discovering a true effect rather than obtaining a false negative result, or worse, overestimating the effect by detecting the noise in the data. Here are 5 seemingly different, but actually similar, ways of describing statistical power: Definition …

## Identify Variable Types in Statistics (with Examples)

Here’s a table that summarizes the types of variables: Types of variables Quantitative(a.k.a. Numerical) Qualitative(a.k.a. Categorical) Continuous Discrete Ordinal Nominal Consists of numerical values that can be measured but not counted. Consists of numerical values that can be counted. Consists of text or labels that have a logical order. Consists of text or labels that …

## Assess Variable Importance in Linear and Logistic Regression

In this article, we will be concerned with the following question: Given a regression model, which of the predictors X1, X2, X3, etc. has the most influence on the outcome Y? In general, assessing the relative importance of predictors by directly comparing their (unstandardized) regression coefficients is not a good idea because: For numerical predictors: …

## Interpret Poisson Regression Coefficients

The Poisson regression coefficient β associated with a predictor X is the expected change, on the log scale, in the outcome Y per unit change in X. So holding all other variables in the model constant, increasing X by 1 unit (or going from 1 level to the next) multiplies the rate of Y by …

## Regression Tree vs Linear Regression

Both the linear regression and the regression tree models take as input 1 or more predictors (Xi) and their goal is to explain their relationship with the outcome (Y). For simplicity, we will consider the case of modeling Y using only 1 predictor X. Linear regression tries to find the equation of the line that …

## How to Report a Random Forest Model

In this article we discuss: How to report the use of a random forest model How to report the results of a random forest model 1. How to report the use of a random forest model The following information should be mentioned in the METHODS section of your research paper: The reason why you chose …

## How to Report a Chi-Square Test

The 3 main types of Chi-square tests are: Chi-square goodness-of-fit test: used to compare the distribution of a categorical variable (with more than 2 levels) to a hypothetical distribution. Chi-square homogeneity test: used to test whether 2 groups (coming from 2 different samples) have the same distribution regarding a certain categorical variable. Chi-square independence test: …

## How to Report a Chi-Square Independence Test

The Chi-square independence test is used to test whether 2 categorical variables, each having 2 or more categories, are dependent or independent of each other. The null hypothesis H0 states that the 2 variables are independent (i.e. knowing the value of one does not tell us anything about the other) The alternative hypothesis H1 states …

## How to Report a Chi-Square Goodness-of-Fit Test

A Chi-square goodness-of-fit test is used to evaluate the distribution of a categorical variable with more than 2 levels/categories against a theoretical one. Simply put, we would like to compare the counts in each level of this categorical variable with the counts that we expect to find given some hypothesis. Therefore, the objective of this …

## How to Report the Shapiro-Wilk Test

The Shapiro-Wilk test is a statistical test used to check if a continuous variable follows a normal distribution. The null hypothesis (H0) states that the variable is normally distributed, and the alternative hypothesis (H1) states that the variable is NOT normally distributed. So after running this test: If p ≤ 0.05: then the null hypothesis …

## How to Report Stepwise Regression

In this article we will discuss: 1. Reporting the use of stepwise regression The following information should be mentioned in the METHODS section of the research paper: (For an easy explanation of the stopping rule and a step-by-step description of how stepwise selection works, I recommend my other article: Understand Forward and Backward Stepwise Regression) …

## Interpret Linear Regression Coefficients

For a simple linear regression model: Y = β0 + β1 X + ε The linear regression coefficient β1 associated with a predictor X is the expected difference in the outcome Y when comparing 2 groups that differ by 1 unit in X. Another common interpretation of β1 is: β1 is the expected change in the outcome …

## Interpret the Logistic Regression Intercept

Here’s the equation of a logistic regression model with 1 predictor X: Where P is the probability of having the outcome and P / (1-P) is the odds of the outcome. The easiest way to interpret the intercept is when X = 0: When X = 0, the intercept β0 is the log of the …

## Interpret Logistic Regression Coefficients [For Beginners]

The logistic regression coefficient β associated with a predictor X is the expected change in log odds of having the outcome per unit change in X. So increasing the predictor by 1 unit (or going from 1 level to the next) multiplies the odds of having the outcome by eβ. Here’s an example: Suppose we …

## When to Use Regression Analysis (With Examples)

Regression analysis can be used to: In the text below, we will go through these points in greater detail and provide a real-world example of each. 1. Estimate the effect of an exposure on a given outcome Regression can model linear and non-linear associations between an exposure (or treatment) and an outcome of interest. It …

## Deviance in the Context of Logistic Regression

Deviance is a number that measures the goodness of fit of a logistic regression model. Think of it as the distance from the perfect fit — a measure of how much your logistic regression model deviates from an ideal model that perfectly fits the data. Deviance ranges from 0 to infinity. The smaller the number …

## Why and When to Include Interactions in a Regression Model

In a regression model, consider including the interaction between 2 variables when: They have large main effects. The effect of one changes for various subgroups of the other. The interaction has been proven in previous studies. You want to explore new hypotheses. Below we will explore each of these points in detail, but first let’s …

## Understand Regularized Regression

Regularized regression is a regression method with an additional constraint designed to deal with a large number of independent variables (a.k.a. predictors). It does so by imposing a larger penalty on unimportant ones, thus shrinking their coefficients towards zero. The objective of regularization is to end up with a model: Other methods that also deal …

## Understand Best Subset Selection

When building a regression model, removing irrelevant variables will make the model easier to interpret and less prone to overfit the data, therefore more generalizable. Best subset selection is a method that aims to find the subset of independent variables (Xi) that best predict the outcome (Y) and it does so by considering all possible …

## Square Root Transformation: A Beginner’s Guide

A square root transformation can be useful for: Normalizing a skewed distribution Transforming a non-linear relationship between 2 variables into a linear one Reducing heteroscedasticity of the residuals in linear regression Focusing on visualizing certain parts of your data Below we will discuss each of these points in details. When you apply a square root …

## What is an Acceptable Value for VIF? (With References)

Most research papers consider a VIF (Variance Inflation Factor) > 10 as an indicator of multicollinearity, but some choose a more conservative threshold of 5 or even 2.5. So what threshold should YOU choose? When choosing a VIF threshold, you should take into account that multicollinearity is a lesser problem when dealing with a large …

## Coefficient of Alienation, Non-determination and Tolerance

When running a linear regression model: Y = β0 + β1 × X1 + β2 × X2 + ε One way of determining if the independent variables X1 and X2 were useful in predicting Y is to calculate the coefficient of determination R2. R2 measures the proportion of variability in Y that can be explained …

## Correlation vs Collinearity vs Multicollinearity

Here’s a table that summarizes the differences between correlation, collinearity and multicollinearity:   Correlation Collinearity Multicollinearity Definition Correlation refers to the linear relationship between 2 variables Collinearity refers to a problem when running a regression model where 2 or more independent variables (a.k.a. predictors) have a strong linear relationship Multicollinearity is a special case of …

## Standardized vs Unstandardized Regression Coefficients

Here’s a table that summarizes the similarities and differences between standardized and unstandardized linear regression coefficients:   Unstandardized β Standardized β Definition Unstandardized coefficients are obtained after running a regression model on variables measured in their original scales Standardized coefficients are obtained after running a regression model on standardized variables (i.e. rescaled variables that have …

## Relationship Between r and R-squared in Linear Regression

R-squared is a measure of how well a linear regression model fits the data. It can be interpreted as the proportion of variance of the outcome Y explained by the linear regression model. It is a number between 0 and 1 (0 ≤ R2 ≤ 1). The closer its value is to 1, the more …

## Understand the F-statistic in Linear Regression

When running a multiple linear regression model: Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + … + ε The F-statistic provides us with a way for globally testing if ANY of the independent variables X1, X2, X3, X4… is related to the outcome Y. For a significance level of 0.05: If …

## Residual Standard Deviation/Error: Guide for Beginners

The residual standard deviation (or residual standard error) is a measure used to assess how well a linear regression model fits the data. (The other measure to assess this goodness of fit is R2). But before we discuss the residual standard deviation, let’s try to assess the goodness of fit graphically. Consider the following linear …

## 7 Tricks to Get Statistically Significant p-Values

The objective of this article is to prove that getting a p-value below the threshold of 0.05 is not that hard, and that a statistically significant result proves nothing by itself. Study results should always be interpreted in the context of: The study design The effect size The size of the sample The results of …

## P-Value: A Simple Explanation for Non-Statisticians

A p-value is a probability, a number between 0 and 1, calculated after running a statistical test on data. A small p-value (< 0.05 in general) means that the observed results are so unusual assuming that they were due to chance only. It is a way of telling if the results obtained should be taken …

## Which Variables Should You Include in a Regression Model?

When building a linear or logistic regression model, you should consider including: However, you should watch out for: Below we discuss each of these points in details. 1. Selecting variables based on background knowledge Advantages of using background knowledge to select variables How to choose variables based on background knowledge? You can find out whether …

## Understand Forward and Backward Stepwise Regression

Running a regression model with many variables including irrelevant ones will lead to a needlessly complex model. Stepwise regression is a way of selecting important variables to get a simple and easily interpretable model. Below we discuss how forward and backward stepwise selection work, their advantages, and limitations and how to deal with them. Forward …