# Data Analysis

## 5 Variable Transformations to Improve Your Regression Model

In this article, we will discuss how you can use the following transformations to build better regression models: Log transformation Square root transformation Polynomial transformation Standardization Centering by substracting the mean Compared to fitting a model using variables in their raw form, transforming them can help: Make the model’s coefficients more interpretable. Meet the model’s …

## Interpret Interactions in Linear Regression

For a linear regression model with interaction: Y = β0 + β1 X1 + β2 X2 + β3 X1X2 The coefficient of the interaction term (β3) is the increase in effectiveness of X1 for a 1 unit change in X2, and vice-versa. For example: Suppose we used linear regression to study the effect of physical …

## Interpret the Linear Regression Intercept

For a linear regression model: Y = β0 + β1 X The linear regression intercept β0 is the predicted value of the outcome Y when the predictor X equals zero. As an example, we will try to interpret the intercept β0 = 78.66 in the following linear regression model: Heart Rate = 78.66 + 2.94 …

## Using the 4 D-Separation Rules to Study a Causal Association

Suppose we want study whether coffee causes cancer, which we will represent as follows: Randomizing people to either consume coffee or not for many years in order to study its effect on cancer is neither ethical nor practical. So we have to use an observational design, where we would have to deal with bias and …

## What is a Good R-Squared Value? [Based on Real-World Data]

I analyzed the content of 43,110 randomly chosen research papers from PubMed to learn more about R-squared. Specifically, I wanted to answer the following questions: What is a good value for R-squared? What is a low value for R-squared? Is a higher R-squared always better? Is a low R-squared necessarily bad? Let’s start with a …

## Statistical Power: What It Is and How It Is Used in Practice

Statistical power is a measure of study efficiency, calculated before conducting the study to estimate the chance of discovering a true effect rather than obtaining a false negative result, or worse, overestimating the effect by detecting the noise in the data. Here are 5 seemingly different, but actually similar, ways of describing statistical power: Definition …

## Identify Variable Types in Statistics (with Examples)

Here’s a table that summarizes the types of variables: Types of variables Quantitative(a.k.a. Numerical) Qualitative(a.k.a. Categorical) Continuous Discrete Ordinal Nominal Consists of numerical values that can be measured but not counted. Consists of numerical values that can be counted. Consists of text or labels that have a logical order. Consists of text or labels that …

## Assess Variable Importance in Linear and Logistic Regression

In this article, we will be concerned with the following question: Given a regression model, which of the predictors X1, X2, X3, etc. has the most influence on the outcome Y? In general, assessing the relative importance of predictors by directly comparing their (unstandardized) regression coefficients is not a good idea because: For numerical predictors: …

## Interpret Poisson Regression Coefficients

The Poisson regression coefficient β associated with a predictor X is the expected change, on the log scale, in the outcome Y per unit change in X. So holding all other variables in the model constant, increasing X by 1 unit (or going from 1 level to the next) multiplies the rate of Y by …

## Regression Tree vs Linear Regression

Both the linear regression and the regression tree models take as input 1 or more predictors (Xi) and their goal is to explain their relationship with the outcome (Y). For simplicity, we will consider the case of modeling Y using only 1 predictor X. Linear regression tries to find the equation of the line that …

## How to Report a Random Forest Model

In this article we discuss: How to report the use of a random forest model How to report the results of a random forest model 1. How to report the use of a random forest model The following information should be mentioned in the METHODS section of your research paper: The reason why you chose …

## How to Report a Chi-Square Test

The 3 main types of Chi-square tests are: Chi-square goodness-of-fit test: used to compare the distribution of a categorical variable (with more than 2 levels) to a hypothetical distribution. Chi-square homogeneity test: used to test whether 2 groups (coming from 2 different samples) have the same distribution regarding a certain categorical variable. Chi-square independence test: …

## How to Report a Chi-Square Independence Test

The Chi-square independence test is used to test whether 2 categorical variables, each having 2 or more categories, are dependent or independent of each other. The null hypothesis H0 states that the 2 variables are independent (i.e. knowing the value of one does not tell us anything about the other) The alternative hypothesis H1 states …

## How to Report a Chi-Square Goodness-of-Fit Test

A Chi-square goodness-of-fit test is used to evaluate the distribution of a categorical variable with more than 2 levels/categories against a theoretical one. Simply put, we would like to compare the counts in each level of this categorical variable with the counts that we expect to find given some hypothesis. Therefore, the objective of this …

## How to Report the Shapiro-Wilk Test

The Shapiro-Wilk test is a statistical test used to check if a continuous variable follows a normal distribution. The null hypothesis (H0) states that the variable is normally distributed, and the alternative hypothesis (H1) states that the variable is NOT normally distributed. So after running this test: If p ≤ 0.05: then the null hypothesis …

## How to Report Stepwise Regression

In this article we will discuss: How to report the use of stepwise regression How to report the output of stepwise regression 1. Reporting the use of stepwise regression The following information should be mentioned in the METHODS section of the research paper: the outcome variable (i.e. the dependent variable Y) the predictor variables (i.e. …

## Interpret Linear Regression Coefficients

For a simple linear regression model: Y = β0 + β1 X + ε The linear regression coefficient β1 associated with a predictor X is the expected difference in the outcome Y when comparing 2 groups that differ by 1 unit in X. Another common interpretation of β1 is: β1 is the expected change in the outcome …

## Interpret the Logistic Regression Intercept

Here’s the equation of a logistic regression model with 1 predictor X: Where P is the probability of having the outcome and P / (1-P) is the odds of the outcome. The easiest way to interpret the intercept is when X = 0: When X = 0, the intercept β0 is the log of the …

## Interpret Logistic Regression Coefficients [For Beginners]

The logistic regression coefficient β associated with a predictor X is the expected change in log odds of having the outcome per unit change in X. So increasing the predictor by 1 unit (or going from 1 level to the next) multiplies the odds of having the outcome by eβ. Here’s an example: Suppose we …

## When to Use Regression Analysis (With Examples)

Regression analysis can be used to: estimate the effect of an exposure on a given outcome predict an outcome using known factors balance dissimilar groups model and replace missing data detect unusual records In the text below, we will go through these points in greater detail and provide a real-world example of each. 1. Estimate …

## Deviance in the Context of Logistic Regression

Deviance is a number that measures the goodness of fit of a logistic regression model. Think of it as the distance from the perfect fit — a measure of how much your logistic regression model deviates from an ideal model that perfectly fits the data. Deviance ranges from 0 to infinity. The smaller the number …

## Why and When to Include Interactions in a Regression Model

In a regression model, consider including the interaction between 2 variables when: They have large main effects. The effect of one changes for various subgroups of the other. The interaction has been proven in previous studies. You want to explore new hypotheses. Below we will explore each of these points in details, but first let’s …

## Understand Regularized Regression

Regularized regression is a regression method with an additional constraint designed to deal with a large number of independent variables (a.k.a. predictors). It does so by imposing a larger penalty on unimportant ones, thus shrinking their coefficients towards zero. The objective of regularization is to end up with a model: That is simple and interpretable. …

## Understand Best Subset Selection

When building a regression model, removing irrelevant variables will make the model easier to interpret and less prone to overfit the data, therefore more generalizable. Best subset selection is a method that aims to find the subset of independent variables (Xi) that best predict the outcome (Y) and it does so by considering all possible …

## Square Root Transformation: A Beginner’s Guide

A square root transformation can be useful for: Normalizing a skewed distribution Transforming a non-linear relationship between 2 variables into a linear one Reducing heteroscedasticity of the residuals in linear regression Focusing on visualizing certain parts of your data Below we will discuss each of these points in details. When you apply a square root …

## What is an Acceptable Value for VIF? (With References)

Most research papers consider a VIF (Variance Inflation Factor) > 10 as an indicator of multicollinearity, but some choose a more conservative threshold of 5 or even 2.5. So what threshold should YOU choose? When choosing a VIF threshold, you should take into account that multicollinearity is a lesser problem when dealing with a large …

## Coefficient of Alienation, Non-determination and Tolerance

When running a linear regression model: Y = β0 + β1 × X1 + β2 × X2 + ε One way of determining if the independent variables X1 and X2 were useful in predicting Y is to calculate the coefficient of determination R2. R2 measures the proportion of variability in Y that can be explained …

## Correlation vs Collinearity vs Multicollinearity

Here’s a table that summarizes the differences between correlation, collinearity and multicollinearity:   Correlation Collinearity Multicollinearity Definition Correlation refers to the linear relationship between 2 variables Collinearity refers to a problem when running a regression model where 2 or more independent variables (a.k.a. predictors) have a strong linear relationship Multicollinearity is a special case of …

## Standardized vs Unstandardized Regression Coefficients

Here’s a table that summarizes the similarities and differences between standardized and unstandardized linear regression coefficients:   Unstandardized β Standardized β Definition Unstandardized coefficients are obtained after running a regression model on variables measured in their original scales Standardized coefficients are obtained after running a regression model on standardized variables (i.e. rescaled variables that have …

## Relationship Between r and R-squared in Linear Regression

R-squared is a measure of how well a linear regression model fits the data. It can be interpreted as the proportion of variance of the outcome Y explained by the linear regression model. It is a number between 0 and 1 (0 ≤ R2 ≤ 1). The closer its value is to 1, the more …

## Understand the F-statistic in Linear Regression

When running a multiple linear regression model: Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + … + ε The F-statistic provides us with a way for globally testing if ANY of the independent variables X1, X2, X3, X4… is related to the outcome Y. For a significance level of 0.05: If …

## Residual Standard Deviation/Error: Guide for Beginners

The residual standard deviation (or residual standard error) is a measure used to assess how well a linear regression model fits the data. (The other measure to assess this goodness of fit is R2). But before we discuss the residual standard deviation, let’s try to assess the goodness of fit graphically. Consider the following linear …

## 7 Tricks to Get Statistically Significant p-Values

The objective of this article is to prove that getting a p-value below the threshold of 0.05 is not that hard, and that a statistically significant result proves nothing by itself. Study results should always be interpreted in the context of: The study design The effect size The size of the sample The results of …

## P-Value: A Simple Explanation for Non-Statisticians

A p-value is a probability, a number between 0 and 1, calculated after running a statistical test on data. A small p-value (< 0.05 in general) means that the observed results are so unusual assuming that they were due to chance only. It is a way of telling if the results obtained should be taken …

## Which Variables Should You Include in a Regression Model?

When building a linear or logistic regression model, you should consider including: Variables that are already proven in the literature to be related to the outcome Variables that can either be considered the cause of the exposure, the outcome, or both Interaction terms of variables that have large main effects However, you should watch out …

## Understand Forward and Backward Stepwise Regression

Running a regression model with many variables including irrelevant ones will lead to a needlessly complex model. Stepwise regression is a way of selecting important variables to get a simple and easily interpretable model. Below we discuss how forward and backward stepwise selection work, their advantages, and limitations and how to deal with them. Forward …