Data Analysis – QUANTIFYING HEALTH

4 Ways to Handle a Categorical Predictor With Many Levels

A regression model that includes a categorical predictor with many levels might not contain enough observations in each category to be able to detect a reasonable effect size with reasonable power, even then, the large number of dummy variables created could be difficult to interpret. In this article, we present 4 ways to deal with …

4 Ways to Handle a Categorical Predictor With Many Levels Read More »

Understand Linear Regression Assumptions

Data Analysis

The 4 assumptions of linear regression in order of importance are: 1. Linearity 1.1. Explanation The relationship between each predictor Xi and the outcome Y should be linear. 1.2. How to check the linearity assumption Instead of checking the relationship between each predictor Xi and the outcome Y in a multivariable model, we can plot …

Understand Linear Regression Assumptions Read More »

Weighted Regression: An Intuitive Introduction

Data Analysis

Weighted regression (a.k.a. weighted least squares) is a regression model where each observation is given a certain weight that tells the software how important it should be in the model fit. Weighted regression can be used to: 1. Weighted regression to handle non-constant variance of error terms Linear regression assumes that the error terms have …

Weighted Regression: An Intuitive Introduction Read More »

How to Report Interaction Effects in Regression

Data Analysis

For a linear regression model: Y = β0 + β1X + β2Z + β3XZ + ε If the coefficient of the interaction term β3 is statistically significant, then there is evidence of an interaction between X and Z. This means that the effect of X on the outcome Y is different for different sub-categories of Z, …

How to Report Interaction Effects in Regression Read More »

Interpret Log Transformations in Linear Regression

Data Analysis

The following table summarizes how to interpret a linear regression model with logarithmic transformations: Transformation Model Interpretation No transformations Y = β0 + β1 X A 1 unit increase in X is associated with an average change of β1 units in Y. Log-transformed predictor Y = β0 + β1 log(X) A 1% increase in X …

Interpret Log Transformations in Linear Regression Read More »

Why Add & How to Interpret a Quadratic Term in Regression

Data Analysis

Linear regression assumes that the relationship between the predictor X and the outcome Y is linear. If this assumption is not met, linear regression will be a poor fit to the data (as shown in the figure below). In this case, adding a quadratic term to the regression equation may help model the relationship between …

Why Add & How to Interpret a Quadratic Term in Regression Read More »

Interpret Linear Regression Output in R

Data Analysis

Here’s an example of linear regression in R: 1. Linear regression equation The formula \(y \sim x + z\) corresponds to the regression equation: \(y = β_0 + β_1x + β_2z\) where: 2. Residuals The residuals are the difference between the regression line that we fitted (using the predictors x and z) and the real …

Interpret Linear Regression Output in R Read More »

When Does Correlation Imply Causation?

Data Analysis

Short answer: Correlation implies causation when alternative explanations of the relationship between the correlated variables (such as confounding and bias) are removed (by appropriately modifying the study design) or controlled for (by adjusting for them in the statistical analysis). Explanation: Causation means that changing the treatment X for a person will affect the probability of …

When Does Correlation Imply Causation? Read More »

Correlation Coefficient vs Regression Coefficient

Data Analysis

Both the correlation and regression coefficients rely on the hypothesis that the data can be represented by a straight line. They are similar in many ways, but they serve different purposes. Here’s a table that summarizes the similarities and differences between the correlation coefficient, r, and the regression coefficient, β: Correlation coefficient: r Regression coefficient: …

Correlation Coefficient vs Regression Coefficient Read More »

An Example of Using Marginal and Conditional Distributions

Data Analysis

The conditional distribution of a variable, for example heights, is the distribution of heights given the value of another variable, for example gender. Plotting the conditional distribution of heights given gender is a way of visualizing the relationship between the 2 variables. The marginal distribution of heights is the distribution of heights for everybody, independent …

An Example of Using Marginal and Conditional Distributions Read More »

Why Divide Sample Standard Deviation by n-1?

Data Analysis

The problem The standard deviation is a measurement of the spread of the data — it is the average distance of the data from the mean. We are rarely interested in the amount of variation in our sample: the sample standard deviation is only useful as an approximation of the population standard deviation. When our …

Why Divide Sample Standard Deviation by n-1? Read More »

How to Handle Missing Data in Practice: Guide for Beginners

Data Analysis

Handling missing data involves 2 steps: Determining the type of missing data, which can be: Missing completely at random (MCAR) Missing at random (MAR) Missing not at random (MNAR) Choosing a method to deal with these missing values, such as: Deleting variables (i.e. columns) that contain missing values Deleting observations (i.e. rows) whose values are …

How to Handle Missing Data in Practice: Guide for Beginners Read More »

5 Variable Transformations to Improve Your Regression Model

Data Analysis

In this article, we will discuss how you can use the following transformations to build better regression models: Log transformation Square root transformation Polynomial transformation Standardization Centering by substracting the mean Compared to fitting a model using variables in their raw form, transforming them can help: Make the model’s coefficients more interpretable. Meet the model’s …

5 Variable Transformations to Improve Your Regression Model Read More »

Interpret Interactions in Linear Regression

Data Analysis

For a linear regression model with interaction: Y = β0 + β1 X1 + β2 X2 + β3 X1X2 The coefficient of the interaction term (β3) is the increase in effectiveness of X1 for a 1 unit change in X2, and vice-versa. For example: Suppose we used linear regression to study the effect of physical …

Interpret Interactions in Linear Regression Read More »

Interpret the Linear Regression Intercept

Data Analysis

For a linear regression model: Y = β0 + β1 X The linear regression intercept β0 is the predicted value of the outcome Y when the predictor X equals zero. As an example, we will try to interpret the intercept β0 = 78.66 in the following linear regression model: Heart Rate = 78.66 + 2.94 …

Interpret the Linear Regression Intercept Read More »

Using the 4 D-Separation Rules to Study a Causal Association

Data Analysis

Suppose we want study whether coffee causes cancer, which we will represent as follows: Randomizing people to either consume coffee or not for many years in order to study its effect on cancer is neither ethical nor practical. So we have to use an observational design, where we would have to deal with bias and …

Using the 4 D-Separation Rules to Study a Causal Association Read More »

What is a Good R-Squared Value? [Based on Real-World Data]

Data Analysis, Lessons from Research Papers

I analyzed the content of 43,110 randomly chosen research papers from PubMed to learn more about R-squared. Specifically, I wanted to answer the following questions: What is a good value for R-squared? What is a low value for R-squared? Is a higher R-squared always better? Is a low R-squared necessarily bad? Let’s start with a …

What is a Good R-Squared Value? [Based on Real-World Data] Read More »

Statistical Power: What It Is and How It Is Used in Practice

Data Analysis

Statistical power is a measure of study efficiency, calculated before conducting the study to estimate the chance of discovering a true effect rather than obtaining a false negative result, or worse, overestimating the effect by detecting the noise in the data. Here are 5 seemingly different, but actually similar, ways of describing statistical power: Definition …

Statistical Power: What It Is and How It Is Used in Practice Read More »

Identify Variable Types in Statistics (with Examples)

Data Analysis

Here’s a table that summarizes the types of variables: Types of variables Quantitative(a.k.a. Numerical) Qualitative(a.k.a. Categorical) Continuous Discrete Ordinal Nominal Consists of numerical values that can be measured but not counted. Consists of numerical values that can be counted. Consists of text or labels that have a logical order. Consists of text or labels that …

Identify Variable Types in Statistics (with Examples) Read More »

Assess Variable Importance in Linear and Logistic Regression

Data Analysis

In this article, we will be concerned with the following question: Given a regression model, which of the predictors X1, X2, X3, etc. has the most influence on the outcome Y? In general, assessing the relative importance of predictors by directly comparing their (unstandardized) regression coefficients is not a good idea because: For numerical predictors: …

Assess Variable Importance in Linear and Logistic Regression Read More »

Interpret Poisson Regression Coefficients

Data Analysis

The Poisson regression coefficient β associated with a predictor X is the expected change, on the log scale, in the outcome Y per unit change in X. So holding all other variables in the model constant, increasing X by 1 unit (or going from 1 level to the next) multiplies the rate of Y by …

Interpret Poisson Regression Coefficients Read More »

Regression Tree vs Linear Regression

Data Analysis

Both the linear regression and the regression tree models take as input 1 or more predictors (Xi) and their goal is to explain their relationship with the outcome (Y). For simplicity, we will consider the case of modeling Y using only 1 predictor X. Linear regression tries to find the equation of the line that …

Regression Tree vs Linear Regression Read More »

How to Report a Random Forest Model

Data Analysis

In this article we discuss: How to report the use of a random forest model How to report the results of a random forest model 1. How to report the use of a random forest model The following information should be mentioned in the METHODS section of your research paper: The reason why you chose …

How to Report a Random Forest Model Read More »

How to Report a Chi-Square Test

Data Analysis

The 3 main types of Chi-square tests are: Chi-square goodness-of-fit test: used to compare the distribution of a categorical variable (with more than 2 levels) to a hypothetical distribution. Chi-square homogeneity test: used to test whether 2 groups (coming from 2 different samples) have the same distribution regarding a certain categorical variable. Chi-square independence test: …

How to Report a Chi-Square Test Read More »

How to Report a Chi-Square Independence Test

Data Analysis

The Chi-square independence test is used to test whether 2 categorical variables, each having 2 or more categories, are dependent or independent of each other. The null hypothesis H0 states that the 2 variables are independent (i.e. knowing the value of one does not tell us anything about the other) The alternative hypothesis H1 states …

How to Report a Chi-Square Independence Test Read More »

How to Report a Chi-Square Goodness-of-Fit Test

Data Analysis

A Chi-square goodness-of-fit test is used to evaluate the distribution of a categorical variable with more than 2 levels/categories against a theoretical one. Simply put, we would like to compare the counts in each level of this categorical variable with the counts that we expect to find given some hypothesis. Therefore, the objective of this …

How to Report a Chi-Square Goodness-of-Fit Test Read More »

How to Report the Shapiro-Wilk Test

Data Analysis

The Shapiro-Wilk test is a statistical test used to check if a continuous variable follows a normal distribution. The null hypothesis (H0) states that the variable is normally distributed, and the alternative hypothesis (H1) states that the variable is NOT normally distributed. So after running this test: If p ≤ 0.05: then the null hypothesis …

How to Report the Shapiro-Wilk Test Read More »

How to Report Stepwise Regression

Data Analysis

In this article we will discuss: 1. Reporting the use of stepwise regression The following information should be mentioned in the METHODS section of the research paper: (For an easy explanation of the stopping rule and a step-by-step description of how stepwise selection works, I recommend my other article: Understand Forward and Backward Stepwise Regression) …

How to Report Stepwise Regression Read More »

Interpret Linear Regression Coefficients

Data Analysis

For a simple linear regression model: Y = β0 + β1 X + ε The linear regression coefficient β1 associated with a predictor X is the expected difference in the outcome Y when comparing 2 groups that differ by 1 unit in X. Another common interpretation of β1 is: β1 is the expected change in the outcome …

Interpret Linear Regression Coefficients Read More »

Interpret the Logistic Regression Intercept

Data Analysis

Here’s the equation of a logistic regression model with 1 predictor X: Where P is the probability of having the outcome and P / (1-P) is the odds of the outcome. The easiest way to interpret the intercept is when X = 0: When X = 0, the intercept β0 is the log of the …

Interpret the Logistic Regression Intercept Read More »

Interpret Logistic Regression Coefficients [For Beginners]

Data Analysis

The logistic regression coefficient β associated with a predictor X is the expected change in log odds of having the outcome per unit change in X. So increasing the predictor by 1 unit (or going from 1 level to the next) multiplies the odds of having the outcome by eβ. Here’s an example: Suppose we …

Interpret Logistic Regression Coefficients [For Beginners] Read More »

When to Use Regression Analysis (With Examples)

Data Analysis

Regression analysis can be used to: In the text below, we will go through these points in greater detail and provide a real-world example of each. 1. Estimate the effect of an exposure on a given outcome Regression can model linear and non-linear associations between an exposure (or treatment) and an outcome of interest. It …

When to Use Regression Analysis (With Examples) Read More »

Deviance in the Context of Logistic Regression

Data Analysis

Deviance is a number that measures the goodness of fit of a logistic regression model. Think of it as the distance from the perfect fit — a measure of how much your logistic regression model deviates from an ideal model that perfectly fits the data. Deviance ranges from 0 to infinity. The smaller the number …

Deviance in the Context of Logistic Regression Read More »

Why and When to Include Interactions in a Regression Model

Data Analysis

In a regression model, consider including the interaction between 2 variables when: They have large main effects. The effect of one changes for various subgroups of the other. The interaction has been proven in previous studies. You want to explore new hypotheses. Below we will explore each of these points in detail, but first let’s …

Why and When to Include Interactions in a Regression Model Read More »

Understand Regularized Regression

Data Analysis

Regularized regression is a regression method with an additional constraint designed to deal with a large number of independent variables (a.k.a. predictors). It does so by imposing a larger penalty on unimportant ones, thus shrinking their coefficients towards zero. The objective of regularization is to end up with a model: Other methods that also deal …

Understand Regularized Regression Read More »

Understand Best Subset Selection

Data Analysis

When building a regression model, removing irrelevant variables will make the model easier to interpret and less prone to overfit the data, therefore more generalizable. Best subset selection is a method that aims to find the subset of independent variables (Xi) that best predict the outcome (Y) and it does so by considering all possible …

Understand Best Subset Selection Read More »

Square Root Transformation: A Beginner’s Guide

Data Analysis

A square root transformation can be useful for: Normalizing a skewed distribution Transforming a non-linear relationship between 2 variables into a linear one Reducing heteroscedasticity of the residuals in linear regression Focusing on visualizing certain parts of your data Below we will discuss each of these points in details. When you apply a square root …

Square Root Transformation: A Beginner’s Guide Read More »

What is an Acceptable Value for VIF? (With References)

Data Analysis

Most research papers consider a VIF (Variance Inflation Factor) > 10 as an indicator of multicollinearity, but some choose a more conservative threshold of 5 or even 2.5. So what threshold should YOU choose? When choosing a VIF threshold, you should take into account that multicollinearity is a lesser problem when dealing with a large …

What is an Acceptable Value for VIF? (With References) Read More »

Coefficient of Alienation, Non-determination and Tolerance

Data Analysis

When running a linear regression model: Y = β0 + β1 × X1 + β2 × X2 + ε One way of determining if the independent variables X1 and X2 were useful in predicting Y is to calculate the coefficient of determination R2. R2 measures the proportion of variability in Y that can be explained …

Coefficient of Alienation, Non-determination and Tolerance Read More »

Correlation vs Collinearity vs Multicollinearity

Data Analysis

Here’s a table that summarizes the differences between correlation, collinearity and multicollinearity: Correlation Collinearity Multicollinearity Definition Correlation refers to the linear relationship between 2 variables Collinearity refers to a problem when running a regression model where 2 or more independent variables (a.k.a. predictors) have a strong linear relationship Multicollinearity is a special case of …

Correlation vs Collinearity vs Multicollinearity Read More »

Standardized vs Unstandardized Regression Coefficients

Data Analysis

Here’s a table that summarizes the similarities and differences between standardized and unstandardized linear regression coefficients: Unstandardized β Standardized β Definition Unstandardized coefficients are obtained after running a regression model on variables measured in their original scales Standardized coefficients are obtained after running a regression model on standardized variables (i.e. rescaled variables that have …

Standardized vs Unstandardized Regression Coefficients Read More »

Relationship Between r and R-squared in Linear Regression

Data Analysis

R-squared is a measure of how well a linear regression model fits the data. It can be interpreted as the proportion of variance of the outcome Y explained by the linear regression model. It is a number between 0 and 1 (0 ≤ R2 ≤ 1). The closer its value is to 1, the more …

Relationship Between r and R-squared in Linear Regression Read More »

Understand the F-statistic in Linear Regression

Data Analysis

When running a multiple linear regression model: Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + … + ε The F-statistic provides us with a way for globally testing if ANY of the independent variables X1, X2, X3, X4… is related to the outcome Y. For a significance level of 0.05: If …

Understand the F-statistic in Linear Regression Read More »

Residual Standard Deviation/Error: Guide for Beginners

Data Analysis

The residual standard deviation (or residual standard error) is a measure used to assess how well a linear regression model fits the data. (The other measure to assess this goodness of fit is R2). But before we discuss the residual standard deviation, let’s try to assess the goodness of fit graphically. Consider the following linear …

Residual Standard Deviation/Error: Guide for Beginners Read More »

7 Tricks to Get Statistically Significant p-Values

Data Analysis

The objective of this article is to prove that getting a p-value below the threshold of 0.05 is not that hard, and that a statistically significant result proves nothing by itself. Study results should always be interpreted in the context of: The study design The effect size The size of the sample The results of …

7 Tricks to Get Statistically Significant p-Values Read More »

P-Value: A Simple Explanation for Non-Statisticians

Data Analysis

A p-value is a probability, a number between 0 and 1, calculated after running a statistical test on data. A small p-value (< 0.05 in general) means that the observed results are so unusual assuming that they were due to chance only. It is a way of telling if the results obtained should be taken …

P-Value: A Simple Explanation for Non-Statisticians Read More »

Which Variables Should You Include in a Regression Model?

Data Analysis

When building a linear or logistic regression model, you should consider including: However, you should watch out for: Below we discuss each of these points in details. 1. Selecting variables based on background knowledge Advantages of using background knowledge to select variables How to choose variables based on background knowledge? You can find out whether …

Which Variables Should You Include in a Regression Model? Read More »

Understand Forward and Backward Stepwise Regression

Data Analysis

Running a regression model with many variables including irrelevant ones will lead to a needlessly complex model. Stepwise regression is a way of selecting important variables to get a simple and easily interpretable model. Below we discuss how forward and backward stepwise selection work, their advantages, and limitations and how to deal with them. Forward …

Understand Forward and Backward Stepwise Regression Read More »