Data Analysis

Correlation Coefficient vs Regression Coefficient

Both the correlation and regression coefficients rely on the hypothesis that the data can be represented by a straight line. They are similar in many ways, but they serve different purposes. Here’s a table that summarizes the similarities and differences between the correlation coefficient, r, and the regression coefficient, β: Correlation coefficient: r Regression coefficient: …

Correlation Coefficient vs Regression Coefficient Read More »

An Example of Using Marginal and Conditional Distributions

The conditional distribution of a variable, for example heights, is the distribution of heights given the value of another variable, for example gender. Plotting the conditional distribution of heights given gender is a way of visualizing the relationship between the 2 variables. The marginal distribution of heights is the distribution of heights for everybody, independent …

An Example of Using Marginal and Conditional Distributions Read More »

How to Handle Missing Data in Practice: Guide for Beginners

Handling missing data involves 2 steps: Determining the type of missing data, which can be: Missing completely at random (MCAR) Missing at random (MAR) Missing not at random (MNAR) Choosing a method to deal with these missing values, such as: Deleting variables (i.e. columns) that contain missing values Deleting observations (i.e. rows) whose values are …

How to Handle Missing Data in Practice: Guide for Beginners Read More »

5 Variable Transformations to Improve Your Regression Model

In this article, we will discuss how you can use the following transformations to build better regression models: Log transformation Square root transformation Polynomial transformation Standardization Centering by substracting the mean Compared to fitting a model using variables in their raw form, transforming them can help: Make the model’s coefficients more interpretable. Meet the model’s …

5 Variable Transformations to Improve Your Regression Model Read More »

What is a Good R-Squared Value? [Based on Real-World Data]

I analyzed the content of 43,110 randomly chosen research papers from PubMed to learn more about R-squared. Specifically, I wanted to answer the following questions: What is a good value for R-squared? What is a low value for R-squared? Is a higher R-squared always better? Is a low R-squared necessarily bad? Let’s start with a …

What is a Good R-Squared Value? [Based on Real-World Data] Read More »

Statistical Power: What It Is and How It Is Used in Practice

Statistical power is a measure of study efficiency, calculated before conducting the study to estimate the chance of discovering a true effect rather than obtaining a false negative result, or worse, overestimating the effect by detecting the noise in the data. Here are 5 seemingly different, but actually similar, ways of describing statistical power: Definition …

Statistical Power: What It Is and How It Is Used in Practice Read More »

Identify Variable Types in Statistics (with Examples)

Here’s a table that summarizes the types of variables: Types of variables Quantitative(a.k.a. Numerical) Qualitative(a.k.a. Categorical) Continuous Discrete Ordinal Nominal Consists of numerical values that can be measured but not counted. Consists of numerical values that can be counted. Consists of text or labels that have a logical order. Consists of text or labels that …

Identify Variable Types in Statistics (with Examples) Read More »

Assess Variable Importance in Linear and Logistic Regression

In this article, we will be concerned with the following question: Given a regression model, which of the predictors X1, X2, X3, etc. has the most influence on the outcome Y? In general, assessing the relative importance of predictors by directly comparing their (unstandardized) regression coefficients is not a good idea because: For numerical predictors: …

Assess Variable Importance in Linear and Logistic Regression Read More »

How to Report a Chi-Square Test

The 3 main types of Chi-square tests are: Chi-square goodness-of-fit test: used to compare the distribution of a categorical variable (with more than 2 levels) to a hypothetical distribution. Chi-square homogeneity test: used to test whether 2 groups (coming from 2 different samples) have the same distribution regarding a certain categorical variable. Chi-square independence test: …

How to Report a Chi-Square Test Read More »

Correlation vs Collinearity vs Multicollinearity

Here’s a table that summarizes the differences between correlation, collinearity and multicollinearity:   Correlation Collinearity Multicollinearity Definition Correlation refers to the linear relationship between 2 variables Collinearity refers to a problem when running a regression model where 2 or more independent variables (a.k.a. predictors) have a strong linear relationship Multicollinearity is a special case of …

Correlation vs Collinearity vs Multicollinearity Read More »

Standardized vs Unstandardized Regression Coefficients

Here’s a table that summarizes the similarities and differences between standardized and unstandardized linear regression coefficients:   Unstandardized β Standardized β Definition Unstandardized coefficients are obtained after running a regression model on variables measured in their original scales Standardized coefficients are obtained after running a regression model on standardized variables (i.e. rescaled variables that have …

Standardized vs Unstandardized Regression Coefficients Read More »

Which Variables Should You Include in a Regression Model?

When building a linear or logistic regression model, you should consider including: However, you should watch out for: Below we discuss each of these points in details. 1. Selecting variables based on background knowledge Advantages of using background knowledge to select variables How to choose variables based on background knowledge? You can find out whether …

Which Variables Should You Include in a Regression Model? Read More »