George Choueiry

I am Georges Choueiry, PharmD, MPH, PhD student in epidemiology.

Run and Interpret a Mixed Model in R

A mixed model (also called a mixed-effects model) is used when the observations (i.e. rows) in the data are not independent. This can occur, for example, when measurements are taken from the same individuals multiple times or when observations can be grouped into clusters, such as families, co-workers, etc. In this tutorial, we will run …

Run and Interpret Ordinal Logistic Regression in R

Ordinal logistic regression is a type of regression analysis that models the relationship between one or more predictors (numerical or categorical) and an ordinal outcome. An ordinal outcome is a variable that has more than 2 categories that have a logical order, such as: In this tutorial, we will use ordinal logistic regression on the …

How to Work With Time Series Data in R (Using fpp3 package)

In this tutorial, we will use the fpp3 library in R to manipulate and plot time series data. fpp3 loads other useful packages such as: dplyr, tidyr, lubridate, and ggplot2. We will start with a simple example and then work with a more complicated one. 1. A simple time series example 1.1. Create the data …

Linear Regression Example for Time Series Data in R

In this tutorial, we will use a linear regression model to examine the relationship between the Google search trends for the terms headache and ibuprofen. 1. Prepare the data 1.1. Download the data The package gtrendsR presents an interface to retrieve the number of Google searches over time for a specific term: 1.2. Plot the …

How to Plot Time Series in R + Basic Analysis

In this tutorial, we will use the discoveries dataset available in R as an example of a time series. The dataset contains yearly count of important scientific discoveries from 1860 to 1959. 1. Load the data A tsibble is a time series table that has an index (the year) and a value (the number of …

Plot Monthly distribution for each year in R (Seasonality)

In this article, we will produce the following plot in R: 1. Load the data First, we will use the COVID19 package in R to download data for Lebanon. We will limit our analysis to 2 variables: (1) the date and (2) the number of new daily cases. 2. Count monthly cases To do that, …

Run and Interpret a Multinomial Logistic Regression in R

In this tutorial, we will use the penguins dataset from the palmerpenguins package in R to examine the relationship between the predictors, bill length and flipper length, and the outcome species (which has 3 categories). 1. Loading the data We will start by loading the necessary packages and summarizing the data: 2. Fitting a multinomial …

How to Run a Logistic Regression in R tidymodels

In this tutorial, we are going to use the tidymodels package to run a logistic regression on the Titanic dataset available in R. 1. Preparing the data 2. Running a logistic regression model In order to fit a logistic regression model in tidymodels, we need to do 4 things: 3. Examining the relationship between the predictors …

Easiest Way to Plot Data on a Map in R (Using ggmap)

In this tutorial, we will use the packages ggmap and COVID19 to create the following plot: 1. Downloading and plotting the map Since we are going to plot the Mediterranean region only, we first need to specify its borders. A simple Google search shows that these are roughly: Output: Since we don’t want country labels …

How to Run a Linear Regression in R tidymodels

In this article, we are going to use the iris dataset available in R to build a linear regression model using the tidymodels package. Building the model In order to fit a linear regression model in tidymodels, we need to do 4 things: Checking linear regression assumptions After fitting the model, we should check whether …

4 Ways to Handle a Categorical Predictor With Many Levels

A regression model that includes a categorical predictor with many levels might not contain enough observations in each category to be able to detect a reasonable effect size with reasonable power, even then, the large number of dummy variables created could be difficult to interpret. In this article, we present 4 ways to deal with …

Create and Plot Graphs from data.frame: Intro to igraph in R

A graph consists of points — called nodes or vertices — connected by line segments — called edges. The only library we will need for this tutorial is igraph: 1. Directed graphs A directed graph, where the edges indicate a one-way relationship between vertices, can be created from a data.frame that simply defines the edges: …

Join Dataframes in R: Left/Right/Inner/Full Joins

Joining 2 dataframes involves linking the rows in one to the rows in the other. This can be done in different ways using the dplyr functions: left_join(), right_join(), inner_join(), and full_join(). Here’s an illustration of the differences between them: In order to demonstrate how these functions work, let’s first create some data: 1. Left join …

Extract Multiple Occurrences of a Pattern in a String in R

Our goal is to extract all p-values reported in these abstracts: Notice that most reported p-values follow this 3-part pattern: In regular expressions, this pattern can be written as: Then we can call str_extract_all() to extract all text that follow this pattern: The new column is a list which we can unnest(): The next step is …

Extract p-values from Text in R: Using separate_wider_regex

Our goal is to extract p-values from article abstracts to create 2 new variables: sign and p-value: We will break the problem into 2 simple steps: But first, let’s load the data: Step 1: determine which abstracts report a p-value Here, we can use the function str_detect() which takes a string and a pattern, and …

Extract Numbers from Strings in R

The functions parse_integer(), parse_double(), and parse_number() from the readr library transform a character vector into a numeric vector. Here’s an example that compares these 3 functions: Exercises 1. Extract the number 1000000 from “1 000 000” Not all characters in this string can be transformed into an integer (since we have white spaces), so we …

In this article, we will use R to: 1. Get PMID numbers of relevant articles Let’s say we are interested in analyzing articles about lung cancer, published in PLOS Medicine, in the year 2022. Here’s the PubMed link that contains PMID (PubMed ID) numbers for these articles: If you visit this URL, you see an …

Plot Median and Interquartile Range in R

In this tutorial, we are going to create the following plot of the median and the interquartile range of sepal length for each iris species using the iris dataset: 1. Using geom_pointrange() We start by calculating the median, the 1st quartile, and the 3rd quartile as follows: Then we give these variables to geom_pointrange() in …

Using pivot_longer with names_sep and names_pattern in R

In this article, we will explain how to use the arguments names_sep and names_pattern of the function pivot_longer() from the tidyr package. First, let’s create some data: The general syntax of pivot_longer() is: 1. Using the names_sep argument in pivot_longer() When calling pivot_longer(), we can create 2 new columns (instead of 1) by splitting the …

Convert Columns to Rows in R

In this article, we will show 2 ways to convert columns into rows in R using the following data: 1. Using matrix transpose The function t() takes the data frame df and returns a matrix where columns and rows are switched: But we have 3 problems with this function: So let’s clean this output a …

How to Summarize Data in R (Using Dplyr)

In this article, we will cover how to apply the function summarize() from the dplyr package using the following data: 1. Summarizing a variable Use the following code to calculate the average age in our dataset: The summarize() function can take many arguments: 2. Grouping and summarizing Use the following code to calculate the average …

How to Deal with Violation of Normality of Errors in R

Linear regression assumes that error terms are normally distributed. This is especially important when we are using linear regression for prediction purposes and our sample size is small (see: Understand Linear Regression Assumptions). When the normality of errors assumption is violated, try: Let’s create some data to demonstrate these methods: Output: So we see that …

How to Deal with Heteroscedasticity in Regression in R

Linear regression assumes that the dispersion of data points around the regression line is constant. We can deal with violation of this assumption (i.e. with heteroscedasticity) by: Let’s create some heteroscedastic data to demonstrate these methods: Output: The residuals vs fitted values plot shows a fan shape, which is evidence of heteroscedasticity. (For more information, …

How to Deal with Violation of the Linearity Assumption in R

The most important assumption of linear regression is that the relationship between each predictor and the outcome is linear. When the linearity assumption is violated, try: Let’s create some non-linear data to demonstrate these methods: The residuals vs fitted values plot shows a curved relationship, therefore, the linearity assumption is violated. Solution #1: Adding a quadratic …

How to Check Linear Regression Assumptions in R

Linear regression has 4 assumptions: 1. How to check linearity Instead of checking the relationship between each predictor and the outcome in a multivariable model, we can plot the residuals versus the fitted values. The plot should show no discernible pattern: The output will look like one of the following: In this plot, R draws …

Understand Linear Regression Assumptions

The 4 assumptions of linear regression in order of importance are: 1. Linearity 1.1. Explanation The relationship between each predictor Xi and the outcome Y should be linear. 1.2. How to check the linearity assumption Instead of checking the relationship between each predictor Xi and the outcome Y in a multivariable model, we can plot …

Linear Regression in R (with a Categorical Variable)

In this article, we will run and interpret a linear regression model where the predictor is a categorical variable with multiple levels. Loading the Data We will use the chickwts dataset available in R. (These data come from an experiment where newly hatched chickens were randomly divided into 6 groups, each group receiving a different …

How to Run and Interpret a Logistic Regression Model in R

In this tutorial, we are going to run a logistic regression using the Titanic dataset available in R: 1. Logistic regression equation The formula $$Survived \sim Age$$ corresponds to the logistic regression equation: $$\log(\frac{P}{1 – P}) = \beta_0 + \beta_1 Age$$ Where $$P$$ is the probability of having the outcome, i.e. the probability of surviving. …

Logistic Regression in R (with Categorical Variables)

In this article, we will run and interpret a logistic regression model where the predictor is a categorical variable with multiple levels. Loading the data We will use the Titanic dataset available in R: Running a logistic regression model Next, we will use logistic regression to examine the effect of class (the predictor) on survival …

Plot Logistic Regression Decision Boundary in R

In this article, we will produce the following R plot that represents the decision boundary of a logistic regression model: Here’s the full code used to generate it: Code explanation First, we create some data (2 continuous variables x1 and x2, and 1 binary variable y) and run a logistic regression: Next, we will create …

Stepwise (Linear & Logistic) Regression in R

In this article, we will cover: Let’s start by creating some data: To run a stepwise regression, use the stepAIC function from the MASS library. 1. How to run forward stepwise linear regression Output: Call: lm(formula = X1 ~ X4 + X3 + X7, data = dat) Residuals: Min 1Q Median 3Q Max -0.52407 -0.23122 …

How to Deal with Multicollinearity in R

Multicollinearity occurs when there is a strong linear relationship between 2 or more predictors in a regression model. It is a problem because it increases the standard errors of the regression coefficients, leading to noisy estimates. Let’s simulate some data in R: We have a collinearity problem in our model since our variables’ VIFs (Variance …

Weighted Regression: An Intuitive Introduction

Weighted regression (a.k.a. weighted least squares) is a regression model where each observation is given a certain weight that tells the software how important it should be in the model fit. Weighted regression can be used to: 1. Weighted regression to handle non-constant variance of error terms Linear regression assumes that the error terms have …

How to Report Interaction Effects in Regression

For a linear regression model: Y = β0 + β1X + β2Z + β3XZ + ε If the coefficient of the interaction term β3 is statistically significant, then there is evidence of an interaction between X and Z. This means that the effect of X on the outcome Y is different for different sub-categories of Z, …

Linear Regression with Interaction in R

Output: Call: lm(formula = Y ~ X + Z + X:Z, data = dat) Residuals: Min 1Q Median 3Q Max -1.00058 -0.25209 0.00766 0.21640 0.89542 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.57717 0.11195 40.885 < 2e-16 *** X 0.44168 0.06551 6.742 3.38e-10 *** Z -1.23932 0.21937 -5.649 8.16e-08 *** X:Z 0.18859 0.03357 5.617 …

Interpret Log Transformations in Linear Regression

The following table summarizes how to interpret a linear regression model with logarithmic transformations: Transformation Model Interpretation No transformations Y = β0 + β1 X A 1 unit increase in X is associated with an average change of β1 units in Y. Log-transformed predictor Y = β0 + β1 log(X) A 1% increase in X …

Linear regression assumes that the relationship between the predictor X and the outcome Y is linear. If this assumption is not met, linear regression will be a poor fit to the data (as shown in the figure below). In this case, adding a quadratic term to the regression equation may help model the relationship between …

Interpret Linear Regression Output in R

Here’s an example of linear regression in R: 1. Linear regression equation The formula $$y \sim x + z$$ corresponds to the regression equation: $$y = β_0 + β_1x + β_2z$$ where: 2. Residuals The residuals are the difference between the regression line that we fitted (using the predictors x and z) and the real …

8 Types of Treatment Effects Explained (with Examples)

When studying the effect of a treatment (or an intervention) on an outcome, we should keep in mind that it will probably not be the same for everyone. In other words, each person will likely experience a different effect of the same treatment — we say that the treatment has a heterogeneous effect. We can …

3 Real-World Examples of Using Instrumental Variables

The instrumental variable approach is a method to identify the causal effect of a treatment on an outcome of interest by controlling for unobserved confounding between them. A valid instrumental variable, Z, is one that influences the outcome, Y, through the treatment, X, without being related to the confounding variable, C, as shown in the …

When Does Correlation Imply Causation?

Short answer: Correlation implies causation when alternative explanations of the relationship between the correlated variables (such as confounding and bias) are removed (by appropriately modifying the study design) or controlled for (by adjusting for them in the statistical analysis). Explanation: Causation means that changing the treatment X for a person will affect the probability of …

Correlation Coefficient vs Regression Coefficient

Both the correlation and regression coefficients rely on the hypothesis that the data can be represented by a straight line. They are similar in many ways, but they serve different purposes. Here’s a table that summarizes the similarities and differences between the correlation coefficient, r, and the regression coefficient, β: Correlation coefficient: r Regression coefficient: …

An Example of Using Marginal and Conditional Distributions

The conditional distribution of a variable, for example heights, is the distribution of heights given the value of another variable, for example gender. Plotting the conditional distribution of heights given gender is a way of visualizing the relationship between the 2 variables. The marginal distribution of heights is the distribution of heights for everybody, independent …

Why Divide Sample Standard Deviation by n-1?

The problem The standard deviation is a measurement of the spread of the data — it is the average distance of the data from the mean. We are rarely interested in the amount of variation in our sample: the sample standard deviation is only useful as an approximation of the population standard deviation. When our …

How to Describe/Summarize Numerical Data in R (Example)

Let’s start by creating IQ, a normally distributed numerical variable, with a mean of 100 and a standard deviation of 15, that represents the IQ scores of a sample of 100 participants: Next, we will (1) summarize this variable and (2) describe its distribution. 1. Summary statistics Reminder: The 1st quartile is the 25th percentile, …

How to Describe/Summarize Categorical Data in R (Example)

Let’s start by creating our own data, consisting of 2 categorical variables: gender and smoking: Next, we will create a frequency table and a bar plot to summarize these data one variable at a time, then we will create a contingency table and a stacked bar plot to describe the relationship between the 2 variables. …

Modulo Operator (%%) in R: Explained + Practical Examples

The modulo operator (%% in R) returns the remainder of the division of 2 numbers. Here are some examples: 5 %% 2 returns 1, because 2 goes into 5 two times and the remainder is 1 (i.e. 5 = 2 × 2 + 1). 4 %% 2 returns 0, since 4 = 2 × 2 …

Find the Minimum and Maximum of a Function in R

The function optimize (also spelled optimise) in R returns the minimum or maximum of a function f(x) within a specified interval. It takes as inputs: f: a function. interval: a vector containing the lower and upper bounds of the domain where we want to search for the minimum or maximum. maximum: a logical, where TRUE …

How to Handle Missing Data in Practice: Guide for Beginners

Handling missing data involves 2 steps: Determining the type of missing data, which can be: Missing completely at random (MCAR) Missing at random (MAR) Missing not at random (MNAR) Choosing a method to deal with these missing values, such as: Deleting variables (i.e. columns) that contain missing values Deleting observations (i.e. rows) whose values are …

Write a Function that Returns the nth Fibonacci Number in R

Challenge: Write a function in R that prints the nth Fibonacci number. Reminder: the Fibonacci sequence is: 1, 1, 2, 3, 5, 8, … So, the first Fibonacci number is 1, the second is also 1, and then each subsequent number is the sum of the previous 2 in the sequence. Solution: In this article, …

Plot a Step Function in Base R and ggplot2

As an example of a step function, we will use the floor function floor(x) that takes a real number x and returns the greatest integer less than or equal to x. Coding the floor function in R: Plotting in base R Output: We can add the data points to the plot with the following code: …

Solve a Polynomial in R

A polynomial p(x) is an expression of the form: $$p(x) = a_0 + a_1x + a_2x^2 + a_3x^3 + … + a_nx^n$$ Where n is any non-negative integer. Solve a polynomial p(x) in R To solve the equation $$p(x) = 0$$ in R, we can use the function: polyroot. For example, let’s solve the equation: …

Coding and Plotting a Piecewise Function in R

In this tutorial we are going to code the following function in R: $$f(x) =\begin{cases}-x, & \text{if x < -1} \\x^2, & \text{if x \geq -1}\end{cases}$$ And produce the following plot: Coding the piecewise function f(x) in R Using if else statements While this code is easy to read and understand, it does not support …

How to Plot a Quadratic Function in R

For the following quadratic function: $$f(x) = x^2 + 2x – 20$$ Here’s the plot that we want to produce: Coding the function f(x) in R A quadratic function is a function of the form: $$ax^2 + bx + c$$, where $$a \neq 0$$. So for $$f(x) = x^2 + 2x – 20$$: a = …

Find the Line Equation From 2 Points in R

Suppose we want to know the equation of the line that passes through 2 points A and B, such that: Quick solution Output: Call: lm(formula = ys ~ xs) Coefficients: (Intercept) xs 58.833 -2.417 So, the equation of the line that passes through A and B is: $$f(x) = -2.417x + 58.833$$ To get more …

Which Sampling Methods Are Most Commonly Used in Research?

I analyzed a random sample of 9,830 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, to check the popularity of different sampling methods and assess their correlation with the quality of research. I used the BioC API to download the data (see the References section below). Here’s a summary of …

7 Sampling Methods Explained Visually

In this article, we will cover 7 sampling methods, which we are going to divide into 2 types: probability sampling methods, and non-probability sampling methods. Probability sampling methods involve random selection of participants, and therefore tend to produce unbiased samples; Non-probability methods do not involve random selection of participants, and therefore are cheaper to apply, …

Writing Custom Functions in R

In this article, you will learn how to write your own functions in R. Specifically, we will cover: How to write a simple function How to write a more complex function How to write an anonymous function How to write a function with an unfixed number of arguments How to write a recursive function 1. …

How to Solve an Equation in R

In this article, will use the uniroot.all() function from the rootSolve package to find all the solutions of an equation over a given interval (or domain). Input: uniroot.all() takes 2 arguments: a function f and an interval. How it works: Its searches the interval for all possible roots of f. Output: uniroot.all() returns a vector …

Create and Graph Intervals in R

A quick review of intervals The open interval from a to b, denoted (a, b), consists of all numbers between a and b excluding the endpoints a and b. Open circles in the graph indicate that the endpoints are excluded: The closed interval from a to b, denoted [a, b], consists of all numbers between …

Working with Sets in R (Tutorial)

A set is an unordered collection of unique elements. It is helpful to keep track of distinct objects. In this tutorial, you will learn how to: Create a set Manipulate sets Work with subsets Apply set operations 1. Create a set 1.1. Create a set from scratch A set can contain different types of elements, …

5 Real-World Examples of Confounding [With References]

An association between 2 variables X and Y cannot be interpreted as causal if it can be attributed to an alternative mechanism. Confounding is an example of such mechanism that alters the relationship between X and Y, and therefore, leads to an over or underestimation of the true effect between them. In its simplest form, …

Front-Door Criterion to Adjust for Unmeasured Confounding

Suppose we conducted an observational study to estimate the causal effect of some depression treatment on the quality of life of patients: The problem is that the relationship between the two is confounded by the severity of depression: The arrows in the diagram reflect causal associations: The arrow from “depression severity” to “treatment” reflects the …

How to Start a Discussion Section in Research? [with Examples]

The examples below are from 72,017 full-text PubMed research papers that I analyzed in order to explore common ways to start writing the Discussion section. Research papers included in this analysis were selected at random from those uploaded to PubMed Central between the years 2016 and 2021. Note that I used the BioC API to …

How to Start a Methods Section in Research? [with Examples]

The examples below are from 76,350 full-text PubMed research papers that I analyzed in order to explore common ways to start the Materials and Methods section. Research papers included in this analysis were selected at random from those uploaded to PubMed Central between the years 2016 and 2021. I used the BioC API to download …

How to Start an Abstract? Examples from 94,745 Research Papers

The examples below are from 94,745 full-text PubMed research papers that I analyzed in order to explore common ways to start writing the Abstract. Research papers included in this analysis were selected at random from those uploaded to PubMed Central between the years 2016 and 2021. Note that I used the BioC API to download …

How to Start an Introduction? Examples from 98,093 Research Papers

The examples below are from 98,093 full-text PubMed research papers that I analyzed in order to explore common ways to start the Introduction section. The research papers included in this analysis were selected at random from those uploaded to PubMed Central between the years 2016 and 2021. Note that I used the BioC API to …

How to Start a Conclusion in Research? [With Examples]

The examples below are from 47,803 full-text PubMed research papers that I analyzed in order to explore common ways to start a Conclusion section. The research papers included in this analysis were selected at random from those uploaded to PubMed Central between the years 2016 and 2021. Note that I used the BioC API to …

How to Start a Research Title? Examples from 105,975 Titles

I analyzed a random sample of 105,975 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to explore common ways to start a research title. I used the BioC API to download the data (see the References section below). Common ways to start a title The most common 3-word …

Meta-Analysis Software Popularity in 1,321 Research Papers

I analyzed a random sample of 1,957 meta-analysis full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to check the popularity packages of meta-analysis software among medical researchers. (I used the BioC API to download the articles — see the References section below). Out of these 1,957 meta-analysis papers, …

Does the Number of Authors Matter? Data from 101,580 Research Papers

I analyzed a random sample of 101,580 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to explore the influence of the number of authors of a research paper on its quality. I used the BioC API to download the data (see the References section below). Here’s a summary …

How to Write & Publish a Research Paper: Step-by-Step Guide

This guide is far more than a list of instructions on what to include in each section of your research paper. In fact, we will: Use a research paper I wrote specifically as an example to illustrate the key ideas in this guide (link to the full-text PDF of the research paper). Use real-world data …

“I” & “We” in Academic Writing: Examples from 9,830 Studies

I analyzed a random sample of 9,830 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to explore whether first-person pronouns are used in the scientific literature, and how? I used the BioC API to download the data (see the References section below). Popularity of first-person pronouns in the …

Paragraph Length: Data from 9,830 Research Papers

I analyzed a random sample of 9,830 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, to answer the question: How long should a paragraph be in a research paper? I used the BioC API to download the data (see the References section below). Paragraph length Our sample of 9,830 research …

Can a Research Title Be a Question? Real-World Examples

I analyzed a random sample of 9,830 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, to answer the questions: Can a research title be a question? And do questions make good titles? I used the BioC API to download the data (see the References section below). Popularity of question titles …

How Long Should a Research Title Be? Data from 104,161 Examples

I analyzed a random sample of 104,161 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, to learn more about title length. I used the BioC API to download the data (see the References section below). Here’s a summary of the key findings 1. The median title was 14 words long …

How Long Should the Abstract Be? Data 61,429 from Examples

I analyzed a random sample of 61,429 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to answer the questions: What is the typical length of an abstract? and which factors influence it? I used the BioC API to download the data (see the References section below). Here’s a …

How Long Should the Discussion Section Be? Data from 61,517 Examples

I analyzed a random sample of 61,517 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to answer the questions: What is the typical length of a discussion section? and which factors influence it? I used the BioC API to download the data (see the References section below). Here’s …

How Long Should the Results Section Be? Data from 61,458 Examples

I analyzed a random sample of 61,458 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to answer the questions: What is the typical length of a results section? and which factors influence it? I used the BioC API to download the data (see the References section below). Here’s …

How Long Should the Methods Section Be? Data from 61,514 Examples

I analyzed a random sample of 61,514 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to answer the questions: What is the typical length of a methods section? and which factors influence it? I used the BioC API to download the data (see the References section below). Here’s …

How Long Should the Introduction of a Research Paper Be? Data from 61,518 Examples

I analyzed a random sample of 61,518 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to answer the questions: What is the typical length of an introduction section? and which factors influence it? I used the BioC API to download the data (see the References section below). Here’s …

5 Variable Transformations to Improve Your Regression Model

In this article, we will discuss how you can use the following transformations to build better regression models: Log transformation Square root transformation Polynomial transformation Standardization Centering by substracting the mean Compared to fitting a model using variables in their raw form, transforming them can help: Make the model’s coefficients more interpretable. Meet the model’s …

Interpret Interactions in Linear Regression

For a linear regression model with interaction: Y = β0 + β1 X1 + β2 X2 + β3 X1X2 The coefficient of the interaction term (β3) is the increase in effectiveness of X1 for a 1 unit change in X2, and vice-versa. For example: Suppose we used linear regression to study the effect of physical …

Interpret the Linear Regression Intercept

For a linear regression model: Y = β0 + β1 X The linear regression intercept β0 is the predicted value of the outcome Y when the predictor X equals zero. As an example, we will try to interpret the intercept β0 = 78.66 in the following linear regression model: Heart Rate = 78.66 + 2.94 …

7 Different Ways to Control for Confounding

Confounding can be controlled in the design phase of the study by using: Random assignment Restriction Matching Or in the data analysis phase by using: Stratification Regression Inverse probability weighting Instrumental variable estimation Here’s a quick summary of the similarities and differences between these methods: Study Phase Method Can easily control for multiple confounders Can …

4 Simple Ways to Identify Confounding

A variable is a confounder if it satisfies one of the following conditions: It has been proven so in previous studies. Adjusting for it produces more than 10% change in the relationship between the exposure and the outcome. It is associated with both the exposure and the outcome, without being on the causal pathway between …

An Example of Identifying and Adjusting for Confounding

Suppose we are interested in studying whether smoking increases heart rate. Because it would not be ethical to randomly assign people to smoke, we are stuck with an observational design where we have to deal with bias and confounding ourselves. The questions that we are going to be concerned with in this article are: Which …

Why Confounding is Not a Type of Bias

Bias is an error in the estimation of an association between an exposure and an outcome due to a flaw in the design or conduct of the study. Confounding on the other hand, is a real but non-causal association between the exposure and the outcome. Although their mechanisms are different, both bias and confounding can …

Using the 4 D-Separation Rules to Study a Causal Association

Suppose we want study whether coffee causes cancer, which we will represent as follows: Randomizing people to either consume coffee or not for many years in order to study its effect on cancer is neither ethical nor practical. So we have to use an observational design, where we would have to deal with bias and …

List of All Biases [Sorted by Popularity in Research Papers]

I analyzed the content of 98,709 randomly chosen research papers from PubMed to learn more about bias. Specifically, I wanted to do 2 things: Rank 64 types of biases by popularity, in order to determine on which ones professional researchers focus the most in practice. Test the hypothesis that addressing bias issues is a sign …

What is a Good R-Squared Value? [Based on Real-World Data]

I analyzed the content of 43,110 randomly chosen research papers from PubMed to learn more about R-squared. Specifically, I wanted to answer the following questions: What is a good value for R-squared? What is a low value for R-squared? Is a higher R-squared always better? Is a low R-squared necessarily bad? Let’s start with a …

Statistical Power: What It Is and How It Is Used in Practice

Statistical power is a measure of study efficiency, calculated before conducting the study to estimate the chance of discovering a true effect rather than obtaining a false negative result, or worse, overestimating the effect by detecting the noise in the data. Here are 5 seemingly different, but actually similar, ways of describing statistical power: Definition …

Solomon Four-Group Design: An Introduction

The Solomon four-group design is a type of experiment where participants get randomly assigned to either 1 of 4 groups that differ in whether the participants receive the treatment or not, and whether the outcome of interest is measured once or twice in each group. The four groups in this design are (see figure below): …

Matched Pairs Design vs Randomized Block Design

In a matched pairs design, treatment options are randomly assigned to pairs of similar participants, whereas in a randomized block design, treatment options are randomly assigned to groups of similar participants. The objective of both is to balance baseline confounding variables by distributing them evenly between the treatment and the control group. Matched pairs design …

Randomized Block Design vs Completely Randomized Design

A randomized block design differs from a completely randomized design by ensuring that an important predictor of the outcome is evenly distributed between study groups in order to force them to be balanced, something that a completely randomized design cannot guarantee. A Completely randomized design uses simple randomization to assign participants to different treatment options …

Identify Variable Types in Statistics (with Examples)

Here’s a table that summarizes the types of variables: Types of variables Quantitative(a.k.a. Numerical) Qualitative(a.k.a. Categorical) Continuous Discrete Ordinal Nominal Consists of numerical values that can be measured but not counted. Consists of numerical values that can be counted. Consists of text or labels that have a logical order. Consists of text or labels that …

Purpose and Limitations of Random Assignment

In an experimental study, random assignment is a process by which participants are assigned, with the same chance, to either a treatment or a control group. The goal is to assure an unbiased assignment of participants to treatment options. Random assignment is considered the gold standard for achieving comparability across study groups, and therefore is …

Pretest-Posttest Control Group Design: An Introduction

The pretest-posttest control group design, also called the pretest-posttest randomized experimental design, is a type of experiment where participants get randomly assigned to either receive an intervention (the treatment group) or not (the control group). The outcome of interest is measured 2 times, once before the treatment group gets the intervention — the pretest — …

Assess Variable Importance in Linear and Logistic Regression

In this article, we will be concerned with the following question: Given a regression model, which of the predictors X1, X2, X3, etc. has the most influence on the outcome Y? In general, assessing the relative importance of predictors by directly comparing their (unstandardized) regression coefficients is not a good idea because: For numerical predictors: …

Separate-Sample Pretest-Posttest Design: An Introduction

The separate-sample pretest-posttest design is a type of quasi-experiment where the outcome of interest is measured 2 times: once before and once after an intervention, each time on a separate group of randomly chosen participants. The difference between the pretest and posttest measures will estimate the intervention’s effect on the outcome. The intervention can be: …