George Choueiry

Run and Interpret Ordinal Logistic Regression in R Read More »

Ordinal logistic regression is a type of regression analysis that models the relationship between one or more predictors (numerical or categorical) and an ordinal outcome. An ordinal outcome is a variable that has more than 2 categories that have a logical order, such as: In this tutorial, we will use ordinal logistic regression on the …

How to Work With Time Series Data in R (Using fpp3 package)

How to Work With Time Series Data in R (Using fpp3 package) Read More »

In this tutorial, we will use the fpp3 library in R to manipulate and plot time series data. fpp3 loads other useful packages such as: dplyr, tidyr, lubridate, and ggplot2. We will start with a simple example and then work with a more complicated one. 1. A simple time series example 1.1. Create the data …

Linear Regression Example for Time Series Data in R

Linear Regression Example for Time Series Data in R Read More »

In this tutorial, we will use a linear regression model to examine the relationship between the Google search trends for the terms headache and ibuprofen. 1. Prepare the data 1.1. Download the data The package gtrendsR presents an interface to retrieve the number of Google searches over time for a specific term: 1.2. Plot the …

How to Plot Time Series in R + Basic Analysis

How to Plot Time Series in R + Basic Analysis Read More »

In this tutorial, we will use the discoveries dataset available in R as an example of a time series. The dataset contains yearly count of important scientific discoveries from 1860 to 1959. 1. Load the data A tsibble is a time series table that has an index (the year) and a value (the number of …

Plot Monthly distribution for each year in R (Seasonality)

Plot Monthly distribution for each year in R (Seasonality) Read More »

In this article, we will produce the following plot in R: 1. Load the data First, we will use the COVID19 package in R to download data for Lebanon. We will limit our analysis to 2 variables: (1) the date and (2) the number of new daily cases. 2. Count monthly cases To do that, …

Run and Interpret a Multinomial Logistic Regression in R

Run and Interpret a Multinomial Logistic Regression in R Read More »

In this tutorial, we will use the penguins dataset from the palmerpenguins package in R to examine the relationship between the predictors, bill length and flipper length, and the outcome species (which has 3 categories). 1. Loading the data We will start by loading the necessary packages and summarizing the data: 2. Fitting a multinomial …

How to Run a Logistic Regression in R tidymodels

How to Run a Logistic Regression in R tidymodels Read More »

In this tutorial, we are going to use the tidymodels package to run a logistic regression on the Titanic dataset available in R. 1. Preparing the data 2. Running a logistic regression model In order to fit a logistic regression model in tidymodels, we need to do 4 things: 3. Examining the relationship between the predictors …

Easiest Way to Plot Data on a Map in R (Using ggmap)

Easiest Way to Plot Data on a Map in R (Using ggmap) Read More »

In this tutorial, we will use the packages ggmap and COVID19 to create the following plot: 1. Downloading and plotting the map Since we are going to plot the Mediterranean region only, we first need to specify its borders. A simple Google search shows that these are roughly: Output: Since we don’t want country labels …

How to Run a Linear Regression in R tidymodels

How to Run a Linear Regression in R tidymodels Read More »

In this article, we are going to use the iris dataset available in R to build a linear regression model using the tidymodels package. Building the model In order to fit a linear regression model in tidymodels, we need to do 4 things: Checking linear regression assumptions After fitting the model, we should check whether …

4 Ways to Handle a Categorical Predictor With Many Levels

4 Ways to Handle a Categorical Predictor With Many Levels Read More »

A regression model that includes a categorical predictor with many levels might not contain enough observations in each category to be able to detect a reasonable effect size with reasonable power, even then, the large number of dummy variables created could be difficult to interpret. In this article, we present 4 ways to deal with …

Create and Plot Graphs from data.frame: Intro to igraph in R

Create and Plot Graphs from data.frame: Intro to igraph in R Read More »

A graph consists of points — called nodes or vertices — connected by line segments — called edges. The only library we will need for this tutorial is igraph: 1. Directed graphs A directed graph, where the edges indicate a one-way relationship between vertices, can be created from a data.frame that simply defines the edges: …

Join Dataframes in R: Left/Right/Inner/Full Joins

Join Dataframes in R: Left/Right/Inner/Full Joins Read More »

Joining 2 dataframes involves linking the rows in one to the rows in the other. This can be done in different ways using the dplyr functions: left_join(), right_join(), inner_join(), and full_join(). Here’s an illustration of the differences between them: In order to demonstrate how these functions work, let’s first create some data: 1. Left join …

Extract Multiple Occurrences of a Pattern in a String in R

Extract Multiple Occurrences of a Pattern in a String in R Read More »

Our goal is to extract all p-values reported in these abstracts: Notice that most reported p-values follow this 3-part pattern: In regular expressions, this pattern can be written as: Then we can call str_extract_all() to extract all text that follow this pattern: The new column is a list which we can unnest(): The next step is …

Extract p-values from Text in R: Using separate_wider_regex

Extract p-values from Text in R: Using separate_wider_regex Read More »

Our goal is to extract p-values from article abstracts to create 2 new variables: sign and p-value: We will break the problem into 2 simple steps: But first, let’s load the data: Step 1: determine which abstracts report a p-value Here, we can use the function str_detect() which takes a string and a pattern, and …

Extract Numbers from Strings in R

Extract Numbers from Strings in R Read More »

The functions parse_integer(), parse_double(), and parse_number() from the readr library transform a character vector into a numeric vector. Here’s an example that compares these 3 functions: Exercises 1. Extract the number 1000000 from “1 000 000” Not all characters in this string can be transformed into an integer (since we have white spaces), so we …

Download and Analyze PubMed Articles in R (Example)

Download and Analyze PubMed Articles in R (Example) Read More »

In this article, we will use R to: 1. Get PMID numbers of relevant articles Let’s say we are interested in analyzing articles about lung cancer, published in PLOS Medicine, in the year 2022. Here’s the PubMed link that contains PMID (PubMed ID) numbers for these articles: If you visit this URL, you see an …

Plot Median and Interquartile Range in R

Plot Median and Interquartile Range in R Read More »

In this tutorial, we are going to create the following plot of the median and the interquartile range of sepal length for each iris species using the iris dataset: 1. Using geom_pointrange() We start by calculating the median, the 1st quartile, and the 3rd quartile as follows: Then we give these variables to geom_pointrange() in …

Using pivot_longer with names_sep and names_pattern in R

Using pivot_longer with names_sep and names_pattern in R Read More »

In this article, we will explain how to use the arguments names_sep and names_pattern of the function pivot_longer() from the tidyr package. First, let’s create some data: The general syntax of pivot_longer() is: 1. Using the names_sep argument in pivot_longer() When calling pivot_longer(), we can create 2 new columns (instead of 1) by splitting the …

Convert Columns to Rows in R

Convert Columns to Rows in R Read More »

In this article, we will show 2 ways to convert columns into rows in R using the following data: 1. Using matrix transpose The function t() takes the data frame df and returns a matrix where columns and rows are switched: But we have 3 problems with this function: So let’s clean this output a …

How to Summarize Data in R (Using Dplyr)

How to Summarize Data in R (Using Dplyr) Read More »

In this article, we will cover how to apply the function summarize() from the dplyr package using the following data: 1. Summarizing a variable Use the following code to calculate the average age in our dataset: The summarize() function can take many arguments: 2. Grouping and summarizing Use the following code to calculate the average …

How to Deal with Violation of Normality of Errors in R

How to Deal with Violation of Normality of Errors in R Read More »

Linear regression assumes that error terms are normally distributed. This is especially important when we are using linear regression for prediction purposes and our sample size is small (see: Understand Linear Regression Assumptions). When the normality of errors assumption is violated, try: Let’s create some data to demonstrate these methods: Output: So we see that …

How to Deal with Heteroscedasticity in Regression in R

How to Deal with Heteroscedasticity in Regression in R Read More »

Linear regression assumes that the dispersion of data points around the regression line is constant. We can deal with violation of this assumption (i.e. with heteroscedasticity) by: Let’s create some heteroscedastic data to demonstrate these methods: Output: The residuals vs fitted values plot shows a fan shape, which is evidence of heteroscedasticity. (For more information, …

How to Deal with Violation of the Linearity Assumption in R

How to Deal with Violation of the Linearity Assumption in R Read More »

The most important assumption of linear regression is that the relationship between each predictor and the outcome is linear. When the linearity assumption is violated, try: Let’s create some non-linear data to demonstrate these methods: The residuals vs fitted values plot shows a curved relationship, therefore, the linearity assumption is violated. Solution #1: Adding a quadratic …

How to Check Linear Regression Assumptions in R

How to Check Linear Regression Assumptions in R Read More »

Linear regression has 4 assumptions: 1. How to check linearity Instead of checking the relationship between each predictor and the outcome in a multivariable model, we can plot the residuals versus the fitted values. The plot should show no discernible pattern: The output will look like one of the following: In this plot, R draws …

Understand Linear Regression Assumptions

Understand Linear Regression Assumptions Read More »

The 4 assumptions of linear regression in order of importance are: 1. Linearity 1.1. Explanation The relationship between each predictor Xi and the outcome Y should be linear. 1.2. How to check the linearity assumption Instead of checking the relationship between each predictor Xi and the outcome Y in a multivariable model, we can plot …

Linear Regression in R (with a Categorical Variable)

Linear Regression in R (with a Categorical Variable) Read More »

In this article, we will run and interpret a linear regression model where the predictor is a categorical variable with multiple levels. Loading the Data We will use the chickwts dataset available in R. (These data come from an experiment where newly hatched chickens were randomly divided into 6 groups, each group receiving a different …

How to Run and Interpret a Logistic Regression Model in R

How to Run and Interpret a Logistic Regression Model in R Read More »

In this tutorial, we are going to run a logistic regression using the Titanic dataset available in R: 1. Logistic regression equation The formula $Survived \sim Age$ corresponds to the logistic regression equation: $\log(\frac{P}{1 – P}) = \beta_0 + \beta_1 Age$ Where $P$ is the probability of having the outcome, i.e. the probability of surviving. …

Logistic Regression in R (with Categorical Variables)

Logistic Regression in R (with Categorical Variables) Read More »

In this article, we will run and interpret a logistic regression model where the predictor is a categorical variable with multiple levels. Loading the data We will use the Titanic dataset available in R: Running a logistic regression model Next, we will use logistic regression to examine the effect of class (the predictor) on survival …

Plot Logistic Regression Decision Boundary in R

Plot Logistic Regression Decision Boundary in R Read More »

In this article, we will produce the following R plot that represents the decision boundary of a logistic regression model: Here’s the full code used to generate it: Code explanation First, we create some data (2 continuous variables x1 and x2, and 1 binary variable y) and run a logistic regression: Next, we will create …

Stepwise (Linear & Logistic) Regression in R

Stepwise (Linear & Logistic) Regression in R Read More »

In this article, we will cover: Let’s start by creating some data: To run a stepwise regression, use the stepAIC function from the MASS library. 1. How to run forward stepwise linear regression Output: Call: lm(formula = X1 ~ X4 + X3 + X7, data = dat) Residuals: Min 1Q Median 3Q Max -0.52407 -0.23122 …

How to Deal with Multicollinearity in R

How to Deal with Multicollinearity in R Read More »

Multicollinearity occurs when there is a strong linear relationship between 2 or more predictors in a regression model. It is a problem because it increases the standard errors of the regression coefficients, leading to noisy estimates. Let’s simulate some data in R: We have a collinearity problem in our model since our variables’ VIFs (Variance …

Weighted Regression: An Intuitive Introduction

Weighted Regression: An Intuitive Introduction Read More »

Weighted regression (a.k.a. weighted least squares) is a regression model where each observation is given a certain weight that tells the software how important it should be in the model fit. Weighted regression can be used to: 1. Weighted regression to handle non-constant variance of error terms Linear regression assumes that the error terms have …

How to Report Interaction Effects in Regression

How to Report Interaction Effects in Regression Read More »

For a linear regression model: Y = β0 + β1X + β2Z + β3XZ + ε If the coefficient of the interaction term β3 is statistically significant, then there is evidence of an interaction between X and Z. This means that the effect of X on the outcome Y is different for different sub-categories of Z, …

Linear Regression with Interaction in R

Linear Regression with Interaction in R Read More »

Output: Call: lm(formula = Y ~ X + Z + X:Z, data = dat) Residuals: Min 1Q Median 3Q Max -1.00058 -0.25209 0.00766 0.21640 0.89542 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.57717 0.11195 40.885 < 2e-16 *** X 0.44168 0.06551 6.742 3.38e-10 *** Z -1.23932 0.21937 -5.649 8.16e-08 *** X:Z 0.18859 0.03357 5.617 …

Interpret Log Transformations in Linear Regression

Interpret Log Transformations in Linear Regression Read More »

The following table summarizes how to interpret a linear regression model with logarithmic transformations: Transformation Model Interpretation No transformations Y = β0 + β1 X A 1 unit increase in X is associated with an average change of β1 units in Y. Log-transformed predictor Y = β0 + β1 log(X) A 1% increase in X …

Why Add & How to Interpret a Quadratic Term in Regression

Why Add & How to Interpret a Quadratic Term in Regression Read More »

Linear regression assumes that the relationship between the predictor X and the outcome Y is linear. If this assumption is not met, linear regression will be a poor fit to the data (as shown in the figure below). In this case, adding a quadratic term to the regression equation may help model the relationship between …

Interpret Linear Regression Output in R

Interpret Linear Regression Output in R Read More »

Here’s an example of linear regression in R: 1. Linear regression equation The formula $y \sim x + z$ corresponds to the regression equation: $y = β_0 + β_1x + β_2z$ where: 2. Residuals The residuals are the difference between the regression line that we fitted (using the predictors x and z) and the real …

8 Types of Treatment Effects Explained (with Examples)

8 Types of Treatment Effects Explained (with Examples) Read More »

When studying the effect of a treatment (or an intervention) on an outcome, we should keep in mind that it will probably not be the same for everyone. In other words, each person will likely experience a different effect of the same treatment — we say that the treatment has a heterogeneous effect. We can …

3 Real-World Examples of Using Instrumental Variables

3 Real-World Examples of Using Instrumental Variables Read More »

The instrumental variable approach is a method to identify the causal effect of a treatment on an outcome of interest by controlling for unobserved confounding between them. A valid instrumental variable, Z, is one that influences the outcome, Y, through the treatment, X, without being related to the confounding variable, C, as shown in the …

When Does Correlation Imply Causation?

When Does Correlation Imply Causation? Read More »

Short answer: Correlation implies causation when alternative explanations of the relationship between the correlated variables (such as confounding and bias) are removed (by appropriately modifying the study design) or controlled for (by adjusting for them in the statistical analysis). Explanation: Causation means that changing the treatment X for a person will affect the probability of …

Correlation Coefficient vs Regression Coefficient

Correlation Coefficient vs Regression Coefficient Read More »

Both the correlation and regression coefficients rely on the hypothesis that the data can be represented by a straight line. They are similar in many ways, but they serve different purposes. Here’s a table that summarizes the similarities and differences between the correlation coefficient, r, and the regression coefficient, β: Correlation coefficient: r Regression coefficient: …

An Example of Using Marginal and Conditional Distributions

An Example of Using Marginal and Conditional Distributions Read More »

The conditional distribution of a variable, for example heights, is the distribution of heights given the value of another variable, for example gender. Plotting the conditional distribution of heights given gender is a way of visualizing the relationship between the 2 variables. The marginal distribution of heights is the distribution of heights for everybody, independent …

Why Divide Sample Standard Deviation by n-1?

Why Divide Sample Standard Deviation by n-1? Read More »

The problem The standard deviation is a measurement of the spread of the data — it is the average distance of the data from the mean. We are rarely interested in the amount of variation in our sample: the sample standard deviation is only useful as an approximation of the population standard deviation. When our …

How to Describe/Summarize Numerical Data in R (Example)

How to Describe/Summarize Numerical Data in R (Example) Read More »

Let’s start by creating IQ, a normally distributed numerical variable, with a mean of 100 and a standard deviation of 15, that represents the IQ scores of a sample of 100 participants: Next, we will (1) summarize this variable and (2) describe its distribution. 1. Summary statistics Reminder: The 1st quartile is the 25th percentile, …

How to Describe/Summarize Categorical Data in R (Example)

How to Describe/Summarize Categorical Data in R (Example) Read More »

Let’s start by creating our own data, consisting of 2 categorical variables: gender and smoking: Next, we will create a frequency table and a bar plot to summarize these data one variable at a time, then we will create a contingency table and a stacked bar plot to describe the relationship between the 2 variables. …

Modulo Operator (%%) in R: Explained + Practical Examples

Modulo Operator (%%) in R: Explained + Practical Examples Read More »

The modulo operator (%% in R) returns the remainder of the division of 2 numbers. Here are some examples: 5 %% 2 returns 1, because 2 goes into 5 two times and the remainder is 1 (i.e. 5 = 2 × 2 + 1). 4 %% 2 returns 0, since 4 = 2 × 2 …

Find the Minimum and Maximum of a Function in R

Find the Minimum and Maximum of a Function in R Read More »

The function optimize (also spelled optimise) in R returns the minimum or maximum of a function f(x) within a specified interval. It takes as inputs: f: a function. interval: a vector containing the lower and upper bounds of the domain where we want to search for the minimum or maximum. maximum: a logical, where TRUE …

How to Handle Missing Data in Practice: Guide for Beginners

How to Handle Missing Data in Practice: Guide for Beginners Read More »

Handling missing data involves 2 steps: Determining the type of missing data, which can be: Missing completely at random (MCAR) Missing at random (MAR) Missing not at random (MNAR) Choosing a method to deal with these missing values, such as: Deleting variables (i.e. columns) that contain missing values Deleting observations (i.e. rows) whose values are …

Write a Function that Returns the nth Fibonacci Number in R

Write a Function that Returns the nth Fibonacci Number in R Read More »

Challenge: Write a function in R that prints the nth Fibonacci number. Reminder: the Fibonacci sequence is: 1, 1, 2, 3, 5, 8, … So, the first Fibonacci number is 1, the second is also 1, and then each subsequent number is the sum of the previous 2 in the sequence. Solution: In this article, …

Plot a Step Function in Base R and ggplot2

Plot a Step Function in Base R and ggplot2 Read More »

As an example of a step function, we will use the floor function floor(x) that takes a real number x and returns the greatest integer less than or equal to x. Coding the floor function in R: Plotting in base R Output: We can add the data points to the plot with the following code: …

Solve a Polynomial in R

Solve a Polynomial in R Read More »

A polynomial p(x) is an expression of the form: $p(x) = a_0 + a_1x + a_2x^2 + a_3x^3 + … + a_nx^n$ Where n is any non-negative integer. Solve a polynomial p(x) in R To solve the equation $p(x) = 0$ in R, we can use the function: polyroot. For example, let’s solve the equation: …

Coding and Plotting a Piecewise Function in R

Coding and Plotting a Piecewise Function in R Read More »

In this tutorial we are going to code the following function in R: $f(x) =\begin{cases}-x, & \text{if $x < -1$} \\x^2, & \text{if $x \geq -1$}\end{cases}$ And produce the following plot: Coding the piecewise function f(x) in R Using if else statements While this code is easy to read and understand, it does not support …

How to Plot a Quadratic Function in R

How to Plot a Quadratic Function in R Read More »

For the following quadratic function: $f(x) = x^2 + 2x – 20$ Here’s the plot that we want to produce: Coding the function f(x) in R A quadratic function is a function of the form: $ax^2 + bx + c$, where $a \neq 0$. So for $f(x) = x^2 + 2x – 20$: a = …

Find the Line Equation From 2 Points in R

Find the Line Equation From 2 Points in R Read More »

Suppose we want to know the equation of the line that passes through 2 points A and B, such that: Quick solution Output: Call: lm(formula = ys ~ xs) Coefficients: (Intercept) xs 58.833 -2.417 So, the equation of the line that passes through A and B is: $f(x) = -2.417x + 58.833$ To get more …

Which Sampling Methods Are Most Commonly Used in Research?

Which Sampling Methods Are Most Commonly Used in Research? Read More »

I analyzed a random sample of 9,830 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, to check the popularity of different sampling methods and assess their correlation with the quality of research. I used the BioC API to download the data (see the References section below). Here’s a summary of …

7 Sampling Methods Explained Visually

7 Sampling Methods Explained Visually Read More »

In this article, we will cover 7 sampling methods, which we are going to divide into 2 types: probability sampling methods, and non-probability sampling methods. Probability sampling methods involve random selection of participants, and therefore tend to produce unbiased samples; Non-probability methods do not involve random selection of participants, and therefore are cheaper to apply, …

Writing Custom Functions in R

Writing Custom Functions in R Read More »

In this article, you will learn how to write your own functions in R. Specifically, we will cover: How to write a simple function How to write a more complex function How to write an anonymous function How to write a function with an unfixed number of arguments How to write a recursive function 1. …

How to Solve an Equation in R

How to Solve an Equation in R Read More »

In this article, will use the uniroot.all() function from the rootSolve package to find all the solutions of an equation over a given interval (or domain). Input: uniroot.all() takes 2 arguments: a function f and an interval. How it works: Its searches the interval for all possible roots of f. Output: uniroot.all() returns a vector …

Create and Graph Intervals in R

Create and Graph Intervals in R Read More »

A quick review of intervals The open interval from a to b, denoted (a, b), consists of all numbers between a and b excluding the endpoints a and b. Open circles in the graph indicate that the endpoints are excluded: The closed interval from a to b, denoted [a, b], consists of all numbers between …

Working with Sets in R (Tutorial)

Working with Sets in R (Tutorial) Read More »

A set is an unordered collection of unique elements. It is helpful to keep track of distinct objects. In this tutorial, you will learn how to: Create a set Manipulate sets Work with subsets Apply set operations 1. Create a set 1.1. Create a set from scratch A set can contain different types of elements, …

5 Real-World Examples of Confounding [With References]

5 Real-World Examples of Confounding [With References] Read More »

An association between 2 variables X and Y cannot be interpreted as causal if it can be attributed to an alternative mechanism. Confounding is an example of such mechanism that alters the relationship between X and Y, and therefore, leads to an over or underestimation of the true effect between them. In its simplest form, …

Front-Door Criterion to Adjust for Unmeasured Confounding

Front-Door Criterion to Adjust for Unmeasured Confounding Read More »

Suppose we conducted an observational study to estimate the causal effect of some depression treatment on the quality of life of patients: The problem is that the relationship between the two is confounded by the severity of depression: The arrows in the diagram reflect causal associations: The arrow from “depression severity” to “treatment” reflects the …

How to Start a Discussion Section in Research? [with Examples]

How to Start a Discussion Section in Research? [with Examples] Read More »

The examples below are from 72,017 full-text PubMed research papers that I analyzed in order to explore common ways to start writing the Discussion section. Research papers included in this analysis were selected at random from those uploaded to PubMed Central between the years 2016 and 2021. Note that I used the BioC API to …

How to Start a Methods Section in Research? [with Examples]

How to Start a Methods Section in Research? [with Examples] Read More »

The examples below are from 76,350 full-text PubMed research papers that I analyzed in order to explore common ways to start the Materials and Methods section. Research papers included in this analysis were selected at random from those uploaded to PubMed Central between the years 2016 and 2021. I used the BioC API to download …

How to Start an Abstract? Examples from 94,745 Research Papers

How to Start an Abstract? Examples from 94,745 Research Papers Read More »

The examples below are from 94,745 full-text PubMed research papers that I analyzed in order to explore common ways to start writing the Abstract. Research papers included in this analysis were selected at random from those uploaded to PubMed Central between the years 2016 and 2021. Note that I used the BioC API to download …

How to Start an Introduction? Examples from 98,093 Research Papers

How to Start an Introduction? Examples from 98,093 Research Papers Read More »

The examples below are from 98,093 full-text PubMed research papers that I analyzed in order to explore common ways to start the Introduction section. The research papers included in this analysis were selected at random from those uploaded to PubMed Central between the years 2016 and 2021. Note that I used the BioC API to …

How to Start a Conclusion in Research? [With Examples]

How to Start a Conclusion in Research? [With Examples] Read More »

The examples below are from 47,803 full-text PubMed research papers that I analyzed in order to explore common ways to start a Conclusion section. The research papers included in this analysis were selected at random from those uploaded to PubMed Central between the years 2016 and 2021. Note that I used the BioC API to …

How to Start a Research Title? Examples from 105,975 Titles

How to Start a Research Title? Examples from 105,975 Titles Read More »

I analyzed a random sample of 105,975 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to explore common ways to start a research title. I used the BioC API to download the data (see the References section below). Common ways to start a title The most common 3-word …

Meta-Analysis Software Popularity in 1,321 Research Papers

Meta-Analysis Software Popularity in 1,321 Research Papers Read More »

I analyzed a random sample of 1,957 meta-analysis full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to check the popularity packages of meta-analysis software among medical researchers. (I used the BioC API to download the articles — see the References section below). Out of these 1,957 meta-analysis papers, …

Does the Number of Authors Matter? Data from 101,580 Research Papers

Does the Number of Authors Matter? Data from 101,580 Research Papers Read More »

I analyzed a random sample of 101,580 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to explore the influence of the number of authors of a research paper on its quality. I used the BioC API to download the data (see the References section below). Here’s a summary …

How to Write & Publish a Research Paper: Step-by-Step Guide

How to Write & Publish a Research Paper: Step-by-Step Guide Read More »

This guide is far more than a list of instructions on what to include in each section of your research paper. In fact, we will: Use a research paper I wrote specifically as an example to illustrate the key ideas in this guide (link to the full-text PDF of the research paper). Use real-world data …

“I” & “We” in Academic Writing: Examples from 9,830 Studies

“I” & “We” in Academic Writing: Examples from 9,830 Studies Read More »

I analyzed a random sample of 9,830 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to explore whether first-person pronouns are used in the scientific literature, and how? I used the BioC API to download the data (see the References section below). Popularity of first-person pronouns in the …

Paragraph Length: Data from 9,830 Research Papers

Paragraph Length: Data from 9,830 Research Papers Read More »

I analyzed a random sample of 9,830 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, to answer the question: How long should a paragraph be in a research paper? I used the BioC API to download the data (see the References section below). Paragraph length Our sample of 9,830 research …

Can a Research Title Be a Question? Real-World Examples

Can a Research Title Be a Question? Real-World Examples Read More »

I analyzed a random sample of 9,830 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, to answer the questions: Can a research title be a question? And do questions make good titles? I used the BioC API to download the data (see the References section below). Popularity of question titles …

How Long Should a Research Title Be? Data from 104,161 Examples

How Long Should a Research Title Be? Data from 104,161 Examples Read More »

I analyzed a random sample of 104,161 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, to learn more about title length. I used the BioC API to download the data (see the References section below). Here’s a summary of the key findings 1. The median title was 14 words long …

How Long Should the Abstract Be? Data 61,429 from Examples

How Long Should the Abstract Be? Data 61,429 from Examples Read More »

I analyzed a random sample of 61,429 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to answer the questions: What is the typical length of an abstract? and which factors influence it? I used the BioC API to download the data (see the References section below). Here’s a …

How Long Should the Discussion Section Be? Data from 61,517 Examples

How Long Should the Discussion Section Be? Data from 61,517 Examples Read More »

I analyzed a random sample of 61,517 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to answer the questions: What is the typical length of a discussion section? and which factors influence it? I used the BioC API to download the data (see the References section below). Here’s …

How Long Should the Results Section Be? Data from 61,458 Examples

How Long Should the Results Section Be? Data from 61,458 Examples Read More »

I analyzed a random sample of 61,458 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to answer the questions: What is the typical length of a results section? and which factors influence it? I used the BioC API to download the data (see the References section below). Here’s …

How Long Should the Methods Section Be? Data from 61,514 Examples

How Long Should the Methods Section Be? Data from 61,514 Examples Read More »

I analyzed a random sample of 61,514 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to answer the questions: What is the typical length of a methods section? and which factors influence it? I used the BioC API to download the data (see the References section below). Here’s …

How Long Should the Introduction of a Research Paper Be? Data from 61,518 Examples

How Long Should the Introduction of a Research Paper Be? Data from 61,518 Examples Read More »

I analyzed a random sample of 61,518 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to answer the questions: What is the typical length of an introduction section? and which factors influence it? I used the BioC API to download the data (see the References section below). Here’s …

5 Variable Transformations to Improve Your Regression Model

5 Variable Transformations to Improve Your Regression Model Read More »

In this article, we will discuss how you can use the following transformations to build better regression models: Log transformation Square root transformation Polynomial transformation Standardization Centering by substracting the mean Compared to fitting a model using variables in their raw form, transforming them can help: Make the model’s coefficients more interpretable. Meet the model’s …

Interpret Interactions in Linear Regression

Interpret Interactions in Linear Regression Read More »

For a linear regression model with interaction: Y = β0 + β1 X1 + β2 X2 + β3 X1X2 The coefficient of the interaction term (β3) is the increase in effectiveness of X1 for a 1 unit change in X2, and vice-versa. For example: Suppose we used linear regression to study the effect of physical …

Interpret the Linear Regression Intercept

Interpret the Linear Regression Intercept Read More »

For a linear regression model: Y = β0 + β1 X The linear regression intercept β0 is the predicted value of the outcome Y when the predictor X equals zero. As an example, we will try to interpret the intercept β0 = 78.66 in the following linear regression model: Heart Rate = 78.66 + 2.94 …

7 Different Ways to Control for Confounding

7 Different Ways to Control for Confounding Read More »

Confounding can be controlled in the design phase of the study by using: Random assignment Restriction Matching Or in the data analysis phase by using: Stratification Regression Inverse probability weighting Instrumental variable estimation Here’s a quick summary of the similarities and differences between these methods: Study Phase Method Can easily control for multiple confounders Can …

4 Simple Ways to Identify Confounding

4 Simple Ways to Identify Confounding Read More »

A variable is a confounder if it satisfies one of the following conditions: It has been proven so in previous studies. Adjusting for it produces more than 10% change in the relationship between the exposure and the outcome. It is associated with both the exposure and the outcome, without being on the causal pathway between …

An Example of Identifying and Adjusting for Confounding

An Example of Identifying and Adjusting for Confounding Read More »

Suppose we are interested in studying whether smoking increases heart rate. Because it would not be ethical to randomly assign people to smoke, we are stuck with an observational design where we have to deal with bias and confounding ourselves. The questions that we are going to be concerned with in this article are: Which …

Why Confounding is Not a Type of Bias

Why Confounding is Not a Type of Bias Read More »

Bias is an error in the estimation of an association between an exposure and an outcome due to a flaw in the design or conduct of the study. Confounding on the other hand, is a real but non-causal association between the exposure and the outcome. Although their mechanisms are different, both bias and confounding can …

Using the 4 D-Separation Rules to Study a Causal Association

Using the 4 D-Separation Rules to Study a Causal Association Read More »

Suppose we want study whether coffee causes cancer, which we will represent as follows: Randomizing people to either consume coffee or not for many years in order to study its effect on cancer is neither ethical nor practical. So we have to use an observational design, where we would have to deal with bias and …

List of All Biases [Sorted by Popularity in Research Papers]

Lessons from Research Papers, Study Design

I analyzed the content of 98,709 randomly chosen research papers from PubMed to learn more about bias. Specifically, I wanted to do 2 things: Rank 64 types of biases by popularity, in order to determine on which ones professional researchers focus the most in practice. Test the hypothesis that addressing bias issues is a sign …

List of All Biases [Sorted by Popularity in Research Papers] Read More »

What is a Good R-Squared Value? [Based on Real-World Data]

Data Analysis, Lessons from Research Papers

I analyzed the content of 43,110 randomly chosen research papers from PubMed to learn more about R-squared. Specifically, I wanted to answer the following questions: What is a good value for R-squared? What is a low value for R-squared? Is a higher R-squared always better? Is a low R-squared necessarily bad? Let’s start with a …

What is a Good R-Squared Value? [Based on Real-World Data] Read More »

Statistical Power: What It Is and How It Is Used in Practice

Statistical Power: What It Is and How It Is Used in Practice Read More »

Statistical power is a measure of study efficiency, calculated before conducting the study to estimate the chance of discovering a true effect rather than obtaining a false negative result, or worse, overestimating the effect by detecting the noise in the data. Here are 5 seemingly different, but actually similar, ways of describing statistical power: Definition …

Solomon Four-Group Design: An Introduction

Solomon Four-Group Design: An Introduction Read More »

The Solomon four-group design is a type of experiment where participants get randomly assigned to either 1 of 4 groups that differ in whether the participants receive the treatment or not, and whether the outcome of interest is measured once or twice in each group. The four groups in this design are (see figure below): …

Matched Pairs Design vs Randomized Block Design

Matched Pairs Design vs Randomized Block Design Read More »

In a matched pairs design, treatment options are randomly assigned to pairs of similar participants, whereas in a randomized block design, treatment options are randomly assigned to groups of similar participants. The objective of both is to balance baseline confounding variables by distributing them evenly between the treatment and the control group. Matched pairs design …

Randomized Block Design vs Completely Randomized Design

Randomized Block Design vs Completely Randomized Design Read More »

A randomized block design differs from a completely randomized design by ensuring that an important predictor of the outcome is evenly distributed between study groups in order to force them to be balanced, something that a completely randomized design cannot guarantee. A Completely randomized design uses simple randomization to assign participants to different treatment options …

Identify Variable Types in Statistics (with Examples)

Identify Variable Types in Statistics (with Examples) Read More »

Here’s a table that summarizes the types of variables: Types of variables Quantitative(a.k.a. Numerical) Qualitative(a.k.a. Categorical) Continuous Discrete Ordinal Nominal Consists of numerical values that can be measured but not counted. Consists of numerical values that can be counted. Consists of text or labels that have a logical order. Consists of text or labels that …

Purpose and Limitations of Random Assignment

Purpose and Limitations of Random Assignment Read More »

In an experimental study, random assignment is a process by which participants are assigned, with the same chance, to either a treatment or a control group. The goal is to assure an unbiased assignment of participants to treatment options. Random assignment is considered the gold standard for achieving comparability across study groups, and therefore is …

Pretest-Posttest Control Group Design: An Introduction

Pretest-Posttest Control Group Design: An Introduction Read More »

The pretest-posttest control group design, also called the pretest-posttest randomized experimental design, is a type of experiment where participants get randomly assigned to either receive an intervention (the treatment group) or not (the control group). The outcome of interest is measured 2 times, once before the treatment group gets the intervention — the pretest — …

Assess Variable Importance in Linear and Logistic Regression

Assess Variable Importance in Linear and Logistic Regression Read More »

In this article, we will be concerned with the following question: Given a regression model, which of the predictors X1, X2, X3, etc. has the most influence on the outcome Y? In general, assessing the relative importance of predictors by directly comparing their (unstandardized) regression coefficients is not a good idea because: For numerical predictors: …

Separate-Sample Pretest-Posttest Design: An Introduction