George Choueiry

I am Georges Choueiry, PharmD, MPH, PhD student in epidemiology.

Extract Numbers from Strings in R

The functions parse_integer(), parse_double(), and parse_number() from the readr library transform a character vector into a numeric vector. Here’s an example that compares these 3 functions: Exercises 1. Extract the number 1000000 from “1 000 000” Not all characters in this string can be transformed into an integer (since we have white spaces), so we …

Extract Numbers from Strings in R Read More »

How to Deal with Violation of Normality of Errors in R

Linear regression assumes that error terms are normally distributed. This is especially important when we are using linear regression for prediction purposes and our sample size is small (see: Understand Linear Regression Assumptions). When the normality of errors assumption is violated, try: Let’s create some data to demonstrate these methods: Output: So we see that …

How to Deal with Violation of Normality of Errors in R Read More »

How to Deal with Heteroscedasticity in Regression in R

Linear regression assumes that the dispersion of data points around the regression line is constant. We can deal with violation of this assumption (i.e. with heteroscedasticity) by: Let’s create some heteroscedastic data to demonstrate these methods: Output: The residuals vs fitted values plot shows a fan shape, which is evidence of heteroscedasticity. (For more information, …

How to Deal with Heteroscedasticity in Regression in R Read More »

How to Deal with Violation of the Linearity Assumption in R

The most important assumption of linear regression is that the relationship between each predictor and the outcome is linear. When the linearity assumption is violated, try: Let’s create some non-linear data to demonstrate these methods: The residuals vs fitted values plot shows a curved relationship, therefore, the linearity assumption is violated. Solution #1: Adding a quadratic …

How to Deal with Violation of the Linearity Assumption in R Read More »

How to Run and Interpret a Logistic Regression Model in R

In this tutorial, we are going to run a logistic regression using the Titanic dataset available in R: 1. Logistic regression equation The formula \(Survived \sim Age\) corresponds to the logistic regression equation: \(\log(\frac{P}{1 – P}) = \beta_0 + \beta_1 Age\) Where \(P\) is the probability of having the outcome, i.e. the probability of surviving. …

How to Run and Interpret a Logistic Regression Model in R Read More »

Correlation Coefficient vs Regression Coefficient

Both the correlation and regression coefficients rely on the hypothesis that the data can be represented by a straight line. They are similar in many ways, but they serve different purposes. Here’s a table that summarizes the similarities and differences between the correlation coefficient, r, and the regression coefficient, β: Correlation coefficient: r Regression coefficient: …

Correlation Coefficient vs Regression Coefficient Read More »

An Example of Using Marginal and Conditional Distributions

The conditional distribution of a variable, for example heights, is the distribution of heights given the value of another variable, for example gender. Plotting the conditional distribution of heights given gender is a way of visualizing the relationship between the 2 variables. The marginal distribution of heights is the distribution of heights for everybody, independent …

An Example of Using Marginal and Conditional Distributions Read More »

How to Handle Missing Data in Practice: Guide for Beginners

Handling missing data involves 2 steps: Determining the type of missing data, which can be: Missing completely at random (MCAR) Missing at random (MAR) Missing not at random (MNAR) Choosing a method to deal with these missing values, such as: Deleting variables (i.e. columns) that contain missing values Deleting observations (i.e. rows) whose values are …

How to Handle Missing Data in Practice: Guide for Beginners Read More »

Solve a Polynomial in R

A polynomial p(x) is an expression of the form: \(p(x) = a_0 + a_1x + a_2x^2 + a_3x^3 + … + a_nx^n\) Where n is any non-negative integer. Solve a polynomial p(x) in R To solve the equation \(p(x) = 0\) in R, we can use the function: polyroot. For example, let’s solve the equation: …

Solve a Polynomial in R Read More »

How to Solve an Equation in R

In this article, will use the uniroot.all() function from the rootSolve package to find all the solutions of an equation over a given interval (or domain). Input: uniroot.all() takes 2 arguments: a function f and an interval. How it works: Its searches the interval for all possible roots of f. Output: uniroot.all() returns a vector …

How to Solve an Equation in R Read More »

Front-Door Criterion to Adjust for Unmeasured Confounding

Suppose we conducted an observational study to estimate the causal effect of some depression treatment on the quality of life of patients: The problem is that the relationship between the two is confounded by the severity of depression: The arrows in the diagram reflect causal associations: The arrow from “depression severity” to “treatment” reflects the …

Front-Door Criterion to Adjust for Unmeasured Confounding Read More »

How to Start an Introduction? Examples from 98,093 Research Papers

The examples below are from 98,093 full-text PubMed research papers that I analyzed in order to explore common ways to start the Introduction section. The research papers included in this analysis were selected at random from those uploaded to PubMed Central between the years 2016 and 2021. Note that I used the BioC API to …

How to Start an Introduction? Examples from 98,093 Research Papers Read More »

Meta-Analysis Software Popularity in 1,321 Research Papers

I analyzed a random sample of 1,957 meta-analysis full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to check the popularity packages of meta-analysis software among medical researchers. (I used the BioC API to download the articles — see the References section below). Out of these 1,957 meta-analysis papers, …

Meta-Analysis Software Popularity in 1,321 Research Papers Read More »

Does the Number of Authors Matter? Data from 101,580 Research Papers

I analyzed a random sample of 101,580 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to explore the influence of the number of authors of a research paper on its quality. I used the BioC API to download the data (see the References section below). Here’s a summary …

Does the Number of Authors Matter? Data from 101,580 Research Papers Read More »

“I” & “We” in Academic Writing: Examples from 9,830 Studies

I analyzed a random sample of 9,830 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to explore whether first-person pronouns are used in the scientific literature, and how? I used the BioC API to download the data (see the References section below). Popularity of first-person pronouns in the …

“I” & “We” in Academic Writing: Examples from 9,830 Studies Read More »

How Long Should the Discussion Section Be? Data from 61,517 Examples

I analyzed a random sample of 61,517 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to answer the questions: What is the typical length of a discussion section? and which factors influence it? I used the BioC API to download the data (see the References section below). Here’s …

How Long Should the Discussion Section Be? Data from 61,517 Examples Read More »

How Long Should the Results Section Be? Data from 61,458 Examples

I analyzed a random sample of 61,458 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to answer the questions: What is the typical length of a results section? and which factors influence it? I used the BioC API to download the data (see the References section below). Here’s …

How Long Should the Results Section Be? Data from 61,458 Examples Read More »

How Long Should the Methods Section Be? Data from 61,514 Examples

I analyzed a random sample of 61,514 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to answer the questions: What is the typical length of a methods section? and which factors influence it? I used the BioC API to download the data (see the References section below). Here’s …

How Long Should the Methods Section Be? Data from 61,514 Examples Read More »

How Long Should the Introduction of a Research Paper Be? Data from 61,518 Examples

I analyzed a random sample of 61,518 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to answer the questions: What is the typical length of an introduction section? and which factors influence it? I used the BioC API to download the data (see the References section below). Here’s …

How Long Should the Introduction of a Research Paper Be? Data from 61,518 Examples Read More »

5 Variable Transformations to Improve Your Regression Model

In this article, we will discuss how you can use the following transformations to build better regression models: Log transformation Square root transformation Polynomial transformation Standardization Centering by substracting the mean Compared to fitting a model using variables in their raw form, transforming them can help: Make the model’s coefficients more interpretable. Meet the model’s …

5 Variable Transformations to Improve Your Regression Model Read More »

7 Different Ways to Control for Confounding

Confounding can be controlled in the design phase of the study by using: Random assignment Restriction Matching Or in the data analysis phase by using: Stratification Regression Inverse probability weighting Instrumental variable estimation Here’s a quick summary of the similarities and differences between these methods: Study Phase Method Can easily control for multiple confounders Can …

7 Different Ways to Control for Confounding Read More »

List of All Biases [Sorted by Popularity in Research Papers]

I analyzed the content of 98,709 randomly chosen research papers from PubMed to learn more about bias. Specifically, I wanted to do 2 things: Rank 64 types of biases by popularity, in order to determine on which ones professional researchers focus the most in practice. Test the hypothesis that addressing bias issues is a sign …

List of All Biases [Sorted by Popularity in Research Papers] Read More »

What is a Good R-Squared Value? [Based on Real-World Data]

I analyzed the content of 43,110 randomly chosen research papers from PubMed to learn more about R-squared. Specifically, I wanted to answer the following questions: What is a good value for R-squared? What is a low value for R-squared? Is a higher R-squared always better? Is a low R-squared necessarily bad? Let’s start with a …

What is a Good R-Squared Value? [Based on Real-World Data] Read More »

Statistical Power: What It Is and How It Is Used in Practice

Statistical power is a measure of study efficiency, calculated before conducting the study to estimate the chance of discovering a true effect rather than obtaining a false negative result, or worse, overestimating the effect by detecting the noise in the data. Here are 5 seemingly different, but actually similar, ways of describing statistical power: Definition …

Statistical Power: What It Is and How It Is Used in Practice Read More »

Matched Pairs Design vs Randomized Block Design

In a matched pairs design, treatment options are randomly assigned to pairs of similar participants, whereas in a randomized block design, treatment options are randomly assigned to groups of similar participants. The objective of both is to balance baseline confounding variables by distributing them evenly between the treatment and the control group. Matched pairs design …

Matched Pairs Design vs Randomized Block Design Read More »

Randomized Block Design vs Completely Randomized Design

A randomized block design differs from a completely randomized design by ensuring that an important predictor of the outcome is evenly distributed between study groups in order to force them to be balanced, something that a completely randomized design cannot guarantee. A Completely randomized design uses simple randomization to assign participants to different treatment options …

Randomized Block Design vs Completely Randomized Design Read More »

Identify Variable Types in Statistics (with Examples)

Here’s a table that summarizes the types of variables: Types of variables Quantitative(a.k.a. Numerical) Qualitative(a.k.a. Categorical) Continuous Discrete Ordinal Nominal Consists of numerical values that can be measured but not counted. Consists of numerical values that can be counted. Consists of text or labels that have a logical order. Consists of text or labels that …

Identify Variable Types in Statistics (with Examples) Read More »

Pretest-Posttest Control Group Design: An Introduction

The pretest-posttest control group design, also called the pretest-posttest randomized experimental design, is a type of experiment where participants get randomly assigned to either receive an intervention (the treatment group) or not (the control group). The outcome of interest is measured 2 times, once before the treatment group gets the intervention — the pretest — …

Pretest-Posttest Control Group Design: An Introduction Read More »

Assess Variable Importance in Linear and Logistic Regression

In this article, we will be concerned with the following question: Given a regression model, which of the predictors X1, X2, X3, etc. has the most influence on the outcome Y? In general, assessing the relative importance of predictors by directly comparing their (unstandardized) regression coefficients is not a good idea because: For numerical predictors: …

Assess Variable Importance in Linear and Logistic Regression Read More »

Separate-Sample Pretest-Posttest Design: An Introduction

The separate-sample pretest-posttest design is a type of quasi-experiment where the outcome of interest is measured 2 times: once before and once after an intervention, each time on a separate group of randomly chosen participants. The difference between the pretest and posttest measures will estimate the intervention’s effect on the outcome. The intervention can be: …

Separate-Sample Pretest-Posttest Design: An Introduction Read More »