# George Choueiry

I am Georges Choueiry, PharmD, MPH, PhD student in epidemiology.

## One-Group Pretest-Posttest Design: An Introduction

The one-group pretest-posttest design is a type of quasi-experiment in which the outcome of interest is measured 2 times: once before and once after exposing a non-random group of participants to a certain intervention/treatment. The objective is to evaluate the effect of that intervention which can be: A training program A policy change A medical …

## Interpret Poisson Regression Coefficients

The Poisson regression coefficient β associated with a predictor X is the expected change, on the log scale, in the outcome Y per unit change in X. So holding all other variables in the model constant, increasing X by 1 unit (or going from 1 level to the next) multiplies the rate of Y by …

## Regression Tree vs Linear Regression

Both the linear regression and the regression tree models take as input 1 or more predictors (Xi) and their goal is to explain their relationship with the outcome (Y). For simplicity, we will consider the case of modeling Y using only 1 predictor X. Linear regression tries to find the equation of the line that …

## How to Report a Random Forest Model

In this article we discuss: How to report the use of a random forest model How to report the results of a random forest model 1. How to report the use of a random forest model The following information should be mentioned in the METHODS section of your research paper: The reason why you chose …

## How to Report a Chi-Square Test

The 3 main types of Chi-square tests are: Chi-square goodness-of-fit test: used to compare the distribution of a categorical variable (with more than 2 levels) to a hypothetical distribution. Chi-square homogeneity test: used to test whether 2 groups (coming from 2 different samples) have the same distribution regarding a certain categorical variable. Chi-square independence test: …

## How to Report a Chi-Square Independence Test

The Chi-square independence test is used to test whether 2 categorical variables, each having 2 or more categories, are dependent or independent of each other. The null hypothesis H0 states that the 2 variables are independent (i.e. knowing the value of one does not tell us anything about the other) The alternative hypothesis H1 states …

## How to Report a Chi-Square Goodness-of-Fit Test

A Chi-square goodness-of-fit test is used to evaluate the distribution of a categorical variable with more than 2 levels/categories against a theoretical one. Simply put, we would like to compare the counts in each level of this categorical variable with the counts that we expect to find given some hypothesis. Therefore, the objective of this …

## How to Report the Shapiro-Wilk Test

The Shapiro-Wilk test is a statistical test used to check if a continuous variable follows a normal distribution. The null hypothesis (H0) states that the variable is normally distributed, and the alternative hypothesis (H1) states that the variable is NOT normally distributed. So after running this test: If p ≤ 0.05: then the null hypothesis …

## How to Report Stepwise Regression

In this article we will discuss: 1. Reporting the use of stepwise regression The following information should be mentioned in the METHODS section of the research paper: (For an easy explanation of the stopping rule and a step-by-step description of how stepwise selection works, I recommend my other article: Understand Forward and Backward Stepwise Regression) …

## Checking the Popularity of 125 Statistical Tests and Models

I analyzed the methods sections of 43,110 randomly chosen research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to check the popularity of 125 statistical methods in medical research. I used the BioC API to download the articles (see the References section below). Here’s a summary of the key findings …

## How Long Should a Research Paper Be? Data from 61,519 Examples

I analyzed a random sample of 61,519 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to answer the questions: What is the typical overall length of a research paper? and how long should each section be? I used the BioC API to download the data (see the References …

## Interpret Linear Regression Coefficients

For a simple linear regression model: Y = β0 + β1 X + ε The linear regression coefficient β1 associated with a predictor X is the expected difference in the outcome Y when comparing 2 groups that differ by 1 unit in X. Another common interpretation of β1 is: β1 is the expected change in the outcome …

## Interpret the Logistic Regression Intercept

Here’s the equation of a logistic regression model with 1 predictor X: Where P is the probability of having the outcome and P / (1-P) is the odds of the outcome. The easiest way to interpret the intercept is when X = 0: When X = 0, the intercept β0 is the log of the …

## Posttest-Only Control Group Design: An Introduction

The posttest-only control group design is a basic experimental design where participants get randomly assigned to either receive an intervention or not, and then the outcome of interest is measured only once after the intervention takes place in order to determine its effect. The intervention can be: a medical treatment a training program an exposure …

## Length of a Conclusion Section: Analysis of 47,810 Examples

I analyzed a random sample of 47,810 conclusion sections found in 98,778 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to answer the questions: When to include a conclusion section in a research paper? and how long should it be? I used the BioC API to download the …

## How Old Should Your Article References Be? Based on 3,823,919 Examples

I analyzed 3,823,919 references cited in 96,685 research papers, chosen at random from those uploaded to PubMed Central between the years 2016 and 2021, in order to answer the question: How to determine if a reference is too old to be included in a research article? I used the BioC API to download the data …

## How Many References Should a Research Paper Have? Study of 96,685 Articles

I analyzed a random sample of 96,685 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to answer the question: How many references should you cite when writing a research article? I used the BioC API to download the data (see the References section below). Here’s a summary of …

## Case Report vs Case-Control Study: A Simple Explanation

A case report is the description of the clinical story of a single patient, whereas a case-control study compares 2 groups of participants differing in outcome in order to determine if a suspected exposure in their past caused that difference. Case Report Case-Control Study Participants involved A case report describes the medical case of 1 …

## Case Report vs Cross-Sectional Study: A Simple Explanation

A case report is the description of the clinical story of a single patient. A cross-sectional study involves a group of participants on which data is collected at a single point in time to investigate the relationship between a certain exposure and an outcome. Here’s a table that summarizes the relationship between a case report …

## Programming Languages Popularity in 12,086 Research Papers

I analyzed a random sample of 76,147 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to check the popularity of programming languages among medical researchers. I used the BioC API to download the articles (see the References section below) of which only 12,086 mentioned the use of at …

## Statistical Software Popularity in 40,582 Research Papers

I analyzed a random sample of 76,147 full-text research papers, uploaded to PubMed Central between the years 2016 and 2021, in order to check the popularity of statistical software among medical researchers. (I used the BioC API to download the articles — see the References section below). Out of these 76,147 research papers, only 40,582 …

## Static-Group Comparison Design: An Introduction

The static-group comparison design is a quasi-experimental design in which the outcome of interest is measured only once, after exposing a non-random group of participants to a treatment, and compared to a control group. The objective is to evaluate the effect of this treatment (or intervention) which can be: a medical treatment a training program …

## Interpret Logistic Regression Coefficients [For Beginners]

The logistic regression coefficient β associated with a predictor X is the expected change in log odds of having the outcome per unit change in X. So increasing the predictor by 1 unit (or going from 1 level to the next) multiplies the odds of having the outcome by eβ. Here’s an example: Suppose we …

## When to Use Regression Analysis (With Examples)

Regression analysis can be used to: In the text below, we will go through these points in greater detail and provide a real-world example of each. 1. Estimate the effect of an exposure on a given outcome Regression can model linear and non-linear associations between an exposure (or treatment) and an outcome of interest. It …

## Deviance in the Context of Logistic Regression

Deviance is a number that measures the goodness of fit of a logistic regression model. Think of it as the distance from the perfect fit — a measure of how much your logistic regression model deviates from an ideal model that perfectly fits the data. Deviance ranges from 0 to infinity. The smaller the number …

## Detection Bias vs Performance Bias

Detection bias refers to systematic differences between groups of a study in how the outcome is assessed, while performance bias is introduced by unequal care between groups and has nothing to do with how the outcome is assessed. In other words, detection bias occurs when the patient’s characteristics influence the probability of detecting the outcome …

## Why and When to Include Interactions in a Regression Model

In a regression model, consider including the interaction between 2 variables when: They have large main effects. The effect of one changes for various subgroups of the other. The interaction has been proven in previous studies. You want to explore new hypotheses. Below we will explore each of these points in detail, but first let’s …

## One-Group Posttest Only Design: An Introduction

The one-group posttest-only design (a.k.a. one-shot case study) is a type of quasi-experiment in which the outcome of interest is measured only once after exposing a non-random group of participants to a certain intervention. The objective is to evaluate the effect of that intervention which can be: A training program A policy change A medical …

## Understand Quasi-Experimental Design Through an Example

Suppose you developed a mobile application whose aim is to help diabetic patients control their blood glucose by providing them information and practical tips on how to behave in different situations. So you decided to design a study to figure out if this app does in fact help these patients control their blood glucose. Here’s …

## How to Identify Different Types of Cohort Studies

The most important characteristics that you should look for to identify a cohort are the following: It is an observational study (the investigator is an observer and does not intervene) It follows participants over time (several months, or even years) It compares the incidence of the outcome (i.e. the number of participants who developed that …

## Understand Regularized Regression

Regularized regression is a regression method with an additional constraint designed to deal with a large number of independent variables (a.k.a. predictors). It does so by imposing a larger penalty on unimportant ones, thus shrinking their coefficients towards zero. The objective of regularization is to end up with a model: Other methods that also deal …

## Cohort vs Cross-Sectional Study: Similarities and Differences

In a cohort study, the researcher selects a group of exposed and another group of unexposed individuals and follows them over time to determine whether or not a particular outcome of interest will occur. The objective is to find out which group is more likely to develop the outcome (eg. disease) by comparing its incidence (i.e. …

## Understand Best Subset Selection

When building a regression model, removing irrelevant variables will make the model easier to interpret and less prone to overfit the data, therefore more generalizable. Best subset selection is a method that aims to find the subset of independent variables (Xi) that best predict the outcome (Y) and it does so by considering all possible …

## Square Root Transformation: A Beginner’s Guide

A square root transformation can be useful for: Normalizing a skewed distribution Transforming a non-linear relationship between 2 variables into a linear one Reducing heteroscedasticity of the residuals in linear regression Focusing on visualizing certain parts of your data Below we will discuss each of these points in details. When you apply a square root …

## Randomized Block Design: An Introduction

A randomized block design is a type of experiment where participants who share certain characteristics are grouped together to form blocks, and then the treatment (or intervention) gets randomly assigned within each block. The objective of the randomized block design is to form groups where participants are similar, and therefore can be compared with each …

## Matched Pairs Design: An Introduction

A matched pairs design is an experimental design where participants having the same characteristics get grouped into pairs, then within each pair, 1 participant gets randomly assigned to either the treatment or the control group and the other is automatically assigned to the other group. In other words, if we take each pair alone, the …

## Experimental vs Quasi-Experimental Design: Which to Choose?

Here’s a table that summarizes the similarities and differences between an experimental and a quasi-experimental study design:   Experimental Study (a.k.a. Randomized Controlled Trial) Quasi-Experimental Study Objective Evaluate the effect of an intervention or a treatment Evaluate the effect of an intervention or a treatment How participants get assigned to groups? Random assignment Non-random assignment …

## What is an Acceptable Value for VIF? (With References)

Most research papers consider a VIF (Variance Inflation Factor) > 10 as an indicator of multicollinearity, but some choose a more conservative threshold of 5 or even 2.5. So what threshold should YOU choose? When choosing a VIF threshold, you should take into account that multicollinearity is a lesser problem when dealing with a large …

## Coefficient of Alienation, Non-determination and Tolerance

When running a linear regression model: Y = β0 + β1 × X1 + β2 × X2 + ε One way of determining if the independent variables X1 and X2 were useful in predicting Y is to calculate the coefficient of determination R2. R2 measures the proportion of variability in Y that can be explained …

## Correlation vs Collinearity vs Multicollinearity

Here’s a table that summarizes the differences between correlation, collinearity and multicollinearity:   Correlation Collinearity Multicollinearity Definition Correlation refers to the linear relationship between 2 variables Collinearity refers to a problem when running a regression model where 2 or more independent variables (a.k.a. predictors) have a strong linear relationship Multicollinearity is a special case of …

## Standardized vs Unstandardized Regression Coefficients

Here’s a table that summarizes the similarities and differences between standardized and unstandardized linear regression coefficients:   Unstandardized β Standardized β Definition Unstandardized coefficients are obtained after running a regression model on variables measured in their original scales Standardized coefficients are obtained after running a regression model on standardized variables (i.e. rescaled variables that have …

## Relationship Between r and R-squared in Linear Regression

R-squared is a measure of how well a linear regression model fits the data. It can be interpreted as the proportion of variance of the outcome Y explained by the linear regression model. It is a number between 0 and 1 (0 ≤ R2 ≤ 1). The closer its value is to 1, the more …

## Understand the F-statistic in Linear Regression

When running a multiple linear regression model: Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + … + ε The F-statistic provides us with a way for globally testing if ANY of the independent variables X1, X2, X3, X4… is related to the outcome Y. For a significance level of 0.05: If …

## Residual Standard Deviation/Error: Guide for Beginners

The residual standard deviation (or residual standard error) is a measure used to assess how well a linear regression model fits the data. (The other measure to assess this goodness of fit is R2). But before we discuss the residual standard deviation, let’s try to assess the goodness of fit graphically. Consider the following linear …

## 12 Famous Epidemiologists and Why

In science, credit goes to the man who convinces the world, not to whom the idea first occurs. Francis Darwin Epidemiology certainly has much more contributors than can be described in a single article. So this will be a list of 12 of the most famous epidemiologists who had largely influenced the field. Note that …

## 7 Tricks to Get Statistically Significant p-Values

The objective of this article is to prove that getting a p-value below the threshold of 0.05 is not that hard, and that a statistically significant result proves nothing by itself. Study results should always be interpreted in the context of: The study design The effect size The size of the sample The results of …

## P-Value: A Simple Explanation for Non-Statisticians

A p-value is a probability, a number between 0 and 1, calculated after running a statistical test on data. A small p-value (< 0.05 in general) means that the observed results are so unusual assuming that they were due to chance only. It is a way of telling if the results obtained should be taken …

## Objectives of Epidemiology (With Real-World Examples)

Epidemiology is the study of health issues at the population level which can provide information not available at the individual level. The ultimate goal of epidemiology is to improve health — lower the risk of death and increase the quality of life — by refining preventive measures and treatments of diseases. The objectives of epidemiology …

## Neyman’s [Prevalence-Incidence] Bias: A Simple Explanation

Neyman’s bias, also known as prevalence-incidence bias, occurs when studying the relationship between an exposure and an outcome using prevalence of the outcome instead of incidence in cases where prevalence is a biased estimator of incidence. Reminder:Prevalence is the proportion of individuals who have the outcome/disease at a given time.Incidence (or risk) is the number …

## Protopathic Bias: Simple Explanation + Examples

Protopathic bias occurs when an exposure is initiated (or stopped) in response to a symptom of the disease (outcome) which is not yet diagnosed. This leads to a false conclusion on the causal relationship between exposure and outcome. This bias is especially known in pharmacoepidemiological studies where: The exposure is the prescription of a medication …

## Proxy Bias: Simple Explanation + Example

Proxy bias occurs when the proxy variable used is systematically different from the variable of interest. A proxy or surrogate variable being a variable related enough to the variable of interest to be used as its substitute. But why use a proxy in the first place? One reason could be because the variable of interest …

## Exposure Suspicion Bias: Simple Explanation + Example

Exposure suspicion bias occurs when the knowledge of the subject’s disease status influences the search for the exposure to the cause. For instance, when subjects who have the disease undergo a more rigorous search for the cause than those who do not have the disease, leading to an overestimation of the relationship between the risk …

## Temporal Bias in Research

Temporal bias occurs when we assume a wrong sequence of events which misleads our reasoning about causality. It mostly affects study designs where participants are not followed over time. The most common study designs that are subject to temporal bias are: Cross-sectional studies: Because information is collected at a single moment in time Case-control studies: …

## Cohort vs Randomized Controlled Trials: A Simple Explanation

A randomized controlled trial (RCT) is an experiment controlled by the researcher. A cohort study is an observational study where the researcher observes the events and does not control them. In short, If you want to prove a causal relationship between a treatment and an outcome, use a randomized controlled trial. If randomization is not …

## Performance Bias in Medical Research

Performance bias occurs when there is unequal care between study groups. This can happen in 2 scenarios: If researchers provided, intentionally or unintentionally, unequal treatment/care to different groups in the study If patients in different groups behaved differently Performance bias affects the study validity since the observed outcome can now be attributed either: To the …

## Risk vs Rate: What’s the Difference?

Here’s a table that summarizes the similarities and differences between risk and rate: (Note that the text below contains all the necessary details to understand this table)   Risk Rate Definition Proportion of individuals who developed the disease over a specified period of time (the follow-up period) Proportion of individuals who developed the disease over …

## Length Time Bias: Simple Explanation + Example

Length time bias occurs when cases who were detected earlier by SCREENING seem to have survived longer than cases DIAGNOSED after symptoms appear just because screening tests tend to identify less aggressive cases of the disease more often than aggressive ones. When screening a population, we can imagine that the slower-developing cases of a disease …

## Case Report: A Beginner’s Guide with Examples

A case report is a descriptive study that documents an unusual clinical phenomenon in a single patient. It describes in details the patient’s history, signs, symptoms, test results, diagnosis, prognosis and treatment. It also contains a short literature review, discusses the importance of the case and how it improves the existing knowledge on the subject. …

## Which Variables Should You Include in a Regression Model?

When building a linear or logistic regression model, you should consider including: However, you should watch out for: Below we discuss each of these points in details. 1. Selecting variables based on background knowledge Advantages of using background knowledge to select variables How to choose variables based on background knowledge? You can find out whether …

## Understand Forward and Backward Stepwise Regression

Running a regression model with many variables including irrelevant ones will lead to a needlessly complex model. Stepwise regression is a way of selecting important variables to get a simple and easily interpretable model. Below we discuss how forward and backward stepwise selection work, their advantages, and limitations and how to deal with them. Forward …

## Lead Time Bias: Simple Explanation + Example

Lead time bias occurs when cases who were detected by screening seem to have survived longer than diagnosed cases just because the disease was detected earlier, not because death was delayed. For example: Consider the following 2 scenarios of a patient who suffers from dementia since the age of 65: Scenario 1: The patient was not diagnosed …

## Prevalence: Simple Explanation + Examples

Prevalence is the proportion of individuals who have the disease at a given time. It is used to quantify the burden of disease in a population. Understanding what is going on in society at a certain point in time can help us plan a policy change and create the right health service. How to calculate …

## Risk Difference, Relative Risk and Odds Ratio

Throughout this article we will use the following example: Suppose we conducted a study and found out that moderate consumers of red wine have a 10-year risk of heart disease of 0.9%, and non-consumers have a risk of 1.2%. Our objective is to find out whether red wine is good for the heart or not. So …