# George Choueiry

I am George Choueiry, PharmD, MPH, my objective is to help you conduct studies, from conception to publication. ## Static-Group Comparison Design: An Introduction

The static-group comparison design is a quasi-experimental design in which the outcome of interest is measured only once, after exposing a non-random group of participants to a treatment, and compared to a control group. The objective is to evaluate the effect of this treatment (or intervention) which can be: a medical treatment a training program …

## Interpret Logistic Regression Coefficients [For Beginners]

The logistic regression coefficient β associated with a predictor X is the expected change in log odds of having the outcome per unit change in X. So increasing the predictor by 1 unit (or going from 1 level to the next) multiplies the odds of having the outcome by eβ. Here’s an example: Suppose we …

## When to Use Regression Analysis (With Examples)

Regression analysis can be used to: In the text below, we will go through these points in greater detail and provide a real-world example of each. 1. Estimate the effect of an exposure on a given outcome Regression can model linear and non-linear associations between an exposure (or treatment) and an outcome of interest. It …

## Deviance in the Context of Logistic Regression

Deviance is a number that measures the goodness of fit of a logistic regression model. Think of it as the distance from the perfect fit — a measure of how much your logistic regression model deviates from an ideal model that perfectly fits the data. Deviance ranges from 0 to infinity. The smaller the number …

## Detection Bias vs Performance Bias

Detection bias refers to systematic differences between groups of a study in how the outcome is assessed, while performance bias is introduced by unequal care between groups and has nothing to do with how the outcome is assessed. In other words, detection bias occurs when the patient’s characteristics influence the probability of detecting the outcome …

## Why and When to Include Interactions in a Regression Model

In a regression model, consider including the interaction between 2 variables when: They have large main effects. The effect of one changes for various subgroups of the other. The interaction has been proven in previous studies. You want to explore new hypotheses. Below we will explore each of these points in detail, but first let’s …

## One-Group Posttest Only Design: An Introduction

The one-group posttest-only design (a.k.a. one-shot case study) is a type of quasi-experiment in which the outcome of interest is measured only once after exposing a non-random group of participants to a certain intervention. The objective is to evaluate the effect of that intervention which can be: A training program A policy change A medical …

## Understand Quasi-Experimental Design Through an Example

Suppose you developed a mobile application whose aim is to help diabetic patients control their blood glucose by providing them information and practical tips on how to behave in different situations. So you decided to design a study to figure out if this app does in fact help these patients control their blood glucose. Here’s …

## How to Identify Different Types of Cohort Studies

The most important characteristics that you should look for to identify a cohort are the following: It is an observational study (the investigator is an observer and does not intervene) It follows participants over time (several months, or even years) It compares the incidence of the outcome (i.e. the number of participants who developed that …

## Understand Regularized Regression

Regularized regression is a regression method with an additional constraint designed to deal with a large number of independent variables (a.k.a. predictors). It does so by imposing a larger penalty on unimportant ones, thus shrinking their coefficients towards zero. The objective of regularization is to end up with a model: Other methods that also deal …

## Cohort vs Cross-Sectional Study: Similarities and Differences

In a cohort study, the researcher selects a group of exposed and another group of unexposed individuals and follows them over time to determine whether or not a particular outcome of interest will occur. The objective is to find out which group is more likely to develop the outcome (eg. disease) by comparing its incidence (i.e. …

## Understand Best Subset Selection

When building a regression model, removing irrelevant variables will make the model easier to interpret and less prone to overfit the data, therefore more generalizable. Best subset selection is a method that aims to find the subset of independent variables (Xi) that best predict the outcome (Y) and it does so by considering all possible …

## Square Root Transformation: A Beginner’s Guide

A square root transformation can be useful for: Normalizing a skewed distribution Transforming a non-linear relationship between 2 variables into a linear one Reducing heteroscedasticity of the residuals in linear regression Focusing on visualizing certain parts of your data Below we will discuss each of these points in details. When you apply a square root …

## Randomized Block Design: An Introduction

A randomized block design is a type of experiment where participants who share certain characteristics are grouped together to form blocks, and then the treatment (or intervention) gets randomly assigned within each block. The objective of the randomized block design is to form groups where participants are similar, and therefore can be compared with each …

## Matched Pairs Design: An Introduction

A matched pairs design is an experimental design where participants having the same characteristics get grouped into pairs, then within each pair, 1 participant gets randomly assigned to either the treatment or the control group and the other is automatically assigned to the other group. In other words, if we take each pair alone, the …

## Experimental vs Quasi-Experimental Design: Which to Choose?

Here’s a table that summarizes the similarities and differences between an experimental and a quasi-experimental study design:   Experimental Study (a.k.a. Randomized Controlled Trial) Quasi-Experimental Study Objective Evaluate the effect of an intervention or a treatment Evaluate the effect of an intervention or a treatment How participants get assigned to groups? Random assignment Non-random assignment …

## What is an Acceptable Value for VIF? (With References)

Most research papers consider a VIF (Variance Inflation Factor) > 10 as an indicator of multicollinearity, but some choose a more conservative threshold of 5 or even 2.5. So what threshold should YOU choose? When choosing a VIF threshold, you should take into account that multicollinearity is a lesser problem when dealing with a large …

## Coefficient of Alienation, Non-determination and Tolerance

When running a linear regression model: Y = β0 + β1 × X1 + β2 × X2 + ε One way of determining if the independent variables X1 and X2 were useful in predicting Y is to calculate the coefficient of determination R2. R2 measures the proportion of variability in Y that can be explained …

## Correlation vs Collinearity vs Multicollinearity

Here’s a table that summarizes the differences between correlation, collinearity and multicollinearity:   Correlation Collinearity Multicollinearity Definition Correlation refers to the linear relationship between 2 variables Collinearity refers to a problem when running a regression model where 2 or more independent variables (a.k.a. predictors) have a strong linear relationship Multicollinearity is a special case of …

## Standardized vs Unstandardized Regression Coefficients

Here’s a table that summarizes the similarities and differences between standardized and unstandardized linear regression coefficients:   Unstandardized β Standardized β Definition Unstandardized coefficients are obtained after running a regression model on variables measured in their original scales Standardized coefficients are obtained after running a regression model on standardized variables (i.e. rescaled variables that have …

## Relationship Between r and R-squared in Linear Regression

R-squared is a measure of how well a linear regression model fits the data. It can be interpreted as the proportion of variance of the outcome Y explained by the linear regression model. It is a number between 0 and 1 (0 ≤ R2 ≤ 1). The closer its value is to 1, the more …

## Understand the F-statistic in Linear Regression

When running a multiple linear regression model: Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + … + ε The F-statistic provides us with a way for globally testing if ANY of the independent variables X1, X2, X3, X4… is related to the outcome Y. For a significance level of 0.05: If …

## Residual Standard Deviation/Error: Guide for Beginners

The residual standard deviation (or residual standard error) is a measure used to assess how well a linear regression model fits the data. (The other measure to assess this goodness of fit is R2). But before we discuss the residual standard deviation, let’s try to assess the goodness of fit graphically. Consider the following linear …

## 12 Famous Epidemiologists and Why

In science, credit goes to the man who convinces the world, not to whom the idea first occurs. Francis Darwin Epidemiology certainly has much more contributors than can be described in a single article. So this will be a list of 12 of the most famous epidemiologists who had largely influenced the field. Note that …

## 7 Tricks to Get Statistically Significant p-Values

The objective of this article is to prove that getting a p-value below the threshold of 0.05 is not that hard, and that a statistically significant result proves nothing by itself. Study results should always be interpreted in the context of: The study design The effect size The size of the sample The results of …

## P-Value: A Simple Explanation for Non-Statisticians

A p-value is a probability, a number between 0 and 1, calculated after running a statistical test on data. A small p-value (< 0.05 in general) means that the observed results are so unusual assuming that they were due to chance only. It is a way of telling if the results obtained should be taken …

## Objectives of Epidemiology (With Real-World Examples)

Epidemiology is the study of health issues at the population level which can provide information not available at the individual level. The ultimate goal of epidemiology is to improve health — lower the risk of death and increase the quality of life — by refining preventive measures and treatments of diseases. The objectives of epidemiology …

## Neyman’s [Prevalence-Incidence] Bias: A Simple Explanation

Neyman’s bias, also known as prevalence-incidence bias, occurs when studying the relationship between an exposure and an outcome using prevalence of the outcome instead of incidence in cases where prevalence is a biased estimator of incidence. Reminder:Prevalence is the proportion of individuals who have the outcome/disease at a given time.Incidence (or risk) is the number …

## Protopathic Bias: Simple Explanation + Examples

Protopathic bias occurs when an exposure is initiated (or stopped) in response to a symptom of the disease (outcome) which is not yet diagnosed. This leads to a false conclusion on the causal relationship between exposure and outcome. This bias is especially known in pharmacoepidemiological studies where: The exposure is the prescription of a medication …

## Proxy Bias: Simple Explanation + Example

Proxy bias occurs when the proxy variable used is systematically different from the variable of interest. A proxy or surrogate variable being a variable related enough to the variable of interest to be used as its substitute. But why use a proxy in the first place? One reason could be because the variable of interest …

## Exposure Suspicion Bias: Simple Explanation + Example

Exposure suspicion bias occurs when the knowledge of the subject’s disease status influences the search for the exposure to the cause. For instance, when subjects who have the disease undergo a more rigorous search for the cause than those who do not have the disease, leading to an overestimation of the relationship between the risk …

## Temporal Bias in Research

Temporal bias occurs when we assume a wrong sequence of events which misleads our reasoning about causality. It mostly affects study designs where participants are not followed over time. The most common study designs that are subject to temporal bias are: Cross-sectional studies: Because information is collected at a single moment in time Case-control studies: …

## Cohort vs Randomized Controlled Trials: A Simple Explanation

A randomized controlled trial (RCT) is an experiment controlled by the researcher. A cohort study is an observational study where the researcher observes the events and does not control them. In short, If you want to prove a causal relationship between a treatment and an outcome, use a randomized controlled trial. If randomization is not …

## Performance Bias in Medical Research

Performance bias occurs when there is unequal care between study groups. This can happen in 2 scenarios: If researchers provided, intentionally or unintentionally, unequal treatment/care to different groups in the study If patients in different groups behaved differently Performance bias affects the study validity since the observed outcome can now be attributed either: To the …

## Risk vs Rate: What’s the Difference?

Here’s a table that summarizes the similarities and differences between risk and rate: (Note that the text below contains all the necessary details to understand this table)   Risk Rate Definition Proportion of individuals who developed the disease over a specified period of time (the follow-up period) Proportion of individuals who developed the disease over …

## Length Time Bias: Simple Explanation + Example

Length time bias occurs when cases who were detected earlier by SCREENING seem to have survived longer than cases DIAGNOSED after symptoms appear just because screening tests tend to identify less aggressive cases of the disease more often than aggressive ones. When screening a population, we can imagine that the slower-developing cases of a disease …

## Case Report: A Beginner’s Guide with Examples

A case report is a descriptive study that documents an unusual clinical phenomenon in a single patient. It describes in details the patient’s history, signs, symptoms, test results, diagnosis, prognosis and treatment. It also contains a short literature review, discusses the importance of the case and how it improves the existing knowledge on the subject. …

## Which Variables Should You Include in a Regression Model?

When building a linear or logistic regression model, you should consider including: However, you should watch out for: Below we discuss each of these points in details. 1. Selecting variables based on background knowledge Advantages of using background knowledge to select variables How to choose variables based on background knowledge? You can find out whether …

## Understand Forward and Backward Stepwise Regression

Running a regression model with many variables including irrelevant ones will lead to a needlessly complex model. Stepwise regression is a way of selecting important variables to get a simple and easily interpretable model. Below we discuss how forward and backward stepwise selection work, their advantages, and limitations and how to deal with them. Forward …

## Lead Time Bias: Simple Explanation + Example

Lead time bias occurs when cases who were detected by screening seem to have survived longer than diagnosed cases just because the disease was detected earlier, not because death was delayed. For example: Consider the following 2 scenarios of a patient who suffers from dementia since the age of 65: Scenario 1: The patient was not diagnosed …

## Prevalence: Simple Explanation + Examples

Prevalence is the proportion of individuals who have the disease at a given time. It is used to quantify the burden of disease in a population. Understanding what is going on in society at a certain point in time can help us plan a policy change and create the right health service. How to calculate …

## Risk Difference, Relative Risk and Odds Ratio

Throughout this article we will use the following example: Suppose we conducted a study and found out that moderate consumers of red wine have a 10-year risk of heart disease of 0.9%, and non-consumers have a risk of 1.2%. Our objective is to find out whether red wine is good for the heart or not. So …