R Tutorials

Create and Plot Graphs from data.frame: Intro to igraph in R

A graph consists of points — called nodes or vertices — connected by line segments — called edges. The only library we will need for this tutorial is igraph: 1. Directed graphs A directed graph, where the edges indicate a one-way relationship between vertices, can be created from a data.frame that simply defines the edges: …

Join Dataframes in R: Left/Right/Inner/Full Joins

Joining 2 dataframes involves linking the rows in one to the rows in the other. This can be done in different ways using the dplyr functions: left_join(), right_join(), inner_join(), and full_join(). Here’s an illustration of the differences between them: In order to demonstrate how these functions work, let’s first create some data: 1. Left join …

Extract Multiple Occurrences of a Pattern in a String in R

Our goal is to extract all p-values reported in these abstracts: Notice that most reported p-values follow this 3-part pattern: In regular expressions, this pattern can be written as: Then we can call str_extract_all() to extract all text that follow this pattern: The new column is a list which we can unnest(): The next step is …

Extract p-values from Text in R: Using separate_wider_regex

Our goal is to extract p-values from article abstracts to create 2 new variables: sign and p-value: We will break the problem into 2 simple steps: But first, let’s load the data: Step 1: determine which abstracts report a p-value Here, we can use the function str_detect() which takes a string and a pattern, and …

Extract Numbers from Strings in R

The functions parse_integer(), parse_double(), and parse_number() from the readr library transform a character vector into a numeric vector. Here’s an example that compares these 3 functions: Exercises 1. Extract the number 1000000 from “1 000 000” Not all characters in this string can be transformed into an integer (since we have white spaces), so we …

In this article, we will use R to: 1. Get PMID numbers of relevant articles Let’s say we are interested in analyzing articles about lung cancer, published in PLOS Medicine, in the year 2022. Here’s the PubMed link that contains PMID (PubMed ID) numbers for these articles: If you visit this URL, you see an …

Plot Median and Interquartile Range in R

In this tutorial, we are going to create the following plot of the median and the interquartile range of sepal length for each iris species using the iris dataset: 1. Using geom_pointrange() We start by calculating the median, the 1st quartile, and the 3rd quartile as follows: Then we give these variables to geom_pointrange() in …

Using pivot_longer with names_sep and names_pattern in R

In this article, we will explain how to use the arguments names_sep and names_pattern of the function pivot_longer() from the tidyr package. First, let’s create some data: The general syntax of pivot_longer() is: 1. Using the names_sep argument in pivot_longer() When calling pivot_longer(), we can create 2 new columns (instead of 1) by splitting the …

Convert Columns to Rows in R

In this article, we will show 2 ways to convert columns into rows in R using the following data: 1. Using matrix transpose The function t() takes the data frame df and returns a matrix where columns and rows are switched: But we have 3 problems with this function: So let’s clean this output a …

How to Summarize Data in R (Using Dplyr)

In this article, we will cover how to apply the function summarize() from the dplyr package using the following data: 1. Summarizing a variable Use the following code to calculate the average age in our dataset: The summarize() function can take many arguments: 2. Grouping and summarizing Use the following code to calculate the average …

How to Deal with Violation of Normality of Errors in R

Linear regression assumes that error terms are normally distributed. This is especially important when we are using linear regression for prediction purposes and our sample size is small (see: Understand Linear Regression Assumptions). When the normality of errors assumption is violated, try: Let’s create some data to demonstrate these methods: Output: So we see that …

How to Deal with Heteroscedasticity in Regression in R

Linear regression assumes that the dispersion of data points around the regression line is constant. We can deal with violation of this assumption (i.e. with heteroscedasticity) by: Let’s create some heteroscedastic data to demonstrate these methods: Output: The residuals vs fitted values plot shows a fan shape, which is evidence of heteroscedasticity. (For more information, …

How to Deal with Violation of the Linearity Assumption in R

The most important assumption of linear regression is that the relationship between each predictor and the outcome is linear. When the linearity assumption is violated, try: Let’s create some non-linear data to demonstrate these methods: The residuals vs fitted values plot shows a curved relationship, therefore, the linearity assumption is violated. Solution #1: Adding a quadratic …

How to Check Linear Regression Assumptions in R

Linear regression has 4 assumptions: 1. How to check linearity Instead of checking the relationship between each predictor and the outcome in a multivariable model, we can plot the residuals versus the fitted values. The plot should show no discernible pattern: The output will look like one of the following: In this plot, R draws …

Linear Regression in R (with a Categorical Variable)

In this article, we will run and interpret a linear regression model where the predictor is a categorical variable with multiple levels. Loading the Data We will use the chickwts dataset available in R. (These data come from an experiment where newly hatched chickens were randomly divided into 6 groups, each group receiving a different …

How to Run and Interpret a Logistic Regression Model in R

Here’s an example of running logistic regression in R: 1. Logistic regression equation The formula $$Survived \sim Age$$ corresponds to the logistic regression equation: $$\log(\frac{P}{1 – P}) = \beta_0 + \beta_1 Age$$ Where $$P$$ is the probability of having the outcome, i.e. the probability of surviving. 2. Deviance residuals A deviance residual measures how much …

Logistic Regression in R (with Categorical Variables)

In this article, we will run and interpret a logistic regression model where the predictor is a categorical variable with multiple levels. Loading the data We will use the Titanic dataset available in R: Running a logistic regression model Next, we will use logistic regression to examine the effect of class (the predictor) on survival …

Plot Logistic Regression Decision Boundary in R

In this article, we will produce the following R plot that represents the decision boundary of a logistic regression model: Here’s the full code used to generate it: Code explanation First, we create some data (2 continuous variables x1 and x2, and 1 binary variable y) and run a logistic regression: Next, we will create …

Stepwise (Linear & Logistic) Regression in R

In this article, we will cover: Let’s start by creating some data: To run a stepwise regression, use the stepAIC function from the MASS library. 1. How to run forward stepwise linear regression Output: Call: lm(formula = X1 ~ X4 + X3 + X7, data = dat) Residuals: Min 1Q Median 3Q Max -0.52407 -0.23122 …

How to Deal with Multicollinearity in R

Multicollinearity occurs when there is a strong linear relationship between 2 or more predictors in a regression model. It is a problem because it increases the standard errors of the regression coefficients, leading to noisy estimates. Let’s simulate some data in R: We have a collinearity problem in our model since our variables’ VIFs (Variance …

Linear Regression with Interaction in R

Output: Call: lm(formula = Y ~ X + Z + X:Z, data = dat) Residuals: Min 1Q Median 3Q Max -1.00058 -0.25209 0.00766 0.21640 0.89542 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.57717 0.11195 40.885 < 2e-16 *** X 0.44168 0.06551 6.742 3.38e-10 *** Z -1.23932 0.21937 -5.649 8.16e-08 *** X:Z 0.18859 0.03357 5.617 …

How to Describe/Summarize Numerical Data in R (Example)

Let’s start by creating IQ, a normally distributed numerical variable, with a mean of 100 and a standard deviation of 15, that represents the IQ scores of a sample of 100 participants: Next, we will (1) summarize this variable and (2) describe its distribution. 1. Summary statistics Reminder: The 1st quartile is the 25th percentile, …

How to Describe/Summarize Categorical Data in R (Example)

Let’s start by creating our own data, consisting of 2 categorical variables: gender and smoking: Next, we will create a frequency table and a bar plot to summarize these data one variable at a time, then we will create a contingency table and a stacked bar plot to describe the relationship between the 2 variables. …

Modulo Operator (%%) in R: Explained + Practical Examples

The modulo operator (%% in R) returns the remainder of the division of 2 numbers. Here are some examples: 5 %% 2 returns 1, because 2 goes into 5 two times and the remainder is 1 (i.e. 5 = 2 × 2 + 1). 4 %% 2 returns 0, since 4 = 2 × 2 …

Find the Minimum and Maximum of a Function in R

The function optimize (also spelled optimise) in R returns the minimum or maximum of a function f(x) within a specified interval. It takes as inputs: f: a function. interval: a vector containing the lower and upper bounds of the domain where we want to search for the minimum or maximum. maximum: a logical, where TRUE …

Write a Function that Returns the nth Fibonacci Number in R

Challenge: Write a function in R that prints the nth Fibonacci number. Reminder: the Fibonacci sequence is: 1, 1, 2, 3, 5, 8, … So, the first Fibonacci number is 1, the second is also 1, and then each subsequent number is the sum of the previous 2 in the sequence. Solution: In this article, …

Plot a Step Function in Base R and ggplot2

As an example of a step function, we will use the floor function floor(x) that takes a real number x and returns the greatest integer less than or equal to x. Coding the floor function in R: Plotting in base R Output: We can add the data points to the plot with the following code: …

Solve a Polynomial in R

A polynomial p(x) is an expression of the form: $$p(x) = a_0 + a_1x + a_2x^2 + a_3x^3 + … + a_nx^n$$ Where n is any non-negative integer. Solve a polynomial p(x) in R To solve the equation $$p(x) = 0$$ in R, we can use the function: polyroot. For example, let’s solve the equation: …

Coding and Plotting a Piecewise Function in R

In this tutorial we are going to code the following function in R: $$f(x) =\begin{cases}-x, & \text{if x < -1} \\x^2, & \text{if x \geq -1}\end{cases}$$ And produce the following plot: Coding the piecewise function f(x) in R Using if else statements While this code is easy to read and understand, it does not support …

How to Plot a Quadratic Function in R

For the following quadratic function: $$f(x) = x^2 + 2x – 20$$ Here’s the plot that we want to produce: Coding the function f(x) in R A quadratic function is a function of the form: $$ax^2 + bx + c$$, where $$a \neq 0$$. So for $$f(x) = x^2 + 2x – 20$$: a = …

Find the Line Equation From 2 Points in R

Suppose we want to know the equation of the line that passes through 2 points A and B, such that: Quick solution Output: Call: lm(formula = ys ~ xs) Coefficients: (Intercept) xs 58.833 -2.417 So, the equation of the line that passes through A and B is: $$f(x) = -2.417x + 58.833$$ To get more …

Writing Custom Functions in R

In this article, you will learn how to write your own functions in R. Specifically, we will cover: How to write a simple function How to write a more complex function How to write an anonymous function How to write a function with an unfixed number of arguments How to write a recursive function 1. …

How to Solve an Equation in R

In this article, will use the uniroot.all() function from the rootSolve package to find all the solutions of an equation over a given interval (or domain). Input: uniroot.all() takes 2 arguments: a function f and an interval. How it works: Its searches the interval for all possible roots of f. Output: uniroot.all() returns a vector …

Create and Graph Intervals in R

A quick review of intervals The open interval from a to b, denoted (a, b), consists of all numbers between a and b excluding the endpoints a and b. Open circles in the graph indicate that the endpoints are excluded: The closed interval from a to b, denoted [a, b], consists of all numbers between …

Working with Sets in R (Tutorial)

A set is an unordered collection of unique elements. It is helpful to keep track of distinct objects. In this tutorial, you will learn how to: Create a set Manipulate sets Work with subsets Apply set operations 1. Create a set 1.1. Create a set from scratch A set can contain different types of elements, …