Join Dataframes in R: Left/Right/Inner/Full Joins

Joining 2 dataframes involves linking the rows in one to the rows in the other. This can be done in different ways using the dplyr functions: left_join(), right_join(), inner_join(), and full_join().

Here’s an illustration of the differences between them:

different types of joins in R

In order to demonstrate how these functions work, let’s first create some data:

library(dplyr)

df1 <- data.frame(ID = c(1,2,3),
                  T1 = c("a","b","c"))
df2 <- data.frame(ID = c(1,2,99),
                  T2 = c("d","e","f"))

df1
#  ID T1
#1  1  a
#2  2  b
#3  3  c

df2
#  ID T2
#1  1  d
#2  2  e
#3 99  f 

1. Left join

left_join(df1, df2, join_by(ID)) returns a new dataframe that contains all the variables of df1 and df2, except that it has only 1 ID variable (since ID is the join key here). The values of ID will be those of df1$ID (i.e. the dataframe on the left) and will determine how the rest of the variables will be filled: with the corresponding value from each dataframe, or with NA whenever a row with a matching ID cannot be not found in df2.

left_join(df1, df2, join_by(ID))
#  ID T1   T2
#1  1  a    d
#2  2  b    e
#3  3  c <NA>

2. Right join

right_join(df1, df2, join_by(ID)) works in the same way as left_join except that the ID values will be those of df2$ID (i.e. the dataframe on the right).

right_join(df1, df2, join_by(ID))
#  ID   T1 T2
#1  1    a  d
#2  2    b  e
#3 99 <NA>  f

3. Inner join

With inner_join(df1, df2, join_by(ID)) the values of ID (the join key) in the new dataframe will be those that are in common between df1$ID and df2$ID.

inner_join(df1, df2, join_by(ID))
#  ID T1 T2
#1  1  a  d
#2  2  b  e

4. Full join

With full_join(df1, df2, join_by(ID)) the values of ID (the join key) in the new dataframe will be those that are either in df1$ID or df2$ID.

full_join(df1, df2, join_by(ID))
#  ID   T1   T2
#1  1    a    d
#2  2    b    e
#3  3    c <NA>
#4 99 <NA>    f

Note 1

If the join key is not specified, for instance left_join(df1, df2), the join function will use all variables that appear in both df1 and df2 as join keys. Here’s an example:

df1 <- data.frame(ID = c(1,2,3),
                  NAME = c("Tom","Bob","Ray"),
                  T1 = c("a","b","c"))
df2 <- data.frame(ID = c(1,2,99),
                  NAME = c("Tom","Bob","Amy"),
                  T2 = c("d","e","f"))

left_join(df1, df2)
#Joining with `by = join_by(ID, name)`
#  ID NAME T1   T2
#1  1  Tom  a    d
#2  2  Bob  b    e
#3  3  Ray  c <NA>

Note 2

If the join key has different names in each dataframe, for instance ID1 and ID2, we can use the following syntax: left_join(df1, df2, joinby(ID1 == ID2)).

df1 <- data.frame(ID1 = c(1,2,3),
                  T1 = c("a","b","c"))
df2 <- data.frame(ID2 = c(1,2,99),
                  T2 = c("d","e","f"))

left_join(df1, df2, join_by(ID1 == ID2))
#  ID1 T1   T2
#1   1  a    d
#2   2  b    e
#3   3  c <NA>

References

Wickham H, Çetinkaya-Rundel M, Grolemund G. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd edition. O’Reilly Media; 2023.

Further reading