Joining 2 dataframes involves linking the rows in one to the rows in the other. This can be done in different ways using the dplyr functions: left_join()
, right_join()
, inner_join()
, and full_join()
.
Here’s an illustration of the differences between them:
In order to demonstrate how these functions work, let’s first create some data:
library(dplyr) df1 <- data.frame(ID = c(1,2,3), T1 = c("a","b","c")) df2 <- data.frame(ID = c(1,2,99), T2 = c("d","e","f")) df1 # ID T1 #1 1 a #2 2 b #3 3 c df2 # ID T2 #1 1 d #2 2 e #3 99 f
1. Left join
left_join(df1, df2, join_by(ID))
returns a new dataframe that contains all the variables of df1 and df2, except that it has only 1 ID variable (since ID is the join key here). The values of ID will be those of df1$ID
(i.e. the dataframe on the left) and will determine how the rest of the variables will be filled: with the corresponding value from each dataframe, or with NA whenever a row with a matching ID cannot be not found in df2.
left_join(df1, df2, join_by(ID)) # ID T1 T2 #1 1 a d #2 2 b e #3 3 c <NA>
2. Right join
right_join(df1, df2, join_by(ID))
works in the same way as left_join except that the ID values will be those of df2$ID
(i.e. the dataframe on the right).
right_join(df1, df2, join_by(ID)) # ID T1 T2 #1 1 a d #2 2 b e #3 99 <NA> f
3. Inner join
With inner_join(df1, df2, join_by(ID))
the values of ID (the join key) in the new dataframe will be those that are in common between df1$ID
and df2$ID
.
inner_join(df1, df2, join_by(ID)) # ID T1 T2 #1 1 a d #2 2 b e
4. Full join
With full_join(df1, df2, join_by(ID))
the values of ID (the join key) in the new dataframe will be those that are either in df1$ID
or df2$ID
.
full_join(df1, df2, join_by(ID)) # ID T1 T2 #1 1 a d #2 2 b e #3 3 c <NA> #4 99 <NA> f
Note 1
If the join key is not specified, for instance left_join(df1, df2)
, the join function will use all variables that appear in both df1 and df2 as join keys. Here’s an example:
df1 <- data.frame(ID = c(1,2,3), NAME = c("Tom","Bob","Ray"), T1 = c("a","b","c")) df2 <- data.frame(ID = c(1,2,99), NAME = c("Tom","Bob","Amy"), T2 = c("d","e","f")) left_join(df1, df2) #Joining with `by = join_by(ID, name)` # ID NAME T1 T2 #1 1 Tom a d #2 2 Bob b e #3 3 Ray c <NA>
Note 2
If the join key has different names in each dataframe, for instance ID1 and ID2, we can use the following syntax: left_join(df1, df2, joinby(ID1 == ID2))
.
df1 <- data.frame(ID1 = c(1,2,3), T1 = c("a","b","c")) df2 <- data.frame(ID2 = c(1,2,99), T2 = c("d","e","f")) left_join(df1, df2, join_by(ID1 == ID2)) # ID1 T1 T2 #1 1 a d #2 2 b e #3 3 c <NA>
References
Wickham H, Çetinkaya-Rundel M, Grolemund G. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd edition. O’Reilly Media; 2023.