How to Summarize Data in R (Using Dplyr)

In this article, we will cover how to apply the function summarize() from the dplyr package using the following data:

library(dplyr)

# create a tibble (like a data.frame)
my_df <- tibble(
  gender = c('male', 'female', 'male', 'female'),
  age = c(20, 21, 30, 27),
  weight = c(60, 55, 73, NA)
  )

my_df
## A tibble: 4 x 3
#  gender   age weight
#  <chr>  <dbl>  <dbl>
#1 male      20     60
#2 female    21     55
#3 male      30     73
#4 female    27     NA

1. Summarizing a variable

Use the following code to calculate the average age in our dataset:

How the summarize function in R works
my_df |>
  summarize(avg_age = mean(age))

## A tibble: 1 x 1
#  avg_age
#    <dbl>
#1    24.5

The summarize() function can take many arguments:

my_df |>
  summarize(min_age = min(age), # minimum
            median_age = median(age), # median
            max_age = max(age), # maximum
            avg_age = mean(age), # average
            sd_age = sd(age), # standard deviation
            count = n()) # nb of observations

## A tibble: 1 x 6
#  min_age median_age max_age avg_age sd_age count
#    <dbl>      <dbl>   <dbl>   <dbl>  <dbl> <int>
#1      20         24      30    24.5   4.80     4

2. Grouping and summarizing

Use the following code to calculate the average age in each gender category:

How the functions group_by and summarize work together
my_df |> 
  group_by(gender) |> 
  summarize(avg_age = mean(age))

## A tibble: 2 x 2
#  gender avg_age
#  <chr>    <dbl>
#1 female      24
#2 male        25

3. Dealing with missing values

The variable weight contains a missing value, so calculating the average weight for each gender category using the following code produces an NA value:

my_df |> 
  group_by(gender) |> 
  summarize(avg_weight = mean(weight))

## A tibble: 2 x 2
#  gender avg_weight
#  <chr>       <dbl>
#1 female       NA  
#2 male         66.5

One simple solution would be to use the na.rm = TRUE argument inside mean() to remove NA values when calculating the average weight:

my_df |> 
  group_by(gender) |> 
  summarize(avg_weight = mean(weight, na.rm = TRUE))

## A tibble: 2 x 2
#  gender avg_weight
#  <chr>       <dbl>
#1 female       55  
#2 male         66.5

While this code works, it creates another subtle problem which can be illustrated by calculating the average weight and counting the number of values in each gender category:

my_df |> 
  group_by(gender) |> 
  summarize(avg_weight = mean(weight, na.rm = TRUE),
            count = n())

## A tibble: 2 x 3
#  gender avg_weight count
#  <chr>       <dbl> <int>
#1 female       55       2
#2 male         66.5     2

Notice that the number of observations for the female category is 2 which is misleading because one of them is missing.

To solve this issue, we can remove missing values before calling summarize() by using the function filter(), as follows:

my_df |> 
  filter(!is.na(gender) & !is.na(weight)) |> 
  group_by(gender) |> 
  summarize(avg_weight = mean(weight),
            count = n())

## A tibble: 2 x 3
#  gender avg_weight count
#  <chr>       <dbl> <int>
#1 female       55       1
#2 male         66.5     2

Now it is clear that female has 1 observation and male has 2 observations based on which we calculated the average weights.

Further reading