# How to Summarize Data in R (Using Dplyr)

In this article, we will cover how to apply the function `summarize()` from the `dplyr` package using the following data:

```library(dplyr)

# create a tibble (like a data.frame)
my_df <- tibble(
gender = c('male', 'female', 'male', 'female'),
age = c(20, 21, 30, 27),
weight = c(60, 55, 73, NA)
)

my_df
## A tibble: 4 x 3
#  gender   age weight
#  <chr>  <dbl>  <dbl>
#1 male      20     60
#2 female    21     55
#3 male      30     73
#4 female    27     NA```

## 1. Summarizing a variable

Use the following code to calculate the average age in our dataset:

```my_df |>
summarize(avg_age = mean(age))

## A tibble: 1 x 1
#  avg_age
#    <dbl>
#1    24.5```

The `summarize()` function can take many arguments:

```my_df |>
summarize(min_age = min(age), # minimum
median_age = median(age), # median
max_age = max(age), # maximum
avg_age = mean(age), # average
sd_age = sd(age), # standard deviation
count = n()) # nb of observations

## A tibble: 1 x 6
#  min_age median_age max_age avg_age sd_age count
#    <dbl>      <dbl>   <dbl>   <dbl>  <dbl> <int>
#1      20         24      30    24.5   4.80     4```

## 2. Grouping and summarizing

Use the following code to calculate the average age in each gender category:

```my_df |>
group_by(gender) |>
summarize(avg_age = mean(age))

## A tibble: 2 x 2
#  gender avg_age
#  <chr>    <dbl>
#1 female      24
#2 male        25```

## 3. Dealing with missing values

The variable weight contains a missing value, so calculating the average weight for each gender category using the following code produces an NA value:

```my_df |>
group_by(gender) |>
summarize(avg_weight = mean(weight))

## A tibble: 2 x 2
#  gender avg_weight
#  <chr>       <dbl>
#1 female       NA
#2 male         66.5```

One simple solution would be to use the `na.rm = TRUE` argument inside `mean()` to remove NA values when calculating the average weight:

```my_df |>
group_by(gender) |>
summarize(avg_weight = mean(weight, na.rm = TRUE))

## A tibble: 2 x 2
#  gender avg_weight
#  <chr>       <dbl>
#1 female       55
#2 male         66.5```

While this code works, it creates another subtle problem which can be illustrated by calculating the average weight and counting the number of values in each gender category:

```my_df |>
group_by(gender) |>
summarize(avg_weight = mean(weight, na.rm = TRUE),
count = n())

## A tibble: 2 x 3
#  gender avg_weight count
#  <chr>       <dbl> <int>
#1 female       55       2
#2 male         66.5     2```

Notice that the number of observations for the female category is 2 which is misleading because one of them is missing.

To solve this issue, we can remove missing values before calling `summarize()` by using the function `filter()`, as follows:

```my_df |>
filter(!is.na(gender) & !is.na(weight)) |>
group_by(gender) |>
summarize(avg_weight = mean(weight),
count = n())

## A tibble: 2 x 3
#  gender avg_weight count
#  <chr>       <dbl> <int>
#1 female       55       1
#2 male         66.5     2```

Now it is clear that female has 1 observation and male has 2 observations based on which we calculated the average weights.