Plot Monthly distribution for each year in R (Seasonality)

In this article, we will produce the following plot in R:

seasonality plot in R showing monthly distribution of new COVID-19 cases for the past 4 years

1. Load the data

First, we will use the COVID19 package in R to download data for Lebanon. We will limit our analysis to 2 variables: (1) the date and (2) the number of new daily cases.

library(COVID19) # load data
library(tidyverse) # manipulate data

dat <- tibble(covid19(country = "Lebanon")) |> 
  mutate(new_cases = confirmed - lag(confirmed)) |> 
  select(date, new_cases)

dat
## A tibble: 1,162 × 2
#   date       daily_cases
#   <date>           <int>
# 1 2020-01-03          NA
# 2 2020-01-04          NA
# 3 2020-01-05          NA
# 4 2020-01-06          NA
# 5 2020-01-07          NA
# 6 2020-01-08          NA
# 7 2020-01-09          NA
# 8 2020-01-10          NA
# 9 2020-01-11          NA
#10 2020-01-12          NA
## ℹ 1,152 more rows
## ℹ Use `print(n = ...)` to see more rows

2. Count monthly cases

To do that, we group data by year and month, and then we count the cases:

# count new cases each month
dat <- dat |> 
  group_by(year = year(date),
           month = month(date)) |> 
  count(wt = daily_cases, name = "monthly_cases")

dat
## A tibble: 39 × 3
## Groups:   year, month [39]
#    year month monthly_cases
#   <dbl> <dbl>         <int>
# 1  2020     1             0
# 2  2020     2             3
# 3  2020     3           466
# 4  2020     4           255
# 5  2020     5           495
# 6  2020     6           558
# 7  2020     7          2777
# 8  2020     8         12753
# 9  2020     9         22326
#10  2020    10         41594
## ℹ 29 more rows
## ℹ Use `print(n = ...)` to see more rows

3. Transform to time series table (tsibble)

Next, we transform the tibble obejct into a tsibble using the fpp3 package, and we specify that we want monthly data to be the index of the table:

library(fpp3) # use time series tables

# transform to time series
dat <- dat |> 
  unite(date, year, month) |> 
  mutate(date = yearmonth(date)) |> 
  as_tsibble(index = date)

dat
## A tsibble: 39 x 2 [1M]
#       date monthly_cases
#      <mth>         <int>
# 1 2020 Jan             0
# 2 2020 Feb             3
# 3 2020 Mar           466
# 4 2020 Apr           255
# 5 2020 May           495
# 6 2020 Jun           558
# 7 2020 Jul          2777
# 8 2020 Aug         12753
# 9 2020 Sep         22326
#10 2020 Oct         41594
## ℹ 29 more rows
## ℹ Use `print(n = ...)` to see more rows

4. Plot the monthly distribution

Now, we can use the function gg_season() and specify that we want to plot the variable called monthly_cases:

df |> gg_season(monthly_cases) +
  geom_point()

Output:

seasonality plot in R

Interpretation: The graph illustrates the seasonal variation of COVID-19 infection rate in Lebanon from 2020 to mid 2023. The infection rate reached its highest levels in the winter months (November to February) and in the summer months (July and August). These periods coincide with longer indoor stays due to cold weather and higher tourism activities respectively. The infection rate declined significantly by the end of 2022.

Further reading