Linear Regression Example for Time Series Data in R

In this tutorial, we will use a linear regression model to examine the relationship between the Google search trends for the terms headache and ibuprofen.

1. Prepare the data

The package `gtrendsR` presents an interface to retrieve the number of Google searches over time for a specific term:

```library(gtrendsR)

time = "2020-01-01 2023-01-01",
gprop = "web",
onlyInterest = TRUE)\$interest_over_time

#        date hits  keyword   geo                  time gprop category
#1 2020-01-05   43 headache world 2020-01-01 2023-01-01   web        0
#2 2020-01-12   41 headache world 2020-01-01 2023-01-01   web        0
#3 2020-01-19   42 headache world 2020-01-01 2023-01-01   web        0
#4 2020-01-26   41 headache world 2020-01-01 2023-01-01   web        0
#5 2020-02-02   40 headache world 2020-01-01 2023-01-01   web        0
#6 2020-02-09   41 headache world 2020-01-01 2023-01-01   web        0```

1.2. Plot the trends

```library(fpp3) # to work with time series data

raw_data |>
ggplot(aes(x = date, y = hits, color = keyword)) +
geom_line()```

Output:

1.3. Group data by month

```trends_df <-
as_tibble(raw_data) |>
group_by(year = year(date),
month = month(date),
keyword) |>
count(wt = hits, name = "hits")

trends_df
## A tibble: 74 × 4
## Groups:   year, month, keyword [74]
#    year month keyword    hits
#   <dbl> <dbl> <chr>     <int>
# 1  2020     1 headache    167
# 2  2020     1 ibuprofen   100
# 3  2020     2 headache    166
# 4  2020     2 ibuprofen   103
# 5  2020     3 headache    266
# 6  2020     3 ibuprofen   248
# 7  2020     4 headache    201
# 8  2020     4 ibuprofen    80
# 9  2020     5 headache    220
#10  2020     5 ibuprofen    95
## ℹ 64 more rows
## ℹ Use `print(n = ...)` to see more rows```

Next, we will create 2 separate variables (`headache` and `ibuprofen`) from the column `keyword`:

```trends_df <- trends_df |>
pivot_wider(names_from = keyword,
values_from = hits)
trends_df
## A tibble: 37 × 4
## Groups:   year, month [37]
#   <dbl> <dbl>    <int>     <int>
# 1  2020     1      167       100
# 2  2020     2      166       103
# 3  2020     3      266       248
# 4  2020     4      201        80
# 5  2020     5      220        95
# 6  2020     6      179        83
# 7  2020     7      195        87
# 8  2020     8      219       105
# 9  2020     9      176        84
#10  2020    10      173        85
## ℹ 27 more rows
## ℹ Use `print(n = ...)` to see more rows```

And we plot the relationship between headache and ibuprofen:

```trends_df |>
ggplot(aes(x = ibuprofen, y = headache)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)```

Output:

The relationship looks somewhat linear, so a linear regression model will be appropriate.

As a last step in the data transformation, we group the `month` and `year` variables into 1 variable and we transform the data frame into a time series table (tsibble):

```trends_ts <- trends_df |>
unite(date, year, month) |>
mutate(date = yearmonth(date)) |>
as_tsibble(index = date)

trends_ts
## A tsibble: 37 x 3 [1M]
#      <mth>    <int>     <int>
# 1 2020 Jan      167       100
# 2 2020 Feb      166       103
# 3 2020 Mar      266       248
# 4 2020 Apr      201        80
# 5 2020 May      220        95
# 6 2020 Jun      179        83
# 7 2020 Jul      195        87
# 8 2020 Aug      219       105
# 9 2020 Sep      176        84
#10 2020 Oct      173        85
## ℹ 27 more rows
## ℹ Use `print(n = ...)` to see more rows```

2. Run linear regression

2.1. Run linear regression and check the output

We will now use a Time Series Linear Model (TSLM) to investigate the degree of association between the Google search frequencies of the words headache and ibuprofen.

```lin_reg <- model(trends_ts, TSLM(headache ~ ibuprofen))

report(lin_reg)
#Model: TSLM
#
#Residuals:
#    Min      1Q  Median      3Q     Max
#-81.409 -17.114  -1.486  13.753  47.006
#
#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)
#(Intercept) 105.5416    15.9110   6.633 1.14e-07 ***
#ibuprofen     0.7314     0.1370   5.339 5.74e-06 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 27.15 on 35 degrees of freedom
#Multiple R-squared: 0.4489,	Adjusted R-squared: 0.4331
#F-statistic: 28.51 on 1 and 35 DF, p-value: 5.7435e-06```

The model shows that the relationship is positive and statistically significant.

But maybe we can do better by adding a linear trend as a predictor in the model (this is equivalent to adding a variable that represents row numbers):

```lin_reg <- model(trends_ts, TSLM(headache ~ ibuprofen + trend()))

report(lin_reg)
#Model: TSLM
#
#Residuals:
#    Min      1Q  Median      3Q     Max
#-52.175 -10.126  -1.794  13.898  40.361
#
#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)
#(Intercept) 124.6388    14.2737   8.732 3.33e-10 ***
#ibuprofen     0.7972     0.1167   6.832 7.30e-08 ***
#trend()      -1.3914     0.3561  -3.907 0.000422 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 22.89 on 34 degrees of freedom
#Multiple R-squared: 0.6197,	Adjusted R-squared: 0.5973
#F-statistic:  27.7 on 2 and 34 DF, p-value: 7.2947e-08```

Indeed, the model is now better as the trend variable is statistically significant and the adjusted R-squared is higher than the previous model.

2.2. Check how well the fitted values match the actual values

We will do this by comparing the plot of the time series for the fitted values and the actual trend of headache:

```fitted <- augment(lin_reg)

fitted
## A tsibble: 37 x 6 [1M]
## Key:       .model [1]
#   .model                                   date headache .fitted .resid .innov
#   <chr>                                   <mth>    <int>   <dbl>  <dbl>  <dbl>
# 1 TSLM(headache ~ ibuprofen + trend()) 2020 Jan      167    203. -36.0  -36.0
# 2 TSLM(headache ~ ibuprofen + trend()) 2020 Feb      166    204. -38.0  -38.0
# 3 TSLM(headache ~ ibuprofen + trend()) 2020 Mar      266    318. -52.2  -52.2
# 4 TSLM(headache ~ ibuprofen + trend()) 2020 Apr      201    183.  18.1   18.1
# 5 TSLM(headache ~ ibuprofen + trend()) 2020 May      220    193.  26.6   26.6
# 6 TSLM(headache ~ ibuprofen + trend()) 2020 Jun      179    182.  -3.46  -3.46
# 7 TSLM(headache ~ ibuprofen + trend()) 2020 Jul      195    184.  10.7   10.7
# 8 TSLM(headache ~ ibuprofen + trend()) 2020 Aug      219    197.  21.8   21.8
# 9 TSLM(headache ~ ibuprofen + trend()) 2020 Sep      176    179.  -3.08  -3.08
#10 TSLM(headache ~ ibuprofen + trend()) 2020 Oct      173    178.  -5.49  -5.49
## ℹ 27 more rows
## ℹ Use `print(n = ...)` to see more rows

fitted |>
ggplot(aes(x = date)) +
geom_line(aes(y = .fitted), color = "red")```

Output:

The plot shows that the model produced values that are close enough to the real values.

3. Check linear regression assumptions

Assumptions #1 and #2: the residuals must be uncorrelated and normally distributed

`gg_tsresiduals(lin_reg)`

Output:

The bottom left plot shows that the residuals are not autocorrelated (since no spike surpasses the blue borders), and the bottom right plot shows that the residuals’ distribution is close to normal.

We can also check the Ljung-Box test, where:

• H0: the residuals are independently distributed (not autocorrelated).
• H1: the residuals are autocorrelated.
```# we use lag = 10 as recommended by Hyndman and Athanasopoulos (see references below)
features(fitted, .resid, ljung_box, lag = 10)
## A tibble: 1 × 3
#  .model                               lb_stat lb_pvalue
#  <chr>                                  <dbl>     <dbl>
#1 TSLM(headache ~ ibuprofen + trend())    7.31     0.696```

The test is not statistically significant meaning that we don’t have enough evidence that the residuals are autocorrelated (which is what we want).

Assumption #3: the residuals must have equal variance (i.e. homoscedastic)

```augment(lin_reg) |>
ggplot(aes(x = .fitted, y = .resid)) +
geom_point()```

Output:

This assumption is satisfied since the plot of the residuals versus the fitted values does not show any discernible pattern, i.e. no fan, blob, or butterfly shape (for more information see: Understand Linear Regression Assumptions).

Assumption #4: the residuals must not be related to the predictor(s)

If the plot of the residuals versus any of the predictors in the linear model shows a discernible pattern, this means that the relationship may be non-linear and should be accounted for in the model.

```resid_df <-
residuals(lin_reg) |>
left_join(trends_ts, join_by(date))

resid_df
## A tsibble: 37 x 5 [1M]
## Key:       .model [1]
#   .model                                   date .resid headache ibuprofen
#   <chr>                                   <mth>  <dbl>    <int>     <int>
# 1 TSLM(headache ~ ibuprofen + trend()) 2020 Jan -36.0       167       100
# 2 TSLM(headache ~ ibuprofen + trend()) 2020 Feb -38.0       166       103
# 3 TSLM(headache ~ ibuprofen + trend()) 2020 Mar -52.2       266       248
# 4 TSLM(headache ~ ibuprofen + trend()) 2020 Apr  18.1       201        80
# 5 TSLM(headache ~ ibuprofen + trend()) 2020 May  26.6       220        95
# 6 TSLM(headache ~ ibuprofen + trend()) 2020 Jun  -3.46      179        83
# 7 TSLM(headache ~ ibuprofen + trend()) 2020 Jul  10.7       195        87
# 8 TSLM(headache ~ ibuprofen + trend()) 2020 Aug  21.8       219       105
# 9 TSLM(headache ~ ibuprofen + trend()) 2020 Sep  -3.08      176        84
#10 TSLM(headache ~ ibuprofen + trend()) 2020 Oct  -5.49      173        85
## ℹ 27 more rows
## ℹ Use `print(n = ...)` to see more rows

resid_df |>
ggplot(aes(x = ibuprofen, y = .resid)) +
geom_point()```

Output:

The plot shows no discernible pattern, suggesting that the linear model is a good fit to the data.

References

• Hyndman RJ, Athanasopoulos G. Forecasting: Principles and Practice. 3rd ed. edition. Otexts; 2021.