In this tutorial, we will use the *discoveries* dataset available in R as an example of a time series. The dataset contains yearly count of important scientific discoveries from 1860 to 1959.

## 1. Load the data

library(fpp3) dat <- as_tsibble(discoveries) dat ## A tsibble: 100 x 2 [1Y] # index value # <dbl> <dbl> # 1 1860 5 # 2 1861 3 # 3 1862 0 # 4 1863 2 # 5 1864 0 # 6 1865 3 # 7 1866 2 # 8 1867 3 # 9 1868 6 #10 1869 1 ## ℹ 90 more rows ## ℹ Use `print(n = ...)` to see more rows

A *tsibble* is a time series table that has an index (the year) and a value (the number of discoveries).

This format will make the analysis and plotting a lot easier than an ordinary *data.frame* or *tibble*.

## 2. Plot the time series

autoplot(dat, value)

**Output:**

The function `autoplot()`

automatically plots the change in the variable *value* over time given that the object *dat* is a *tsibble*.

## 3. Check if the data is white noise

In order to determine if our data is *white noise* (i.e. consists of random values), we will run a statistical test: the Ljung-Box test.

- H
_{0}: The data is white noise. - H
_{1}: There exists an actual pattern in the data.

We will use lag = 10, as suggested by Hyndman and Athanasopoulos.

features(dat, value, ljung_box, lag = 10) ## A tibble: 1 × 2 # lb_stat lb_pvalue # <dbl> <dbl> #1 28.2 0.00169

The p-value of the Ljung-Box test is 0.00169 (< 0.05), so we can reject the null hypothesis that the observed pattern is just random noise. In other words, our data support the alternative hypothesis that scientific discoveries are not random events that occur over time.

## 4. Fit a simple time series model

A simple model that we can use to fit our data is the naïve model, which predicts for each period the value of the last observation.

# model the data mod <- model(dat, NAIVE(value)) # get the fitted values and the residuals fitted <- augment(mod) fitted ## A tsibble: 100 x 6 [1Y] ## Key: .model [1] # .model index value .fitted .resid .innov # <chr> <dbl> <dbl> <dbl> <dbl> <dbl> # 1 NAIVE(value) 1860 5 NA NA NA # 2 NAIVE(value) 1861 3 5 -2 -2 # 3 NAIVE(value) 1862 0 3 -3 -3 # 4 NAIVE(value) 1863 2 0 2 2 # 5 NAIVE(value) 1864 0 2 -2 -2 # 6 NAIVE(value) 1865 3 0 3 3 # 7 NAIVE(value) 1866 2 3 -1 -1 # 8 NAIVE(value) 1867 3 2 1 1 # 9 NAIVE(value) 1868 6 3 3 3 #10 NAIVE(value) 1869 1 6 -5 -5 ## ℹ 90 more rows ## ℹ Use `print(n = ...)` to see more rows

Let’s plot the fitted values of the model (in red) and compare them to the real values (in black):

autoplot(fitted, value) + # real values autolayer(fitted, .fitted, color = "red") # fitted values

**Output:**

In order to determine whether this model fits the data well enough, we need to look at the residuals. Specifically, we need them to be: (1) uncorrelated, (2) normally distributed with mean zero, and (3) have constant variance over time.

gg_tsresiduals(mod)

**Output:**

The top plot shows that the variance of the residuals changes over time. The bottom left plot shows that the residuals are correlated (2 spikes extend beyond the blue lines). And the bottom right plot shows that the residuals are not normally distributed.

Therefore, we can conclude that the naïve model is a poor fit to our data.