Empirical Dynamic Models for Forecasting
A tutorial on forcasting with stationary and non-stationary time series
Introduction to EDMs for Forecasting Non-stationary data
EDMs are a data-driven solution for uncovering hidden dynamic behavior in natural systems, which are often complex and dynamic (referred to as “non-stationarity” or “non-linearity”). This non-linearity means that the sign and magnitude of relationships within a system change with time, and therefore linear statistical approaches fail to properly represent such changes. Rather than assuming that the system is governed by any set of equations (i.e. unlike meteorological systems), EDMs reconstruct the dynamics of the system from time series data (hence “data-driven”) and provide a mechanistic understanding of the system. Under EDMs, the dynamics of a system are encoded in the temporal ordering of the time series, and the behavior of such a system can be explained by relating various states of a system using time lags (i.e. estimating the mathematical relationship of one variable at time $X(t)$, to the same variable at other times: $X(t+1)$ and $X(t+2)$. By relating states of a system using such lags, causal relationships between variables in the original system may be uncovered–providing a number of ecologically relevant applications, including forecasting.
To reiterate, EDMs are driven by non-linear dynamics in a system (the relationship of a variable, or state, at various time lags vary in sign and magnitude). Taken’s theorem–the basis of EDM–states that an original system’s dynamics can be reconstructed by exploiting the mathematical relationships between historical records of a single variable. These relationships can be mapped 1-to-1 using the Lorenz Attractor (also known as the Butterfly attractor).
Tutorial on forecasting with stationary and non-stationary time series
Load libraries
library(astsa)
library(rEDM)
library(tidyverse)
library(forecast)
library(ggpubr)
Set time series parameters, where time = hrs and the temporal range is 4 days
set.seed(1)
time = 1:96
Stationary time series
Simulate autocorrelated timeseries data with stationarity (linear data, with cyclical autocorrelation) using arima.sim
Arima, or AutoRegressive Integrated Moving Average, models necessarily assume linearity, because they rely on a linear relationship to predict values from one time step to another.
stationary_y_arima <- arima.sim(n = length(time), list(ar = c(0.9, -0.8), ma = c(-0.41, 0.2)),
sd = sqrt(0.1))
df_ts <- data.frame(x = time, y = stationary_y_arima)
autoplot(stationary_y_arima) + ylab("Stationary Time Series")
Visualize autocorrelation structures using the Parial Autocorrelation Function Estimation feature in the forecast
package (function acf()
)
acf(stationary_y_arima)
pacf(stationary_y_arima)
Partition data into training and predicting subsets:
train <- 1:(length(time)/2) # indices for the first 2/3 of the time series
Arima models for forecasting:
Run a standard Arima model, with no lag dependencies
This model is mathematically identical to a intercept only linear model:
$$\Large \hat{y}_t = \mu + \epsilon_{t}$$
Where, the intercept is equal to the mean of the response variable:
$$\Large \mu = \frac{1}{n} \sum_{t=1}^{n} y_{t}$$
a <- Arima(stationary_y_arima[train])
#plot the fitted values from Arima model
autoplot(fitted(a), col = "blue") + geom_path(data = df_ts, aes(x = x, y = y)) + ylab("Stationary Time Series")
Perform forecast of prediction data using a no-lag Arima model
autoplot(forecast(a, h = 48)) + geom_path(data = df_ts, aes(x = x, y = y)) + ylab("Stationary Time Series")
Autoregressive model, with one time dependency–an hourly lag term:
$$\Large \hat{y}_{t} = \mu + \phi_{1}y_{t-1} + \epsilon_{t}$$
Where, $\Large \phi_1$ is a coefficient of lag
a1 <- Arima(stationary_y_arima[train], c(1,0,0))
#plot the fitted values from Arima model
autoplot(fitted(a1), col = "blue") + geom_path(data = df_ts, aes(x = x, y = y)) + ylab("Stationary Time Series")
#plot the forecasted values from Arima model
autoplot(forecast(a1, h = 48)) + geom_path(data = df_ts, aes(x = x, y = y)) + ylab("Stationary Time Series")
Autoregressive model, with two hourly lags:
$$\Large \hat{y}_{t} = \mu + \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + \epsilon_{t}$$
a2 <- Arima(stationary_y_arima[train], c(1,0,0))
#plot the fitted values from Arima model
autoplot(fitted(a2), col = "blue") + geom_path(data = df_ts, aes(x = x, y = y)) + ylab("Stationary Time Series")
#plot the forecasted values from Arima model
autoplot(forecast(a2, h = 48)) + geom_path(data = df_ts, aes(x = x, y = y)) + ylab("Stationary Time Series")
Autoregressive models, with up to 5 hourly lags:
$$\Large \hat{y}_t = \mu + \phi_{1}y_{t-1} + […] + \phi_{5}y_{t-5} + \epsilon_{t}$$
a3 <- Arima(stationary_y_arima[train], c(3,0,0))
a4 <- Arima(stationary_y_arima[train], c(4,0,0))
a5 <- Arima(stationary_y_arima[train], c(5,0,0))
a1_gg <- autoplot(forecast(a3, h = 48)) + ggtitle("Arima Model Forecast: 3 hourly lags") +
geom_path(data = df_ts, aes(x = x, y = y)) +
geom_path(aes(x = time[train], y = fitted(a3)[train]), col = "blue") +
ylab(" ")
a2_gg <- autoplot(forecast(a4, h = 48)) + ggtitle("Arima Model Forecast: 4 hourly lags") +
geom_path(data = df_ts, aes(x = x, y = y)) +
geom_path(aes(x = time[train], y = fitted(a4)[train]), col = "blue") +
ylab("Stationary Time Series")
a3_gg <- autoplot(forecast(a5, h = 48)) + ggtitle("Arima Model Forecast: 5 hourly lags") +
geom_path(data = df_ts, aes(x = x, y = y)) +
geom_path(aes(x = time[train], y = fitted(a5)[train]), col = "blue") +
ylab(" ")
ggarrange(a1_gg, a2_gg, a3_gg, ncol = 1)
Now, we can move into models with different cycle structures. For this, we will consider half day lags (12 hr periods)
Autoregressive models, with an hourly- and half-day-time dependency:
$$\Large \hat{y}_t = \mu + \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + \phi_{3}y_{t-3} + \phi_{4}y_{t-4} + \phi_{5}y_{t-12} + \epsilon_{t}$$
a41 <- Arima(stationary_y_arima[train], c(4,0,0), c(1,0,0))
autoplot(forecast(a41, h = 48)) + ggtitle("Arima Model Forecast: 4 hourly cycle lag") +
geom_path(data = df_ts, aes(x = x, y = y)) +
geom_path(aes(x = time[train], y = fitted(a41)[train]), col = "blue") +
ylab("Stationary Time Series")
Now, we will let the Arima algorithm choose the time lag parameters, using auto.arima
:
aa <- auto.arima(stationary_y_arima[train])
summary(aa)
## Series: stationary_y_arima[train]
## ARIMA(3,0,0) with zero mean
##
## Coefficients:
## ar1 ar2 ar3
## 0.4728 -0.1068 -0.5655
## s.e. 0.1272 0.1513 0.1384
##
## sigma^2 estimated as 0.08692: log likelihood=-9.02
## AIC=26.04 AICc=26.97 BIC=33.52
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.02152847 0.2854554 0.2289932 187.4472 335.9332 0.6855497
## ACF1
## Training set -0.06089878
# Auto-arima chose a 3-hour lag structure, with no half-day effects
autoplot(forecast(aa, h = 48)) + geom_path(data = df_ts, aes(x = x, y = y)) +
ylab("Stationary Time Series")
Non-stationary time series
Now we will simulate non-linear (a.k.a. non-stationary) data, where relationships change through time, using diffinv
:
## non-stationary data
set.seed(44)
nonstationary_y <- diffinv(rnorm(length(time))) %>% ts()
autoplot(nonstationary_y) + ylab("Non-stationary Time Series")
Let’s see what the auto Arima algorithm estimates with non-stationary data:
aa_ns <- auto.arima(nonstationary_y[train])
summary(aa_ns)
## Series: nonstationary_y[train]
## ARIMA(0,1,0)
##
## sigma^2 estimated as 1.137: log likelihood=-69.71
## AIC=141.42 AICc=141.51 BIC=143.27
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.01182676 1.055224 0.7741009 0.9130602 36.00029 0.9791667
## ACF1
## Training set 0.08409507
Now, visualize forecast of a linear model with non-linear data!
df_ts_st <- data.frame(x = time, y = nonstationary_y[1:96])
aa_ns <- autoplot(forecast(aa_ns, h = 48)) +
geom_path(data = df_ts_st, aes(x = x, y = y)) +
ylab("Non-stationary Time Series"); aa_ns
Not a very good prediction… Let’s try empirical dynamic models!
Empirical Dynamic Models for forecasting:
The model is a system of three ordinary differential equations now known as the Lorenz equations:
$$\frac{dx}{dt} = \sigma(y - x)$$ $$\frac{dy}{dt} = x(p - x) - y$$ $$\frac{dz}{dt} = xy - \beta z$$
We will use the simplex
function to determine how many dimensions (time lags) are needed to effectively develope a data-driven mechanistic formulation of the time series
# set data for historical record (library) and prediction
lib <- c(1, 48)
pred <- c(49, 96)
simplex_output <- simplex(nonstationary_y, lib, pred)
str(simplex_output)
## 'data.frame': 10 obs. of 16 variables:
## $ E : int 1 2 3 4 5 6 7 8 9 10
## $ tau : num 1 1 1 1 1 1 1 1 1 1
## $ tp : num 1 1 1 1 1 1 1 1 1 1
## $ nn : num 2 3 4 5 6 7 8 9 10 11
## $ num_pred : num 47 46 45 44 43 42 41 40 39 38
## $ rho : num 0.768 0.796 0.682 0.716 0.515 ...
## $ mae : num 2.81 2.76 3.03 3.1 3.38 ...
## $ rmse : num 3.55 3.46 3.89 3.88 4.21 ...
## $ perc : num 0.979 0.978 1 1 1 ...
## $ p_val : num 7.73e-12 5.15e-13 3.37e-08 4.22e-09 1.56e-04 ...
## $ const_pred_num_pred: num 47 46 45 44 43 42 41 40 39 38
## $ const_pred_rho : num 0.954 0.954 0.947 0.944 0.939 ...
## $ const_pred_mae : num 1.008 0.988 0.989 0.951 0.966 ...
## $ const_pred_rmse : num 1.23 1.21 1.22 1.17 1.18 ...
## $ const_pred_perc : num 0.979 0.978 0.978 1 1 ...
## $ const_p_val : num 8.26e-36 6.46e-35 1.10e-31 2.88e-30 3.02e-28 ...
Let’s visualize the forecasting skill (rho)
par(mar = c(4, 4, 1, 1), mgp = c(2.5, 1, 0)) # set margins for plotting
plot(simplex_output$E, simplex_output$rho, type = "l", lwd = 5, col = "light blue", xlab = "Embedding Dimension (E)",
ylab = "Forecast Skill (rho)")
simplex_output <- simplex(nonstationary_y, lib, pred, E = 2, tp = 1:10)
plot(simplex_output$tp, simplex_output$rho, type = "l", lwd = 5, col = "light blue", xlab = "Time to Prediction (tp)",
ylab = "Forecast Skill (rho)")
Run simplex
to create EDM model for forecasting
smap_output <- simplex(nonstationary_y, lib, pred, E = 2, stats_only = FALSE)
predictions <- na.omit(smap_output$model_output[[1]])
df_ts_st_pred <- data.frame(x = time[51:96], y = nonstationary_y[51:96], predictions)
plot(df_ts_st$y~df_ts_st$x, type = "l")
edm <- ggplot(data = df_ts_st_pred) + ggtitle("Forecasts from EDM") + xlab("Time") + ylab(" ") +
geom_ribbon(aes(x = x, y = y, ymin = y - 1.96*sqrt(pred_var), ymax = y +.96*sqrt(pred_var)), fill = "blue", alpha = 0.2) +
geom_ribbon(aes(x = x, y = y, ymin = y-sqrt(pred_var), ymax = y+sqrt(pred_var)), fill = "blue", alpha = 0.4) +
geom_path(aes(x = x, y = y)) +
geom_path(data = df_ts_st, aes(x = x, y = y)) +
ylab("Non-stationary Time Series"); edm
ggarrange(aa_ns + coord_cartesian(ylim = c(-20,8)) + ggtitle("Forecast with ARIMA"),
edm + coord_cartesian(ylim = c(-20,8)) + ggtitle("Forecast with EDM")) + theme_bw()
ggsave("forecasts.jpeg", dpi = 300)
## Saving 7 x 5 in image