Ecological forecasting in R

Exercises and associated data

The data and modelling objects created in this notebook can be downloaded directly to save computational time.

Users who wish to complete the exercises can download a small template R script. Assuming you have already downloaded the data objects above, this script will load all data objects so that the steps used to create them are not necessary to tackle the exercises.

Load libraries and time series data

This tutorial relates to content covered in Lecture 5, and relies on the following packages for manipulating data, shaping time series, fitting dynamic regression models and plotting:

library(dplyr)
library(mvgam) 
library(tidybayes)
library(bayesplot)
library(gratia)
library(ggplot2); theme_set(theme_classic())
library(marginaleffects)

This tutorial will focus on one of mvgam’s more advanced features: the ability to fit dynamic models in a State-Space representation. We have not gone into too much detail about how these models work (but see this outstanding resource from E. E. Holmes, M. D. Scheuerell, and E. J. Ward for many useful details, along with this nice lecture by E. E. Holmes).

Illustration of a basic State-Space model, which assumes that a latent dynamic process (X) can evolve independently from the way we take observations (Y) of that process

Briefly, these models allow us to separately make inferences about the underlying dynamic process model that we are interested in (i.e. the evolution of a time series, or of a collection of time series) and the observation model (i.e. the way that we survey / measure this underlying process). This is extremely useful in ecology because our observations are always imperfect / noisy measurements of the thing we are interested in measuring. It is also helpful because we often know that some covariates will impact our ability to measure accurately (i.e. we cannot take accurate counts of rodents if there is a thunderstorm happening) while other covariates impact the underlying process (it is highly unlikely that rodent abundance responds to one storm, but instead probably responds to longer-term weather and climate variation). A State-Space model allows us to model both components in a single unified modelling framework. A major advantage of mvgam is that it can include nonlinear and random effects in BOTH model components while also capturing dynamic processes. I am not aware of any other packages that can easily do this, but of course there may be some.

Lake Washington plankton data

The data we will use to illustrate how we can fit State-Space models in mvgam are from a long-term monitoring study of plankton counts (cells per mL) taken from Lake Washington in Washington, USA. The data are available as part of the MARSS package and can be downloaded using the following:

load(url('https://github.com/atsa-es/MARSS/raw/master/data/lakeWAplankton.rda'))

We will work with five different groups of plankton:

outcomes <- c(
  'Greens', 
  'Bluegreens', 
  'Diatoms', 
  'Unicells', 
  'Other.algae'
)

As usual, preparing the data into the correct format for mvgam modelling takes a little bit of wrangling in dplyr:

# loop across each plankton group to create the long datframe
plankton_data <- do.call(
  rbind, 
  lapply(outcomes, function(x){
    
    # create a group-specific dataframe with counts labelled 'y'
    # and the group name in the 'series' variable
    data.frame(year = lakeWAplanktonTrans[, 'Year'],
               month = lakeWAplanktonTrans[, 'Month'],
               y = lakeWAplanktonTrans[, x],
               series = x,
               temp = lakeWAplanktonTrans[, 'Temp'])})
) %>%
  
  # change the 'series' label to a factor
  dplyr::mutate(series = factor(series)) %>%
  
  # filter to only include some years in the data
  dplyr::filter(year >= 1965 & 
                  year < 1975) %>%
  dplyr::arrange(year, month) %>%
  dplyr::group_by(series) %>%
  
  # z-score the counts so they are approximately standard normal
  dplyr::mutate(y = as.vector(scale(y))) %>%
  
  # add the time indicator
  dplyr::mutate(time = dplyr::row_number()) %>%
  dplyr::ungroup()

Inspect the data structure

head(plankton_data)

## # A tibble: 6 × 6
##    year month       y series       temp  time
##   <dbl> <dbl>   <dbl> <fct>       <dbl> <int>
## 1  1965     1 -0.542  Greens      -1.23     1
## 2  1965     1 -0.344  Bluegreens  -1.23     1
## 3  1965     1 -0.0768 Diatoms     -1.23     1
## 4  1965     1 -1.52   Unicells    -1.23     1
## 5  1965     1 -0.491  Other.algae -1.23     1
## 6  1965     2 NA      Greens      -1.32     2

dplyr::glimpse(plankton_data)

## Rows: 600
## Columns: 6
## $ year   <dbl> 1965, 1965, 1965, 1965, 1965, 1965, 1965, 1965, 1965, 1965, 196…
## $ month  <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, …
## $ y      <dbl> -0.54241769, -0.34410776, -0.07684901, -1.52243490, -0.49055442…
## $ series <fct> Greens, Bluegreens, Diatoms, Unicells, Other.algae, Greens, Blu…
## $ temp   <dbl> -1.2306562, -1.2306562, -1.2306562, -1.2306562, -1.2306562, -1.…
## $ time   <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, …

Note that we have z-scored the counts in this example as that will make it easier to specify priors

plot_mvgam_series(
  data = plankton_data, 
  series = 'all'
)

As usual, check the data for NAs:

image(
  is.na(t(plankton_data)), 
  axes = F,
  col = c('grey80', 'darkred')
)
axis(
  3, 
  at = seq(0,1, len = NCOL(plankton_data)), 
  labels = colnames(plankton_data)
)

Manipulate data for modeling

We have some missing observations, but of course this isn’t an issue for modelling in mvgam. A useful property to understand about these counts is that they tend to be highly seasonal. Below are some plots of z-scored counts against the z-scored temperature measurements in the lake for each month:

plankton_data %>%
  dplyr::filter(series == 'Other.algae') %>%
  ggplot(aes(x = time, y = temp)) +
  geom_line(size = 1.1) +
  geom_line(aes(y = y), col = 'white',
            size = 1.2) +
  geom_line(aes(y = y), col = 'darkred',
            size = 1.1) +
  ylab('z-score') +
  xlab('Time') +
  ggtitle('Temperature (black) vs Other algae (red)')

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

plankton_data %>%
  dplyr::filter(series == 'Diatoms') %>%
  ggplot(aes(x = time, y = temp)) +
  geom_line(size = 1.1) +
  geom_line(aes(y = y), col = 'white',
            size = 1.2) +
  geom_line(aes(y = y), col = 'darkred',
            size = 1.1) +
  ylab('z-score') +
  xlab('Time') +
  ggtitle('Temperature (black) vs Diatoms (red)')

plankton_data %>%
  dplyr::filter(series == 'Greens') %>%
  ggplot(aes(x = time, y = temp)) +
  geom_line(size = 1.1) +
  geom_line(aes(y = y), col = 'white',
            size = 1.2) +
  geom_line(aes(y = y), col = 'darkred',
            size = 1.1) +
  ylab('z-score') +
  xlab('Time') +
  ggtitle('Temperature (black) vs Greens (red)')

We will have to try and capture this seasonality in our process model, which should be easy to do given the flexibility of GAMs. Next we will split the data into training and testing splits:

plankton_train <- plankton_data %>%
  dplyr::filter(time <= 112)
plankton_test <- plankton_data %>%
  dplyr::filter(time > 112)

Exercises

Calculate the number of timepoints in the training data that have non-missing observations for all five time series.

Capturing seasonality

Now time to fit some models. This requires a bit of thinking about how we can best tackle the seasonal variation and the likely dependence structure in the data. These algae are interacting as part of a complex system within the same lake, so we certainly expect there to be some lagged cross-dependencies underling their dynamics. But if we do not model the seasonal variation with covariates, our multivariate dynamic model will be forced to try and capture it, which could lead to poor convergence and unstable results (we could feasibly capture cyclic dynamics with a more complex multi-species Lotka-Volterra model, but ordinary differential equation approaches are beyond the scope of this workshop).

First we will fit a model that does not include a dynamic component, just to see if it can reproduce the seasonal variation in the observations. This model capitalizes on hierarchical multidimensional smooths. It includes a “global” tensor product of the month and temp variables, capturing our expectation that algal seasonality responds to temperature variation. But this response should depend on when in the year these temperatures are recorded (i.e. a response to warm temperatures in Spring should be different to a response to warm temperatures in Autumn). The model also fits series-specific deviation smooths (i.e. one tensor product per series) to describe how each algal group’s seasonality differs from the overall “global” seasonality. Note that we do not include series-specific intercepts in this model because each series was z-scored to have a mean of 0.

notrend_mod <- mvgam(
  y ~ 
    # tensor of temp and month to capture
    # "global" seasonality
    te(temp, month, k = c(4, 4)) +
    
    # series-specific deviation tensor products
    te(temp, month, k = c(4, 4), by = series),
  family = gaussian(),
  data = plankton_train,
  newdata = plankton_test,
  trend_model = 'None'
)

The “global” tensor product smooth function can be quickly visualized using gratia:

gratia::draw(
  notrend_mod$mgcv_model, 
  select = 1
)

We can then plot the deviation smooths for each algal group to see how they vary from the “global” pattern:

gratia::draw(
  notrend_mod$mgcv_model, 
  select = 2
)

gratia::draw(
  notrend_mod$mgcv_model, 
  select = 3
)

gratia::draw(
  notrend_mod$mgcv_model, 
  select = 4
)

gratia::draw(
  notrend_mod$mgcv_model, 
  select = 5
)

gratia::draw(
  notrend_mod$mgcv_model, 
  select = 6
)

These multidimensional smooths have done a good job of capturing the seasonal variation in our observations:

fcs <- forecast(notrend_mod)

## Loading required package: rstan

## Loading required package: StanHeaders

## 
## rstan version 2.32.6 (Stan version 2.32.2)

## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
## change `threads_per_chain` option:
## rstan_options(threads_per_chain = 1)

## Do not specify '-march=native' in 'LOCAL_CPPFLAGS' or a Makevars file

plot(fcs, series = 1)

## Out of sample CRPS:
## 6.81377214277226

plot(fcs, series = 2)

## Out of sample CRPS:
## 6.80210576695201

plot(fcs, series = 3)

## Out of sample CRPS:
## 4.10537245887331

plot(fcs, series = 4)

## Out of sample CRPS:
## 3.57216173181198

plot(fcs, series = 5)

## Out of sample CRPS:
## 2.86161755725245

Multiseries dynamics

The basic model gives us confidence that we can capture the seasonal variation in the observations. But the model has not captured the remaining temporal dynamics, which is obvious when we inspect Dunn-Smyth residuals for each series using pp_check():

pp_check(
  notrend_mod,
  type = 'resid_ribbon_grouped',
  x = 'time',
  group = 'series',
  ndraws = 200
)

Now it is time to get into multivariate State-Space models. We will fit two models that can both incorporate lagged cross-dependencies in the latent process models. The first model assumes that the process errors operate independently from one another, while the second assumes that there may be contemporaneous correlations in the process errors. Both models include a Vector Autoregressive component for the process means, and so both can model complex community dynamics. The models can be described mathematically as follows:

\[\begin{align*} \boldsymbol{count}_t & \sim \text{Normal}(\mu_{obs[t]}, \sigma_{obs}) \\ \mu_{obs[t]} & = process_t \\ process_t & \sim \text{MVNormal}(\mu_{process[t]}, \Sigma_{process}) \\ \mu_{process[t]} & = VAR * process_{t-1} + f_{global}(\boldsymbol{month},\boldsymbol{temp})_t + f_{series}(\boldsymbol{month},\boldsymbol{temp})_t \\ f_{global}(\boldsymbol{month},\boldsymbol{temp}) & = \sum_{k=1}^{K}b_{global} * \beta_{global} \\ f_{series}(\boldsymbol{month},\boldsymbol{temp}) & = \sum_{k=1}^{K}b_{series} * \beta_{series} \end{align*}\]

Here you can see that there are no terms in the observation model apart from the underlying process model. But we could easily add covariates into the observation model if we felt that they could explain some of the systematic observation errors. We also assume independent observation processes (there is no covariance structure in the observation errors \(\sigma_{obs}\)). At present, mvgam does not support multivariate observation models. But this feature will be added in future versions. However the underlying process model is multivariate, and there is a lot going on here. This component has a Vector Autoregressive part, where the process mean at time \(t\) \((\mu_{process[t]})\) is a vector that evolves as a function of where the vector-valued process model was at time \(t-1\). The \(VAR\) matrix captures these dynamics with self-dependencies on the diagonal and possibly asymmetric cross-dependencies on the off-diagonals, while also incorporating the nonlinear smooth functions that capture seasonality for each series. The contemporaneous process errors are modeled by \(\Sigma_{process}\), which can be constrained so that process errors are independent (i.e. setting the off-diagonals to 0) or can be fully parameterized using a Cholesky decomposition (using Stan’s \(LKJcorr\) distribution to place a prior on the strength of inter-species correlations). For those that are interested in the inner-workings, mvgam makes use of a recent breakthrough by Sarah Heaps to enforce stationarity of Bayesian VAR processes. This is advantageous as we often don’t expect forecast variance to increase without bound forever into the future, but many estimated VARs tend to behave this way.

Ok that was a lot to take in. Let’s fit some models to try and inspect what is going on and what they assume. But first, we need to update mvgam’s default priors for the observation and process errors. By default, mvgam uses a fairly wide Student-T prior on these parameters to avoid being overly informative. But our observations are z-scored and so we do not expect very large process or observation errors. However, we also do not expect very small observation errors either as we know these measurements are not perfect. So let’s update the priors for these parameters. In doing so, you will get to see how the formula for the latent process (i.e. trend) model is used in mvgam:

priors <- get_mvgam_priors(
  # observation formula, which has no terms in it
  y ~ -1,
  
  # process model formula, which includes the smooth functions
  trend_formula = ~ te(temp, month, k = c(4, 4)) +
    te(temp, month, k = c(4, 4), by = trend),
  
  # VAR1 model with correlated process errors
  trend_model = VAR(cor = TRUE),
  family = gaussian(),
  data = plankton_train
)

Get names of all parameters whose priors can be modified:

priors[, 3]

##  [1] "(Intercept)"                                                                                                                                                                                                                                                           
##  [2] "process error sd"                                                                                                                                                                                                                                                      
##  [3] "diagonal autocorrelation population mean"                                                                                                                                                                                                                              
##  [4] "off-diagonal autocorrelation population mean"                                                                                                                                                                                                                          
##  [5] "diagonal autocorrelation population variance"                                                                                                                                                                                                                          
##  [6] "off-diagonal autocorrelation population variance"                                                                                                                                                                                                                      
##  [7] "shape1 for diagonal autocorrelation precision"                                                                                                                                                                                                                         
##  [8] "shape1 for off-diagonal autocorrelation precision"                                                                                                                                                                                                                     
##  [9] "shape2 for diagonal autocorrelation precision"                                                                                                                                                                                                                         
## [10] "shape2 for off-diagonal autocorrelation precision"                                                                                                                                                                                                                     
## [11] "observation error sd"                                                                                                                                                                                                                                                  
## [12] "(Intercept) for the trend"                                                                                                                                                                                                                                             
## [13] "te(temp,month) smooth parameters, te(temp,month):trendtrend1 smooth parameters, te(temp,month):trendtrend2 smooth parameters, te(temp,month):trendtrend3 smooth parameters, te(temp,month):trendtrend4 smooth parameters, te(temp,month):trendtrend5 smooth parameters"

And their default prior distributions:

priors[, 4]

##  [1] "(Intercept) ~ student_t(3, -0.1, 2.5);" 
##  [2] "sigma ~ inv_gamma(1.418, 0.452);"       
##  [3] "es[1] = 0;"                             
##  [4] "es[2] = 0;"                             
##  [5] "fs[1] = sqrt(0.455);"                   
##  [6] "fs[2] = sqrt(0.455);"                   
##  [7] "gs[1] = 1.365;"                         
##  [8] "gs[2] = 1.365;"                         
##  [9] "hs[1] = 0.071175;"                      
## [10] "hs[2] = 0.071175;"                      
## [11] "sigma_obs ~ inv_gamma(1.418, 0.452);"   
## [12] "(Intercept)_trend ~ student_t(3, 0, 2);"
## [13] "lambda_trend ~ normal(5, 30);"

Setting priors is easy in mvgam as you can use brms routines. Here we use a more informative Beta prior for the process error variation:

priors <- prior(
  beta(5, 5),
  class = sigma,
  lb = 0,
  ub = 1
)

You may have noticed something else unique about this model: there is no intercept term in the observation formula. This is because a shared intercept parameter can sometimes be unidentifiable with respect to the latent VAR process, particularly if our series have similar long-run averages (which they do in this case because they were z-scored). We will often get better convergence in these State-Space models if we drop this parameter. mvgam accomplishes this by fixing the coefficient for the intercept to zero. We also assume that all series share the same observation variance, which makes sense given that we z-scored the series. Now we can fit the first model, which assumes that process errors may be contemporaneously correlated.

var_mod <- mvgam(  
  # observation formula, which is empty
  y ~ -1,
  
  # process model formula, which includes the smooth functions
  trend_formula = ~ te(temp, month, k = c(4, 4)) +
    te(temp, month, k = c(4, 4), by = trend),
  
  # VAR1 model with correlated process errors
  trend_model = VAR(cor = TRUE),
  family = gaussian(),
  data = plankton_train,
  newdata = plankton_test,
  
  # include the updated priors
  priors = priors
)

Inspecting SS models

This model’s summary is a bit different to other mvgam summaries. It separates parameters based on whether they belong to the observation model or to the latent process model. This is because we may often have covariates that impact the observations but not the latent process, so we can have fairly complex models for each component. You will notice that some parameters have not fully converged, particularly for the VAR coefficients (called A in the output) and for the process errors (Sigma). Note that we set include_betas = FALSE to stop the summary from printing output for all of the spline coefficients, which can be dense and hard to interpret:

summary(
  var_mod, 
  include_betas = FALSE
)

## GAM observation formula:
## y ~ 1
## 
## GAM process formula:
## ~te(temp, month, k = c(4, 4)) + te(temp, month, k = c(4, 4), 
##     by = trend)
## 
## Family:
## gaussian
## 
## Link function:
## identity
## 
## Trend model:
## VAR(cor = TRUE)
## 
## 
## N process models:
## 5 
## 
## N series:
## 5 
## 
## N timepoints:
## 120 
## 
## Status:
## Fitted using Stan 
## 4 chains, each with iter = 1500; warmup = 1000; thin = 1 
## Total post-warmup draws = 2000
## 
## 
## Observation error parameter estimates:
##           2.5%  50% 97.5% Rhat n_eff
## sigma_obs 0.16 0.26  0.34 1.07    51
## 
## GAM observation model coefficient (beta) estimates:
##             2.5% 50% 97.5% Rhat n_eff
## (Intercept)    0   0     0  NaN   NaN
## 
## Process model VAR parameter estimates:
##          2.5%     50% 97.5% Rhat n_eff
## A[1,1]  0.720  0.8600 0.990 1.02   152
## A[1,2] -0.180 -0.0220 0.130 1.01   711
## A[1,3] -0.063  0.0320 0.140 1.00  1423
## A[1,4] -0.110  0.0053 0.110 1.01   740
## A[1,5] -0.150 -0.0054 0.140 1.01   349
## A[2,1] -0.330 -0.1100 0.052 1.00   925
## A[2,2]  0.190  0.4500 0.710 1.00  1043
## A[2,3] -0.150 -0.0180 0.110 1.00  1274
## A[2,4] -0.071  0.0950 0.270 1.00  1443
## A[2,5] -0.081  0.1100 0.340 1.00  1062
## A[3,1] -0.280 -0.0140 0.220 1.00  1422
## A[3,2] -0.340  0.0073 0.300 1.00  1386
## A[3,3] -0.042  0.2200 0.450 1.00  1164
## A[3,4] -0.042  0.2100 0.500 1.00  1390
## A[3,5] -0.270  0.0160 0.300 1.00  1595
## A[4,1] -0.210 -0.0220 0.130 1.00   950
## A[4,2] -0.360 -0.0880 0.130 1.00   895
## A[4,3] -0.120  0.0290 0.170 1.00  1437
## A[4,4]  0.490  0.6800 0.880 1.00  1052
## A[4,5] -0.130  0.0580 0.280 1.00  1269
## A[5,1] -0.032  0.1100 0.300 1.00  1112
## A[5,2] -0.230 -0.0220 0.180 1.00   987
## A[5,3] -0.034  0.0890 0.220 1.00  1550
## A[5,4] -0.200 -0.0310 0.120 1.00  1462
## A[5,5]  0.360  0.6000 0.820 1.01   406
## 
## Process error parameter estimates:
##               2.5%    50% 97.5% Rhat n_eff
## Sigma[1,1]  0.0760  0.130 0.210 1.02   143
## Sigma[1,2] -0.0840 -0.029 0.020 1.00  1087
## Sigma[1,3] -0.0980 -0.026 0.046 1.00  1161
## Sigma[1,4] -0.0130  0.035 0.093 1.00   796
## Sigma[1,5]  0.0190  0.069 0.130 1.01  1032
## Sigma[2,1] -0.0840 -0.029 0.020 1.00  1087
## Sigma[2,2]  0.1700  0.270 0.380 1.02   211
## Sigma[2,3] -0.0530  0.049 0.150 1.00  1090
## Sigma[2,4]  0.0430  0.110 0.180 1.00  1106
## Sigma[2,5] -0.0560  0.014 0.076 1.00  1306
## Sigma[3,1] -0.0980 -0.026 0.046 1.00  1161
## Sigma[3,2] -0.0530  0.049 0.150 1.00  1090
## Sigma[3,3]  0.4900  0.650 0.810 1.01   707
## Sigma[3,4] -0.0340  0.067 0.170 1.00  1554
## Sigma[3,5] -0.1600 -0.055 0.042 1.00  1133
## Sigma[4,1] -0.0130  0.035 0.093 1.00   796
## Sigma[4,2]  0.0430  0.110 0.180 1.00  1106
## Sigma[4,3] -0.0340  0.067 0.170 1.00  1554
## Sigma[4,4]  0.2100  0.310 0.440 1.02   207
## Sigma[4,5]  0.0021  0.066 0.140 1.00  1379
## Sigma[5,1]  0.0190  0.069 0.130 1.01  1032
## Sigma[5,2] -0.0560  0.014 0.076 1.00  1306
## Sigma[5,3] -0.1600 -0.055 0.042 1.00  1133
## Sigma[5,4]  0.0021  0.066 0.140 1.00  1379
## Sigma[5,5]  0.1600  0.260 0.390 1.02   188
## 
## GAM process model coefficient (beta) estimates:
##                   2.5%    50% 97.5% Rhat n_eff
## (Intercept)_trend -0.2 -0.052  0.11    1  1107
## 
## Approximate significance of GAM process smooths:
##                              edf Ref.df Chi.sq p-value
## te(temp,month)              3.60     15  28.20    0.47
## te(temp,month):seriestrend1 2.35     15   3.84    0.99
## te(temp,month):seriestrend2 3.86     15  41.32    0.50
## te(temp,month):seriestrend3 3.97     15   0.53    1.00
## te(temp,month):seriestrend4 1.29     15   9.96    0.99
## te(temp,month):seriestrend5 3.72     15  12.75    0.45
## 
## Stan MCMC diagnostics:
## ✔ No issues with effective samples per iteration
## ✖ Rhats above 1.05 found for some parameters
##     Use pairs() and mcmc_plot() to investigate
## ✔ No issues with divergences
## ✔ No issues with maximum tree depth
## 
## Samples were drawn using sampling(hmc). For each parameter, n_eff is a
##   crude measure of effective sample size, and Rhat is the potential scale
##   reduction factor on split MCMC chains (at convergence, Rhat = 1)
## 
## Use how_to_cite() to get started describing this model

We can again plot the smooth functions, which this time operate on the process model. The coefficients for this model are now accessible through the trend_mgcv_model slot in the model object:

gratia::draw(
  var_mod$trend_mgcv_model, 
  select = 1
)

The VAR matrix is of particular interest here, as it captures lagged dependencies and cross-dependencies in the latent process model:

mcmc_plot(
  var_mod, 
  variable = 'A', 
  regex = TRUE, 
  type = 'hist'
)

Unfortunately bayesplot doesn’t know this is a matrix of parameters so what we see is actually the transpose of the VAR matrix. A little bit of wrangling gives us these histograms in the correct order:

A_pars <- matrix(NA, nrow = 5, ncol = 5)
for(i in 1:5){
  for(j in 1:5){
    A_pars[i, j] <- paste0('A[', i, ',', j, ']')
  }
}
mcmc_plot(
  var_mod, 
  variable = as.vector(t(A_pars)), 
  type = 'hist'
)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There is a lot happening in this matrix. Each cell captures the lagged effect of the process in the column on the process in the row in the next timestep. So for example, the effect in cell [5,1], which is quite strongly positive, means that an increase in the process for series 5 (Unicells) at time \(t\) is expected to lead to a subsequent increase in the process for series 1 (Bluegreens) at time \(t+1\). The latent process model is now capturing these effects and the smooth seasonal effects, so the trend plot shows our best estimate of what the true count should have been at each time point:

plot(
  var_mod, 
  type = 'trend', 
  series = 1
)

plot(
  var_mod, 
  type = 'trend', 
  series = 3
)

The process error \((\Sigma)\) captures unmodelled variation in the process models, together with any possible correlations among these errors:

Sigma_pars <- matrix(NA, nrow = 5, ncol = 5)
for(i in 1:5){
  for(j in 1:5){
    Sigma_pars[i, j] <- paste0('Sigma[', i, ',', j, ']')
  }
}
mcmc_plot(
  var_mod, 
  variable = as.vector(t(Sigma_pars)), 
  type = 'hist'
)

The observation error estimate \((\sigma_{obs})\) represents how much the model thinks we might miss the true count when we take our imperfect measurements:

mcmc_plot(
  var_mod, 
  variable = 'sigma_obs', 
  regex = TRUE, 
  type = 'hist'
)

Impulse response functions

Because Vector Autoregressions can capture complex lagged dependencies, it is often difficult to understand how the member time series are thought to interact with one another. A method that is commonly used to directly test for possible interactions is to compute an Impulse Response Function (IRF). If \(h\) represents the simulated forecast horizon, an IRF asks how each of the remaining series might respond over times \((t+1):h\) if a focal series is given an innovation “shock” at time \(t = 0\). mvgam can compute Generalized and Orthogonalized IRFs from models that included latent VAR dynamics. We simply feed the fitted model to the irf() function and then use the S3 plot() function to view the estimated responses. By default, irf() will compute IRFs by separately imposing positive shocks of one standard deviation to each series in the VAR process. Here we compute Generalized IRFs over a horizon of 12 timesteps:

irfs <- irf(
  var_mod, 
  h = 12
)

A summary of the IRFs can be computed using the summary() function:

summary(irfs)

## # A tibble: 300 × 5
##    shock                horizon irf_median irf_Qlower irf_Qupper
##    <chr>                  <int>      <dbl>      <dbl>      <dbl>
##  1 Process1 -> Process1       1     0.364      0.276       0.456
##  2 Process1 -> Process1       2     0.303      0.233       0.379
##  3 Process1 -> Process1       3     0.255      0.196       0.325
##  4 Process1 -> Process1       4     0.217      0.161       0.284
##  5 Process1 -> Process1       5     0.184      0.128       0.252
##  6 Process1 -> Process1       6     0.157      0.0992      0.225
##  7 Process1 -> Process1       7     0.134      0.0758      0.205
##  8 Process1 -> Process1       8     0.115      0.0582      0.187
##  9 Process1 -> Process1       9     0.0981     0.0438      0.172
## 10 Process1 -> Process1      10     0.0839     0.0333      0.157
## # ℹ 290 more rows

But it is easier to understand these responses using plots. For example, we can plot the expected responses of the remaining series to a positive shock for series 3 (Greens) using the plot() function:

plot(irfs, series = 3)

This series of plots makes it clear that some of the other series would be expected to show both instantaneous responses to a shock for the Greens (due to their correlated process errors) as well as delayed and nonlinear responses over time (due to the complex lagged dependence structure captured by the \(A\) matrix). This hopefully makes it clear why IRFs are an important tool in the analysis of multivariate autoregressive models. You can also use these IRFs to calculate a relative contribution from each shock to the forecast error variance for a focal series. This method, known as a Forecast Error Variance Decomposition (FEVD), is useful to get an idea about the amount of information that each series contributes to the evolution of all other series in a Vector Autoregression:

fevds <- fevd(
  var_mod, 
  h = 12
)
plot(fevds)

The plot above shows the median contribution to forecast error variance for each series.

Dynamic factor trends

Let’s see how a dynamic factor model compares. Recall from the lecture that dynamic factor models can induce correlations among the time series using a reduced rank “factor” model. We effectively estimate fewer dynamic factors than we have series, but let each series depend on these factors to form it’s series-specific trend. Priors for this model are simpler because the variances of the factors are fixed, so we don’t need a prior for those. We can fit the dynamic factor model, using three AR1 factors for the five series. Note that I am using fewer posterior samples for this model so it is easier to work with for completing the exercises:

df_mod <- mvgam(  
  # observation formula
  y ~ te(temp, month, k = c(4, 4)) +
    te(temp, month, k = c(4, 4), by = series),
  
  # three AR1 factors
  trend_model = AR(p = 1),
  use_lv = TRUE,
  n_lv = 3,
  family = gaussian(),
  share_obs_params = TRUE,
  data = plankton_train,
  newdata = plankton_test,
  
  # use reduced samples for inclusion in tutorial data
  samples = 100
)

The summary now contains less information as we don’t get details about the factor variances:

summary(
  df_mod, 
  include_betas = FALSE
)

## GAM formula:
## y ~ te(temp, month, k = c(4, 4)) + te(temp, month, k = c(4, 4), 
##     by = series)
## 
## Family:
## gaussian
## 
## Link function:
## identity
## 
## Trend model:
## AR(p = 1)
## 
## 
## N latent factors:
## 3 
## 
## N series:
## 5 
## 
## N timepoints:
## 120 
## 
## Status:
## Fitted using Stan 
## 4 chains, each with iter = 1100; warmup = 1000; thin = 1 
## Total post-warmup draws = 400
## 
## 
## Observation error parameter estimates:
##           2.5%  50% 97.5% Rhat n_eff
## sigma_obs 0.44 0.48  0.52 1.02   160
## 
## GAM coefficient (beta) estimates:
##              2.5%    50% 97.5% Rhat n_eff
## (Intercept) -0.17 -0.041 0.046 1.04    79
## 
## Approximate significance of GAM smooths:
##                                   edf Ref.df Chi.sq p-value  
## te(temp,month)                   5.15     15  20.15   0.072 .
## te(temp,month):seriesBluegreens  2.67     15   3.57   0.993  
## te(temp,month):seriesDiatoms     4.48     15  45.23   0.518  
## te(temp,month):seriesGreens      1.41     15   5.87   1.000  
## te(temp,month):seriesOther.algae 2.81     15  10.43   0.916  
## te(temp,month):seriesUnicells    1.49     15  32.96   0.638  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Latent trend AR parameter estimates:
##          2.5%  50% 97.5% Rhat n_eff
## ar1[1] 0.9000 0.96  1.00 1.03   156
## ar1[2] 0.4800 0.69  0.85 1.02   169
## ar1[3] 0.0015 0.30  0.60 1.04    95
## 
## Stan MCMC diagnostics:
## ✔ No issues with effective samples per iteration
## ✖ Rhats above 1.05 found for some parameters
##     Use pairs() and mcmc_plot() to investigate
## ✔ No issues with divergences
## ✔ No issues with maximum tree depth
## 
## Samples were drawn using sampling(hmc). For each parameter, n_eff is a
##   crude measure of effective sample size, and Rhat is the potential scale
##   reduction factor on split MCMC chains (at convergence, Rhat = 1)
## 
## Use how_to_cite() to get started describing this model

Plot estimates for the three factors, which shows that all three have captured important temporal dynamics

plot(
  df_mod, 
  type = 'factors'
)

## # A tibble: 3 × 2
##   Factor   Contribution
##   <chr>           <dbl>
## 1 Factor 1        0.169
## 2 Factor 2        0.302
## 3 Factor 3        0.529

The loading matrix, which determines how each series depends on the set of factors, can be used to calculate correlations among the series.

mean_corrs <- lv_correlations(
  object = df_mod
)$mean_correlations
mean_corrs

##               Bluegreens    Diatoms       Greens Other.algae   Unicells
## Bluegreens   1.000000000 -0.5966785 -0.002814428   0.1841590  0.9246815
## Diatoms     -0.596678520  1.0000000  0.225148625   0.6420348 -0.3309501
## Greens      -0.002814428  0.2251486  1.000000000   0.3994746 -0.0996246
## Other.algae  0.184159039  0.6420348  0.399474626   1.0000000  0.4207321
## Unicells     0.924681469 -0.3309501 -0.099624596   0.4207321  1.0000000

Here we can see some very strong patterns, i.e. strong positive correlations between Unicells and Bluegreens for example, and a strong negative correlation between Diatoms and Bluegreens. Plotting these as a heatmap, where blue colours show negative correlations and red colours show positive correlations, makes these relationships easier to visualise

heatmap(
  mean_corrs,
  Colv = NA, 
  Rowv = NA,
  cexRow = 1, 
  cexCol = 1, 
  symm = TRUE,
  distfun = function(c) as.dist(1 - c),
  col = hcl.colors(n = 12, palette = 'Blue-Red')
)

But which model is better? We can compute the variogram score for out of sample forecasts to get a sense of which model does a better job of capturing the dependence structure in the true evaluation set:

# create forecast objects for each model
fcvar <- forecast(var_mod)
fcdf <- forecast(df_mod)

# plot the difference in variogram scores; a negative value means the DF model is better, while a positive value means the VAR1 model is better
diff_scores <- score(fcdf, score = 'variogram')$all_series$score -
  score(fcvar, score = 'variogram')$all_series$score
plot(
  diff_scores, 
  pch = 16, 
  col = 'darkred', 
  ylim = c(-1*max(abs(diff_scores), na.rm = TRUE),
           max(abs(diff_scores), na.rm = TRUE)),
  bty = 'l',
  xlab = 'Forecast horizon',
  ylab = expression(variogram[DF]~-~variogram[VAR1])
)
abline(h = 0, lty = 'dashed')

The models tend to provide similar forecasts, though the dynamic factor model does slightly better overall. We would probably need to use a more extensive rolling forecast evaluation exercise if we felt like we needed to only choose one for production. mvgam offers some utilities for doing this (i.e. see ?lfo_cv for guidance).

Exercises

Plot conditional effects of month and temperature for each algal group using the dynamic factor model. Hint, see the documentation in ?marginaleffects::plot_predictions for guidance
Compare in-sample fits from the two models (var_mod and df_mod) using loo_compare(). Does this comparison agree with the forecast comparison above? Why might they differ?
Fit a second dynamic factor model that uses Gaussian Process factors in place of AR1 factors. Compare forecasts from this model to the AR1 factor model using the energy and variogram scores. Which model is preferred?

Check here for template code if you’re having trouble plotting conditional effects by algal group

# Replace the ? with the correct value(s)
# You can use 'plot_predictions' to generate conditional effects plots that are stratified over a number of variables (up to three at once).
# This will feed a particular grid of 'newdata' to the 'predict.mvgam' 
# function, returning conditional predictions on the response scale
?marginaleffects::plot_predictions
plot_predictions(df_mod,
                 condition = c(?, ?, ?),
                 conf_level = 0.8)

Session Info

sessionInfo()

## R version 4.4.2 (2024-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8   
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.utf8    
## 
## time zone: Australia/Brisbane
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] rstan_2.32.6           StanHeaders_2.32.10    marginaleffects_0.25.0
##  [4] ggplot2_3.5.1          gratia_0.10.0          bayesplot_1.11.1.9000 
##  [7] tidybayes_3.0.7        mvgam_1.1.52           dplyr_1.1.4           
## [10] downloadthis_0.4.1    
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1      svUnit_1.0.6          farver_2.1.2         
##  [4] loo_2.8.0.9000        fastmap_1.2.0         tensorA_0.36.2.1     
##  [7] digest_0.6.37         timechange_0.3.0      mime_0.12            
## [10] lifecycle_1.0.4       processx_3.8.6        magrittr_2.0.3       
## [13] posterior_1.6.0.9000  compiler_4.4.2        rlang_1.1.5          
## [16] sass_0.4.9            tools_4.4.2           utf8_1.2.4           
## [19] yaml_2.3.10           data.table_1.17.0     knitr_1.49           
## [22] labeling_0.4.3        bridgesampling_1.1-2  curl_6.2.1           
## [25] pkgbuild_1.4.6        cmdstanr_0.8.0        plyr_1.8.9           
## [28] RColorBrewer_1.1-3    abind_1.4-8           klippy_0.0.0.9500    
## [31] withr_3.0.2           purrr_1.0.4           stats4_4.4.2         
## [34] grid_4.4.2            inline_0.3.21         colorspace_2.1-1     
## [37] scales_1.3.0          isoband_0.2.7         insight_1.0.2        
## [40] cli_3.6.4             mvtnorm_1.3-3         rmarkdown_2.29       
## [43] crayon_1.5.3          generics_0.1.3        RcppParallel_5.1.10  
## [46] rstudioapi_0.17.1     reshape2_1.4.4        cachem_1.1.0         
## [49] stringr_1.5.1         splines_4.4.2         assertthat_0.2.1     
## [52] parallel_4.4.2        matrixStats_1.5.0     brms_2.22.9          
## [55] vctrs_0.6.5           V8_6.0.1              Matrix_1.7-1         
## [58] jsonlite_1.9.0        patchwork_1.3.0       arrayhelpers_1.1-0   
## [61] ggdist_3.3.2          b64_0.1.3             jquerylib_0.1.4      
## [64] tidyr_1.3.1           glue_1.8.0            ps_1.9.0             
## [67] codetools_0.2-20      ggokabeito_0.1.0      mvnfast_0.2.8        
## [70] distributional_0.5.0  lubridate_1.9.4       stringi_1.8.4        
## [73] gtable_0.3.6          QuickJSR_1.5.2        munsell_0.5.1        
## [76] tibble_3.2.1          pillar_1.10.1         htmltools_0.5.8.1    
## [79] Brobdingnag_1.2-9     R6_2.6.1              evaluate_1.0.3       
## [82] lattice_0.22-6        backports_1.5.0       bslib_0.9.0          
## [85] rstantools_2.4.0.9000 bsplus_0.1.4          Rcpp_1.0.14          
## [88] zip_2.3.2             gridExtra_2.3         coda_0.19-4.1        
## [91] nlme_3.1-166          checkmate_2.3.2       mgcv_1.9-1           
## [94] xfun_0.51             fs_1.6.5              pkgconfig_2.0.3

n.clark@uq.edu.au, https://github.com/nicholasjclark ↩︎

Ecological forecasting in R

Tutorial 4: multivariate time series

Nicholas Clark, School of Veterinary Science, University of Queensland¹

Exercises and associated data

Load libraries and time series data

Lake Washington plankton data

Manipulate data for modeling

Exercises

Capturing seasonality

Multiseries dynamics

Inspecting SS models

Impulse response functions

Dynamic factor trends

Exercises

Session Info

Ecological forecasting in R

Tutorial 4: multivariate time series

Nicholas Clark, School of Veterinary Science, University of Queensland1

Exercises and associated data

Load libraries and time series data

Lake Washington plankton data

Manipulate data for modeling

Exercises

Capturing seasonality

Multiseries dynamics

Inspecting SS models

Impulse response functions

Dynamic factor trends

Exercises

Session Info

Nicholas Clark, School of Veterinary Science, University of Queensland¹