Ecological forecasting in R

Exercises and associated data

The data and modelling objects created in this notebook can be downloaded directly to save computational time.

Users who wish to complete the exercises can download a small template R script. Assuming you have already downloaded the data objects above, this script will load all data objects so that the steps used to create them are not necessary to tackle the exercises.

Load libraries and time series data

This tutorial relates to content covered in Lecture 2 and Lecture 3, and relies on the following packages for manipulating data, shaping time series, fitting dynamic regression models and plotting:

library(dplyr)
library(mvgam) 
library(brms)
library(marginaleffects)
library(tidybayes)
library(ggplot2); theme_set(theme_classic())

We will continue with time series of rodent captures from the Portal Project, a long-term monitoring study based near the town of Portal, Arizona, with particular focus on the time series of captures for the Desert Pocket Mouse Chaetodipus penicillatus. Data manipulations will proceed in a similar way as in Tutorial 1, except that we now use a lagged version of mintemp (at a lag of four lunar months; see ?dplyr::lag for details) to give you a sense of how you can create and use lagged predictors in time series models

data("portal_data")

portal_data %>%
  
  # Filter the data to only contain captures of the 'PP' 
  dplyr::filter(series == 'PP') %>%
  droplevels() %>%
  
  # Create a 'count' variable for the outcome
  dplyr::mutate(count = captures) %>%
  
  # Create a lagged version of mintemp to use as a predictor in place of the 
  # contemporaneous mintemp we have been using previously. Note, the data must be 
  # in the correct order for this to work properly
  dplyr::arrange(time) %>%
  dplyr::mutate(mintemp_lag4 = dplyr::lag(mintemp, 4)) %>%
  
  # Select the variables of interest to keep in the model_data
  dplyr::select(series, time, count,
                ndvi_ma12, mintemp_lag4) -> model_data

The data now contain five variables:
series, a factor indexing which time series each observation belongs to
time, the indicator of which time step each observation belongs to
count, the response variable representing the number of captures of the species PP in each sampling observation
ndvi_ma12, a 12-month moving average of the monthly Normalized Difference Vegetation Index at each time step
mintemp_lag4, the lagged monthly average minimum temperature at each time step

Using lagged versions of predictors is a standard practice in time series analysis / forecasting. But unfortunately the software we are using cannot easily handle missing values in the predictors. Because we’ve calculated lags of mintemp, the first 4 rows of our data now have missing values for this predictor:

head(model_data, 6)

##   series time count   ndvi_ma12 mintemp_lag4
## 1     PP    1     0 -0.17214413           NA
## 2     PP    2    NA -0.23736348           NA
## 3     PP    3     0 -0.21212064           NA
## 4     PP    4     1 -0.16043812           NA
## 5     PP    5     7 -0.08267729   -0.7963381
## 6     PP    6     7 -0.03692877   -1.3347160

Visualize the data as a heatmap to get a sense of where these are distributed (NAs are shown as red bars in the below plot)

image(
  is.na(
    t(model_data %>%
        dplyr::arrange(dplyr::desc(time)))
  ), 
  axes = F,
  col = c('grey80', 'darkred'))
axis(3, 
     at = seq(0, 
              1, 
              len = NCOL(model_data)), 
     labels = colnames(model_data))

For now we will simply omit these missing values and restart the time index at 1:

model_data %>%
  dplyr::filter(!is.na(mintemp_lag4)) %>%
  dplyr::mutate(time = time - min(time) + 1) -> model_data_trimmed
head(model_data_trimmed)

As in the previous tutorial, we also split the data into training and testing subsets for inspecting forecast behaviours. We will do the same here, but use a shorter validation period in this example

model_data_trimmed %>% 
  dplyr::filter(time <= 68) -> data_train 
model_data_trimmed %>% 
  dplyr::filter(time > 68) -> data_test

Forecasting with temporal smooths

Our last tutorial ended with a model that included a smooth function of time. The lectures have illustrated why this might be a bad idea if one of our goals is to forecast. We will produce forecasts from this model to show why this can be problematic. Start by fitting a similar model to the one we ended with in Tutorial 1, including the data_test object as newdata to ensure the model automatically produces forecasts.

model0 <- mvgam(
  count ~ s(time, bs = 'bs', k = 15) + 
    mintemp_lag4 + 
    ndvi_ma12,
  family = poisson(),
  data = data_train,
  newdata = data_test
)

Inspect forecasts from this model, which are not great

plot(forecast(model0))

What happens if we change k slightly by using 8 basis functions rather than 15?

model0b <- mvgam(
  count ~ s(time, bs = 'bs', k = 8) + 
    mintemp_lag4 + 
    ndvi_ma12,
  family = poisson(),
  data = data_train,
  newdata = data_test
)

Forecasts from this model are completely different!

plot(forecast(model0b))

## Out of sample DRPS:
## 75.518577

Why is this happening? The forecasts are driven almost entirely by variation in the temporal spline, which is extrapolating linearly forever beyond the training data. Any slight wiggles near the boundary will result in wildly different forecasts. To visualize this, we can plot the extrapolated temporal functions for the two models. Here for the first model, with 15 basis functions:

plot_predictions(
  model0, 
  by = 'time',
  newdata = datagrid(time = 1:max(data_test$time)),
  type = 'link'
) +
  geom_vline(xintercept = max(data_train$time), 
             linetype = 'dashed')

And for the second model, with fewer basis functions (10). The difference is clear, with this model reacting to a slightly upward shift in the smooth toward the boundary that results in predictions heading to infinity:

plot_predictions(
  model0b, 
  by = 'time',
  newdata = datagrid(time = 1:max(data_test$time)),
  type = 'link'
) +
  geom_vline(xintercept = max(data_train$time), 
             linetype = 'dashed')

Try changing the value of k and seeing how forecasts change. This should give you more confidence that extrapolating from penalized smooths can be unpredictable and difficult to control. There are some strategies to help reign in this behaviour (once again this advice comes via a blogpost by Gavin Simpson), but results will still be unpredictable and sensitive to choices such as k or the placement of knots. But if we shouldn’t use smooth functions of time for forecasting, what other options do we have?

Exercises

Using advice from Gavin Simpson’s blog post above, try changing the order of the penalty in the temporal smooth to see how the resulting predictions change (use ?b.spline for guidance)
Try using a cubic regression spline in place of the b-spline and inspect the resulting predictions (use bs = 'cr')

Check here for template code if you’re having trouble changing the penalty order in the b-spline

# Replace the ? with the correct value(s)
# Using ?b.spline you will see examples in the Description and Details sections about how these splines can handle multiple penalties
model0 <- mvgam(
  count ~ s(time, bs = 'bs', k = 15,
            m = c(?)) + 
    mintemp_lag4 + 
    ndvi_ma12,
  family = poisson(),
  data = data_train,
  newdata = data_test
)

Residual correlation structures

One useful strategy to deal with this is to fit models that incorporate autocorrelated residual processes within the GAM. If our model’s residuals are autocorrelated, our estimates of uncertainty around other regression parameters (such as the \(\beta\) coefficients) can be biased and misleading. Failing to address this may lead to false inferences and perhaps to poor forecasts. Fortunately there are ways we can tackle this by modelling the autocorrelation in our model’s errors. This can be done using mgcv functionality, as the blog post above explains. But we will use brms here as it is more flexible and provides an interface that is more consistent with mvgam.

A standard Poisson GAM in `brms`

First, we will fit a similar model to those used in Tutorial 1 so to see how brms works. The interface is very familiar, relying on the usual R formula syntax (see ?brmsformula for details and examples). But there are some key differences between brms and mvgam. For example, brms can handle a much larger diversity of response distributions (see ?brmsfamily for details). But for now, we will stick a similar Poisson model as previously (a nonlinear function of mintemp_lag4 and a parametric, linear function of mintemp):

model1 <- brm(
  count ~ 
    s(mintemp_lag4, k = 9) +
    ndvi_ma12,
  family = poisson(),
  # set a mildly informative prior on beta coefficients
  prior = set_prior("normal(0, 2)", class = "b"),
  data = data_train,
  backend = 'cmdstanr',
  iter = 1000,
  cores = 4
)

Use summary() to ensure no major diagnostic warnings have been produced and inspect posterior distributions for key parameters. You may notice some warnings of divergent transitions for this model, which should give you some immediate concern about the validity of the model’s inferences

summary(model1)

## Warning: There were 179 divergent transitions after warmup. Increasing
## adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup

##  Family: poisson 
##   Links: mu = log 
## Formula: count ~ s(mintemp_lag4, k = 9) + ndvi_ma12 
##    Data: data_train (Number of observations: 52) 
##   Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
##          total post-warmup draws = 2000
## 
## Smoothing Spline Hyperparameters:
##                      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## sds(smintemp_lag4_1)     2.46      1.79     0.08     6.09 1.04      120
##                      Tail_ESS
## sds(smintemp_lag4_1)      205
## 
## Regression Coefficients:
##                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept           1.52      0.07     1.38     1.65 1.01      383      517
## ndvi_ma12           0.12      0.25    -0.35     0.59 1.01      278      209
## smintemp_lag4_1    -2.37      1.50    -5.63     0.51 1.02      307      243
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

As witht he mvgam models, for any brms model we can inspect the underlying Stan code using the stancode function:

stancode(model1)

Check here to see the model’s Stan code

## // generated with brms 2.22.9
## functions {
## }
## data {
##   int<lower=1> N;  // total number of observations
##   array[N] int Y;  // response variable
##   int<lower=1> K;  // number of population-level effects
##   matrix[N, K] X;  // population-level design matrix
##   int<lower=1> Kc;  // number of population-level effects after centering
##   // data for splines
##   int Ks;  // number of linear effects
##   matrix[N, Ks] Xs;  // design matrix for the linear effects
##   // data for spline 1
##   int nb_1;  // number of bases
##   array[nb_1] int knots_1;  // number of knots
##   // basis function matrices
##   matrix[N, knots_1[1]] Zs_1_1;
##   int prior_only;  // should the likelihood be ignored?
## }
## transformed data {
##   matrix[N, Kc] Xc;  // centered version of X without an intercept
##   vector[Kc] means_X;  // column means of X before centering
##   for (i in 2:K) {
##     means_X[i - 1] = mean(X[, i]);
##     Xc[, i - 1] = X[, i] - means_X[i - 1];
##   }
## }
## parameters {
##   vector[Kc] b;  // regression coefficients
##   real Intercept;  // temporary intercept for centered predictors
##   vector[Ks] bs;  // unpenalized spline coefficients
##   // parameters for spline 1
##   // standardized penalized spline coefficients
##   vector[knots_1[1]] zs_1_1;
##   vector<lower=0>[nb_1] sds_1;  // SDs of penalized spline coefficients
## }
## transformed parameters {
##   // penalized spline coefficients
##   vector[knots_1[1]] s_1_1;
##   real lprior = 0;  // prior contributions to the log posterior
##   // compute penalized spline coefficients
##   s_1_1 = sds_1[1] * zs_1_1;
##   lprior += normal_lpdf(b | 0, 2);
##   lprior += student_t_lpdf(Intercept | 3, 1.4, 2.5);
##   lprior += normal_lpdf(bs | 0, 2);
##   lprior += student_t_lpdf(sds_1 | 3, 0, 2.5)
##     - 1 * student_t_lccdf(0 | 3, 0, 2.5);
## }
## model {
##   // likelihood including constants
##   if (!prior_only) {
##     // initialize linear predictor term
##     vector[N] mu = rep_vector(0.0, N);
##     mu += Intercept + Xs * bs + Zs_1_1 * s_1_1;
##     target += poisson_log_glm_lpmf(Y | Xc, mu, b);
##   }
##   // priors including constants
##   target += lprior;
##   target += std_normal_lpdf(zs_1_1);
## }
## generated quantities {
##   // actual population-level intercept
##   real b_Intercept = Intercept - dot_product(means_X, b);
## }

Of course we still cannot interpret the smooth effects using the posterior summary. We will need to make targeted plots to do that. Fortunately brms has a similar conditional_effects() function to mvgam so this is simple to do (note that by default calling conditional_effects() on brms models will return predictions on the expectation scale):

plot(
  conditional_effects(model1),
     theme = theme_classic(),
     mean = FALSE,
     rug = TRUE, 
     ask = FALSE
  )

Computing residuals with `tidybayes`

A notable difference between brms and mvgam is that the former does not automatically return predictions. These need to be computed after model fitting, which means that residuals also have to be computed. To inspect residual diagnostics we can enlist the help of the tidybayes package to produce randomized quantile residuals. More information on this procedure can be found in this tidybayes vignette. Do not worry if it is difficult to understand, it is not of major relevance to the learning objectives of the course (and if you use mvgam, this will be done for you automatically anyway)

data_train %>%
  # calculate posterior predictions
  add_predicted_draws(model1) %>%
  # use predictions to compute randomized quantile residuals
  summarise(
    p_lower = mean(.prediction < count),
    p_upper = mean(.prediction <= count),
    p_residual = runif(1, 
                       max(p_lower, 0.00001), 
                       min(p_upper, 0.99999)),
    z_residual = qnorm(p_residual),
    .groups = "drop_last"
  ) -> residual_data

Perform the usual residual diagnostics. For example, below are residual ACF, pACF and Q-Q plots, all of which show that the model’s predictions do not adequately match the features of the observed data

acf(residual_data$z_residual, 
    na.action = na.pass, 
    bty = 'l',
    lwd = 2,
    ci.col = 'darkred',
    main = 'Dunn-Smyth residuals')

pacf(residual_data$z_residual, 
    na.action = na.pass, 
    bty = 'l',
    lwd = 2,
    ci.col = 'darkred',
    main = 'Dunn-Smyth residuals')

ggplot(residual_data, 
       aes(sample = z_residual)) +
  geom_qq() +
  geom_abline(col = 'darkred') +
  theme_classic()

Autocorrelated error GAM in `brms`

This model is not doing well. Clearly we need to account for temporal autocorrelation without using a smooth of time. Now onto another great feature of brms: the ability to include (possibly latent) autocorrelated residuals with the ar() function (see ?brms::ar for details). This model will use a separate sub-model for latent residuals that evolve as an AR1 process (i.e. the error in the current time point is a function of the error in the previous time point, plus stochastic noise). When using non-Gaussian observations, we can only select an order of 1 for the AR component (higher orders are only currently allowed for Gaussian observations). Note, this model takes much longer to fit than the previous model because it estimates a full temporal covariance for the residuals

model2 <- brm(
  count ~ 
    s(mintemp_lag4, k = 9) +
    ndvi_ma12 +
    ar(p = 1, 
       time = time,
       cov = TRUE),
  family = poisson(),
  prior = c(set_prior("normal(0, 2)", class = "b")),
  data = data_train,
  backend = 'cmdstanr',
  iter = 1000,
  cores = 4
)

Check here to see a mathematical description of the model

The model can be described mathematically as follows: \[\begin{align*} \boldsymbol{count}_t & \sim \text{Poisson}(\lambda_t) \\ log(\lambda_t) & = f(\boldsymbol{mintemp})_t + \beta_{ndvi} * \boldsymbol{ndvi}_t + z_t \\ z_t & \sim \text{Normal}(ar1 * z_{t-1}, \sigma_{error}) \\ ar1 & \sim \text{Normal}(0, 1)[-1, 1] \\ \sigma_{error} & \sim \text{Exponential}(2) \\ f(\boldsymbol{mintemp}) & = \sum_{k=1}^{K}b * \beta_{smooth} \\ \beta_{smooth} & \sim \text{MVNormal}(0, (\Omega * \lambda)^{-1}) \\ \beta_{ndvi} & \sim \text{Normal}(0, 1) \end{align*}\]

Here the term \(z_t\) captures autocorrelated latent residuals, which are modelled using an AR1 process. The AR coefficients are restricted to the interval \([-1,1]\) to ensure the dynamic process is stationary.

View the model’s summary, and notice how there are now summaries for the residual autoregressive parameters as well

summary(model2)

## Warning: There were 19 divergent transitions after warmup. Increasing
## adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup

##  Family: poisson 
##   Links: mu = log 
## Formula: count ~ s(mintemp_lag4, k = 9) + ndvi_ma12 + ar(p = 1, time = time, cov = TRUE) 
##    Data: data_train (Number of observations: 52) 
##   Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
##          total post-warmup draws = 2000
## 
## Smoothing Spline Hyperparameters:
##                      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## sds(smintemp_lag4_1)     0.99      0.83     0.04     3.18 1.00      633
##                      Tail_ESS
## sds(smintemp_lag4_1)      804
## 
## Correlation Structures:
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## ar[1]     0.79      0.10     0.58     0.95 1.01      636      928
## sderr     0.74      0.13     0.52     1.04 1.00     1022     1141
## 
## Regression Coefficients:
##                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept           1.16      0.54     0.07     2.28 1.01      845      990
## ndvi_ma12           0.75      0.82    -0.86     2.40 1.00     1147     1077
## smintemp_lag4_1    -1.91      1.53    -4.63     1.32 1.01     1176     1138
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Again you can inspect the Stan code if you’d like to see how the autocorrelation structure is included in the linear predictor

stancode(model2)

Check here to see the model’s Stan code

## // generated with brms 2.22.9
## functions {
##   /* integer sequence of values
##    * Args:
##    *   start: starting integer
##    *   end: ending integer
##    * Returns:
##    *   an integer sequence from start to end
##    */
##   array[] int sequence(int start, int end) {
##     array[end - start + 1] int seq;
##     for (n in 1:num_elements(seq)) {
##       seq[n] = n + start - 1;
##     }
##     return seq;
##   }
##   // are two 1D integer arrays equal?
##   int is_equal(array[] int a, array[] int b) {
##     int n_a = size(a);
##     int n_b = size(b);
##     if (n_a != n_b) {
##       return 0;
##     }
##     for (i in 1:n_a) {
##       if (a[i] != b[i]) {
##         return 0;
##       }
##     }
##     return 1;
##   }
## 
##   /* grouped data stored linearly in "data" as indexed by begin and end
##    * is repacked to be stacked into an array of vectors.
##    */
##   array[] vector stack_vectors(vector long_data, int n, array[] int stack,
##                                array[] int begin, array[] int end) {
##     int S = sum(stack);
##     int G = size(stack);
##     array[S] vector[n] stacked;
##     int j = 1;
##     for (i in 1:G) {
##       if (stack[i] == 1) {
##         stacked[j] = long_data[begin[i]:end[i]];
##         j += 1;
##       }
##     }
##     return stacked;
##   }
##   /* compute the cholesky factor of an AR1 correlation matrix
##    * Args:
##    *   ar: AR1 autocorrelation
##    *   nrows: number of rows of the covariance matrix
##    * Returns:
##    *   A nrows x nrows matrix
##    */
##    matrix cholesky_cor_ar1(real ar, int nrows) {
##      matrix[nrows, nrows] mat;
##      vector[nrows - 1] gamma;
##      mat = diag_matrix(rep_vector(1, nrows));
##      for (i in 2:nrows) {
##        gamma[i - 1] = pow(ar, i - 1);
##        for (j in 1:(i - 1)) {
##          mat[i, j] = gamma[i - j];
##          mat[j, i] = gamma[i - j];
##        }
##      }
##      return cholesky_decompose(mat ./ (1 - ar^2));
##    }
##   /* scale and correlate time-series residuals
##    * using the Cholesky factor of the correlation matrix
##    * Args:
##    *   zerr: standardized and independent residuals
##    *   sderr: standard deviation of the residuals
##    *   chol_cor: cholesky factor of the correlation matrix
##    *   nobs: number of observations in each group
##    *   begin: the first observation in each group
##    *   end: the last observation in each group
##    * Returns:
##    *   vector of scaled and correlated residuals
##    */
##    vector scale_time_err(vector zerr, real sderr, matrix chol_cor,
##                          array[] int nobs, array[] int begin, array[] int end) {
##      vector[rows(zerr)] err;
##      for (i in 1:size(nobs)) {
##        matrix[nobs[i], nobs[i]] L_i;
##        L_i = sderr * chol_cor[1:nobs[i], 1:nobs[i]];
##        err[begin[i]:end[i]] = L_i * zerr[begin[i]:end[i]];
##      }
##      return err;
##    }
## }
## data {
##   int<lower=1> N;  // total number of observations
##   array[N] int Y;  // response variable
##   int<lower=1> K;  // number of population-level effects
##   matrix[N, K] X;  // population-level design matrix
##   int<lower=1> Kc;  // number of population-level effects after centering
##   // data for splines
##   int Ks;  // number of linear effects
##   matrix[N, Ks] Xs;  // design matrix for the linear effects
##   // data for spline 1
##   int nb_1;  // number of bases
##   array[nb_1] int knots_1;  // number of knots
##   // basis function matrices
##   matrix[N, knots_1[1]] Zs_1_1;
##   // data needed for ARMA correlations
##   int<lower=0> Kar;  // AR order
##   int<lower=0> Kma;  // MA order
##   // see the functions block for details
##   int<lower=1> N_tg;
##   array[N_tg] int<lower=1> begin_tg;
##   array[N_tg] int<lower=1> end_tg;
##   array[N_tg] int<lower=1> nobs_tg;
##   int prior_only;  // should the likelihood be ignored?
## }
## transformed data {
##   matrix[N, Kc] Xc;  // centered version of X without an intercept
##   vector[Kc] means_X;  // column means of X before centering
##   int max_lag = max(Kar, Kma);
##   int max_nobs_tg = max(nobs_tg);  // maximum dimension of the autocorrelation matrix
##   for (i in 2:K) {
##     means_X[i - 1] = mean(X[, i]);
##     Xc[, i - 1] = X[, i] - means_X[i - 1];
##   }
## }
## parameters {
##   vector[Kc] b;  // regression coefficients
##   real Intercept;  // temporary intercept for centered predictors
##   vector[Ks] bs;  // unpenalized spline coefficients
##   // parameters for spline 1
##   // standardized penalized spline coefficients
##   vector[knots_1[1]] zs_1_1;
##   vector<lower=0>[nb_1] sds_1;  // SDs of penalized spline coefficients
##   vector[N] zerr;  // unscaled residuals
##   real<lower=0> sderr;  // SD of residuals
##   vector<lower=-1,upper=1>[Kar] ar;  // autoregressive coefficients
## }
## transformed parameters {
##   // penalized spline coefficients
##   vector[knots_1[1]] s_1_1;
##   vector[N] err;  // actual residuals
##   // cholesky factor of the autocorrelation matrix
##   matrix[max_nobs_tg, max_nobs_tg] Lcortime;
##   real lprior = 0;  // prior contributions to the log posterior
##   // compute residual covariance matrix
##   Lcortime = cholesky_cor_ar1(ar[1], max_nobs_tg);
##   // compute correlated time-series residuals
##   err = scale_time_err(zerr, sderr, Lcortime, nobs_tg, begin_tg, end_tg);
##   // compute penalized spline coefficients
##   s_1_1 = sds_1[1] * zs_1_1;
##   lprior += normal_lpdf(b | 0, 2);
##   lprior += student_t_lpdf(Intercept | 3, 1.4, 2.5);
##   lprior += normal_lpdf(bs | 0, 2);
##   lprior += student_t_lpdf(sds_1 | 3, 0, 2.5)
##     - 1 * student_t_lccdf(0 | 3, 0, 2.5);
##   lprior += student_t_lpdf(sderr | 3, 0, 2.5)
##     - 1 * student_t_lccdf(0 | 3, 0, 2.5);
## }
## model {
##   // likelihood including constants
##   if (!prior_only) {
##     // initialize linear predictor term
##     vector[N] mu = rep_vector(0.0, N);
##     mu += Intercept + Xc * b + Xs * bs + Zs_1_1 * s_1_1 + err;
##     target += poisson_log_lpmf(Y | mu);
##   }
##   // priors including constants
##   target += lprior;
##   target += std_normal_lpdf(zs_1_1);
##   target += std_normal_lpdf(zerr);
## }
## generated quantities {
##   // actual population-level intercept
##   real b_Intercept = Intercept - dot_product(means_X, b);
## }

Autocorrelated error GAM in `mvgam`

The brms model is estimating autocorrelated errors for the full time period, even though some of these time points have missing observations. This is useful for getting more realistic estimates of the residual autocorrelation parameters. In contrast, if you were to attempt an AR error process in mgcv (which you can do using gamm() or bam()), the time points with missing data would be thrown out before constructing the model. This will cause problems, sometimes small but sometimes very big, for your inferences.

mvgam can also fit the latent trend AR1 model, though it is considerably faster. It also includes an option to reparameterize the latent dynamic process using non-centering, which typically leads to higher effective sample sizes and better convergence. Let’s fit it (note that only 250 posterior samples per chain are taken to reduce the size of the model, which will make it easier to download for completing the exercises):

model3 <- mvgam(
  count ~ 
    s(mintemp_lag4, k = 9) +
    ndvi_ma12,
  family = poisson(),
  data = data_train,
  newdata = data_test,
  priors = prior(normal(0, 2), class = b),
  samples = 250,
  trend_model = AR(p = 1),
  noncentred = TRUE
)

Inspect the Stan code for this model to see how it differs from the brms version

stancode(model3)

Check here to see the model’s Stan code

## // Stan model code generated by package mvgam
## data {
##   int<lower=0> total_obs; // total number of observations
##   int<lower=0> n; // number of timepoints per series
##   int<lower=0> n_sp; // number of smoothing parameters
##   int<lower=0> n_series; // number of series
##   int<lower=0> num_basis; // total number of basis coefficients
##   vector[num_basis] zero; // prior locations for basis coefficients
##   matrix[total_obs, num_basis] X; // mgcv GAM design matrix
##   array[n, n_series] int<lower=0> ytimes; // time-ordered matrix (which col in X belongs to each [time, series] observation?)
##   matrix[8, 16] S1; // mgcv smooth penalty matrix S1
##   int<lower=0> n_nonmissing; // number of nonmissing observations
##   array[n_nonmissing] int<lower=0> flat_ys; // flattened nonmissing observations
##   matrix[n_nonmissing, num_basis] flat_xs; // X values for nonmissing observations
##   array[n_nonmissing] int<lower=0> obs_ind; // indices of nonmissing observations
## }
## parameters {
##   // raw basis coefficients
##   vector[num_basis] b_raw;
##   
##   // latent trend AR1 terms
##   vector<lower=-1, upper=1>[n_series] ar1;
##   
##   // latent trend variance parameters
##   vector<lower=0>[n_series] sigma;
##   
##   // raw latent trends
##   matrix[n, n_series] trend_raw;
##   
##   // smoothing parameters
##   vector<lower=0>[n_sp] lambda;
## }
## transformed parameters {
##   // basis coefficients
##   vector[num_basis] b;
##   
##   // latent trends
##   matrix[n, n_series] trend;
##   trend = trend_raw .* rep_matrix(sigma', rows(trend_raw));
##   for (s in 1 : n_series) {
##     trend[2 : n, s] += ar1[s] * trend[1 : (n - 1), s];
##   }
##   b[1 : num_basis] = b_raw[1 : num_basis];
## }
## model {
##   // prior for (Intercept)...
##   b_raw[1] ~ student_t(3, 1.4, 2.5);
##   
##   // prior for ndvi_ma12...
##   b_raw[2] ~ normal(0, 2);
##   
##   // prior for s(mintemp_lag4)...
##   b_raw[3 : 10] ~ multi_normal_prec(zero[3 : 10],
##                                     S1[1 : 8, 1 : 8] * lambda[1]
##                                     + S1[1 : 8, 9 : 16] * lambda[2]);
##   
##   // priors for AR parameters
##   ar1 ~ std_normal();
##   
##   // priors for smoothing parameters
##   lambda ~ normal(5, 30);
##   
##   // priors for latent trend variance parameters
##   sigma ~ student_t(3, 0, 2.5);
##   to_vector(trend_raw) ~ std_normal();
##   {
##     // likelihood functions
##     vector[n_nonmissing] flat_trends;
##     flat_trends = to_vector(trend)[obs_ind];
##     flat_ys ~ poisson_log_glm(append_col(flat_xs, flat_trends), 0.0,
##                               append_row(b, 1.0));
##   }
## }
## generated quantities {
##   vector[total_obs] eta;
##   matrix[n, n_series] mus;
##   vector[n_sp] rho;
##   vector[n_series] tau;
##   array[n, n_series] int ypred;
##   rho = log(lambda);
##   for (s in 1 : n_series) {
##     tau[s] = pow(sigma[s], -2.0);
##   }
##   
##   // posterior predictions
##   eta = X * b;
##   for (s in 1 : n_series) {
##     mus[1 : n, s] = eta[ytimes[1 : n, s]] + trend[1 : n, s];
##     ypred[1 : n, s] = poisson_log_rng(mus[1 : n, s]);
##   }
## }

Check the summary:

summary(model3)

## GAM formula:
## count ~ s(mintemp_lag4, k = 9) + ndvi_ma12
## 
## Family:
## poisson
## 
## Link function:
## log
## 
## Trend model:
## AR()
## 
## 
## N series:
## 1 
## 
## N timepoints:
## 76 
## 
## Status:
## Fitted using Stan 
## 4 chains, each with iter = 750; warmup = 500; thin = 1 
## Total post-warmup draws = 1000
## 
## 
## GAM coefficient (beta) estimates:
##                     2.5%     50%  97.5% Rhat n_eff
## (Intercept)        0.560  1.1000  1.500 1.00   486
## ndvi_ma12         -0.920  0.4200  1.700 1.00   581
## s(mintemp_lag4).1 -0.430 -0.0380  0.130 1.02   302
## s(mintemp_lag4).2 -0.390  0.0190  0.440 1.01   242
## s(mintemp_lag4).3 -0.150 -0.0051  0.089 1.01   224
## s(mintemp_lag4).4 -0.230  0.0082  0.220 1.01   213
## s(mintemp_lag4).5 -0.074 -0.0008  0.058 1.01   263
## s(mintemp_lag4).6 -0.200  0.0053  0.200 1.01   202
## s(mintemp_lag4).7 -0.550  0.0310  0.700 1.01   228
## s(mintemp_lag4).8 -0.990 -0.4200 -0.011 1.00   363
## 
## Approximate significance of GAM smooths:
##                  edf Ref.df Chi.sq p-value
## s(mintemp_lag4) 1.12      8   16.2    0.71
## 
## Latent trend parameter AR estimates:
##          2.5%  50% 97.5% Rhat n_eff
## ar1[1]   0.47 0.80  0.99 1.00   475
## sigma[1] 0.56 0.77  1.10 1.01   304
## 
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 0 of 1000 iterations ended with a divergence (0%)
## 0 of 1000 iterations saturated the maximum tree depth of 10 (0%)
## E-FMI indicated no pathological behavior
## 
## Samples were drawn using NUTS(diag_e) at Sat Mar 08 6:23:50 PM 2025.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split MCMC chains
## (at convergence, Rhat = 1)
## 
## Use how_to_cite(model3) to get started describing this model

Comparing inferences

We can compare inferences from the two models to see that they are broadly similar:

plot(
  conditional_effects(model2), 
  ask = FALSE
  )

conditional_effects(model3, 
                    type = 'expected')

Now compare estimates for the residual AR1 parameters. First, extract posterior draws for the AR1 parameter from the two models, which can be done using the as.data.frame:

ar1_ests <- rbind(
  data.frame(
    ar1 = as.data.frame(model3, variable = 'ar1[1]')$`ar1[1]`,
    model = 'mvgam'),
  data.frame(
    ar1 = as.data.frame(model2, variable = 'ar[1]')$`ar[1]`,
    model = 'brms')
)

We can use ggplot2 to compare these estimates, which shows that they are similar, but not exactly the same:

ggplot(data = ar1_ests, 
       aes(x = ar1, fill = model)) +
  geom_histogram(aes(y = after_stat(density)), 
                 bins = 40, 
                 col = 'white') +
  facet_grid(rows = 'model') +
  theme_bw() +
  theme(legend.position = 'none') +
  scale_fill_manual(values = c('darkblue', 
                               'darkred')) +
  labs(x = expression(ar1[error]), 
       y = 'Frequency') +
  xlim(c(0,1))

There is a major drawback of the brms model here, which happens because any rows in data_train that have NA in the outcome variable (count) are being thrown out before model fitting. In fact, there are quite a few of these in our data:

length(which(is.na(data_train$count)))

## [1] 16

This means that the brms model thinks that certain observations are adjacent to one another in time when actually they are not. For example, the brms model believes that timepoints 5 and 7 are right next to each other, and also that timepoints 12 and 14 are right next to each other:

data_train %>%
  dplyr::select(time, count) %>%
  head(16)

##    time count
## 1     1     7
## 2     2     7
## 3     3     8
## 4     4     8
## 5     5     4
## 6     6    NA
## 7     7     0
## 8     8     0
## 9     9     0
## 10   10     0
## 11   11     0
## 12   12     0
## 13   13    NA
## 14   14     2
## 15   15     4
## 16   16     7

This can often cause problems because we would not expect the level of autocorrelation to be the same when jumping 2 timesteps vs only 1 timestep. In fact, most regression packages in R will do this (i.e. they will fit autocorrelation models to residuals after throwing out observations with missing values). But, mvgam does not do this! It keeps the observations with NAs in the data so that the temporal spacing can be adequately preserved. Other drawbacks of the brms approach to correlated residuals include:

We cannot allow different time series to have different AR1 parameters
We cannot model the correlations among the errors for different time series
We cannot fit the dynamic processes using a State-Space approach

Of course there may be situations where we really do have unequal temporal spacing among observations and we don’t want to force a model into equal spacing by filling missing timepoints with NAs. In this case, mvgam can be used with a Continuous Time Autoregressive process:

\[\begin{align*} z_{i,t} & \sim \text{Normal}(ar1_i^{distance} * z_{i,t-1}, \sigma_i) \end{align*}\]

Where \(distance\) is a vector of non-negative measurements of the time differences between successive observations. See the Examples section in ?CAR for an illustration of how to set these models up.

AR processes and stationarity

As you may remember from the lectures, the larger this AR1 parameter becomes (in absolute value), the more a time series tends toward nonstationarity. We can look at some example functions to investigate this behaviour in more detail here

Click here if you would like to see the R code used to produce these simulations

# Define a function to simulate AR1 processes with a fixed error variance
simulate_ar1 = function(ar1 = 0.5, N = 50){
  # simulate the initial value of the series
  init <- rnorm(1, mean = 0, sd = 0.25)
  
  # create an empty vector to store the time series values
  states <- vector(length = N)
  
  # set the first value of the states as the initial value
  states[1] <- init
  
  # loop over remaining time points and fill in the AR1 process
  for(t in 2:N){
    states[t] <- rnorm(1, mean = ar1 * states[t - 1],
                       sd = 0.25)
  }
  
  return(states)
}

# plot a single AR1 series with an AR1 parameter = 0.5
plot(x = 1:50, 
     y = simulate_ar1(ar1 = 0.5), 
     type = 'l',
     lwd = 2, col = 'darkred',
     ylab = expression(f(Time)),
     xlab = 'Time', bty = 'l',
     ylim = c(-2,2),
     main = expression(AR[1]~'='~0.5))

# overlay an additional 20 possible series using different colours
for(i in 1:20){
  lines(simulate_ar1(ar1 = 0.5),
        col = sample(c("#DCBCBC",
                       "#C79999",
                       "#B97C7C",
                       "#A25050",
                       "#7C0000"), 1),
        lwd = 2)
}
box(bty = 'l', lwd = 2)

This plot shows how an AR1 process with a smallish AR1 parameter (0.5) approaches stationarity fairly quickly. What happens if we increase this parameter toward the types of values that our models have estimated? The code to produce these simulations is the same as above, except the AR1 parameter was changed to 0.95

The variance of this particular series grows into the future for a much longer time than did the series we simulated above. This is useful for getting a sense about the relative stability of the underlying dynamic process. We can enforce a Random Walk so that the dynamic process will not be stationary (note, a Random Walk trend can only be fit in mvgam; there is currently no way of using such a process in brms, that I’m aware of):

model3b <- mvgam(
  count ~ 
    s(mintemp_lag4, k = 9) +
    ndvi_ma12,
  family = poisson(),
  data = data_train,
  newdata = data_test,
  priors = prior(normal(0, 2), class = b),
  trend_model = RW()
)

Check here to see a mathematical description of the model

The model can be described mathematically as follows: \[\begin{align*} \boldsymbol{count}_t & \sim \text{Poisson}(\lambda_t) \\ log(\lambda_t) & = f(\boldsymbol{mintemp})_t + \beta_{ndvi} * \boldsymbol{ndvi}_t + z_t \\ z_t & \sim \text{Normal}(z_{t-1}, \sigma_{error}) \\ \sigma_{error} & \sim \text{Exponential}(2) \\ f(\boldsymbol{mintemp}) & = \sum_{k=1}^{K}b * \beta_{smooth} \\ \beta_{smooth} & \sim \text{MVNormal}(0, (\Omega * \lambda)^{-1}) \\ \beta_{ndvi} & \sim \text{Normal}(0, 1) \end{align*}\]

Note the similarity to model3 above. The only difference is that we no longer estimate the AR1 parameter, so the mean of the latent process at time \(t\) is simply the value of the latent process at time \(t-1\).

Plotting the trend estimates for the AR1 and Random Walk models gives you further insight into the different assumptions they make about the temporal evolution of the system:

patchwork::wrap_plots(
  plot(
    forecast(model3, 
             type = 'trend')
  ) +
    labs(title = 'AR1 trend'),
  plot(
    forecast(model3b, 
             type = 'trend')
  ) +
    labs(title = 'RW trend'),
  ncol = 1
)

Exercises

Compare estimates for the latent residual error terms (\(\sigma_{error}\)) from model2 and model3. In mvgam, this parameter is called sigma[1], while in brms, it is called sderr
Use plot_slopes() to inspect rates of change in the smooth of mintemp_lag4 for model3 (hint, use type = 'link' for a cleaner plot of these slopes)
Look at the Dunn-Smyth residuals for model3 and provide a few comments describing what you see (use plot.mvgam() with type = residuals). Does this model seem to capture the relevant features of the autocorrelated observed data?
Plot residuals against ndvi_ma12 for model3 using pp_check(). Do you see any evidence that a linear association might be too simple for this effect?

Gaussian Processes

Another way to capture dynamic errors in ecological time series is with a Gaussian Process:

“A Gaussian process defines a probability distribution over functions; in other words every sample from a Gaussian Process is an entire function from the covariate space X to the real-valued output space.” (Betancourt; Robust Gaussian Process Modeling)

In many time series analyses, we believe the power of past observations to explain current data is a function of how long ago we observed those past observations. Gaussian Processes (GPs) use covariance functions (often called ‘kernels’) that allow this to happen by specifying how correlations depend on the difference, in time, between observations. For our purposes we will rely on the squared exponential kernel, which depends on \(\rho\) (often called the length scale parameter) to control how quickly correlations decay as a function of time. The functions also depend on a parameter \(\alpha\), which controls how much the GP term contributes to the linear predictor. Below we simulate from such a GP to get an idea of the kinds of functional shapes that can be produced as you vary the length scale \(\rho\):

Click here if you would like to see the R code used to produce these simulations

# set a seed for reproducibility
set.seed(2222)

# simulate from a fairly large length-scale parameter
rho <- 10

# set the alpha parameter to 1 for all simulations
alpha <- 1

# simulate functions that span 50 time points
times <- 1:50

# generate a sequence of prediction values
draw_seq <- seq(min(times), max(times), length.out = 100)

# construct the temporal covariance matrix
Sigma <- alpha^2 * 
  exp(-0.5 * ((outer(draw_seq, draw_seq, "-") / rho) ^ 2)) +
  diag(1e-12, length(draw_seq))

# draw a single realization from the GP distribution
gp_vals <- MASS::mvrnorm(n = 1, 
                         mu = rep(0, length(draw_seq)),
                         Sigma = Sigma)

# plot the single GP draw
plot(x = draw_seq, 
     y = gp_vals, 
     type = 'l',
     lwd = 2, 
     col = 'darkred',
     ylab = expression(f(Time)),
     xlab = 'Time', 
     bty = 'l',
     ylim = c(-3.5, 3.5),
     main = expression(alpha~'='~1*'; 
                       '~rho~'='~10))
# overlay an additional 10 possible functions using different colours
for(i in 1:10){
  draw <- MASS::mvrnorm(n = 1, 
                        mu = rep(0, length(draw_seq)),
                        Sigma = Sigma)
  lines(x = draw_seq, 
        y = draw, 
        lwd = 3.5, 
        col = 'white')
  lines(x = draw_seq, 
        y = draw,
        col = sample(c("#DCBCBC",
                       "#C79999",
                       "#B97C7C",
                       "#A25050",
                       "#7C0000"), 1),
        lwd = 2.5)
}
box(bty = 'l', lwd = 2)

This plot shows that the temporal autocorrelations have a long ‘memory’, i.e. they tend to change slowly over time. In contrast, let’s see what happens when we simulate from a GP with a shorter length scale parameter (the same code as above was used, except the length scale was changed to 4)

These functions are much ‘wigglier’ and can be useful for capturing temporal dynamics with short-term ‘memory’. A big advantage of GPs is that they can easily handle data that are irregularly sampled, much like splines can (and in contrast to autoregressive processes like we’ve been using so far). But although GPs may look similar to splines in the functions they return, they are fundamentally different. In particular, a spline has local knowledge, meaning it has no principled way of understanding how the function itself is evolving across time. But a GP directly models how the function changes over time, which means it is better able to generate out-of-sample predictions.

GP trends in `brms`

We can use GPs in both brms and mvgam. In fact, both packages capitalize on some advances that allow GPs to be approximated using Hilbert space basis functions, which considerably speed up computation at little cost to accuracy or prediction performance. You can incorporate GPs of time (or of any other covariate) using the gp() function (see ?gp for details).

Up to this point we’ve used a Poisson model and have been able to capture excess dispersion with autoregressive error terms. But a GP trend will not readily capture this overdispersion because it evolves smoothly. We’ll instead switch to a Negative Binomial model. This can be helpful for our data as we want to capture overdispersion in the counts while estimating a smooth underlying dynamic trend with the GP function. Read more about the Negative Binomial and its widespread use in ecological modelling in this recent paper.

model4 <- brm(
  count ~ 
    s(mintemp_lag4, k = 9) +
    ndvi_ma12 +
    # use a GP of time with squared exponential kernel;
    # employ Hilbert space approximation
    # for faster computation, with 20 basis functions
    # used for the approximation; set scale = FALSE so
    # the rho parameter is more interpretable
    gp(time, c = 5/4, k = 20, scale = FALSE),
  # use a Negative Binomial observation model
  family = negbinomial(),
  prior = set_prior("normal(0, 2)", class = "b"),
  data = data_train,
  backend = 'cmdstanr',
  iter = 1000,
  cores = 4
)

Check here to see a mathematical description of the model

The model can be described mathematically as follows: \[\begin{align*} \boldsymbol{count}_t & \sim \text{NegBinomial}(\lambda_t, \phi) \\ log(\lambda_t) & = f(\boldsymbol{mintemp})_t + \beta_{ndvi} * \boldsymbol{ndvi}_t + z_t \\ z & \sim \text{MVNormal}(0, \Sigma_{error}) \\ \Sigma_{error}[t_i, t_j] & = \alpha^2 * exp(-0.5 * ((|t_i - t_j| / \rho))^2) \\ \alpha & \sim \text{StudentT}(3, 0, 2.5)[0, ] \\ \rho & \sim \text{InverseGamma}(1.5, 9) \\ f(\boldsymbol{mintemp}) & = \sum_{k=1}^{K}b * \beta_{smooth} \\ \beta_{smooth} & \sim \text{MVNormal}(0, (\Omega * \lambda)^{-1}) \\ \beta_{ndvi} & \sim \text{Normal}(0, 1) \\ \phi & \sim \text{Gamma}(0.01, 0.01) \end{align*}\]

The latent dynamic process now evolves as a GP. The Negative Binomial distribution also has a second parameter, \(\phi\), which controls any extra dispersion. A summary of this model now returns posterior summaries of the GP function through the parameters sdgp(gptime) (the \(\alpha\) covariance parameter) and lscale(gptime) (the length scale or \(\rho\) covariance parameter). It also contains the overdispersion parameter (\(\phi\)), labelled as shape. The smaller this value, the more support there is for overdispersion in the observations

summary(model4)

## Warning: There were 2 divergent transitions after warmup. Increasing
## adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup

##  Family: negbinomial 
##   Links: mu = log; shape = identity 
## Formula: count ~ s(mintemp_lag4, k = 9) + ndvi_ma12 + gp(time, c = 5/4, k = 20, scale = FALSE) 
##    Data: data_train (Number of observations: 52) 
##   Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
##          total post-warmup draws = 2000
## 
## Smoothing Spline Hyperparameters:
##                      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## sds(smintemp_lag4_1)     1.12      0.95     0.04     3.70 1.01      558
##                      Tail_ESS
## sds(smintemp_lag4_1)      846
## 
## Gaussian Process Hyperparameters:
##                Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sdgp(gptime)       1.40      0.50     0.71     2.62 1.00      948     1235
## lscale(gptime)     2.87      1.33     1.03     6.06 1.00     1869     1385
## 
## Regression Coefficients:
##                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept           1.32      0.46     0.34     2.24 1.00     1151      993
## ndvi_ma12           0.23      0.87    -1.53     1.87 1.00     1585     1366
## smintemp_lag4_1    -1.75      1.57    -4.47     1.66 1.01      883     1024
## 
## Further Distributional Parameters:
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## shape     2.99      1.24     1.34     6.08 1.00     1728     1564
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Plot conditional effects from this model, on the response scale

plot(
  conditional_effects(model4),
  theme = theme_classic(),
  mean = FALSE,
  rug = TRUE,
  ask = FALSE
)

The last plot shows that the GP has clearly captured nonlinear, smooth variation over time.

GP trends in `mvgam`

We can fit a similar model in mvgam using the nb() family:

model5 <- mvgam(
  count ~ 
    s(mintemp_lag4, k = 9) +
    ndvi_ma12 +
    gp(time, c = 5/4, k = 20, scale = FALSE),
  priors = prior(normal(0, 2), class = b),
  family = nb(),
  data = data_train,
  newdata = data_test,
)

This model’s summary also now contains posterior summaries of the GP parameters and the Negative Binomial overdispersion parameter

summary(model5)

## GAM formula:
## count ~ s(mintemp_lag4, k = 9) + ndvi_ma12 + gp(time, c = 5/4, 
##     k = 20, scale = FALSE)
## 
## Family:
## negative binomial
## 
## Link function:
## log
## 
## Trend model:
## None
## 
## N series:
## 1 
## 
## N timepoints:
## 76 
## 
## Status:
## Fitted using Stan 
## 4 chains, each with iter = 1000; warmup = 500; thin = 1 
## Total post-warmup draws = 2000
## 
## 
## Observation dispersion parameter estimates:
##        2.5% 50% 97.5% Rhat n_eff
## phi[1]  1.9 4.1   9.1    1  1844
## 
## GAM coefficient (beta) estimates:
##                    2.5%     50%  97.5% Rhat n_eff
## (Intercept)        0.45  1.4000  2.200 1.00   859
## ndvi_ma12         -1.40  0.3400  2.000 1.00  1764
## s(mintemp_lag4).1 -0.48 -0.0530  0.130 1.02   281
## s(mintemp_lag4).2 -0.98 -0.0190  0.360 1.01   160
## s(mintemp_lag4).3 -0.23 -0.0120  0.100 1.01   199
## s(mintemp_lag4).4 -0.52 -0.0064  0.210 1.01   161
## s(mintemp_lag4).5 -0.13 -0.0035  0.060 1.01   187
## s(mintemp_lag4).6 -0.39 -0.0057  0.180 1.01   171
## s(mintemp_lag4).7 -1.00  0.0061  0.590 1.01   190
## s(mintemp_lag4).8 -0.80 -0.4000  0.029 1.01   468
## gp(time).1        -8.00 -0.0680  7.700 1.00   933
## gp(time).2        -7.70 -1.5000  4.100 1.00   812
## gp(time).3        -7.00 -0.0640  6.900 1.00  1515
## gp(time).4        -3.30  2.0000  7.700 1.00   976
## gp(time).5        -5.20  0.7300  6.500 1.00  1461
## gp(time).6        -7.60 -1.5000  4.200 1.00  1808
## gp(time).7        -7.20 -1.5000  3.900 1.00  1704
## gp(time).8        -5.20  0.5800  5.900 1.00  1734
## gp(time).9        -2.80  2.6000  7.900 1.00  1148
## gp(time).10       -5.80 -0.0820  5.100 1.00  1846
## gp(time).11       -7.50 -2.1000  2.800 1.01   762
## gp(time).12       -4.70  0.6000  5.500 1.00  2030
## gp(time).13       -3.90  0.7900  5.600 1.00  1720
## gp(time).14       -6.40 -1.5000  3.600 1.00  2253
## gp(time).15       -4.40  0.0450  4.800 1.00  1741
## gp(time).16       -2.20  2.1000  6.900 1.00  2244
## gp(time).17       -3.30  1.4000  5.700 1.00  1913
## gp(time).18       -6.70 -1.8000  2.200 1.00  1712
## gp(time).19       -7.90 -3.6000 -0.470 1.00  1678
## gp(time).20       -3.80 -0.1000  3.100 1.00  1933
## 
## GAM gp term marginal deviation (alpha) and length scale (rho) estimates:
##                2.5% 50% 97.5% Rhat n_eff
## alpha_gp(time) 0.73 1.3   2.7    1   890
## rho_gp(time)   1.00 2.5   6.1    1  1256
## 
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 0 of 2000 iterations ended with a divergence (0%)
## 0 of 2000 iterations saturated the maximum tree depth of 10 (0%)
## E-FMI indicated no pathological behavior
## 
## Samples were drawn using NUTS(diag_e) at Sat Mar 08 6:25:50 PM 2025.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split MCMC chains
## (at convergence, Rhat = 1)
## 
## Use how_to_cite(model5) to get started describing this model

And plot conditional effects:

conditional_effects(model5,
                    type = 'expected')

As we saw in Tutorial 1, it can also be useful to understand first derivatives of smooth functions to get a sense of when (in history) they were increasing or decreasing. We can view the first derivative estimates for this model’s smooth trend using the following command:

plot_slopes(
  model5, 
  variable = 'time', 
  condition = 'time',
  type = 'link'
) +
  labs(y = 'First derivative (slope) of linear predictor') +
  geom_hline(yintercept = 0, linetype = 'dashed')

These two models use slightly different priors and parameterisations for the smooth functions, but they should be fairly equivalent. We can check that by looking at their LOO diagnostics

loo(model4)

## 
## Computed from 2000 by 52 log-likelihood matrix.
## 
##          Estimate   SE
## elpd_loo   -128.0  9.2
## p_loo        11.8  2.5
## looic       255.9 18.4
## ------
## MCSE of elpd_loo is NA.
## MCSE and ESS estimates assume MCMC draws (r_eff in [0.7, 1.3]).
## 
## Pareto k diagnostic values:
##                          Count Pct.    Min. ESS
## (-Inf, 0.7]   (good)     49    94.2%   252     
##    (0.7, 1]   (bad)       2     3.8%   <NA>    
##    (1, Inf)   (very bad)  1     1.9%   <NA>    
## See help('pareto-k-diagnostic') for details.

loo(model5)

## 
## Computed from 2000 by 52 log-likelihood matrix.
## 
##          Estimate   SE
## elpd_loo   -127.5  8.9
## p_loo        12.5  2.3
## looic       255.0 17.8
## ------
## MCSE of elpd_loo is NA.
## MCSE and ESS estimates assume MCMC draws (r_eff in [0.7, 1.3]).
## 
## Pareto k diagnostic values:
##                          Count Pct.    Min. ESS
## (-Inf, 0.7]   (good)     49    94.2%   221     
##    (0.7, 1]   (bad)       3     5.8%   <NA>    
##    (1, Inf)   (very bad)  0     0.0%   <NA>    
## See help('pareto-k-diagnostic') for details.

Of course we can also look at forecasts from the mvgam model:

plot(forecast(model5))

## Out of sample DRPS:
## 19.4762895

These look much better than the spline extrapolations we saw at the beginning of this tutorial. We can also see how the GP is extrapolating on the link scale

plot_predictions(
  model5, 
  by = 'time',
  newdata = datagrid(time = 1:max(data_test$time)),
  type = 'link'
) +
  geom_vline(xintercept = max(data_train$time), 
             linetype = 'dashed')

This is much better extrapolation behaviour compared to splines. Using how_to_cite() for this model will give you information on the approximate GPs and how they are estimated:

description <- how_to_cite(model5)

description

## Methods text skeleton

## We used the R package mvgam (version 1.1.5001; Clark & Wells, 2023) to
##   construct, fit and interrogate the model. mvgam fits Bayesian
##   State-Space models that can include flexible predictor effects in both
##   the process and observation components by incorporating functionalities
##   from the brms (Burkner 2017), mgcv (Wood 2017) and splines2 (Wang & Yan,
##   2023) packages. Gaussian Process functional effects were estimated using
##   a low-rank Hilbert space approximation following methods described by
##   Riutort-Mayol et al. (2023). The mvgam-constructed model and observed
##   data were passed to the probabilistic programming environment Stan
##   (version 2.36.0; Carpenter et al. 2017, Stan Development Team 2025),
##   specifically through the cmdstanr interface (Gabry & Cesnovar, 2021). We
##   ran 4 Hamiltonian Monte Carlo chains for 500 warmup iterations and 500
##   sampling iterations for joint posterior estimation. Rank normalized
##   split Rhat (Vehtari et al. 2021) and effective sample sizes were used to
##   monitor convergence.

## 
## Primary references

## Clark, NJ and Wells K (2023). Dynamic Generalized Additive Models
##   (DGAMs) for forecasting discrete ecological time series. Methods in
##   Ecology and Evolution, 14, 771-784. doi.org/10.1111/2041-210X.13974
## Burkner, PC (2017). brms: An R Package for Bayesian Multilevel Models
##   Using Stan. Journal of Statistical Software, 80(1), 1-28.
##   doi:10.18637/jss.v080.i01
## Wood, SN (2017). Generalized Additive Models: An Introduction with R
##   (2nd edition). Chapman and Hall/CRC.
## Wang W and Yan J (2021). Shape-Restricted Regression Splines with R
##   Package splines2. Journal of Data Science, 19(3), 498-517.
##   doi:10.6339/21-JDS1020 https://doi.org/10.6339/21-JDS1020.
## Riutort-Mayol, G, Burkner, PC, Andersen, MR, Solin, A and Vehtari, A
##   (2023). Practical Hilbert space approximate Bayesian Gaussian processes
##   for probabilistic programming. Statistics and Computing 33, 1.
##   https://doi.org/10.1007/s11222-022-10167-2
## Carpenter, B, Gelman, A, Hoffman, MD, Lee, D, Goodrich, B, Betancourt,
##   M, Brubaker, M, Guo, J, Li, P and Riddell, A (2017). Stan: A
##   probabilistic programming language. Journal of Statistical Software 76.
## Gabry J, Cesnovar R, Johnson A, and Bronder S (2025). cmdstanr: R
##   Interface to 'CmdStan'. https://mc-stan.org/cmdstanr/,
##   https://discourse.mc-stan.org.
## Vehtari A, Gelman A, Simpson D, Carpenter B, and Burkner P (2021).
##   Rank-normalization, folding, and localization: An improved Rhat for
##   assessing convergence of MCMC (with discussion). Bayesian Analysis 16(2)
##   667-718. https://doi.org/10.1214/20-BA1221.

## 
## Other useful references

## Arel-Bundock, V, Greifer, N, and Heiss, A (2024). How to interpret
##   statistical models using marginaleffects for R and Python. Journal of
##   Statistical Software, 111(9), 1-32.
##   https://doi.org/10.18637/jss.v111.i09
## Gabry J, Simpson D, Vehtari A, Betancourt M, and Gelman A (2019).
##   Visualization in Bayesian workflow. Journal of the Royal Statatistical
##   Society A, 182, 389-402. doi:10.1111/rssa.12378.
## Vehtari A, Gelman A, and Gabry J (2017). Practical Bayesian model
##   evaluation using leave-one-out cross-validation and WAIC. Statistics and
##   Computing, 27, 1413-1432. doi:10.1007/s11222-016-9696-4.
## Burkner, PC, Gabry, J, and Vehtari, A. (2020). Approximate
##   leave-future-out cross-validation for Bayesian time series models.
##   Journal of Statistical Computation and Simulation, 90(14), 2499-2523.
##   https://doi.org/10.1080/00949655.2020.1783262

Inference on GP parameters

Another big advantage of GPs is that their parameters are more directly interpretable. For instance, the length scale of a GP is related to the stationarity (or stability) of the system. We can plot how covariance is expected to change over time to understand this stability. Those of you that have worked with variograms or semi-variograms before when analysing spatial data will be used to visualizing how semivariance decays over distance. Here we will visualize how covariance decays over time distances (i.e. how far apart in time to we need to be before our estimate from the GP no longer depends on another?). We can define a few helper functions to generate these plots, which will accept posterior estimates of GP parameters from a fitted mvgam object, compute GP covariance kernels, calculate posterior quantiles and then plot them:

Click here if you would like to see the R function plot_kernels, used to plot GP kernels

# Compute the covariance kernel for a given draw
# of GP alpha and rho parameters
quad_kernel = function(rho, alpha, times){
  covs <- alpha ^ 2 * 
    exp(-0.5 * ((times / rho) ^ 2))
  return(covs)
}

# Compute kernels for each posterior draw
plot_kernels = function(gp_ests, max_time = 50){
  # set up an empty matrix to store the kernel estimates 
  # for each posterior draw
  draw_seq <- seq(0, max_time, length.out = 100)
  kernels <- matrix(NA, nrow = NROW(gp_ests), ncol = 100)
  
  # use a for-loop to estimate the kernel for each draw using
  # our custom quad_kernel() function
  for(i in 1:NROW(gp_ests)){
    kernels[i,] <- quad_kernel(
      rho = gp_ests[, grep('rho', names(gp_ests))][i],
      alpha = gp_ests[, grep('alpha', names(gp_ests))][i],
      times = draw_seq
    )
  }
  
  # Calculate posterior empirical quantiles of the kernels
  probs <- c(0.05, 0.2, 0.5, 0.8, 0.95)
  cred <- sapply(1:NCOL(kernels),
                 function(n) quantile(kernels[,n],
                                      probs = probs))
  
  # Plot the kernel uncertainty estimates
  pred_vals <- draw_seq
  plot(1, 
       xlim = c(0, max_time), 
       ylim = c(0, max(cred)), 
       type = 'n',
       xlab = 'Time difference', 
       ylab = 'Covariance',
       bty = 'L')
  box(bty = 'L', lwd = 2)
  polygon(c(pred_vals, rev(pred_vals)), 
          c(cred[1,], rev(cred[5,])),
          col = "#DCBCBC", 
          border = NA)
  polygon(c(pred_vals, rev(pred_vals)), 
          c(cred[2,], rev(cred[4,])),
          col = "#B97C7C", 
          border = NA)
  lines(pred_vals, 
        cred[3,], 
        col = "#7C0000", 
        lwd = 2.5)
}

Extract posterior estimates of the GP parameters using the as.data.frame function that we’ve been using previously. You can then plot the posterior kernel estimates for model 5 to visualize how temporal error covariance is estimated to change as two time points are further apart. To do this, all we need to supply is the posterior estimates for the GP parameters in the form of a data.frame:

gp_ests <- as.data.frame(model5, variable = 'gp_', regex = TRUE)
plot_kernels(gp_ests = gp_ests, max_time = 30)

This can be helpful to understand how long the temporal ‘memory’ of a given GP is, which also relates to how smooth it’s underlying functions are.

Exercises

Fit a model that uses a spline of time (using bs = 'bs') and a Negative Binomial family in mvgam for comparisons. Plot the 1st derivative of this temporal spline and describe how (or if) it differs from the 1st derivative of the GP effect of model5
Plot extrapolations from the spline model and take a few notes describing how these differ from the GP predictions in model5
Compare the in-sample predictive accuracies of the two models using loo_compare(). Is either model favoured over the other?
Consider watching the below lecture by Richard McElreath on Gaussian Processes and their wide uses in statistical modelling

Session Info

sessionInfo()

## R version 4.4.2 (2024-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8   
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.utf8    
## 
## time zone: Australia/Brisbane
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] rstan_2.32.6           StanHeaders_2.32.10    ggplot2_3.5.1         
##  [4] tidybayes_3.0.7        marginaleffects_0.25.0 brms_2.22.9           
##  [7] Rcpp_1.0.14            mvgam_1.1.5001         dplyr_1.1.4           
## [10] downloadthis_0.4.1    
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1      svUnit_1.0.6          farver_2.1.2         
##  [4] loo_2.8.0.9000        fastmap_1.2.0         tensorA_0.36.2.1     
##  [7] digest_0.6.37         timechange_0.3.0      mime_0.12            
## [10] lifecycle_1.0.4       processx_3.8.6        magrittr_2.0.3       
## [13] posterior_1.6.0.9000  compiler_4.4.2        rlang_1.1.5          
## [16] sass_0.4.9            tools_4.4.2           yaml_2.3.10          
## [19] collapse_2.0.19       data.table_1.17.0     knitr_1.49           
## [22] labeling_0.4.3        bridgesampling_1.1-2  curl_6.2.1           
## [25] pkgbuild_1.4.6        plyr_1.8.9            cmdstanr_0.8.0       
## [28] abind_1.4-8           klippy_0.0.0.9500     withr_3.0.2          
## [31] purrr_1.0.4           grid_4.4.2            stats4_4.4.2         
## [34] colorspace_2.1-1      inline_0.3.21         MASS_7.3-61          
## [37] scales_1.3.0          insight_1.0.2         cli_3.6.4            
## [40] mvtnorm_1.3-3         rmarkdown_2.29        generics_0.1.3       
## [43] RcppParallel_5.1.10   rstudioapi_0.17.1     reshape2_1.4.4       
## [46] cachem_1.1.0          stringr_1.5.1         splines_4.4.2        
## [49] bayesplot_1.11.1.9000 assertthat_0.2.1      parallel_4.4.2       
## [52] matrixStats_1.5.0     vctrs_0.6.5           V8_6.0.1             
## [55] Matrix_1.7-1          jsonlite_1.9.0        gratia_0.10.0        
## [58] patchwork_1.3.0       arrayhelpers_1.1-0    ggdist_3.3.2         
## [61] b64_0.1.3             jquerylib_0.1.4       tidyr_1.3.1          
## [64] glue_1.8.0            ps_1.9.0              codetools_0.2-20     
## [67] ggokabeito_0.1.0      mvnfast_0.2.8         distributional_0.5.0 
## [70] lubridate_1.9.4       stringi_1.8.4         gtable_0.3.6         
## [73] QuickJSR_1.5.2        munsell_0.5.1         tibble_3.2.1         
## [76] pillar_1.10.1         htmltools_0.5.8.1     Brobdingnag_1.2-9    
## [79] R6_2.6.1              evaluate_1.0.3        lattice_0.22-6       
## [82] backports_1.5.0       bslib_0.9.0           rstantools_2.4.0.9000
## [85] bsplus_0.1.4          zip_2.3.2             coda_0.19-4.1        
## [88] gridExtra_2.3         nlme_3.1-166          checkmate_2.3.2      
## [91] mgcv_1.9-1            xfun_0.51             fs_1.6.5             
## [94] pkgconfig_2.0.3

n.clark@uq.edu.au, https://github.com/nicholasjclark ↩︎

Ecological forecasting in R

Tutorial 2: residual correlation and dynamic processes

Nicholas Clark, School of Veterinary Science, University of Queensland¹

Exercises and associated data

Load libraries and time series data

Forecasting with temporal smooths

Exercises

Residual correlation structures

A standard Poisson GAM in `brms`

Computing residuals with `tidybayes`

Autocorrelated error GAM in `brms`

Autocorrelated error GAM in `mvgam`

Comparing inferences

AR processes and stationarity

Exercises

Gaussian Processes

GP trends in `brms`

GP trends in `mvgam`

Inference on GP parameters

Exercises

Session Info

Ecological forecasting in R

Tutorial 2: residual correlation and dynamic processes

Nicholas Clark, School of Veterinary Science, University of Queensland1

Exercises and associated data

Load libraries and time series data

Forecasting with temporal smooths

Exercises

Residual correlation structures

A standard Poisson GAM in brms

Computing residuals with tidybayes

Autocorrelated error GAM in brms

Autocorrelated error GAM in mvgam

Comparing inferences

AR processes and stationarity

Exercises

Gaussian Processes

GP trends in brms

GP trends in mvgam

Inference on GP parameters

Exercises

Session Info

Nicholas Clark, School of Veterinary Science, University of Queensland¹

A standard Poisson GAM in `brms`

Computing residuals with `tidybayes`

Autocorrelated error GAM in `brms`

Autocorrelated error GAM in `mvgam`

GP trends in `brms`

GP trends in `mvgam`