Exercises and associated data

The data and modelling objects created in this notebook can be downloaded directly to save computational time.


Users who wish to complete the exercises can download a small template R script. Assuming you have already downloaded the data objects above, this script will load all data objects so that the steps used to create them are not necessary to tackle the exercises.

Load libraries and time series data

This tutorial relates to content covered in Lecture 4, and relies on the following packages for manipulating data, shaping time series, fitting dynamic regression models and plotting:

library(dplyr)
library(mvgam) 
library(tidybayes)
library(ggplot2); theme_set(theme_classic())
library(marginaleffects)
A sockeye salmon (Oncorhynchus nerka) in spawning colours; photo Wikimedia Commons
A sockeye salmon (Oncorhynchus nerka) in spawning colours; photo Wikimedia Commons


The data we will use for this tutorial is spawner-recruit data for sockeye salmon (Oncorhynchus nerka) from the Bristol Bay region in Southwest Alaska. Commercial fishermen in Alaska target this important species using seines and gillnets for fresh or frozen fillet sales and canning. Records for the rivers in Bristol Bay span the years 1952-2005, and are maintained within the RAM Legacy Stock Assessment Database. Spawner estimates originated from Bristol Bay Area Annual Management Reports from the Alaska Department of Fish and Game, and are based on in-river spawner counts (including both tower counts and aerial surveys). Age-specific catch rates were calculated from records of catch and escapement according to age class. These data were used to determine the number of recruits in each year.

Map of the rivers in Bristol Bay, Alaska sampled for sockeye spawners; photo courtesy of Holmes, Ward and Scheuerell, https://atsa-es.github.io/
Map of the rivers in Bristol Bay, Alaska sampled for sockeye spawners; photo courtesy of Holmes, Ward and Scheuerell, https://atsa-es.github.io/

Manipulating data for modelling

Below we process the data, keeping observations for a single river and taking \(log(spawners_{t})\) to use as the offset in our models.

# access the data from the ATSA Github site
load(url('https://github.com/atsa-es/atsalibrary/raw/master/data/sockeye.RData'))
sockeye %>%
  
  # Remove any previous grouping structures
  dplyr::ungroup() %>%
  
  # select the target river
  dplyr::filter(region == "KVICHAK") %>%
  
  # filter the data to only include years when number of spawners
  # was estimated, as this will be used as the offset
  dplyr::filter(!is.na(spawners)) %>%
  
  # log the spawners variable to use as an appropriate offset
  dplyr::mutate(log_spawners = log(spawners)) %>%
  
  # add a 'time' variable
  dplyr::mutate(time = dplyr::row_number()) %>%
  
  # add a series variable (as a factor)
  dplyr::mutate(series = as.factor('sockeye')) %>%
  
  # keep variables of interest
  dplyr::select(time, series, log_spawners, recruits) -> model_data

The data now contain four variables: time, the indicator of which time step each observation belongs to series, a factor indexing which time series each observation belongs to
log_spawners, the logged number of recorded sockeye spawners (in thousands), measured as the number of returning spawners minus catch at each timepoint, which will be used as an offset in models recruits, the response variable representing the number of estimated sockeye recruits at each timepoint (in thousands)

Check the data structure

head(model_data)
## # A tibble: 6 × 4
##    time series  log_spawners recruits
##   <int> <fct>          <dbl>    <dbl>
## 1     1 sockeye         9.15    39000
## 2     2 sockeye         7.95     4090
## 3     3 sockeye         6.28      289
## 4     4 sockeye         6.52      547
## 5     5 sockeye         9.59    55300
## 6     6 sockeye         8.22     3500
dplyr::glimpse(model_data)
## Rows: 50
## Columns: 4
## $ time         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ series       <fct> sockeye, sockeye, sockeye, sockeye, sockeye, sockeye, soc…
## $ log_spawners <dbl> 9.152711, 7.951559, 6.282267, 6.522093, 9.588777, 8.21878…
## $ recruits     <dbl> 39000, 4090, 289, 547, 55300, 3500, 5360, 1120, 5750, 450…

Visualize the data as a heatmap to get a sense of where any NAs are distributed (NAs are shown as red bars in the below plot)

image(is.na(t(model_data %>%
                dplyr::arrange(dplyr::desc(time)))), axes = F,
      col = c('grey80', 'darkred'))
axis(3, at = seq(0,1, len = NCOL(model_data)), labels = colnames(model_data))

Plot properties of the time series, which indicate some overdispersion in the recruit count values that we’ll have to address

plot_mvgam_series(data = model_data, y = 'recruits')

Posterior simulation in mvgam

We will fit a number of models to these data to showcase the various methods that are available to us for evaluating and comparing dynamic GLMs / GAMs. We begin with a model that does not include a dynamic component, as this is always a useful starting point. It also allows us to highlight another useful feature of GLMs / GAMs when modelling count data, as we can include an offset variable that allows us to model rates of interest.

Sometimes it is more relevant to model the rate an event occurs per some underlying unit. For example, in epidemiological modelling the interest is often in understanding the number of disease cases per population. But we never get to observe this rate directly. An offset allows us to model the underlying rate by algebraically accounting for variation in the number of possible counts in each observation unit. This works naturally in models that use a log link function, such as the Poisson and Negative Binomial models. Including an offset is usually straightforward in most R modelling software, and indeed this is the case in mvgam as well. Here we fit a model with no trend and Negative Binomial observations, including an offset of the log_spawners variable that we calculated above so we can estimate the rate of recruitment per spawner. Our first model addresses overdispersion in the observed data by using a Negative Binomial observation model, and we only include the implicit intercept along with an offset of log_spawners in the linear predictor. As in previous tutorials, only 250 samples per chain are used to keep the size of the resulting object small (though we’d always want to use more samples in a real analysis)

model1 <- mvgam(recruits ~ offset(log_spawners),
              family = nb(),
              trend_model = 'None',
              data = model_data,
              samples = 250)
Check here to see a mathematical description of the model

The model can be described mathematically as follows: \[\begin{align*} \boldsymbol{recruits}_t & \sim \text{NegBinomial}(\lambda_t, \phi) \\ log(\lambda_t/\boldsymbol{spawners_{t}}) & = \alpha \\ log(\lambda_t) & = \alpha + 1*log(\boldsymbol{spawners_{t}}) \\ \alpha & \sim \text{Normal}(0, 1) \\ \phi^1 & \sim \text{StudentT}(3, 0, 0.1) \end{align*}\]

Note that this is not a classic density-dependent stock-recruitment model such as the Ricker or Beverton Holt models. Rather, this model very simplistically assumes recruits can increase without bound as a function of the number of spawners.


The inclusion of the offset is important if we wish to make predictions on the scale of the observations. This is because there will be no regression coefficient assigned to this variable (it’s coefficient is fixed at 1). If you wish to manually make predictions, you’ll need to take care to include the offset. This is returned as an attribute of the linear predictor matrix when we call predict.gam on the underlying mgcv model:

lpmatrix <- predict(model1$mgcv_model, type = 'lpmatrix')
attr(lpmatrix, 'model.offset')
##  [1]  9.152711  7.951559  6.282267  6.522093  9.588777  8.218787  7.855545
##  [8]  5.826000  6.863803 10.098232  8.237479  8.077137  7.847763  9.034796
## [15]  9.539644  7.779049  6.917706  5.424950  8.396155  9.480368  7.585789
## [22]  7.200425  8.330864  9.323669  9.769956  7.467371  7.038784  8.180321
## [29]  9.259131  8.883224  7.073270  8.711114  8.311398  9.026418  8.849371
## [36]  8.347590  8.461680  8.301522  9.028818  9.210340  7.279319  7.313220
## [43]  7.740664  8.732305  7.512071  7.003065  6.556778  7.432484  8.612503
## [50]  7.749322

To make predictions on the outcome scale, we might do something like the following:

# extract posterior estimates of the observation linear predictor betas
betas_post <- as.data.frame(model1, variable = 'betas')

# generate the linear predictor matrix for the fitted data
# (note, you an supply newdata here to predict for new scenarios)
lpmatrix <- predict(model1$mgcv_model, type = 'lpmatrix')

# set up a matrix to store linear predictor values
linpreds <- matrix(NA, nrow = NROW(betas_post), ncol = NROW(lpmatrix))

# loop across all posterior estimates
for(i in 1:NROW(betas_post)){
  
  # multiply covariate values by the beta estimates (including intercepts)
  linpreds[i, ] <- lpmatrix %*% betas_post[i, ] + 
    
    # add the offset, which will be equal to 0 if no offset was
    # supplied to the model
    as.vector(attr(lpmatrix, 'model.offset'))
}

# plot a histogram to see the distribution of these values
hist(linpreds,
     col = 'darkred',
     border = 'white',
     xlab = expression(Linear~predictor~(log[mu])),
     ylab = '',
     yaxt = 'n',
     main = '',
     lwd = 2)

This same type of process underlies many of mvgam’s prediction / plotting functions. For example, to plot partial effects of smooth terms, the plot_mvgam_smooth function will set up a grid of evenly-spaced hypothetical values as newdata to predict the contribution of the smooth term to the linear predictor. It does this by essentially ignoring the contributions of all other predictors (setting their values to 0 in the linear predictor matrix). The same process also underlies the functions used to plot conditional effects and posterior contrasts (you can learn a great deal about how these newdata grids are set up by looking over the documentation in the marginaleffects book and by calling ?marginaleffects::datagrid)

Once these predictions are made for the linear predictor, you could then compute posterior expectations (in this case by exponentiating the values with exp(linpreds) to “undo” the log link). Or you could compute posterior predictions by incorporating the uncertainty in the observation model:

# extract posterior estimates of the overdispersion parameter phi
phi_post <- as.data.frame(model1, variable = 'obs_params')

# set up a matrix to store posterior predictions
ypreds <- matrix(NA, nrow = NROW(betas_post), ncol = NROW(lpmatrix))

# loop across all posterior estimates
for(i in 1:NROW(betas_post)){
  
  # take Negative Binomial draws, being sure to exponentiate the
  # linear predictors and to use the appropriate overdispersion 
  # estimate for the 'size' argument
  ypreds[i, ] <- rnbinom(n = NROW(lpmatrix),
                         mu = exp(linpreds[i, ]),
                         size = phi_post[i,])
}

# plot a histogram to see the distribution of these values
hist(ypreds,
     xlim = c(0, quantile(as.vector(ypreds), probs = 0.95)),
     breaks = 150,
     col = 'darkred',
     border = 'white',
     xlab = 'Posterior predictions',
     ylab = '',
     yaxt = 'n',
     main = '',
     lwd = 2)

mvgam makes this process straightforward using the predict.mvgam, hindcast.mvgam and forecast.mvgam functions. For example, we could replicate the linear prediction calculations using the following:

linpreds2 <- predict(model1, 
                     type = 'link', 
                     process_error = FALSE)
all.equal(linpreds, linpreds2)
## [1] "Attributes: < Length mismatch: comparison on first 1 components >"     
## [2] "Attributes: < Component \"dim\": Mean relative difference: 0.9485714 >"
## [3] "Numeric: lengths (50000, 200) differ"

Notice how we explicitly set process_error to FALSE so that the calculations of the linear predictor will ignore any contributions of dynamic trends. This isn’t applicable in this example, but it becomes important to consider when making predictions from dynamic models. We could also compute expectations using predict.mvgam:

linpreds3 <- predict(model1, type = 'expected', process_error = FALSE)
all.equal(exp(linpreds), linpreds3)
## [1] "Attributes: < Length mismatch: comparison on first 1 components >"     
## [2] "Attributes: < Component \"dim\": Mean relative difference: 0.9485714 >"
## [3] "Numeric: lengths (50000, 200) differ"

And finally we can calculate posterior predictions, which consider uncertainty in the Negative Binomial sampling distribution:

ypreds2 <- predict(model1, type = 'response', process_error = FALSE)
all.equal(ypreds, ypreds2)
## [1] "Attributes: < Length mismatch: comparison on first 1 components >"     
## [2] "Attributes: < Component \"dim\": Mean relative difference: 0.9485714 >"
## [3] "Numeric: lengths (50000, 200) differ"

These values will not be identical due to the random nature of the sampling distribution. If you are computing predictions / expectations for new data, the predict.mvgam function should be your primary choice. But if you merely want to return predictions / expectations for the training data and/or any testing data supplied to the model, you will often be better off using the hindcast.mvgam and forecast.mvgam functions. For example, here are the hindcasts of the expectations, which may differ very slightly from those we computed above due to variation in the floating point arithmetic’s used by Stan and R:

hc_expectations <- hindcast(model1, type = 'expected')
all.equal(exp(linpreds), hc_expectations$hindcasts$sockeye)
## [1] "Mean relative difference: 1.275639e-05"

As we’ve seen before, these can be plotted in the same general way as other mvgam plotting functions:

plot(hc_expectations)

The model summary is shown below, which indicates there are no major sampling issues to worry about

summary(model1)
## GAM formula:
## recruits ~ offset(log_spawners)
## 
## Family:
## negative binomial
## 
## Link function:
## log
## 
## Trend model:
## None
## 
## N series:
## 1 
## 
## N timepoints:
## 50 
## 
## Status:
## Fitted using Stan 
## 4 chains, each with iter = 750; warmup = 500; thin = 1 
## Total post-warmup draws = 1000
## 
## 
## Observation dispersion parameter estimates:
##        2.5% 50% 97.5% Rhat n_eff
## phi[1]  1.2 1.7   2.3    1   615
## 
## GAM coefficient (beta) estimates:
##             2.5%  50% 97.5% Rhat n_eff
## (Intercept) 0.61 0.82   1.1 1.01   573
## 
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 0 of 1000 iterations ended with a divergence (0%)
## 0 of 1000 iterations saturated the maximum tree depth of 12 (0%)
## E-FMI indicated no pathological behavior
## 
## Samples were drawn using NUTS(diag_e) at Tue May 07 11:49:43 AM 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split MCMC chains
## (at convergence, Rhat = 1)

Note though that the hindcast.mvgam and forecast.mvgam functions simply extract any predictions that were made in the original model, so will always include uncertainty in any underlying dynamic components.

Posterior predictive checks

Posterior predictive checks are useful for understanding how well your model can capture particular features of your observed data. In mvgam, the pp_check() function is particularly helpful as it relies on the very broad range of checking functions that are available in bayesplot. For example, we can plot a density kernel of our observed counts and overlay simulated kernels from our model. These simulated kernels are generated using predict.mvgam(), so they are informative for all mvgam models (regardless of whether or not the model included dynamic components).

pp_check(model1, type = 'dens_overlay', ndraws = 25)
## Warning in pp_check.mvgam(model1, type = "dens_overlay", ndraws = 25): NA
## responses are not shown in 'pp_check'.

The model generally tends to capture the shape of the observed data. We can confirm this behaviour by inspecting a PIT histogram to again compare observed vs predicted values. For a well calibrated model fit, uncertainty envelopes (in light red) around the observed CDF (in dark red) will be standard uniform.

pp_check(model1, type = 'pit_ecdf')
## Using all posterior draws for ppc type 'pit_ecdf' by default.
## Warning in pp_check.mvgam(model1, type = "pit_ecdf"): NA responses are not
## shown in 'pp_check'.

The pattern suggests the model is fitting the data well, which we can further confirm by comparing observed vs simulated CDFs:

pp_check(model1, type = 'ecdf_overlay')
## Using 10 posterior draws for ppc type 'ecdf_overlay' by default.
## Warning in pp_check.mvgam(model1, type = "ecdf_overlay"): NA responses are not
## shown in 'pp_check'.

pp_check() even accepts custom functions. For example, if we want to know whether our model simulates counts that are < 1000 as frequently as they tend to observe in the true data, we can use:

prop_under1000 <- function(x){
  length(which(x < 1000)) / length(x)
}
pp_check(model1, type = "stat", stat = "prop_under1000")
## Using all posterior draws for ppc type 'stat' by default.
## Warning in pp_check.mvgam(model1, type = "stat", stat = "prop_under1000"): NA
## responses are not shown in 'pp_check'.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There are many other types of posterior predictive checks that can be implemented, and it is highly recommended that you design some to suit your own specific needs (i.e., what aspect(s) of the data are you most concerned with your model being able to adequately capture?). See ?mvgam::ppc_checkfor details

Plot Dunn-Smyth residuals and posterior hindcasts to further inspect model fit

plot(model1, type = 'residuals')

plot(model1, type = 'forecast')

Here it is worth quickly revisiting the uncertainty in these predicted values, and how that differs from uncertainty in our expected values (calculated above as hc_expectations). Remember that our goal is to produce a forecast distribution because we do not know what the future holds and we can therefore consider it to be a random variable drawn from some probability distribution. Often we visualise forecasts as lines and ribbons to show particular summaries of this forecast distribution, but it can be easier to understand where these summaries come from if we instead view realizations from this distribution (i.e. random draws). For example, here are 5 realizations from our expected hindcast values…

plot(hc_expectations, realisations = TRUE, n_realisations = 5)

…and here are 20 realizations:

plot(hc_expectations, realisations = TRUE, n_realisations = 20)

Note how there isn’t much uncertainty in these values because all uncertainty is coming from variation in the intercept term \(\alpha\). In sharp contrast, there is larger uncertainty in our hindcast predictions because these predictions also consider variation in the Negative Binomial sampling distribution (including uncertainty in the overdispersion parameter \(\phi\)). As above, here are 5 realizations for the hindcast predictions…

hc_predictions <- hindcast(model1, type = 'response')
plot(hc_predictions, realisations = TRUE, n_realisations = 5)

…and here are 20 realizations:

plot(hc_predictions, realisations = TRUE, n_realisations = 20)

predict vs hindcast

Back to the modelling stragy. Although the model is trying to capture the dispersion in the data (the summary above shows that the \(\phi\) parameter is estimated to be very small, indicating a large amount of overdispersion after accounting for the offset), there are still some patterns in the residuals, including substantial autocorrelation at a lag of 1. We need a dynamic model to capture this autocorrelation, so next we add to the model by including an AR1 latent residuals model

model2 <- mvgam(recruits ~ offset(log_spawners),
              family = nb(),
              trend_model = AR(),
              data = model_data)

Because this model has a dynamic trend component, it is worth quickly revisiting the differences between the predict.mvgam function and the hindcast.mvgam / forecast.mvgam functions. First, we investigate how ignoring the uncertainty in the trend component changes the predictions. In this step, we calculate predictions for the observed time period in two ways: first we average over the uncertainty from the latent trend, and second we ignore this uncertainty. Plotting both of these together shows how our prediction uncertainty changes when we include the uncertainty in the dynamic trend component. A crucial detail to understand here is that the predict.mvgam function does not account for the actual ‘times’ of the data, rather it assumes the dynamic process has reached stationarity and simply samples from the following distribution for the trend: \(z_t \sim Normal(0, \sigma_{error})\)

# calculate predictions for the training data by including the dynamic process
# uncertainty
ypreds <- predict(model2, type = 'response', 
                  process_error = TRUE, summary = FALSE)

# calculate 95% empirical quantiles of predictions
cred <- sapply(1:NCOL(ypreds),
                   function(n) quantile(ypreds[,n],
                                        probs = c(0.025, 0.975), na.rm = T))
pred_vals <- 1:NCOL(ypreds)

# graphical plotting parameters
layout(matrix(1:2, ncol = 2))
par(mar = c(2.5, 2.3, 2, 2),
    oma = c(1, 1, 0, 0),
    mgp = c(1.5, 0.5, 0))
    
# plot credible intervals    
plot(1, type = "n", bty = 'L',
     xlab = 'Time',
     ylab = 'predict.mvgam predictions',
     xlim = c(min(pred_vals), max(pred_vals)),
     ylim = c(min(cred, na.rm = T),
              max(cred, na.rm = T)))
polygon(c(pred_vals, rev(pred_vals)), c(cred[1,], rev(cred[2,])),
        col = 'darkred', border = NA)

# now repeat, but this time ignore uncertainty in the dynamic component
ypreds2 <- predict(model2, type = 'response', 
                   process_error = FALSE, summary = FALSE)
cred2 <- sapply(1:NCOL(ypreds),
               function(n) quantile(ypreds2[,n],
                                    probs = c(0.025, 0.975), na.rm = T))

# overlay these predictions as a lighter colour
polygon(c(pred_vals, rev(pred_vals)), c(cred2[1,], rev(cred2[2,])),
        col = "#DCBCBC", border = NA)

# overlay the actual observations as well
points(model_data$recruits, pch = 16, cex = 0.8, col = 'white')
points(model_data$recruits, pch = 16, cex = 0.65)
box(bty = 'l', lwd = 2)

# now plot the actual hindcasts, which did not assume the process was 
# stationary but instead modelled its dynamics over the training time.
# use the same y-axis limits to see how these plots differ
plot_mvgam_fc(model2, ylim = c(min(cred, na.rm = T),
                             max(cred, na.rm = T)),
              ylab = 'Hindcast predictions')

layout(1)

As you can see, the predictions differ. The hindcast.mvgam and forecast.mvgam functions use the time component of the data, so it is crucial that any newdata fed to the forecast.mvgam function follows on sequentially from the data that was used to fit the model (this is not internally checked by the package because it might be a headache to do so when data are not supplied in a specific time-order). But the predictions from these functions might not be as useful for scenario-planning because any scenarios would then be required to have an explicit time dimension. Sometimes we’d rather get a sense of how predictions might change on average if some external conditions changed, for example if temperature were to increase by x degrees or if we were to sample from a new group for a random effect. For exploring these types of scenarios, the predict.mvgam function is more useful. This is the function that is used in any calls to plot_predictions, for example.

As an illustration of the some of the many uses for the predict.mvgam function, here we can compute how many more recruits this model would expect to see if there were 5,000 spawners recorded vs if there were only 3,000 spawners recorded. We can take full advantage of the avg_comparisons function from the marginaleffects package to do this (see ?marginaleffects::avg_comparisons for details)

# take draws from the model that compute the average comparison between
# spawners recorded at 5,000 vs 3,000 (be sure to log first!)
post_contrasts <- avg_comparisons(model2, 
                        variables = list(log_spawners = log(c(3000, 5000)))) %>%
  posteriordraws() 

# use the resulting posterior draw object to plot a density of the 
# posterior contrasts
post_contrasts %>%
  ggplot(aes(x = draw)) +
  
  # use the stat_halfeye function from tidybayes for a nice visual
  stat_halfeye(fill = "#C79999") +
  labs(x = "(spawners = 5,000) − (spawners = 3,000)", y = "Density", 
       title = "Average posterior contrast") +
  theme_classic()

Exercises

  1. Plot a histogram of model1’s posterior intercept coefficients \(\alpha\). Use ?mvgam::as.data.frame.mvgam for guidance
  2. Plot a scatterplot of recruits vs log_spawners using the supplies model_data. Take a few notes on whether you think our primary modelling assumption, that the number of log(recruits) scales linearly with number of log_spawners, is justified.
Check here if you’re having trouble extracting the intercept posterior
# Replace the ? with the correct value(s)
# check names of variables that can be extracted
vars_available <- variables(model1)

# we are interested in the observation model coefficients
vars_available$observation_betas

# extract the intercepts using the as.data.frame.mvgam function
# you will have two options for extracting this variable. first, you
# can use the alias
alphas <- as.data.frame(model1, variable = ?)

# alternatively, use the original name
alphas <- as.data.frame(model1, variable = ?)

# plot a histogram
hist(alphas[,1], xlab = expression(alpha),
     main = '', col = 'darkred', border = 'white')

Comparing dynamic models

Now on to the topic of model comparisons. Plot Dunn-Smyth residuals for this dynamic model to see that there is no more autocorrelation

plot(model2, type = 'residuals')

However we must be cautious interpreting these residuals because the AR1 trend model is extremely flexible. The in-sample fits can sometimes be nearly perfect, meaning the residuals will very often look excellent! We can confirm this behaviour by inspecting hindcasts

plot(model2, type = 'forecast')

But PPCs can still give valid insights into model inadequacy:

pp_check(model2, type = 'dens_overlay')
## Using 10 posterior draws for ppc type 'dens_overlay' by default.
## Warning in pp_check.mvgam(model2, type = "dens_overlay"): NA responses are not
## shown in 'pp_check'.

pp_check(model2, type = 'pit_ecdf')
## Using all posterior draws for ppc type 'pit_ecdf' by default.
## Warning in pp_check.mvgam(model2, type = "pit_ecdf"): NA responses are not
## shown in 'pp_check'.

These plots show that the model still tends to predict well. But there are some more worrying and pressing matters we can address with this model. Inspect the summary to see what I mean:

summary(model2)
## GAM formula:
## recruits ~ offset(log_spawners)
## 
## Family:
## negative binomial
## 
## Link function:
## log
## 
## Trend model:
## AR1
## 
## N series:
## 1 
## 
## N timepoints:
## 50 
## 
## Status:
## Fitted using Stan 
## 4 chains, each with iter = 1000; warmup = 500; thin = 1 
## Total post-warmup draws = 2000
## 
## 
## Observation dispersion parameter estimates:
##        2.5% 50% 97.5% Rhat n_eff
## phi[1]  4.3  16   310 1.07    64
## 
## GAM coefficient (beta) estimates:
##                2.5% 50% 97.5% Rhat n_eff
## (Intercept) -0.0077 0.6   2.1 1.09    32
## 
## Latent trend parameter AR estimates:
##          2.5%  50% 97.5% Rhat n_eff
## ar1[1]   0.28 0.63  0.95 1.03   106
## sigma[1] 0.53 0.72  0.94 1.01   271
## 
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhats above 1.05 found for 43 parameters
##  *Diagnose further to investigate why the chains have not mixed
## 56 of 2000 iterations ended with a divergence (2.8%)
##  *Try running with larger adapt_delta to remove the divergences
## 0 of 2000 iterations saturated the maximum tree depth of 12 (0%)
## Chain 4: E-FMI = 0.0674
##  *E-FMI below 0.2 indicates you may need to reparameterize your model
## 
## Samples were drawn using NUTS(diag_e) at Tue May 07 11:50:47 AM 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split MCMC chains
## (at convergence, Rhat = 1)

The HMC sampler is telling us there are clear issues when trying to implement both a flexible latent trend AND an overdispersion parameter. Both of these components are competing to capture the extra dispersion in the observations, leading to low effective sample sizes and poor convergence for the overdispersion parameter \(\phi\). It turns out to be very challenging to estimate them both without making some strong adjustments to the model. We now have several options:

  1. Use containment priors to restrict one or both of the processes (i.e. pull the overdispersion parameter towards very large values to make the observations behave like a Poisson a priori)
  2. Keep the Negative Binomial observation model and use a smooth Gaussian Process for the trend
  3. Drop the Negative Binomial observation model (using a Poisson instead) and use a flexible Autoregressive trend

Here we will illustrate the latter two options. First, we try the Negative Binomial observation model with a smooth Gaussian Process for the trend

model3 <- mvgam(recruits ~ offset(log_spawners),
              family = nb(),
              trend_model = 'GP',
              data = model_data)

Next we use a Poisson observation model with a AR1 trend

model4 <- mvgam(recruits ~ offset(log_spawners),
              family = poisson(),
              trend_model = AR(),
              data = model_data)

Neither of these models has any major issues with sampling, which is a good sign!

summary(model3)
## GAM formula:
## recruits ~ offset(log_spawners)
## 
## Family:
## negative binomial
## 
## Link function:
## log
## 
## Trend model:
## GP
## 
## N series:
## 1 
## 
## N timepoints:
## 50 
## 
## Status:
## Fitted using Stan 
## 4 chains, each with iter = 1000; warmup = 500; thin = 1 
## Total post-warmup draws = 2000
## 
## 
## Observation dispersion parameter estimates:
##        2.5% 50% 97.5% Rhat n_eff
## phi[1]  1.4 2.2   3.2    1  1047
## 
## GAM coefficient (beta) estimates:
##             2.5%  50% 97.5% Rhat n_eff
## (Intercept) 0.19 0.74   1.3    1  1067
## 
## Latent trend marginal deviation (alpha) and length scale (rho) estimates:
##             2.5%  50% 97.5% Rhat n_eff
## alpha_gp[1] 0.11 0.55   1.1    1   705
## rho_gp[1]   1.10 2.80  18.0    1   607
## 
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 13 of 2000 iterations ended with a divergence (0.65%)
##  *Try running with larger adapt_delta to remove the divergences
## 0 of 2000 iterations saturated the maximum tree depth of 12 (0%)
## E-FMI indicated no pathological behavior
## 
## Samples were drawn using NUTS(diag_e) at Tue May 07 11:51:57 AM 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split MCMC chains
## (at convergence, Rhat = 1)
summary(model4)
## GAM formula:
## recruits ~ offset(log_spawners)
## 
## Family:
## poisson
## 
## Link function:
## log
## 
## Trend model:
## AR1
## 
## N series:
## 1 
## 
## N timepoints:
## 50 
## 
## Status:
## Fitted using Stan 
## 4 chains, each with iter = 1000; warmup = 500; thin = 1 
## Total post-warmup draws = 2000
## 
## 
## GAM coefficient (beta) estimates:
##              2.5%  50% 97.5% Rhat n_eff
## (Intercept) 0.074 0.49   1.1 1.02   246
## 
## Latent trend parameter AR estimates:
##          2.5%  50% 97.5% Rhat n_eff
## ar1[1]   0.30 0.57  0.86    1   824
## sigma[1] 0.63 0.77  0.98    1  1306
## 
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 0 of 2000 iterations ended with a divergence (0%)
## 32 of 2000 iterations saturated the maximum tree depth of 12 (1.6%)
##  *Run with max_treedepth set to a larger value to avoid saturation
## E-FMI indicated no pathological behavior
## 
## Samples were drawn using NUTS(diag_e) at Tue May 07 11:53:11 AM 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split MCMC chains
## (at convergence, Rhat = 1)

Hindcasts and trend estimates are very different for the two models:

plot(model3, type = 'trend')

plot(model4, type = 'trend')

plot(model3, type = 'forecast')

plot(model4, type = 'forecast')

Approximate Leave-One-Out CV

Here the extreme flexibility of the AR1 trend is apparent in the hindcasts for model4. Hopefully this gives you more of an idea about why we can’t put too much faith in residual or fitted value diagnostics for these flexible dynamic models. Ok so we cannot use in-sample fits or residual diagnostics to guide reasonable comparisons of these two models. It would be more appropriate to use 1-step ahead residuals for interrogation. But computing these is extremely computationally demanding. How then can we compare models? Using loo_compare(), we can ask which model would likely give better predictions for new data within the training set. This method of approximating leave-one-out prediction error (using a method called Pareto Smoothed Importance Sampling (PSIS)) is widely-used for comparing Bayesian models, and is generally interpreted in the same way that AIC comparisons can be interpreted. Here I use a modified loo() function that will be implemented in the next CRAN release of mvgam to compute in-sample LOO values (don’t worry about these technicalities):

loo_hcs = function(x, ...){
  x$series_names <- levels(x$obs_data$series)
  logliks <- logLik(x, 
                    linpreds = predict(x,
                                       newdata = x$obs_data,
                                       type = 'link',
                                       summary = FALSE,
                                       process_error = TRUE),
                    newdata = x$obs_data,
                    family_pars = mvgam:::extract_family_pars(x))
  logliks <- logliks[,!apply(logliks, 2, function(x) all(!is.finite(x)))]
  releffs <- loo::relative_eff(exp(logliks),
                               chain_id = sort(rep(1:x$model_output@sim$chains,
                                                   (NROW(logliks) /
                                                      x$model_output@sim$chains))))
  loo::loo(logliks, r_eff = releffs, ...)
}
loo::loo_compare(list(model3 = loo_hcs(model3), 
                      model4 = loo_hcs(model4)))
##        elpd_diff  se_diff   
## model3        0.0        0.0
## model4 -7095298.8  1057034.9

The Gaussian Process / Negative Binomial model is strongly favoured here, yielding much higher values of the Expected Log Predictive Density (ELPD; also known as the log score). This is one useful metric for choosing among models, but we’d also like to compare them on their abilities to predict future out of sample data.

Approximate Leave-Future-Out CV

mvgam includes functions that facilitate leave-future-out comparisons. Time series models are often evaluated using an expanding window training technique, where the model is initially trained on some subset of data from \(t = 1\) to \(t = n_{train}\), and then is used to produce forecasts for the next \(h\) time steps \((t = n_{train} + h)\). In the next iteration, the size of training data is expanded by a single time point and the process repeated. This is obviously computationally challenging for Bayesian time series models, as the number of refits can be very large. mvgam uses an approximation based on importance sampling. Briefly, we refit the model using the first \(min_t\) observations to perform a single exact \(h~ahead\) forecast step. This forecast is evaluated against the \(min_t + h\) out of sample observations using the ELPD as with the LOO comparisons above. Next, we approximate each successive round of expanding window forecasts by moving forward one step at a time \(for~i~in~1:N_{evaluations}\) and re-weighting draws from the model’s posterior predictive distribution using PSIS. In each iteration \(i\), PSIS weights are obtained for all observations that would have been included in the model if we had re-fit. If these importance ratios are stable, we consider the approximation adequate and use the re-weighted posterior’s forecast for evaluating the next holdout set of testing observations \(((min_t + i + 1):(min_t + i + h))\). This is similar to the process of particle filtering to update forecasts in light of new data by re-weighting the posterior draws using importance weights. But at some point the importance ratio variability will become too large and importance sampling will be unreliable. This is indicated by the estimated shape parameter k of the generalized Pareto distribution crossing a certain threshold pareto_k_threshold. Only then do we refit the model using all of the observations up to the time of the failure. We then restart the process and iterate forward until the next refit is triggered. The process, which is implemented using the lfo_cv function, is computationally much more efficient, as only a fraction of the evaluations typically requires refits (the algorithm is described in detail by Bürkner et al. 2020).

Run the approximator for the first model, setting min_t = 20 and using a forecast horizon of fc_horizon = 3

lfo_gp <- lfo_cv(model3, min_t = 20, 
                 pareto_k_threshold = 0.5,
                 fc_horizon = 3, n_cores = 4)

The S3 plotting function for these lfo_cv objects will show the Pareto-k values and ELPD values over the evaluation time points. For the Pareto k plot, a dashed red line indicates the specified threshold chosen for triggering model refits. For the ELPD plot, a dashed red line indicates the bottom 10% quantile of ELPD values. Points below this threshold may represent outliers that were more difficult to forecast

plot(lfo_gp)

The top plot (the Pareto k plot) indicates that we only needed one additional refit to give good approximate predictions for the Gaussian Process / Negative Binomial model

lfo_ar <- lfo_cv(model4, min_t = 20, 
                 pareto_k_threshold = 0.5,
                 fc_horizon = 3, n_cores = 4)
plot(lfo_ar)

The Poisson / AR model, in contrast, needed to be refit many times to ensure the stability of importance-weighted predictions. We would generally expect more refits for this model as the underlying trend has less “memory” than the GP trend, so its forecasts could become irrelevant more quickly as a result.

Compare ELPDs for the two models using all evaluation timepoints when observations weren’t missing by plotting the differences in their scores. This plot will give us an indication if one model is consistently giving better h-step ahead forecasts when trained on successively more data

# determine which timepoints were not missing in the evaluation set
nonmiss_evalpoints <- which(!is.na(model_data$recruits[26:NROW(model_data)]))

# plot the GP ELPD values minus the AR ELPD values
# values will be > 0 if the GP model gave better predictions
# (ELPD is better when larger)
gp_minus_ar <- lfo_gp$elpds[nonmiss_evalpoints] - lfo_ar$elpds[nonmiss_evalpoints]
plot(gp_minus_ar,
     ylab = 'GP - AR ELPD',
     xlab = 'Time point',
     pch = 16, 
     bty = 'l',
     cex = 1,
     col = 'white',
     xaxt = 'n')
axis(side = 1, at = nonmiss_evalpoints,
     labels = lfo_gp$eval_timepoints[nonmiss_evalpoints])
abline(h = 0, lty = 'dashed', lwd = 2)

# show all ELPD comparisons in black
points(gp_minus_ar, pch = 16, cex = 0.85)

# highlight the comparisons for when the GP predictions were better
points(ifelse(gp_minus_ar > 0, gp_minus_ar, NA), 
       pch = 16, cex = 0.85, col = 'darkred')

The Gaussian Process / Negative Binomial model tends to give better probabilistic predictions throughout the leave-future-out exercise. This result strongly agrees with the results from the leave-one-out approximator above, which makes model selection easy in this case. In-sample residuals for the Gaussian Process / Negative Binomial model look fairly good as well, suggesting we should definitely stick with this model going forward:

plot(model3, type = 'residuals')

Exact Leave-Future-Out CV

If we have enough computational power and our model is simple / small enough, it can often be worthwhile to compute exact leave-future-out forecasts. The update.mvgam function makes this relatively simple to do in mvgam, as all we need to do is supply the new testing and training datasets and the model will be updated. Below I show how to do this for our chosen model (model3) by splitting on timepoint 36 to create training and testing data.

data_train = model_data %>%
  dplyr::filter(time <= 36)
data_test = model_data %>%
  dplyr::filter(time > 36)

Now refit the model using the update.mvgam function. Note that you can supply an argument to state lfo = TRUE if you wish to ignore the computation of residuals and lighten the final model. This can be helpful if all you plan to do is compare forecasts, as it will make repeated calls to the update function faster. See ?mvgam::update.mvgam for more details

model3_exact <- update(model3, 
                       data = data_train,
                       newdata = data_test,
                       lfo = TRUE)

Once this model is finished, we are ready to compute exact forecast evaluations. The first step in this process is to create an object of class mvgam_forecast using the forecast.mvgam function:

fc <- forecast(model3_exact)

We’ve seen this class before when calling the hindcast.mvgam or forecast.mvgam functions. But we haven’t looked in-depth at the other ways we can use objects of this class. A primary purpose is to readily allow forecast evalutions for each series in the data, using a variety of possible scoring functions. See ?mvgam::score.mvgam_forecast to view the types of scores that are available. For these data, which have very large counts, a useful scoring metric is the Continuous Rank Probability Score (CRPS). A CRPS value is similar to what we might get if we calculated a weighted absolute error using the full forecast distribution.

scores <- score(fc, score = 'crps')
scores$sockeye
##        score in_interval interval_width eval_horizon score_type
## 1  6022.7278           0            0.9            1       crps
## 2  3560.9737           1            0.9            2       crps
## 3  6604.2062           1            0.9            3       crps
## 4  6794.8664           1            0.9            4       crps
## 5   881.5366           1            0.9            5       crps
## 6  1757.0302           1            0.9            6       crps
## 7  2622.2656           1            0.9            7       crps
## 8         NA          NA            0.9            8       crps
## 9         NA          NA            0.9            9       crps
## 10        NA          NA            0.9           10       crps
## 11        NA          NA            0.9           11       crps
## 12        NA          NA            0.9           12       crps
## 13        NA          NA            0.9           13       crps
## 14        NA          NA            0.9           14       crps

The returned data.frame now shows the CRPS score for each evaluation in the testing data, along with some other useful information about the fit of the forecast distribution. In particular, we are given a logical value (1s and 0s) telling us whether the true value was within a pre-specified credible interval (i.e. the coverage of the forecast distribution). The default interval width is 0.9, so we would hope that the values in the in_interval column take a 1 approximately 90% of the time. This value can be changed if you wish to compute different coverages, say using a 60% interval:

scores <- score(fc, score = 'crps', interval_width = 0.6)
scores$sockeye
##        score in_interval interval_width eval_horizon score_type
## 1  6022.7278           0            0.6            1       crps
## 2  3560.9737           0            0.6            2       crps
## 3  6604.2062           0            0.6            3       crps
## 4  6794.8664           1            0.6            4       crps
## 5   881.5366           1            0.6            5       crps
## 6  1757.0302           0            0.6            6       crps
## 7  2622.2656           0            0.6            7       crps
## 8         NA          NA            0.6            8       crps
## 9         NA          NA            0.6            9       crps
## 10        NA          NA            0.6           10       crps
## 11        NA          NA            0.6           11       crps
## 12        NA          NA            0.6           12       crps
## 13        NA          NA            0.6           13       crps
## 14        NA          NA            0.6           14       crps

Although we would prefer to use distributional forecast evaluations, point evaluations are also possible using outputs from the forecast.mvgam function. For example, we can calculate the Mean Absolute Percentage Error (MAPE), which is a commonly used metric to evaluate point forecasts when the scales of the time series vary quite a lot:

# exctract the true values from the test set
truth <- fc$test_observations$sockeye

# calculate mean prediction values for each column in the forecast matrix
mean_fc <- apply(fc$forecasts$sockeye, 2, mean)

# calculate the absolute percentage errors, which are scaled by the 
# value of the truths. Note that we only use the first 7 observations
# in the test set because the rest are NA
p <- abs(100*((truth[1:7] - mean_fc[1:7]) / mean_fc[1:7]))

# take the mean of the absolute percentage errors
mean(p)
## [1] 73.3368

Finally, we can look at one more type of score. The lfo_cv function we used above for approximate leave-future-out CV currently can only use the ELPD to score forecasts. This is fine as the ELPD is a strictly proper scoring rule that can be applied to any distributional forecast, but currently there is no straightforward way to calculate it for an object of clases mvgam_forecast like our fc object. But we can calculate it for all data that was fed to the model (in this case the training and testing data) using the logLik.mvgam function. For example, we can calculate the predictive density for our refit like this:

# calculate predictive density values for each posterior draw for the
# training and testing observations
log_scores <- logLik(model3_exact, n_cores = 2)

This function returns the log(probability) of the observation given the distributional assumptions (i.e. that the truth is distributed Negative Binomial with a mean centred around the predicted expectation and overdispersion given by the estimated \(\phi\) parameter). We can illustrate how this works with a simple example. Assume our model gave a posterior expectation of 21 for a particular timepoint. The model has also estimated an overdispersion parameter of 5, and the true observed value at this timepoint is 39. The log score is then calculated using the correct density function (dnbinom in this case; see ?dnbinom for details):

dnbinom(x = 39, mu = 21, size = 5, log = TRUE)
## [1] -4.849416

We would get a higher score (i.e. a better score) if our model’s expectation was closer to the truth:

dnbinom(x = 39, mu = 41, size = 5, log = TRUE)
## [1] -3.860454

In contrast, we would get a lower score (i.e. a worse score) if the expectation was further from the truth:

dnbinom(x = 39, mu = 12, size = 5, log = TRUE)
## [1] -7.979571

The logLik.mvgam function does this for us, computing the log(probability) scores using the correct distribution for the model, and returning a matrix of scores (one for each posterior draw * observation combination). The ELPD for each out of sample observation can is calculated by taking the mean of these scores. This option is available in mvgam as well, but note that you must use type = 'link' when producing forecasts so that expectations can be calculated. Because of this reason, interval coverages are not calculated when using type = 'elpd':

fc_linpreds <- forecast(model3_exact, type = 'link')
scores <- score(fc_linpreds, score = 'elpd')
scores$sockeye
##         score eval_horizon score_type
## 1  -10.330861            1       elpd
## 2   -9.608175            2       elpd
## 3  -10.263423            3       elpd
## 4  -10.411050            4       elpd
## 5   -8.482237            5       elpd
## 6   -8.875313            6       elpd
## 7   -9.331130            7       elpd
## 8          NA            8       elpd
## 9          NA            9       elpd
## 10         NA           10       elpd
## 11         NA           11       elpd
## 12         NA           12       elpd
## 13         NA           13       elpd
## 14         NA           14       elpd

The pattern in these scores reflects the pattern from the crps scores above (remember that for ELPD, the goal is to achieve higher scores).

Exercises

  1. Calculate the Root Mean Squared Error (RMSE) for posterior mean predictions from the fc object using the formula provided by Hyndman and Athanasopoulos
  2. Roll the training data one timepoint forward and refit model3 (i.e. split the data on timepoint 37 rather than 36). Compare CRPS values from this model to the previous refit to see whether the scores for timepoints 38-43 have changed when conditioning on the extra information in timepoint 37
  3. Consider watching the below video by Rob Hyndman on evaluating distributional forecasts

Session Info

sessionInfo()
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22621)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8   
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.utf8    
## 
## time zone: Australia/Brisbane
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggplot2_3.5.0               tidybayes_3.0.6            
##  [3] mvgam_1.1.0                 insight_0.19.7             
##  [5] marginaleffects_0.17.0.9002 brms_2.21.0                
##  [7] Rcpp_1.0.12                 mgcv_1.8-42                
##  [9] nlme_3.1-162                dplyr_1.1.4                
## [11] downloadthis_0.3.3         
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     svUnit_1.0.6         farver_2.1.1        
##  [4] scoringRules_1.1.1   loo_2.6.0            fastmap_1.1.1       
##  [7] tensorA_0.36.2.1     digest_0.6.34        timechange_0.2.0    
## [10] mime_0.12            lifecycle_1.0.4      StanHeaders_2.26.28 
## [13] magrittr_2.0.3       posterior_1.5.0      compiler_4.3.1      
## [16] rlang_1.1.3          sass_0.4.8           tools_4.3.1         
## [19] utf8_1.2.4           yaml_2.3.8           collapse_2.0.7      
## [22] data.table_1.14.10   knitr_1.45           labeling_0.4.3      
## [25] bridgesampling_1.1-2 pkgbuild_1.4.3       plyr_1.8.9          
## [28] abind_1.4-5          klippy_0.0.0.9500    withr_3.0.0         
## [31] purrr_1.0.2          grid_4.3.1           stats4_4.3.1        
## [34] fansi_1.0.6          colorspace_2.1-0     inline_0.3.19       
## [37] scales_1.3.0         MASS_7.3-60          cli_3.6.2           
## [40] mvtnorm_1.2-4        rmarkdown_2.25       generics_0.1.3      
## [43] RcppParallel_5.1.7   rstudioapi_0.15.0    reshape2_1.4.4      
## [46] cachem_1.0.8         rstan_2.32.3         stringr_1.5.1       
## [49] splines_4.3.1        bayesplot_1.10.0     assertthat_0.2.1    
## [52] parallel_4.3.1       matrixStats_1.2.0    base64enc_0.1-3     
## [55] vctrs_0.6.5          Matrix_1.6-4         jsonlite_1.8.8      
## [58] arrayhelpers_1.1-0   ggdist_3.3.1         jquerylib_0.1.4     
## [61] tidyr_1.3.1          glue_1.7.0           codetools_0.2-19    
## [64] distributional_0.3.2 lubridate_1.9.3      stringi_1.8.3       
## [67] gtable_0.3.4         QuickJSR_1.0.9       munsell_0.5.0       
## [70] tibble_3.2.1         pillar_1.9.0         htmltools_0.5.7     
## [73] Brobdingnag_1.2-9    R6_2.5.1             evaluate_0.23       
## [76] lattice_0.21-8       highr_0.10           backports_1.4.1     
## [79] bslib_0.6.1          rstantools_2.3.1.1   bsplus_0.1.4        
## [82] zip_2.3.0            coda_0.19-4.1        gridExtra_2.3       
## [85] checkmate_2.3.1      xfun_0.42            fs_1.6.3            
## [88] pkgconfig_2.0.3