The data and modelling objects created in this notebook can be downloaded directly to save computational time.
Users who wish to complete the exercises can download a small
template R
script. Assuming you have already downloaded the
data objects above, this script will load all data objects so that the
steps used to create them are not necessary to tackle the exercises.
This tutorial relates to content covered in Lecture 1 and Lecture 2, and relies on the following packages for manipulating data, shaping time series, fitting dynamic regression models and plotting:
library(dplyr)
library(mvgam)
library(gratia)
library(ggplot2); theme_set(theme_classic())
library(marginaleffects)
library(zoo)
library(viridis)
We will work with time series of rodent captures from the Portal Project, a long-term monitoring study based near the town of Portal, Arizona. Researchers have been operating a standardized set of baited traps within 24 experimental plots at this site since the 1970’s. Sampling follows the lunar monthly cycle, with observations occurring on average about 28 days apart. However, missing observations do occur due to difficulties accessing the site (weather events, COVID disruptions etc…). You can read about the sampling protocol in this preprint by Ernest et al on the Biorxiv.
All data from the Portal Project are made openly available in
near real-time so that they can provide maximum benefit to scientific
research and outreach (a set of open-source software tools make data readily
accessible). These data are extremely rich, containing monthly
counts of rodent captures for >20 species. But rather than accessing
the raw data, we will use some data that I have already processed and
put into a simple, usable form
data("portal_data")
As the data come pre-loaded with mvgam
, you can read a
little about it in the help page using ?portal_data
. Before
working with data, it is important to inspect how the data are
structured. There are various ways to inspect data in R
; I
typically find the the glimpse
function in
dplyr
to be useful for understanding how variables are
structured
dplyr::glimpse(portal_data)
## Rows: 199
## Columns: 10
## $ moon <int> 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 3…
## $ DM <int> 10, 14, 9, NA, 15, NA, NA, 9, 5, 8, NA, 14, 7, NA, NA, 9…
## $ DO <int> 6, 8, 1, NA, 8, NA, NA, 3, 3, 4, NA, 3, 8, NA, NA, 3, NA…
## $ PP <int> 0, 1, 2, NA, 10, NA, NA, 16, 18, 12, NA, 3, 2, NA, NA, 1…
## $ OT <int> 2, 0, 1, NA, 1, NA, NA, 1, 0, 0, NA, 2, 1, NA, NA, 1, NA…
## $ year <int> 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 20…
## $ month <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6,…
## $ mintemp <dbl> -9.710, -5.924, -0.220, 1.931, 6.568, 11.590, 14.370, 16…
## $ precipitation <dbl> 37.8, 8.7, 43.5, 23.9, 0.9, 1.4, 20.3, 91.0, 60.5, 25.2,…
## $ ndvi <dbl> 1.4658889, 1.5585069, 1.3378172, 1.6589129, 1.8536561, 1…
We will analyse time series of captures for one specific rodent species, the Desert Pocket Mouse Chaetodipus penicillatus. This species is interesting in that it goes into a kind of “hibernation” during the colder months, leading to very low captures during the winter period
Manipulating the data into a ‘long’ format is necessary for modelling
in mvgam
, mgcv
and brms
(see the
formatting data for use in mvgam
vignette
for more details). By ‘long’ format, we mean that each
series x time
observation needs to have its own entry in
the data.frame
or list
object that we wish to
use as data for modelling. A simple example can be viewed by simulating
data using sim_mvgam()
.
data <- sim_mvgam(n_series = 4, T = 24)
head(data$data_train, 12)
## y season year series time
## 1 1 1 1 series_1 1
## 2 0 1 1 series_2 1
## 3 2 1 1 series_3 1
## 4 1 1 1 series_4 1
## 5 2 2 1 series_1 2
## 6 3 2 1 series_2 2
## 7 5 2 1 series_3 2
## 8 1 2 1 series_4 2
## 9 5 3 1 series_1 3
## 10 1 3 1 series_2 3
## 11 7 3 1 series_3 3
## 12 0 3 1 series_4 3
Notice how we have four different time series in these simulated
data, but we do not spread these out into different columns. Rather,
there is only a single column for the outcome variable, labelled
y
in these simulated data. We also must supply a variable
labelled time
to ensure the modelling software knows how to
arrange the time series when building models. This setup still allows us
to formulate multivariate time series models, as we will see in later
tutorials. Below are the steps needed to shape our
portal_data
object into the correct form. First, we create
a time
variable, select the column representing counts of
our target species (PP
), and select some predictors
portal_data %>%
# mvgam requires a 'time' variable be present in the data to index
# the temporal observations. This is especially important when tracking
# multiple time series. In the Portal data, the 'moon' variable indexes the
# lunar monthly timestep of the trapping sessions
dplyr::mutate(time = moon - (min(moon)) + 1) %>%
# We can also provide a more informative name for the outcome variable, which
# is counts of the 'PP' species (Chaetodipus penicillatus) across all control
# plots
dplyr::mutate(count = PP) %>%
# The other requirement for mvgam is a 'series' variable, which needs to be a
# factor variable to index which time series each row in the data belongs to.
# Again, this is more useful when you have multiple time series in the data
dplyr::mutate(series = as.factor('PP')) %>%
# Select the variables of interest to keep in the model_data
dplyr::select(series, year, time, count, mintemp, ndvi) -> model_data
The data now contain six variables:
series
, a factor indexing which time series each
observation belongs to
year
, the year of sampling
time
, the indicator of which time step each observation
belongs to
count
, the response variable representing the number of
captures of the species PP
in each sampling
observation
mintemp
, the monthly average minimum temperature at each
time step
ndvi
, the monthly average Normalized Difference Vegetation
Index at each time step
Check the data structure again
dplyr::glimpse(model_data)
## Rows: 199
## Columns: 6
## $ series <fct> PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP…
## $ year <int> 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 20…
## $ time <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ count <int> 0, 1, 2, NA, 10, NA, NA, 16, 18, 12, NA, 3, 2, NA, NA, 13, NA,…
## $ mintemp <dbl> -9.710, -5.924, -0.220, 1.931, 6.568, 11.590, 14.370, 16.520, …
## $ ndvi <dbl> 1.4658889, 1.5585069, 1.3378172, 1.6589129, 1.8536561, 1.76132…
summary(model_data)
## series year time count mintemp
## PP:199 Min. :2004 Min. : 1.0 Min. : 0.00 Min. :-24.000
## 1st Qu.:2008 1st Qu.: 50.5 1st Qu.: 2.50 1st Qu.: -3.884
## Median :2012 Median :100.0 Median :12.00 Median : 2.130
## Mean :2012 Mean :100.0 Mean :15.14 Mean : 3.504
## 3rd Qu.:2016 3rd Qu.:149.5 3rd Qu.:24.00 3rd Qu.: 12.310
## Max. :2020 Max. :199.0 Max. :65.00 Max. : 18.140
## NA's :36
## ndvi
## Min. :0.2817
## 1st Qu.:1.0741
## Median :1.3501
## Mean :1.4709
## 3rd Qu.:1.8178
## Max. :3.9126
##
We have some NA
s in our response variable
count
. Visualize the data as a heatmap to get a sense of
where these are distributed (NA
s are shown as red bars)
image(is.na(t(model_data %>%
dplyr::arrange(dplyr::desc(time)))), axes = F,
col = c('grey80', 'darkred'))
axis(3, at = seq(0,1, len = NCOL(model_data)), labels = colnames(model_data))
These observations will be thrown out by most modelling packages in
R
. But as you will see when we work through the tutorials,
mvgam
keeps these in the data so that predictions can be
automatically returned for the full dataset.
count
,
mintemp
and ndvi
variables as histograms to
get a better sense of the types of values they can take. See
?hist
for guidancecount
are
zeros. See ?which
and ?==
for guidancecount
# the which function is useful for finding indices in a variable that fit a certain pattern
which(model_data$count == 0)
# to see how many zeros there are, calculate the length of this set of indices
length(which(model_data$count == 0))
# to calculate proportion of observations that are zero, divide this by the total number of observations
length(which(model_data$count == 0)) / length(model_data$count)
“The first thing to do in any data analysis task is to plot the data. Graphs enable many features of the data to be visualised, including patterns, unusual observations, changes over time, and relationships between variables” (Hyndman and Athanasopoulos; Forecasting: Principles and Practice (3rd ed))
A time series is a set of observations taken sequentially in time. When working with time series, the first step (after inspecting the data structure) is usually to plot observations over time, as either a line plot or scatterplot
plot(x = model_data$time, y = model_data$count, type = 'l',
xlab = 'Time', ylab = 'Counts', bty = 'l', col = 'darkred',
lwd = 2)
This plot reveals several important features including: missing data, evidence of repeated seasonal cycles, lower bounding at zero and long-term undulations. Most of these features make it challenging to model series like this using conventional time series / forecasting software. But before we discuss modelling approaches, let’s build some more visuals to look for other key features
Lag plots are often used to inspect temporal autocorrelation at a
variety of lags for a univariate time series. While correlation measures
the extent of a linear relationship between two variables,
autocorrelations measure linear relationships between lagged values of a
time series. Ordinarily we’d want to plot our outcome of interest
(count
), but the default lag.plot
can’t handle
missing values (more on that below). But we can inspect correlations
among NDVI values (using the ndvi
variable) and lagged
versions of ndvi
to see evidence of strong positive
autocorrelation at several lags
lag.plot(model_data$ndvi, lags = 4, pch = 16)
This function takes differences of the variable ndvi
and
plotting the lagged value (i.e. the value of the series at time \(t-lag\)) vs the observed value at time
\(t\). We can replicate this process
for a single lag so it is more obvious what is happening under the
hood
ndvi_lag1 <- diff(model_data$ndvi, lag = 1)
plot(ndvi_lag1, model_data$ndvi[2:NROW(model_data)],
xlab = expression("NDVI"["t-1"]),
ylab = expression("NDVI"["t"]), pch = 16, bty = 'l')
A Pearson’s correlation test confirms that there is a strong positive
relationship between current ndvi
and what
ndvi
was one month earlier
cor.test(ndvi_lag1, model_data$ndvi[2:NROW(model_data)])
##
## Pearson's product-moment correlation
##
## data: ndvi_lag1 and model_data$ndvi[2:NROW(model_data)]
## t = 6.736, df = 196, p-value = 1.765e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3130538 0.5403426
## sample estimates:
## cor
## 0.4335688
But unfortunately the lag plot will fail when the variable of
interest has missing values, which our outcome of interest
(count
) has plenty of. Rather than using a lag plot, we can
instead plot the autocorrelation function at a variety of lags. These
plots are often used for deciding which moving average order
to choose when building autoregressive models
acf(model_data$count,
na.action = na.pass,
bty = 'l',
lwd = 2,
ci.col = 'darkred',
main = 'Counts of PP trappings')
Several features are evident from this plot: First, we observe positive autocorrelation at small lags. When time series show a trend, autocorrelations for small lags will often to be large and positive because observations nearby in time have similar values. So the ACF suggests there is evidence of a trend as autocorrelations tend to slowly decrease as the lags increase. Second, we see evidence of seasonality in captures. When data are seasonal, the autocorrelations will often be larger for the seasonal lags (at multiples of the seasonal period) than for other lags. We know this species shows evidence of torpor in the winter months, so it is not surprising to see a strong annual pattern in these autocorrelations.
The partial autocorrelation function shows something similar, but has controlled for the short-term lagged correlations. These plots are often used for selecting orders of autoregressive models. For example, if we choose an AR2 model then we hope our AR2 parameter will capture remaining autocorrelations at lag 2 when controlling for any autocorrelations at lag 1.
pacf(model_data$count,
na.action = na.pass,
bty = 'l',
lwd = 2,
ci.col = 'darkred',
main = 'Counts of PP trappings')
Both of these plots suggest there is temporal autocorrelation
present, as well as strong seasonality. We can view the ACF and some
other useful features of the data with one of mvgam
’s
primary functions, plot_mvgam_series
(see
?plot_mvgam_series
for details):
plot_mvgam_series(data = model_data, y = 'count')
Another standard vizualisation is Seasonal Trend Decomposition. It is
often helpful to split a time series into several components, each
representing an underlying pattern, to inspect features of the data that
may not be immediately apparent. STL decomposes a time series into
seasonal, trend and irregular components using loess
smoothing (see ?stl
for details). But just to get to a
point where we can use this with most ecological series, we have to deal
with missing observations. The usual practice is to impute these in some
way. We can use the na.approx
function from the
zoo
package for this, which will use linear interpolation
to impute missing values. This requires that we first convert the
count
observations to a time series object of class
ts
. We set the frequency to 12
as this is
close enough to what we have with our lunar monthly observation
intervals. We will not be using the imputed data for any modelling, so
please don’t worry too much if you aren’t up to speed on how linear
interpolation works. This is purely for visualization purposes
# First convert the counts to a time series object of class ts
count_ts <- ts(model_data$count, frequency = 12)
# Impute NAs using linear interpolation
interp_series <- zoo::na.approx(count_ts)
Now we can perform the decomposition. This requires us to specify the seasonal window, which controls the span in lags over which the seasonal loess will operate. Smaller values allow for more rapid changes. A key advantage of this method is that the seasonal component is allowed to change over time. But a drawback is that this parameter must be chosen by the analyst, and changing its value can give different results. We will use a 5-year seasonal window in this first example
plot(stl(interp_series, s.window = 5))
The strong seasonality is obvious in this plot, as is the evidence of
an undulating trend. There is also a lot of variation in the
remainder
component, which demonstrates that the
seasonality and trend alone cannot capture all of the variation in the
observations over time. The sizes of the bars on the right-hand side of
these plots show scale to help us understand the relative magnitudes of
each component. But what happens if we change the seasonal window and
force it to be stable through time?
plot(stl(interp_series, s.window = 'periodic'))
Here the remainder
component is even more important, and
the trend has become less smooth. Try varying the s.window
argument to see how the conclusions about component contributions will
change. For more information on decomposition, see this chapter in Applied Time Series Analysis for Fisheries and
Environmental Sciences, by Holmes, Scheuerell and Ward
count
and ndvi
at lags of 1 - 15 months. This
procedure is often used to help identify lags of the x-variable that
might be useful predictors of the outcome. Note, you will need to use
na.action = na.pass
so that the missing values in
count
don’t cause problems. See ?ccf
for
details about calculating cross-correlations using
type = 'correlation'
. Which lags of ndvi
would
you consider to be the most likely candidates for predicting our outcome
count
?STL
decomposition for the ndvi
variableccf
# fill in the "?" with the correct variable / value
ccf(x = ?,
y = ?,
na.action = ?,
lag.max = ?,
# compute correlations at each lag
type = 'correlation',
# not necessary but makes for a prettier plot
bty = 'l',
lwd = 2,
ci.col = 'darkred',
main = expression(paste(italic(Cor),"(",ndvi[lag],",",count, ")")))
Now onto some modelling! Our first task will be to fit a Generalized
Linear Model (GLM) that can capture the features of our
count
observations (integer data, lower bound at zero,
missing values) while also attempting to model temporal variation. We
will use a GLM with Poisson observations, a log link function and random
(hierarchical) intercepts for year
. This will allow us to
capture our prior belief that, although each year is unique, having been
sampled from the same population of effects, all years are connected and
thus might contain valuable information about one another. This will
capitalize on the partial pooling properties of hierarchical models:
Hierarchical (random) effects offer many advantages when
modelling data with grouping structures (i.e. multiple species,
locations, years etc…). But before we fit the model, we will need to
convert year
to a factor so that we can use a random effect
basis in mvgam
. See ?smooth.terms
and
?smooth.construct.re.smooth.spec
for details about the
re
basis construction that is used by both
mvgam
and mgcv
model_data %>%
# Convert 'year' to a factor variable
dplyr::mutate(year = factor(year)) -> model_data
levels(model_data$year)
We are now ready for our first mvgam
model. The syntax
will be familiar to users who have built models with mgcv
.
But for a refresher, see ?mvgam_formulae
and the examples
in ?gam.models
. Random effects can be specified using the
s()
wrapper with the re
basis. Note that we
can also suppress the primary intercept using the usual R
formula syntax - 1
. mvgam
has a number of
possible observation families that can be used, see
?mvgam_families
for more information. We will use
Stan
as the fitting engine, which deploys Hamiltonian Monte
Carlo (HMC) for full Bayesian inference. By default, 4 HMC chains will
be run using a warmup of 500 iterations and collecting 250 posterior
samples from each chain. Ordinarily we would take more posterior samples
per chain but 225 give us an object that will be easier to download for
completing the exercises. Interested users should consult the Stan
user’s guide for more information
about the software and the enormous variety of models that can be
tackled with HMC.
model1 <- mvgam(count ~ s(year, bs = 're') - 1,
family = poisson(),
data = model_data,
samples = 250)
With any model fitted in mvgam
, the Stan
code can be viewed using code()
. This is helpful if you
aren’t sure what priors are used by default:
code(model1)
Stan
code
## // Stan model code generated by package mvgam
## data {
## int<lower=0> total_obs; // total number of observations
## int<lower=0> n; // number of timepoints per series
## int<lower=0> n_series; // number of series
## int<lower=0> num_basis; // total number of basis coefficients
## matrix[total_obs, num_basis] X; // mgcv GAM design matrix
## array[n, n_series] int<lower=0> ytimes; // time-ordered matrix (which col in X belongs to each [time, series] observation?)
## int<lower=0> n_nonmissing; // number of nonmissing observations
## array[n_nonmissing] int<lower=0> flat_ys; // flattened nonmissing observations
## matrix[n_nonmissing, num_basis] flat_xs; // X values for nonmissing observations
## array[n_nonmissing] int<lower=0> obs_ind; // indices of nonmissing observations
## }
## parameters {
## // raw basis coefficients
## vector[num_basis] b_raw;
##
## // random effect variances
## vector<lower=0>[1] sigma_raw;
##
## // random effect means
## vector[1] mu_raw;
## }
## transformed parameters {
## // basis coefficients
## vector[num_basis] b;
## b[1 : 17] = mu_raw[1] + b_raw[1 : 17] * sigma_raw[1];
## }
## model {
## // prior for random effect population variances
## sigma_raw ~ student_t(3, 0, 2.5);
##
## // prior for random effect population means
## mu_raw ~ std_normal();
##
## // prior (non-centred) for s(year)...
## b_raw[1 : 17] ~ std_normal();
## {
## // likelihood functions
## flat_ys ~ poisson_log_glm(flat_xs, 0.0, b);
## }
## }
## generated quantities {
## vector[total_obs] eta;
## matrix[n, n_series] mus;
## array[n, n_series] int ypred;
##
## // posterior predictions
## eta = X * b;
## for (s in 1 : n_series) {
## mus[1 : n, s] = eta[ytimes[1 : n, s]];
## ypred[1 : n, s] = poisson_log_rng(mus[1 : n, s]);
## }
## }
Once the model has finished, the first step is to inspect the
summary()
to ensure no major diagnostic warnings have been
produced and to summarise posterior effects
summary(model1)
## GAM formula:
## count ~ s(year, bs = "re") - 1
##
## Family:
## poisson
##
## Link function:
## log
##
## Trend model:
## None
##
## N series:
## 1
##
## N timepoints:
## 199
##
## Status:
## Fitted using Stan
## 4 chains, each with iter = 725; warmup = 500; thin = 1
## Total post-warmup draws = 900
##
##
## GAM coefficient (beta) estimates:
## 2.5% 50% 97.5% Rhat n_eff
## s(year).1 1.80 2.0 2.3 1.00 1114
## s(year).2 2.50 2.7 2.9 1.00 1193
## s(year).3 3.00 3.1 3.2 1.00 1161
## s(year).4 3.10 3.3 3.4 1.00 1138
## s(year).5 1.90 2.1 2.3 1.00 1146
## s(year).6 1.50 1.8 2.0 1.00 1864
## s(year).7 1.80 2.0 2.3 1.00 1290
## s(year).8 2.80 3.0 3.1 1.00 1398
## s(year).9 3.10 3.3 3.4 1.00 1216
## s(year).10 2.60 2.8 2.9 1.00 1293
## s(year).11 3.00 3.1 3.2 1.00 1469
## s(year).12 3.10 3.2 3.3 1.00 1173
## s(year).13 2.00 2.2 2.5 1.00 1005
## s(year).14 2.50 2.6 2.8 1.00 1611
## s(year).15 2.00 2.2 2.4 1.00 1080
## s(year).16 1.90 2.1 2.3 1.00 1399
## s(year).17 -0.22 1.1 1.9 1.01 275
##
## GAM group-level estimates:
## 2.5% 50% 97.5% Rhat n_eff
## mean(s(year)) 2.00 2.40 2.7 1.02 111
## sd(s(year)) 0.45 0.67 1.1 1.03 113
##
## Approximate significance of GAM smooths:
## edf Ref.df Chi.sq p-value
## s(year) 14.2 17 25255 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 0 of 900 iterations ended with a divergence (0%)
## 0 of 900 iterations saturated the maximum tree depth of 12 (0%)
## E-FMI indicated no pathological behavior
##
## Samples were drawn using NUTS(diag_e) at Tue Apr 23 10:43:31 AM 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split MCMC chains
## (at convergence, Rhat = 1)
We can get some sense of the variation in yearly intercepts from the
summary, but it is easier to understand them by plotting. Plot posterior
distributions of the temporal random effects using
plot.mvgam()
with type = 're'
. See
?plot.mvgam
for more details
plot(model1, type = 're')
A better way, however, to quickly plot all key effects in an
mvgam
model is to use conditional_effects()
,
which can give quick overviews of estimated effects on either the link
scale or the response scale (see ?conditional_effects.mvgam
for details)
conditional_effects(model1)
How do these yearly intercepts translate into time-varying
predictions? To understand this, we can plot posterior hindcasts from
this model for the training period using plot.mvgam
with
type = 'forecast'
plot(model1, type = 'forecast')
If you wish to extract these hindcasts for other downstream analyses,
the hindcast()
function can be used. This will return an
object of class mvgam_forecast
. In the
hindcasts
slot, a matrix of predictions will be returned
for each series in the data (only one series in our example):
hc <- hindcast(model1)
str(hc)
## List of 15
## $ call :Class 'formula' language count ~ s(year, bs = "re") - 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## $ trend_call : NULL
## $ family : chr "poisson"
## $ trend_model : chr "None"
## $ drift : logi FALSE
## $ use_lv : logi FALSE
## $ fit_engine : chr "stan"
## $ type : chr "response"
## $ series_names : chr "PP"
## $ train_observations:List of 1
## ..$ PP: int [1:199] 0 1 2 NA 10 NA NA 16 18 12 ...
## $ train_times : num [1:199] 1 2 3 4 5 6 7 8 9 10 ...
## $ test_observations : NULL
## $ test_times : NULL
## $ hindcasts :List of 1
## ..$ PP: num [1:900, 1:199] 9 12 5 9 11 5 11 5 5 6 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : NULL
## .. .. ..$ : chr [1:199] "ypred[1,1]" "ypred[2,1]" "ypred[3,1]" "ypred[4,1]" ...
## $ forecasts : NULL
## - attr(*, "class")= chr "mvgam_forecast"
In any regression, a key question is whether the residuals show any
patterns that can be indicate un-modelled variation. For GLMs, we can
use a modified residual called the Dunn-Smyth,
or randomized quantile, residual. Inspect Dunn-Smyth residuals from
the model using plot()
with
type = 'residuals'
plot(model1, type = 'residuals')
n_draws x n_timepoints
. Calculate posterior median
residuals per time point and plot these as a line plot (hint, use
?apply
and either ?median
or
?quantile
for guidance). Add a dashed horizontal line at
0
(hint, use ?abline
for guidance). Do these
residuals look worrisome to you?# fill in the "?" with the correct variable / value
# extract residuals for our rodent species of interest
model1_resids <- model1$resids$PP
# check the dimensions of the residuals matrix
dim(model1_resids)
# calculate posterior median residuals per timepoint
median_resids <- apply(model1_resids,
MARGIN = ?,
FUN = ?)
# plot median residuals over time
plot(median_resids,
type = 'l',
# not necessary but makes for a prettier plot
lwd = 2,
col = 'darkred',
bty = 'l',
ylab = 'Median residual',
xlab = 'Time')
# add a horizontal dashed line at zero
abline(? = 0, lty = 'dashed')
These random effects do not have a sense of “time”. Each yearly
intercept is not restricted to be similar to the previous yearly
intercept. This drawback is evident when we predict for a new year. To
do this, we repeat the exercise above but this time split the data into
training and testing sets before re-running the model. We can then
supply the test set as newdata
. For splitting, we will make
use of the filter
function from dplyr
model_data %>%
dplyr::filter(time <= 160) -> data_train
model_data %>%
dplyr::filter(time > 160) -> data_test
model1b <- mvgam(count ~ s(year, bs = 're') - 1,
family = poisson(),
data = data_train,
newdata = data_test)
Repeating the plots above gives insight into how the model’s hierarchical formulation provides all structure needed to sample values for un-modelled years
plot(model1b, type = 'forecast')
## Out of sample DRPS:
## 182.91865725
Here you can see that the predictions do not capture the temporal variation in the test set.
As with the hindcast()
function, we can use
forecast()
to automatically extract forecast predictions.
This also returns an object of class mvgam_forecast
, but
now it will contain hindcasts and forecasts for each series in the
data:
fc <- forecast(model1b)
str(fc)
## List of 16
## $ call :Class 'formula' language count ~ s(year, bs = "re") - 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## $ trend_call : NULL
## $ family : chr "poisson"
## $ family_pars : NULL
## $ trend_model : chr "None"
## $ drift : logi FALSE
## $ use_lv : logi FALSE
## $ fit_engine : chr "stan"
## $ type : chr "response"
## $ series_names : Factor w/ 1 level "PP": 1
## $ train_observations:List of 1
## ..$ PP: int [1:160] 0 1 2 NA 10 NA NA 16 18 12 ...
## $ train_times : num [1:160] 1 2 3 4 5 6 7 8 9 10 ...
## $ test_observations :List of 1
## ..$ PP: int [1:39] NA 0 0 10 3 14 18 NA 28 46 ...
## $ test_times : num [1:39] 161 162 163 164 165 166 167 168 169 170 ...
## $ hindcasts :List of 1
## ..$ PP: num [1:2000, 1:160] 9 7 7 11 6 1 6 5 11 7 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : NULL
## .. .. ..$ : chr [1:160] "ypred[1,1]" "ypred[2,1]" "ypred[3,1]" "ypred[4,1]" ...
## $ forecasts :List of 1
## ..$ PP: num [1:2000, 1:39] 9 7 9 4 9 5 14 9 6 8 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : NULL
## .. .. ..$ : chr [1:39] "ypred[161,1]" "ypred[162,1]" "ypred[163,1]" "ypred[164,1]" ...
## - attr(*, "class")= chr "mvgam_forecast"
We nearly always wish to include predictor variables that may explain
some variation in our observations. Here, we will update the model from
above by including a parametric (fixed) effect of ndvi
.
This code also shows how you can start modifying priors, in this case by
using a standard normal prior for the ndvi
effect:
model2 <- mvgam(count ~ s(year, bs = 're') +
ndvi - 1,
family = poisson(),
# Standard normal prior on the NDVI coefficient
priors = prior(normal(0, 1), class = b),
data = data_train,
newdata = data_test)
ndvi
value at each timepoint \(t\).
Inspect the summary
summary(model2)
## GAM formula:
## count ~ ndvi + s(year, bs = "re") - 1
##
## Family:
## poisson
##
## Link function:
## log
##
## Trend model:
## None
##
## N series:
## 1
##
## N timepoints:
## 160
##
## Status:
## Fitted using Stan
## 4 chains, each with iter = 1000; warmup = 500; thin = 1
## Total post-warmup draws = 2000
##
##
## GAM coefficient (beta) estimates:
## 2.5% 50% 97.5% Rhat n_eff
## ndvi 0.32 0.39 0.46 1 1816
## s(year).1 1.10 1.40 1.70 1 2023
## s(year).2 1.80 2.00 2.20 1 2246
## s(year).3 2.20 2.40 2.60 1 2289
## s(year).4 2.30 2.50 2.70 1 1863
## s(year).5 1.20 1.40 1.70 1 2307
## s(year).6 1.00 1.30 1.50 1 2716
## s(year).7 1.10 1.40 1.70 1 2537
## s(year).8 2.10 2.30 2.50 1 2309
## s(year).9 2.70 2.90 3.00 1 2053
## s(year).10 2.00 2.20 2.40 1 2149
## s(year).11 2.30 2.40 2.60 1 1787
## s(year).12 2.50 2.70 2.80 1 1877
## s(year).13 1.40 1.60 1.80 1 2518
## s(year).14 0.54 2.00 3.30 1 1675
## s(year).15 0.58 2.00 3.30 1 1583
## s(year).16 0.46 2.00 3.30 1 1384
## s(year).17 0.61 2.00 3.30 1 1628
##
## GAM group-level estimates:
## 2.5% 50% 97.5% Rhat n_eff
## mean(s(year)) 1.60 2.00 2.3 1.00 515
## sd(s(year)) 0.42 0.61 1.0 1.01 421
##
## Approximate significance of GAM smooths:
## edf Ref.df Chi.sq p-value
## s(year) 11.1 17 2937 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 0 of 2000 iterations ended with a divergence (0%)
## 0 of 2000 iterations saturated the maximum tree depth of 12 (0%)
## E-FMI indicated no pathological behavior
##
## Samples were drawn using NUTS(diag_e) at Tue Apr 23 10:46:52 AM 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split MCMC chains
## (at convergence, Rhat = 1)
Look again at the estimated effects, but this time on the link scale
conditional_effects(model2, type = 'link')
This plot indicates a positive effect of ndvi
on
log(counts)
. But given that our model used a nonlinear link
function, it can still be difficult to fully understand what
relationship our model is estimating between a predictor and the
response. We can use
conditional_effects(model2, type = 'response')
to to again
see these on the response scale, or we can use
plot_predictions()
for more customisable plots:
plot_predictions(model2,
condition = "ndvi",
# include the observed count values
# as points, and show rugs for the observed
# ndvi and count values on the axes
points = 0.5, rug = TRUE) +
labs(y = 'Predicted count', x = 'NDVI')
## Warning: Removed 28 rows containing missing values or values outside the scale range
## (`geom_point()`).
Now it is easier to get a sense of the positive effect of
ndvi
. But has this predictor made an impact on
predictions?
plot(model2, type = 'forecast', newdata = data_test)
## Out of sample DRPS:
## 151.1333405
These predictions certainly look different. Are the residual diagnostics improved?
plot(model2, type = 'residuals')
ndvi
and our outcome
count
mintemp
as a second predictor. Be sure to first
check whether there is a strong pairwise correlation between
mintemp
and ndvi
(hint, use
?cor.test
for guidance)mintemp
on
log(counts)
? Or does inclusion of mintemp
lead
to different inference on the effect of ndvi
compared to
when mintemp
was not in the model?We will now incorporate a spline of time
in place of the
temporal yearly random effects. Nonlinear splines are commonly viewed as
variations of random effects in which coefficients that control the
shape of the spline are drawn from a joint, penalized distribution. This
strategy is often used in time series analysis to capture smooth
temporal variation. The lectures (particularly lecture 2) have gone into detail about smoothing
splines and basis functions, but it is worth quickly re-iterating some
of this detail here. When we construct smoothing splines, the workhorse
package mgcv
will calculate a set of basis functions that
will collectively control the shape and complexity of the resulting
spline. It is often helpful to visualize these basis functions to get a
better sense of how splines work. We’ll create a set of 6 basis
functions to represent possible variation in the effect of
time
on our outcome. This can easily be done using
functions in the gratia
package:
basis_funs <- basis(s(time, bs = 'tp', k = 6),
data = model_data)
Once the functions are created, the next step is to plot them
draw(basis_funs) +
labs(y = 'Basis function value', x = 'Time')
Upping the potential complexity of the smooth function, by increasing
k
(to 10 in this example), leads to a larger diversity of
basis functions:
draw(basis(s(time, bs = 'tp', k = 10),
data = model_data)) +
labs(y = 'Basis function value', x = 'Time')
R
routines
# Set up the smooth object that is needed to predict values of
# basis functions
temp_gam <- gam(count ~ s(time, bs = 'tp', k = 6),
data = model_data,
family = poisson(), fit = FALSE)
# We can understand how this penalty matrix is used to
# translate values of `time` into basis functions by
# generating a sequence of `time` values for sampling
# prior function realisations
cov_seq <- seq(min(model_data$time),
max(model_data$time),
length.out = 500)
summary(cov_seq)
# Generate the linear predictor matrix for this
# sequence of `time` values
Xp <- mgcv::PredictMat(temp_gam$smooth[[1]],
data.frame(time = cov_seq))
str(Xp)
# This linear predictor contains 5 columns, one for each
# function; the sequences of values in the columns represent
# the actual basis functions over the range of simulated
# `time` values
par(mar = c(4,4,2,1.5))
plot(1, type = 'n',
ylim = range(Xp),
xlim = range(cov_seq),
bty = 'l',
ylab = 'Basis function value',
xlab = 'Time')
cols <- viridis::viridis(n = NCOL(Xp), alpha = 0.8)
for(i in 1:NCOL(Xp)){
lines(x = cov_seq, y = Xp[,i], lwd = 3, col = 'white')
lines(x = cov_seq, y = Xp[,i], lwd = 2.5, col = cols[i])
}
box(bty = 'l', lwd = 2)
mvgam
In addition to constructing basis functions, mgcv
also
creates a penalty matrix \(S\) that
works to constrain the spline’s wiggliness. When fitting a GAM, we must
estimate smoothing parameters (\(\lambda\)) to penalize these matrices,
resulting in constrained basis coefficients and smoother functions that
are less likely to overfit the data. This is the key to fitting GAMs in
a Bayesian framework, as we can jointly estimate the \(\lambda\)’s using informative priors to
prevent overfitting and expand the complexity of models we can tackle.
To see this in practice, we can now fit a model that replaces the yearly
random effects with a smooth function of time
. We will need
a reasonably complex function (large k
) to try and
accommodate the temporal variation in our observations. Following some
useful advice by Gavin Simpson, we will use a
b-spline basis for the temporal smooth. Because we no longer have
intercepts for each year, we also retain the primary intercept term in
this model (there is no -1
in the formula now):
model3 <- mvgam(count ~ s(time, bs = 'bs', k = 15) +
ndvi +
mintemp,
family = poisson(),
# Standard normal priors on all fixed effect coefficients
priors = prior(normal(0, 1), class = b),
data = data_train,
newdata = data_test)
mgcv
. The weights \((\beta_{smooth})\) are drawn from a
penalized multivariate normal distribution where the precision matrix
\((\Omega\)) is multiplied by a
smoothing penalty \((\lambda)\). If
\(\lambda\) becomes large, this acts to
squeeze the covariances among the weights \((\beta_{smooth})\), leading to a less
wiggly spline. Note that sometimes there are multiple smoothing
penalties that contribute to the covariance matrix, but I am only
showing one here for simplicity.
View the summary
summary(model3)
## GAM formula:
## count ~ s(time, bs = "bs", k = 15) + ndvi + mintemp
##
## Family:
## poisson
##
## Link function:
## log
##
## Trend model:
## None
##
## N series:
## 1
##
## N timepoints:
## 160
##
## Status:
## Fitted using Stan
## 4 chains, each with iter = 1000; warmup = 500; thin = 1
## Total post-warmup draws = 2000
##
##
## GAM coefficient (beta) estimates:
## 2.5% 50% 97.5% Rhat n_eff
## (Intercept) 2.200 2.300 2.400 1.00 1161
## ndvi -0.130 -0.050 0.028 1.00 1176
## mintemp 0.061 0.066 0.072 1.00 3081
## s(time).1 -1.700 -0.730 0.240 1.00 649
## s(time).2 0.086 0.920 1.900 1.01 355
## s(time).3 -0.350 0.620 1.600 1.00 398
## s(time).4 1.800 2.700 3.700 1.01 352
## s(time).5 -1.200 -0.230 0.760 1.01 409
## s(time).6 -0.690 0.180 1.200 1.01 366
## s(time).7 -1.400 -0.390 0.650 1.00 429
## s(time).8 0.480 1.400 2.300 1.00 376
## s(time).9 0.800 1.700 2.700 1.01 348
## s(time).10 -0.330 0.540 1.500 1.01 383
## s(time).11 0.630 1.500 2.600 1.01 352
## s(time).12 0.470 1.300 2.200 1.00 387
## s(time).13 -1.600 -0.830 0.082 1.00 500
## s(time).14 -6.100 -3.000 -0.220 1.01 441
##
## Approximate significance of GAM smooths:
## edf Ref.df Chi.sq p-value
## s(time) 8.37 14 644 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 0 of 2000 iterations ended with a divergence (0%)
## 0 of 2000 iterations saturated the maximum tree depth of 12 (0%)
## E-FMI indicated no pathological behavior
##
## Samples were drawn using NUTS(diag_e) at Tue Apr 23 10:47:50 AM 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split MCMC chains
## (at convergence, Rhat = 1)
The summary above now contains posterior estimates for the smoothing
parameters as well as basis coefficients for effect of
time
. One quick way to get a sense of how “nonlinear” a
function might be is to compare the reference degrees of
freedom (Ref.df
) to the estimated effective
degrees of freedom edf
in the summary output. Larger
values of edf
indicate more wiggly terms, while the number
of basis functions supplied in the k
argument controls the
Ref.df
, i.e. the maximum possible edf
. When
the edf
is well below the Ref.df
, this
suggests the function is being shrunk down towards a smoother shape. But
if edf
is large (approaching Ref.df
), that
suggests the model needs all of those basis functions to capture
nonlinearities in the data. These values, alongside the estimate
smoothing parameters, are also useful for generating an approximate
significance test to ask if the function is statistically ‘different’
from a linear or flat function. Low p-values for this test also
suggest the function is capturing a lot of variation in the data. You
can look at the Details section in
?summary.gam
for more information about these tests.
Clearly we can see that our function of time
is highly
nonlinear here, as it has a relatively high edf
and a low
p-value.
Plot the effects as before, using
conditional_effects()
conditional_effects(model3)
## Warning: Removed 28 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 28 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 28 rows containing missing values or values outside the scale range
## (`geom_point()`).
We can also view some realizations of the underlying
time
function (here, each line is a different potential
curve drawn from the posterior of all possible curves):
plot(model3, type = 'smooths', realisations = TRUE,
n_realisations = 30)
A useful question when modelling using GAMs is to identify where the
function is changing most rapidly. To address this, we can plot
estimated 1st derivative (i.e. the slope) of the spline over a sequence
of time
values:
plot(model3, type = 'smooths', derivatives = TRUE)
Here, values above >0
indicate the function was
increasing at that time point, while values <0
indicate
the function was declining. The most rapid declines appear to have been
happening around timepoints 50 and again toward the end of the training
period, for example. And as usual, we can use
marginaleffects
routines to get a similar plot:
plot_slopes(model3, variables = 'time',
condition = 'time',
type = 'link') +
geom_hline(yintercept = 0, linetype = 'dashed')
loo
Working with brms
or mvgam
objects can be
daunting because we don’t have anova()
or
AIC()
methods that ecologists are used to for model
comparison. But we have approximate leave-one-out cross-validation,
which works in a similar way. Using loo_compare()
on
multiple models will ask which model would be expected to do a better
job of predicting left-out observations within the training
period. A larger value is better
loo_compare(model2, model3)
## elpd_diff se_diff
## model3 0.0 0.0
## model2 -399.4 62.6
Clearly the nonlinear smooth of time
(in
model3
) gives a better in-sample fit than the random
effects of year
(model2
).
ndvi
effect to a smooth function of
ndvi
using the s()
argument (hint, see
?s
for guidance on how to define smooth terms in GAM
formulae using mgcv
syntax). Inspect the resulting function
estimates for both ndvi
and time
, and consider
using some of the strategies outlined in this blog post on interpreting nonlinear effects in GAMs with mgcv and
marginaleffects to understand the smooths in more depthk
to a larger number, something like 50. Do you get any
sense that there are problems with fitting this model? Do any of your
conclusions change?sessionInfo()
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22621)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_Australia.utf8 LC_CTYPE=English_Australia.utf8
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Australia.utf8
##
## time zone: Australia/Brisbane
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] viridis_0.6.4 viridisLite_0.4.2
## [3] zoo_1.8-12 ggplot2_3.5.0
## [5] gratia_0.8.9.11 mvgam_1.1.0
## [7] insight_0.19.7 marginaleffects_0.17.0.9002
## [9] brms_2.21.0 Rcpp_1.0.12
## [11] mgcv_1.8-42 nlme_3.1-162
## [13] dplyr_1.1.4 downloadthis_0.3.3
##
## loaded via a namespace (and not attached):
## [1] gridExtra_2.3 inline_0.3.19 rlang_1.1.3
## [4] magrittr_2.0.3 matrixStats_1.2.0 compiler_4.3.1
## [7] loo_2.6.0 vctrs_0.6.5 stringr_1.5.1
## [10] pkgconfig_2.0.3 fastmap_1.1.1 backports_1.4.1
## [13] labeling_0.4.3 utf8_1.2.4 cmdstanr_0.7.1
## [16] rmarkdown_2.25 pracma_2.4.4 ps_1.7.5
## [19] nloptr_2.0.3 purrr_1.0.2 xfun_0.42
## [22] cachem_1.0.8 jsonlite_1.8.8 collapse_2.0.7
## [25] highr_0.10 parallel_4.3.1 R6_2.5.1
## [28] bslib_0.6.1 stringi_1.8.3 StanHeaders_2.26.28
## [31] bsplus_0.1.4 lubridate_1.9.3 jquerylib_0.1.4
## [34] assertthat_0.2.1 rstan_2.32.3 knitr_1.45
## [37] smooth_4.0.0 klippy_0.0.0.9500 base64enc_0.1-3
## [40] mvnfast_0.2.8 bayesplot_1.10.0 Matrix_1.6-4
## [43] splines_4.3.1 timechange_0.2.0 tidyselect_1.2.1
## [46] rstudioapi_0.15.0 abind_1.4-5 yaml_2.3.8
## [49] scoringRules_1.1.1 codetools_0.2-19 processx_3.8.3
## [52] pkgbuild_1.4.3 lattice_0.21-8 tibble_3.2.1
## [55] withr_3.0.0 bridgesampling_1.1-2 posterior_1.5.0
## [58] coda_0.19-4.1 evaluate_0.23 RcppParallel_5.1.7
## [61] zip_2.3.0 texreg_1.39.3 pillar_1.9.0
## [64] tensorA_0.36.2.1 checkmate_2.3.1 stats4_4.3.1
## [67] distributional_0.3.2 generics_0.1.3 rstantools_2.3.1.1
## [70] munsell_0.5.0 scales_1.3.0 greybox_2.0.0
## [73] xtable_1.8-4 glue_1.7.0 tools_4.3.1
## [76] data.table_1.14.10 fs_1.6.3 mvtnorm_1.2-4
## [79] grid_4.3.1 ggokabeito_0.1.0 tidyr_1.3.1
## [82] QuickJSR_1.0.9 colorspace_2.1-0 patchwork_1.2.0
## [85] cli_3.6.2 fansi_1.0.6 Brobdingnag_1.2-9
## [88] gtable_0.3.4 sass_0.4.8 digest_0.6.34
## [91] farver_2.1.1 htmltools_0.5.7 lifecycle_1.0.4
## [94] httr_1.4.7 statmod_1.5.0 mime_0.12
## [97] MASS_7.3-60