The data and modelling objects created in this notebook can be downloaded directly to save computational time.
Users who wish to complete the exercises can download a small
template R
script. Assuming you have already downloaded the
data objects above, this script will load all data objects so that the
steps used to create them are not necessary to tackle the exercises.
This tutorial relates to content covered in Lecture 2 and Lecture 3, and relies on the following packages for manipulating data, shaping time series, fitting dynamic regression models and plotting:
library(dplyr)
library(mvgam)
library(brms)
library(marginaleffects)
library(tidybayes)
library(ggplot2); theme_set(theme_classic())
We will continue with time series of rodent captures from the Portal
Project, a
long-term monitoring study based near the town of Portal, Arizona,
with particular focus on the time series of captures for the Desert
Pocket Mouse Chaetodipus penicillatus. Data manipulations will
proceed in a similar way as in Tutorial 1, except that we now use a lagged version
of ndvi
(at a lag of 12 lunar months; see
?dplyr::lag
for details) to give you a sense of how you can
create and use lagged predictors in time series models
data("portal_data")
portal_data %>%
# mvgam requires a 'time' variable be present in the data to index
# the temporal observations. This is especially important when tracking
# multiple time series. In the Portal data, the 'moon' variable indexes the
# lunar monthly timestep of the trapping sessions
dplyr::mutate(time = moon - (min(moon)) + 1) %>%
# We can also provide a more informative name for the outcome variable, which
# is counts of the 'PP' species (Chaetodipus penicillatus) across all control
# plots
dplyr::mutate(count = PP) %>%
# The other requirement for mvgam is a 'series' variable, which needs to be a
# factor variable to index which time series each row in the data belongs to.
# Again, this is more useful when you have multiple time series in the data
dplyr::mutate(series = as.factor('PP')) %>%
# Create a lagged version of ndvi to use as a predictor in place of the
# contemporaneous ndvi we have been using previously. Note, the data must be
# in the correct order for this to work properly
dplyr::arrange(time) %>%
dplyr::mutate(ndvi_lag12 = dplyr::lag(ndvi, 12)) %>%
# Select the variables of interest to keep in the model_data
dplyr::select(series, time, count,
mintemp, ndvi_lag12) -> model_data
The data now contain five variables:
series
, a factor indexing which time series each
observation belongs to
time
, the indicator of which time step each observation
belongs to
count
, the response variable representing the number of
captures of the species PP
in each sampling
observation
mintemp
, the monthly average minimum temperature at each
time step
ndvi_lag12
, the lagged monthly average Normalized
Difference Vegetation Index at each time step
Using lagged versions of predictors is a standard practice in time
series analysis / forecasting. But unfortunately the software we are
using cannot easily handle missing values in the predictors. Because
we’ve calculated lags of ndvi
, the first 12 rows of our
data now have missing values for this predictor:
head(model_data, 14)
## series time count mintemp ndvi_lag12
## 1 PP 1 0 -9.710 NA
## 2 PP 2 1 -5.924 NA
## 3 PP 3 2 -0.220 NA
## 4 PP 4 NA 1.931 NA
## 5 PP 5 10 6.568 NA
## 6 PP 6 NA 11.590 NA
## 7 PP 7 NA 14.370 NA
## 8 PP 8 16 16.520 NA
## 9 PP 9 18 12.490 NA
## 10 PP 10 12 9.210 NA
## 11 PP 11 NA 0.845 NA
## 12 PP 12 3 -7.080 NA
## 13 PP 13 2 -7.550 1.465889
## 14 PP 14 NA -3.633 1.558507
Visualize the data as a heatmap to get a sense of where these are
distributed (NA
s are shown as red bars in the below
plot)
image(is.na(t(model_data %>%
dplyr::arrange(dplyr::desc(time)))), axes = F,
col = c('grey80', 'darkred'))
axis(3, at = seq(0,1, len = NCOL(model_data)), labels = colnames(model_data))
For now we will simply omit these missing values and restart the time index at 1:
model_data %>%
dplyr::filter(!is.na(ndvi_lag12)) %>%
dplyr::mutate(time = time - min(time) + 1) -> model_data_trimmed
head(model_data_trimmed)
As in the previous tutorial, we also split the data into training and testing subsets for inspecting forecast behaviours. We will do the same here, but use a shorter validation period in this example
model_data_trimmed %>%
dplyr::filter(time <= 162) -> data_train
model_data_trimmed %>%
dplyr::filter(time > 162) -> data_test
Our last tutorial ended with a model that included a smooth function
of time
. The lectures have illustrated why this might be a
bad idea if one of our goals is to forecast. We will produce forecasts
from this model to show why this can be problematic. Start by refitting
the model from Tutorial 1, including the data_test
object as newdata
to ensure the model automatically
produces forecasts.
model0 <- mvgam(count ~ s(time, bs = 'bs', k = 15) +
ndvi_lag12 +
mintemp,
family = poisson(),
data = data_train,
newdata = data_test)
Inspect forecasts from this model, which are not great
plot(model0, type = 'forecast', newdata = data_test)
## Out of sample DRPS:
## 310.32655475
What happens if we change k
slightly by using 12 basis
functions rather than 15?
model0b <- mvgam(count ~ s(time, bs = 'bs', k = 12) +
ndvi_lag12 +
mintemp,
family = poisson(),
data = data_train,
newdata = data_test)
Forecasts from this model are, somehow, even more ridiculous!
plot(model0b, type = 'forecast', newdata = data_test)
## Out of sample CRPS:
## 7178.942387
Why is this happening? The forecasts are driven almost entirely by variation in the temporal spline, which is extrapolating linearly forever beyond the training data. Any slight wiggles near the boundary will result in wildly different forecasts. To visualize this, we can plot the extrapolated temporal functions for the two models. Here for the first model, with 15 basis functions:
plot_predictions(model0, by = 'time',
newdata = datagrid(time = 1:max(data_test$time)),
type = 'link') +
geom_vline(xintercept = max(data_train$time),
linetype = 'dashed')
And for the second model, with fewer basis functions (12). The difference is clear, with this model reacting to a slightly upward shift in the smooth toward the boundary that results in predictions heading to infinity:
plot_predictions(model0b, by = 'time',
newdata = datagrid(time = 1:max(data_test$time)),
type = 'link') +
geom_vline(xintercept = max(data_train$time),
linetype = 'dashed')
Try changing the value of k
and seeing how forecasts
change. This should give you more confidence that extrapolating from
penalized smooths can be unpredictable and difficult to control. There
are some strategies to help reign in this behaviour (once again this advice comes via a blogpost by Gavin
Simpson), but results will still be unpredictable and sensitive to
choices such as k
or the placement of knots. But if we
shouldn’t use smooth functions of time for forecasting, what other
options do we have?
?b.spline
for guidance)bs = 'cr'
)# Replace the ? with the correct value(s)
# Using ?b.spline you will see examples in the Description and Details sections about how these splines can handle multiple penalties
model0 <- mvgam(count ~ s(time, bs = 'bs', k = 15,
m = c(?)) +
ndvi_lag12 +
mintemp,
family = poisson(),
data = data_train,
newdata = data_test)
One useful strategy to deal with this is to fit models that incorporate autocorrelated residual processes within the
GAM. If our model’s residuals are autocorrelated, our estimates of
uncertainty around other regression parameters (such as the \(\beta\) coefficients) can be biased and
misleading. Failing to address this may lead to false inferences and
perhaps to poor forecasts. Fortunately there are ways we can tackle this
by modelling the autocorrelation in our model’s errors. This can be done
using mgcv
functionality, as the blog post above explains.
But we will use brms
here as it is more flexible and
provides an interface that is more consistent with
mvgam
.
brms
First, we will fit a similar model to those used in Tutorial 1 so to see how brms
works.
The interface is very familiar, relying on the usual R
formula syntax (see ?brmsformula
for details and examples).
But there are some key differences between brms
and
mvgam
. For example, brms
can handle a much
larger diversity of response distributions (see ?brmsfamily
for details). But for now, we will stick a similar Poisson model as
previously (a nonlinear function of ndvi_lag12
and a
parametric, linear function of mintemp
):
model1 <- brm(count ~
s(ndvi_lag12, k = 9) +
mintemp,
family = poisson(),
# set a mildly informative prior on beta coefficients
prior = set_prior("normal(0, 2)", class = "b"),
data = data_train,
backend = 'cmdstanr',
iter = 1000,
cores = 4)
Use summary()
to ensure no major diagnostic warnings
have been produced and inspect posterior distributions for key
parameters. You may notice some warnings of divergent transitions for
this model, which should give you some immediate concern about the
validity of the model’s inferences
summary(model1)
## Warning: Parts of the model have not converged (some Rhats are > 1.05). Be
## careful when analysing the results! We recommend running more iterations and/or
## setting stronger priors.
## Warning: There were 118 divergent transitions after warmup. Increasing
## adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Family: poisson
## Links: mu = log
## Formula: count ~ s(ndvi_lag12, k = 9) + mintemp
## Data: data_train (Number of observations: 134)
## Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
## total post-warmup draws = 2000
##
## Smoothing Spline Hyperparameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sds(sndvi_lag12_1) 0.35 0.36 0.02 1.50 1.08 40 34
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 2.39 0.03 2.33 2.45 1.01 920 931
## mintemp 0.07 0.00 0.06 0.07 1.01 860 872
## sndvi_lag12_1 -0.22 0.50 -1.67 0.80 1.06 47 24
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
For any brms
model, we can inspect the underlying
Stan
code using the stancode
function:
stancode(model1)
Stan
code
## // generated with brms 2.21.0
## functions {
## }
## data {
## int<lower=1> N; // total number of observations
## array[N] int Y; // response variable
## int<lower=1> K; // number of population-level effects
## matrix[N, K] X; // population-level design matrix
## int<lower=1> Kc; // number of population-level effects after centering
## // data for splines
## int Ks; // number of linear effects
## matrix[N, Ks] Xs; // design matrix for the linear effects
## // data for spline 1
## int nb_1; // number of bases
## array[nb_1] int knots_1; // number of knots
## // basis function matrices
## matrix[N, knots_1[1]] Zs_1_1;
## int prior_only; // should the likelihood be ignored?
## }
## transformed data {
## matrix[N, Kc] Xc; // centered version of X without an intercept
## vector[Kc] means_X; // column means of X before centering
## for (i in 2:K) {
## means_X[i - 1] = mean(X[, i]);
## Xc[, i - 1] = X[, i] - means_X[i - 1];
## }
## }
## parameters {
## vector[Kc] b; // regression coefficients
## real Intercept; // temporary intercept for centered predictors
## vector[Ks] bs; // unpenalized spline coefficients
## // parameters for spline 1
## // standardized penalized spline coefficients
## vector[knots_1[1]] zs_1_1;
## vector<lower=0>[nb_1] sds_1; // SDs of penalized spline coefficients
## }
## transformed parameters {
## // penalized spline coefficients
## vector[knots_1[1]] s_1_1;
## real lprior = 0; // prior contributions to the log posterior
## // compute penalized spline coefficients
## s_1_1 = sds_1[1] * zs_1_1;
## lprior += normal_lpdf(b | 0, 2);
## lprior += student_t_lpdf(Intercept | 3, 2.6, 2.5);
## lprior += normal_lpdf(bs | 0, 2);
## lprior += student_t_lpdf(sds_1 | 3, 0, 2.5)
## - 1 * student_t_lccdf(0 | 3, 0, 2.5);
## }
## model {
## // likelihood including constants
## if (!prior_only) {
## // initialize linear predictor term
## vector[N] mu = rep_vector(0.0, N);
## mu += Intercept + Xs * bs + Zs_1_1 * s_1_1;
## target += poisson_log_glm_lpmf(Y | Xc, mu, b);
## }
## // priors including constants
## target += lprior;
## target += std_normal_lpdf(zs_1_1);
## }
## generated quantities {
## // actual population-level intercept
## real b_Intercept = Intercept - dot_product(means_X, b);
## }
Of course we still cannot interpret the smooth effects using the
posterior summary. We will need to make targeted plots to do that.
Fortunately brms
has a similar
conditional_effects()
function to mvgam
so
this is simple to do:
plot(conditional_effects(model1),
theme = theme_classic(),
mean = FALSE,
rug = TRUE,
ask = FALSE)
tidybayes
A notable difference between brms
and mvgam
is that the former does not automatically return predictions. These need
to be computed after model fitting, which means that residuals also have
to be computed. To inspect residual diagnostics we can enlist the help
of the tidybayes
package to produce randomized quantile
residuals. More information on this procedure can be found in this tidybayes vignette. Do not worry if it is difficult
to understand, it is not of major relevance to the learning objectives
of the course (and if you use mvgam
, this will be done for
you automatically anyway)
data_train %>%
# calculate posterior predictions
add_predicted_draws(model1) %>%
# use predictions to compute randomized quantile residuals
summarise(
p_lower = mean(.prediction < count),
p_upper = mean(.prediction <= count),
p_residual = runif(1,
max(p_lower, 0.00001),
min(p_upper, 0.99999)),
z_residual = qnorm(p_residual),
.groups = "drop_last"
) -> residual_data
Perform the usual residual diagnostics. For example, below are residual ACF, pACF and Q-Q plots, all of which show that the model’s predictions do not adequately match the features of the observed data
acf(residual_data$z_residual,
na.action = na.pass,
bty = 'l',
lwd = 2,
ci.col = 'darkred',
main = 'Dunn-Smyth residuals')
pacf(residual_data$z_residual,
na.action = na.pass,
bty = 'l',
lwd = 2,
ci.col = 'darkred',
main = 'Dunn-Smyth residuals')
ggplot(residual_data, aes(sample = z_residual)) +
geom_qq() +
geom_abline(col = 'darkred') +
theme_classic()
## Warning: Removed 56 rows containing non-finite outside the scale range
## (`stat_qq()`).
We can compare inferences from the two models to see that they are similar (although in a different order):
plot(conditional_effects(model2), ask = FALSE)
conditional_effects(model3)
## Warning: Removed 28 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 28 rows containing missing values or values outside the scale range
## (`geom_point()`).
Now compare estimates for the residual AR1 parameters. First, extract
posterior draws for the AR1 parameter from the two models, which can be
done using the as.data.frame
:
ar1_ests <- rbind(data.frame(
ar1 = as.data.frame(model3, variable = 'ar1[1]')$`ar1[1]`,
model = 'mvgam'),
data.frame(
ar1 = as.data.frame(model2, variable = 'ar[1]')$`ar[1]`,
model = 'brms'))
We can use ggplot2
to compare these estimates, which
shows that they are also very similar:
ggplot(data = ar1_ests, aes(x = ar1, fill = model)) +
geom_histogram(aes(y = after_stat(density)),
bins = 40, col = 'white') +
facet_grid(rows = 'model') +
theme_bw() +
theme(legend.position = 'none') +
scale_fill_manual(values = c('darkblue', 'darkred')) +
labs(x = expression(ar1[error]), y = 'Frequency') +
xlim(c(0,1))
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_bar()`).
As you may remember from the lectures, the larger this AR1 parameter becomes (in absolute value), the more a time series tends toward nonstationarity. We can look at some example functions to investigate this behaviour in more detail here
R
code used to
produce these simulations
# Define a function to simulate AR1 processes with a fixed error variance
simulate_ar1 = function(ar1 = 0.5, N = 50){
# simulate the initial value of the series
init <- rnorm(1, mean = 0, sd = 0.25)
# create an empty vector to store the time series values
states <- vector(length = N)
# set the first value of the states as the initial value
states[1] <- init
# loop over remaining time points and fill in the AR1 process
for(t in 2:N){
states[t] <- rnorm(1, mean = ar1 * states[t - 1],
sd = 0.25)
}
return(states)
}
# plot a single AR1 series with an AR1 parameter = 0.5
plot(x = 1:50, y = simulate_ar1(ar1 = 0.5),
type = 'l',
lwd = 2, col = 'darkred',
ylab = expression(f(Time)),
xlab = 'Time', bty = 'l',
ylim = c(-2,2),
main = expression(AR[1]~'='~0.5))
# overlay an additional 20 possible series using different colours
for(i in 1:20){
lines(simulate_ar1(ar1 = 0.5),
col = sample(c("#DCBCBC",
"#C79999",
"#B97C7C",
"#A25050",
"#7C0000"), 1),
lwd = 2)
}
box(bty = 'l', lwd = 2)
This plot shows how an AR1 process with a smallish AR1 parameter (0.5) approaches stationarity fairly quickly. What happens if we increase this parameter toward the types of values that our models have estimated? The code to produce these simulations is the same as above, except the AR1 parameter was changed to 0.95
The variance of this particular series grows into the future for a
much longer time than did the series we simulated above. This is useful
for getting a sense about the relative stability of the underlying
dynamic process. We can enforce a Random Walk so that the dynamic
process will not be stationary (note, a Random Walk trend can only be
fit in mvgam
; there is currently no way of using such a
process in brms
, that I’m aware of):
model3b <- mvgam(count ~
s(ndvi_lag12, k = 9) +
mintemp,
family = poisson(),
data = data_train,
newdata = data_test,
priors = prior(normal(0, 2), class = b),
trend_model = RW())
The model can be described mathematically as follows: \[\begin{align*} \boldsymbol{count}_t & \sim \text{Poisson}(\lambda_t) \\ log(\lambda_t) & = f(\boldsymbol{ndvi})_t + \beta_{mintemp} * \boldsymbol{mintemp}_t + z_t \\ z_t & \sim \text{Normal}(z_{t-1}, \sigma_{error}) \\ \sigma_{error} & \sim \text{Exponential}(2) \\ f(\boldsymbol{ndvi}) & = \sum_{k=1}^{K}b * \beta_{smooth} \\ \beta_{smooth} & \sim \text{MVNormal}(0, (\Omega * \lambda)^{-1}) \\ \beta_{mintemp} & \sim \text{Normal}(0, 1) \end{align*}\]
Note the similarity to model3 above. The only difference is that we no longer estimate the AR1 parameter, so the mean of the latent process at time \(t\) is simply the value of the latent process at time \(t-1\).Plotting the trend estimates for the AR1 and Random Walk models gives you further insight into the different assumptions they make about the temporal evolution of the system:
layout(matrix(1:2, nrow = 2))
plot(model3, type = 'trend', newdata = data_test, main = 'AR1 trend')
plot(model3b, type = 'trend', newdata = data_test, main = 'RW trend')
mvgam
, this parameter is called sigma[1]
,
while in brms
, it is called sderr
mvgam
, this parameter is called
mintemp
, while in brms
, it is called
b_mintemp
plot.mvgam()
with
type = residuals
). Does this model seem to capture the
relevant features of the autocorrelated observed data?Another way to capture dynamic errors in ecological time series is with a Gaussian Process:
“A Gaussian process defines a probability distribution over functions; in other words every sample from a Gaussian Process is an entire function from the covariate space X to the real-valued output space.” (Betancourt; Robust Gaussian Process Modeling)
In many time series analyses, we believe the power of past observations to explain current data is a function of how long ago we observed those past observations. Gaussian Processes (GPs) use covariance functions (often called ‘kernels’) that allow this to happen by specifying how correlations depend on the difference, in time, between observations. For our purposes we will rely on the squared exponential kernel, which depends on \(\rho\) (often called the length scale parameter) to control how quickly correlations decay as a function of time. The functions also depend on a parameter \(\alpha\), which controls how much the GP term contributes to the linear predictor. Below we simulate from such a GP to get an idea of the kinds of functional shapes that can be produced as you vary the length scale \(\rho\):
R
code used to
produce these simulations
# set a seed for reproducibility
set.seed(2222)
# simulate from a fairly large length-scale parameter
rho <- 10
# set the alpha parameter to 1 for all simulations
alpha <- 1
# simulate functions that span 50 time points
times <- 1:50
# generate a sequence of prediction values
draw_seq <- seq(min(times), max(times), length.out = 100)
# construct the temporal covariance matrix
Sigma <- alpha^2 *
exp(-0.5 * ((outer(draw_seq, draw_seq, "-") / rho) ^ 2)) +
diag(1e-12, length(draw_seq))
# draw a single realization from the GP distribution
gp_vals <- MASS::mvrnorm(n = 1,
mu = rep(0, length(draw_seq)),
Sigma = Sigma)
# plot the single GP draw
plot(x = draw_seq, y = gp_vals, type = 'l',
lwd = 2, col = 'darkred',
ylab = expression(f(Time)),
xlab = 'Time', bty = 'l',
ylim = c(-3,3),
main = expression(alpha~'='~1*'; '~rho~'='~10))
# overlay an additional 10 possible functions using different colours
for(i in 1:10){
draw <- MASS::mvrnorm(n = 1, mu = rep(0, length(draw_seq)),
Sigma = Sigma)
lines(x = draw_seq, y = draw, lwd = 3.5, col = 'white')
lines(x = draw_seq, y = draw,
col = sample(c("#DCBCBC",
"#C79999",
"#B97C7C",
"#A25050",
"#7C0000"), 1),
lwd = 2.5)
}
box(bty = 'l', lwd = 2)
This plot shows that the temporal autocorrelations have a long ‘memory’, i.e. they tend to change slowly over time. In contrast, let’s see what happens when we simulate from a GP with a shorter length scale parameter (the same code as above was used, except the length scale was changed to 4)
These functions are much ‘wigglier’ and can be useful for capturing temporal dynamics with short-term ‘memory’. A big advantage of GPs is that they can easily handle data that are irregularly sampled, much like splines can (and in contrast to autoregressive processes like we’ve been using so far). But although GPs may look similar to splines in the functions they return, they are fundamentally different. In particular, a spline has local knowledge, meaning it has no principled way of understanding how the function itself is evolving across time. But a GP directly models how the function changes over time, which means it is better able to generate out-of-sample predictions.
brms
We can use GPs in both brms
and mvgam
. In
fact, both packages capitalize on some advances that allow GPs to be
approximated using Hilbert space basis functions, which considerably speed up computation at little cost to
accuracy or prediction performance. You can incorporate GPs of time
(or of any other covariate) using the gp()
function (see
?gp
for details).
Up to this point we’ve used a Poisson model and have been able to capture excess dispersion with autoregressive error terms. But a GP trend will not readily capture this overdispersion because it evolves smoothly. We’ll isntead switch to a Negative Binomial model. This can be helpful for our data as we want to capture overdispersion in the counts while estimating a smooth underlying dynamic trend with the GP function. Read more about the Negative Binomial and its widespread use in ecological modelling in this recent paper.
model4 <- brm(count ~
s(ndvi_lag12, k = 9) +
mintemp +
# use a GP of time with squared exponential kernel;
# employ Hilbert space approximation
# for faster computation, with 20 basis functions
# used for the approximation; set scale = FALSE so
# the rho parameter is more interpretable
gp(time, c = 5/4, k = 20, scale = FALSE),
# use a Negative Binomial observation model
family = negbinomial(),
prior = set_prior("normal(0, 2)", class = "b"),
data = data_train,
backend = 'cmdstanr',
iter = 1000,
cores = 4)
The latent dynamic process now evolves as a GP. The Negative Binomial
distribution also has a second parameter, \(\phi\), which controls any extra
dispersion. A summary of this model now returns posterior summaries of
the GP function through the parameters sdgp(gptime)
(the
\(\alpha\) covariance parameter) and
lscale(gptime)
(the length scale or \(\rho\) covariance parameter). It also
contains the overdispersion parameter (\(\phi\)), labelled as shape
.
The smaller this value, the more support there is for overdispersion in
the observations
summary(model4)
## Warning: There were 6 divergent transitions after warmup. Increasing
## adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Family: negbinomial
## Links: mu = log; shape = identity
## Formula: count ~ s(ndvi_lag12, k = 9) + mintemp + gp(time, c = 5/4, k = 20, scale = FALSE)
## Data: data_train (Number of observations: 134)
## Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
## total post-warmup draws = 2000
##
## Smoothing Spline Hyperparameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sds(sndvi_lag12_1) 0.48 0.50 0.02 1.80 1.00 1072 822
##
## Gaussian Process Hyperparameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sdgp(gptime) 1.14 0.44 0.55 2.26 1.00 800 1013
## lscale(gptime) 5.85 3.53 1.65 14.85 1.00 1625 1306
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 2.15 0.32 1.56 2.84 1.00 1122 705
## mintemp 0.10 0.01 0.08 0.12 1.00 2584 1632
## sndvi_lag12_1 -0.67 0.96 -2.45 1.35 1.00 1274 1098
##
## Further Distributional Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## shape 1.91 0.33 1.34 2.66 1.00 1701 1332
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Plot conditional effects from this model, on the response scale
plot(conditional_effects(model4),
theme = theme_classic(),
mean = FALSE,
rug = TRUE,
ask = FALSE)
The last plot shows that the GP has clearly captured nonlinear, smooth variation over time.
mvgam
We can fit a similar model in mvgam
using the
nb()
family:
model5 <- mvgam(count ~
s(ndvi_lag12, k = 9) +
mintemp +
gp(time, c = 5/4, k = 20, scale = FALSE),
family = nb(),
data = data_train,
newdata = data_test,
priors = prior(normal(0, 2), class = b))
This model’s summary also now contains posterior summaries of the GP parameters and the Negative Binomial overdispersion parameter
summary(model5)
## GAM formula:
## count ~ s(ndvi_lag12, k = 9) + mintemp + gp(time, c = 5/4, k = 20,
## scale = FALSE)
##
## Family:
## negative binomial
##
## Link function:
## log
##
## Trend model:
## None
##
## N series:
## 1
##
## N timepoints:
## 162
##
## Status:
## Fitted using Stan
## 4 chains, each with iter = 1000; warmup = 500; thin = 1
## Total post-warmup draws = 2000
##
##
## Observation dispersion parameter estimates:
## 2.5% 50% 97.5% Rhat n_eff
## phi[1] 1.5 2.1 3 1 2623
##
## GAM coefficient (beta) estimates:
## 2.5% 50% 97.5% Rhat n_eff
## (Intercept) 1.500 2.20000 3.100 1.04 90
## mintemp 0.077 0.09600 0.120 1.00 1912
## s(ndvi_lag12).1 -0.100 0.00920 0.130 1.00 900
## s(ndvi_lag12).2 -0.240 0.00120 0.240 1.00 645
## s(ndvi_lag12).3 -0.053 -0.00180 0.048 1.00 1495
## s(ndvi_lag12).4 -0.130 -0.00150 0.120 1.00 635
## s(ndvi_lag12).5 -0.063 0.00085 0.067 1.00 642
## s(ndvi_lag12).6 -0.110 -0.00100 0.110 1.00 644
## s(ndvi_lag12).7 -0.670 -0.00460 0.660 1.00 625
## s(ndvi_lag12).8 -0.310 -0.11000 0.095 1.00 937
## gp(time).1 -12.000 -0.25000 8.500 1.04 85
## gp(time).2 -7.000 0.14000 8.800 1.01 364
## gp(time).3 -7.300 0.81000 10.000 1.00 1308
## gp(time).4 -6.400 0.20000 6.900 1.01 1115
## gp(time).5 -9.000 -0.90000 5.200 1.00 1240
## gp(time).6 -8.500 -0.94000 6.000 1.00 586
## gp(time).7 -6.400 0.56000 6.600 1.00 2194
## gp(time).8 -6.100 2.50000 8.800 1.01 269
## gp(time).9 -6.300 -0.16000 8.600 1.01 295
## gp(time).10 -11.000 -3.90000 2.600 1.00 1833
## gp(time).11 -5.800 0.18000 7.900 1.01 299
## gp(time).12 -2.400 3.50000 9.400 1.00 2436
## gp(time).13 -5.800 0.42000 6.100 1.00 2156
## gp(time).14 -7.700 -0.98000 4.800 1.00 2343
## gp(time).15 -8.900 -2.60000 3.500 1.00 490
## gp(time).16 -7.800 -2.00000 3.400 1.00 1403
## gp(time).17 -1.400 3.60000 11.000 1.00 1600
## gp(time).18 -2.500 1.90000 8.600 1.00 877
## gp(time).19 -5.000 -1.10000 3.300 1.00 2149
## gp(time).20 -0.840 2.90000 8.200 1.00 1390
##
## GAM gp term marginal deviation (alpha) and length scale (rho) estimates:
## 2.5% 50% 97.5% Rhat n_eff
## alpha_gp(time) 0.57 1.2 2.7 1.01 363
## rho_gp(time) 1.60 4.6 16.0 1.03 135
##
## Approximate significance of GAM smooths:
## edf Ref.df Chi.sq p-value
## s(ndvi_lag12) 0.88 8 2.68 0.99
##
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 23 of 2000 iterations ended with a divergence (1.15%)
## *Try running with larger adapt_delta to remove the divergences
## 0 of 2000 iterations saturated the maximum tree depth of 12 (0%)
## E-FMI indicated no pathological behavior
##
## Samples were drawn using NUTS(diag_e) at Wed Mar 27 3:44:30 PM 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split MCMC chains
## (at convergence, Rhat = 1)
And plot conditional effects:
conditional_effects(model5)
## Warning: Removed 28 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 28 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 28 rows containing missing values or values outside the scale range
## (`geom_point()`).
As we saw in Tutorial 1, it can also be useful to understand first derivatives of smooth functions to get a sense of when (in history) they were increasing or decreasing. We can view the first derivative estimates for this model’s smooth trend using the following command:
plot_slopes(model5, variable = 'time', condition = 'time',
type = 'link') +
labs(y = 'First derivative (slope) of linear predictor') +
geom_hline(yintercept = 0, linetype = 'dashed')
These two models use slightly different priors and parameterisations for the smooth functions, but they should be fairly equivalent. We can check that by looking at their LOO diagnostics
loo(model4)
##
## Computed from 2000 by 134 log-likelihood matrix
##
## Estimate SE
## elpd_loo -469.0 14.1
## p_loo 12.1 1.9
## looic 938.1 28.3
## ------
## Monte Carlo SE of elpd_loo is 0.1.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 133 99.3% 236
## (0.5, 0.7] (ok) 1 0.7% 232
## (0.7, 1] (bad) 0 0.0% <NA>
## (1, Inf) (very bad) 0 0.0% <NA>
##
## All Pareto k estimates are ok (k < 0.7).
## See help('pareto-k-diagnostic') for details.
loo(model5)
##
## Computed from 2000 by 134 log-likelihood matrix
##
## Estimate SE
## elpd_loo -468.4 14.0
## p_loo 12.5 1.9
## looic 936.8 28.1
## ------
## Monte Carlo SE of elpd_loo is 0.1.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
Of course we can also look at forecasts from the mvgam
model:
plot(model5, type = 'forecast')
## Out of sample DRPS:
## 307.72798425
These look much better than the spline extrapolations we saw at the beginning of this tutorial. We can also see how the GP is extrapolating on the link scale
plot_predictions(model5, by = 'time',
newdata = datagrid(time = 1:max(data_test$time)),
type = 'link') +
geom_vline(xintercept = max(data_train$time),
linetype = 'dashed')
This is much better extrapolation behaviour compared to splines
Another big advantage of GPs is that their parameters are more
directly interpretable. For instance, the length scale of a GP is
related to the stationarity (or stability) of the system. We can plot
how covariance is expected to change over time to understand this
stability. Those of you that have worked with variograms or
semi-variograms before when analysing spatial data will be used to
visualizing how semivariance decays over distance. Here we will
visualize how covariance decays over time distances (i.e. how far apart
in time to we need to be before our estimate from the GP no longer
depends on another?). We can define a few helper functions to generate
these plots, which will accept posterior estimates of GP parameters from
a fitted mvgam
object, compute GP covariance kernels,
calculate posterior quantiles and then plot them:
R
function
plot_kernels
, used to plot GP kernels
# Compute the covariance kernel for a given draw
# of GP alpha and rho parameters
quad_kernel = function(rho, alpha, times){
covs <- alpha ^ 2 *
exp(-0.5 * ((times / rho) ^ 2))
return(covs)
}
# Compute kernels for each posterior draw
plot_kernels = function(gp_ests, max_time = 50){
# set up an empty matrix to store the kernel estimates
# for each posterior draw
draw_seq <- seq(0, max_time, length.out = 100)
kernels <- matrix(NA, nrow = NROW(gp_ests), ncol = 100)
# use a for-loop to estimate the kernel for each draw using
# our custom quad_kernel() function
for(i in 1:NROW(gp_ests)){
kernels[i,] <- quad_kernel(rho = gp_ests$`rho_gp_time_`[i],
alpha = gp_ests$`alpha_gp_time_`[i],
times = draw_seq)
}
# Calculate posterior empirical quantiles of the kernels
probs <- c(0.05, 0.2, 0.5, 0.8, 0.95)
cred <- sapply(1:NCOL(kernels),
function(n) quantile(kernels[,n],
probs = probs))
# Plot the kernel uncertainty estimates
pred_vals <- draw_seq
plot(1, xlim = c(0, max_time), ylim = c(0, max(cred)), type = 'n',
xlab = 'Time difference', ylab = 'Covariance',
bty = 'L')
box(bty = 'L', lwd = 2)
polygon(c(pred_vals, rev(pred_vals)), c(cred[1,], rev(cred[5,])),
col = "#DCBCBC", border = NA)
polygon(c(pred_vals, rev(pred_vals)), c(cred[2,], rev(cred[4,])),
col = "#B97C7C", border = NA)
lines(pred_vals, cred[3,], col = "#7C0000", lwd = 2.5)
}
Extract posterior estimates of the GP parameters using the
as.data.frame
function that we’ve been using previously.
You can then plot the posterior kernel estimates for model 5 to
visualize how temporal error covariance is estimated to change as two
time points are further apart. To do this, all we need to supply is the
posterior estimates for the GP parameters in the form of a
data.frame
:
gp_ests <- as.data.frame(model5, variable = c('rho_gp_time_',
'alpha_gp_time_'))
plot_kernels(gp_ests = gp_ests, max_time = 40)
This can be helpful to understand how long the temporal ‘memory’ of a given GP is, which also relates to how smooth it’s underlying functions are.
But one potential downside is that the length scale parameter \(\rho\) is fundamentally challenging to
identify, particularly when estimating a latent Gaussian Process.
For both the mvgam
and brms
models, our
estimates of this parameter are somewhat wide:
gp_rhos <- rbind(data.frame(
rho = as.data.frame(model5, variable = 'rho_gp_time_')$`rho_gp_time_`,
model = 'mvgam'),
data.frame(
rho = as.data.frame(model4, variable = 'lscale_gptime')$lscale_gptime,
model = 'brms'))
ggplot(data = gp_rhos, aes(x = rho, fill = model)) +
geom_histogram(aes(y = after_stat(density)),
bins = 40, col = 'white') +
facet_grid(rows = 'model') +
theme_bw() +
theme(legend.position = 'none') +
scale_fill_manual(values = c('darkblue', 'darkred')) +
labs(x = expression(rho[GP]), y = 'Frequency') +
xlim(c(0,24))
## Warning: Removed 9 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_bar()`).
This doesn’t cause issue in our example, but sometimes this lack of identifiability can cause problems. Imagine a scenario where a very short length scale leads to overly wiggly functions that compete with other temporal processes we’d like to estimate. Often in practice it can be useful to place a fairly restrictive prior that imposes the kind of smoothness we might expect to see. If you are interested in this, I recommend you see this useful (and detailed) case study on GPs by Michael Betancourt.
bs = 'bs'
) and a Negative Binomial family in
mvgam
for comparisons. Plot the 1st derivative of this
temporal spline and describe how (or if) it differs from that of
model5loo_compare()
. Is either model favoured over the
other?sessionInfo()
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22621)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_Australia.utf8 LC_CTYPE=English_Australia.utf8
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Australia.utf8
##
## time zone: Australia/Brisbane
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.5.0 tidybayes_3.0.6
## [3] mvgam_1.1.0 insight_0.19.7
## [5] marginaleffects_0.17.0.9002 brms_2.21.0
## [7] Rcpp_1.0.12 mgcv_1.8-42
## [9] nlme_3.1-162 dplyr_1.1.4
## [11] downloadthis_0.3.3
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 svUnit_1.0.6 farver_2.1.1
## [4] scoringRules_1.1.1 loo_2.6.0 fastmap_1.1.1
## [7] tensorA_0.36.2.1 digest_0.6.34 timechange_0.2.0
## [10] mime_0.12 lifecycle_1.0.4 StanHeaders_2.26.28
## [13] processx_3.8.3 magrittr_2.0.3 posterior_1.5.0
## [16] compiler_4.3.1 rlang_1.1.3 sass_0.4.8
## [19] tools_4.3.1 utf8_1.2.4 yaml_2.3.8
## [22] collapse_2.0.7 data.table_1.14.10 knitr_1.45
## [25] labeling_0.4.3 bridgesampling_1.1-2 pkgbuild_1.4.3
## [28] plyr_1.8.9 cmdstanr_0.7.1 abind_1.4-5
## [31] klippy_0.0.0.9500 withr_3.0.0 purrr_1.0.2
## [34] grid_4.3.1 stats4_4.3.1 fansi_1.0.6
## [37] colorspace_2.1-0 inline_0.3.19 scales_1.3.0
## [40] MASS_7.3-60 cli_3.6.2 mvtnorm_1.2-4
## [43] rmarkdown_2.25 generics_0.1.3 RcppParallel_5.1.7
## [46] rstudioapi_0.15.0 reshape2_1.4.4 cachem_1.0.8
## [49] rstan_2.32.3 stringr_1.5.1 splines_4.3.1
## [52] bayesplot_1.10.0 assertthat_0.2.1 parallel_4.3.1
## [55] matrixStats_1.2.0 base64enc_0.1-3 vctrs_0.6.5
## [58] Matrix_1.6-4 jsonlite_1.8.8 arrayhelpers_1.1-0
## [61] ggdist_3.3.1 jquerylib_0.1.4 tidyr_1.3.1
## [64] glue_1.7.0 ps_1.7.5 codetools_0.2-19
## [67] distributional_0.3.2 lubridate_1.9.3 stringi_1.8.3
## [70] gtable_0.3.4 QuickJSR_1.0.9 munsell_0.5.0
## [73] tibble_3.2.1 pillar_1.9.0 htmltools_0.5.7
## [76] Brobdingnag_1.2-9 R6_2.5.1 evaluate_0.23
## [79] lattice_0.21-8 highr_0.10 backports_1.4.1
## [82] bslib_0.6.1 rstantools_2.3.1.1 bsplus_0.1.4
## [85] zip_2.3.0 coda_0.19-4.1 gridExtra_2.3
## [88] checkmate_2.3.1 xfun_0.42 fs_1.6.3
## [91] pkgconfig_2.0.3