Specify correlated residual processes in mvgam

Set up latent correlated multivariate Gaussian residual processes in mvgam. This function does not evaluate its arguments – it exists purely to help set up a model with particular error processes

Usage

ZMVN(unit = time, gr = NA, subgr = series)

Arguments

unit

The unquoted name of the variable that represents the unit of analysis in data over which latent residuals should be correlated. This variable should be either a numeric or integer variable in the supplied data. Defaults to time to be consistent with other functionalities in mvgam, though note that the data need not be time series in this case. See examples below for further details and explanations

gr

An optional grouping variable, which must be a factor in the supplied data, for setting up hierarchical residual correlation structures. If specified, this will automatically set up a model where the residual correlations for a specific level of gr are modelled hierarchically:

\(\Omega_{group} = p\Omega_{global} + (1 - p)\Omega_{group, local}\),

where \(\Omega_{global}\) is a global correlation matrix, \(\Omega_{group, local}\) is a local deviation correlation matrix, and \(p\) is a weighting parameter controlling how strongly the local correlation matrix \(\Omega_{group}\) is shrunk towards the global correlation matrix \(\Omega_{global}\). If gr is supplied then subgr must also be supplied

subgr

A subgrouping factor variable specifying which element in data represents the different observational units. Defaults to series to be consistent with other functionalities in mvgam, though note that the data need not be time series in this case

Models that use the hierarchical correlations (by supplying a value for gr) should not include a series element in data. Rather, this element will be created internally based on the supplied variables for gr and subgr

For example, if you are modelling counts for a group of species (labelled as species in the data) across sampling sites (labelled as site in the data) in three different geographical regions (labelled as region), and you would like the residuals to be correlated within regions, then you should specify unit = site, gr = region, and subgr = species

Internally, mvgam() will appropriately order the data by unit (in this case, by site) and create the series element for the data using something like:

series = as.factor(paste0(group, '_', subgroup))

Value

An object of class mvgam_trend, which contains a list of arguments to be interpreted by the parsing functions in mvgam

Examples

# \donttest{
# Simulate counts of four species over ten sampling locations
site_dat <- data.frame(
  site = rep(1:10, 4),
  species = as.factor(sort(rep(letters[1:4], 10))),
  y = c(NA, rpois(39, 3))
)
head(site_dat)
#>   site species  y
#> 1    1       a NA
#> 2    2       a  4
#> 3    3       a  4
#> 4    4       a  1
#> 5    5       a  1
#> 6    6       a  2

# Set up a correlated residual (i.e. Joint Species Distribution) model
trend_model <- ZMVN(unit = site, subgr = species)
mod <- mvgam(
  y ~ species,
  trend_model = ZMVN(unit = site, subgr = species),
  data = site_dat,
  chains = 2,
  silent = 2
)

# Inspect the estimated species-species residual covariances
mcmc_plot(mod, variable = 'Sigma', regex = TRUE, type = 'hist')
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


# A hierarchical correlation example
Sigma <- matrix(
  c(1, -0.4, 0.5,
    -0.4, 1, 0.3,
    0.5, 0.3, 1),
  byrow = TRUE,
  nrow = 3
)

make_site_dat <- function(...) {
  errors <- mgcv::rmvn(
    n = 30,
    mu = c(0.6, 0.8, 1.8),
    V = Sigma
  )
  site_dat <- do.call(rbind, lapply(1:3, function(spec) {
    data.frame(
      y = rpois(30, lambda = exp(errors[, spec])),
      species = paste0('species', spec),
      site = 1:30
    )
  }))
  site_dat
}

site_dat <- rbind(
  make_site_dat() %>%
    dplyr::mutate(group = 'group1'),
  make_site_dat() %>%
    dplyr::mutate(group = 'group2')
) %>%
  dplyr::mutate(
    species = as.factor(species),
    group = as.factor(group)
  )

# Fit the hierarchical correlated residual model
mod <- mvgam(
  y ~ species,
  trend_model = ZMVN(unit = site, gr = group, subgr = species),
  data = site_dat
)
#> Compiling Stan program using cmdstanr
#> 
#> Start sampling
#> Running MCMC with 4 chains, at most 3 in parallel...
#> 
#> Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
#> Chain 2 Iteration:   1 / 1000 [  0%]  (Warmup) 
#> Chain 3 Iteration:   1 / 1000 [  0%]  (Warmup) 
#> Chain 2 Iteration: 100 / 1000 [ 10%]  (Warmup) 
#> Chain 2 Iteration: 200 / 1000 [ 20%]  (Warmup) 
#> Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
#> Chain 2 Iteration: 300 / 1000 [ 30%]  (Warmup) 
#> Chain 3 Iteration: 100 / 1000 [ 10%]  (Warmup) 
#> Chain 2 Iteration: 400 / 1000 [ 40%]  (Warmup) 
#> Chain 2 Iteration: 500 / 1000 [ 50%]  (Warmup) 
#> Chain 2 Iteration: 501 / 1000 [ 50%]  (Sampling) 
#> Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
#> Chain 2 Iteration: 600 / 1000 [ 60%]  (Sampling) 
#> Chain 3 Iteration: 200 / 1000 [ 20%]  (Warmup) 
#> Chain 2 Iteration: 700 / 1000 [ 70%]  (Sampling) 
#> Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
#> Chain 2 Iteration: 800 / 1000 [ 80%]  (Sampling) 
#> Chain 3 Iteration: 300 / 1000 [ 30%]  (Warmup) 
#> Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
#> Chain 2 Iteration: 900 / 1000 [ 90%]  (Sampling) 
#> Chain 3 Iteration: 400 / 1000 [ 40%]  (Warmup) 
#> Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
#> Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
#> Chain 2 Iteration: 1000 / 1000 [100%]  (Sampling) 
#> Chain 2 finished in 2.3 seconds.
#> Chain 3 Iteration: 500 / 1000 [ 50%]  (Warmup) 
#> Chain 3 Iteration: 501 / 1000 [ 50%]  (Sampling) 
#> Chain 4 Iteration:   1 / 1000 [  0%]  (Warmup) 
#> Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
#> Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
#> Chain 3 Iteration: 600 / 1000 [ 60%]  (Sampling) 
#> Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
#> Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
#> Chain 3 Iteration: 700 / 1000 [ 70%]  (Sampling) 
#> Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
#> Chain 1 finished in 3.5 seconds.
#> Chain 3 Iteration: 800 / 1000 [ 80%]  (Sampling) 
#> Chain 4 Iteration: 100 / 1000 [ 10%]  (Warmup) 
#> Chain 3 Iteration: 900 / 1000 [ 90%]  (Sampling) 
#> Chain 4 Iteration: 200 / 1000 [ 20%]  (Warmup) 
#> Chain 3 Iteration: 1000 / 1000 [100%]  (Sampling) 
#> Chain 4 Iteration: 300 / 1000 [ 30%]  (Warmup) 
#> Chain 3 finished in 3.9 seconds.
#> Chain 4 Iteration: 400 / 1000 [ 40%]  (Warmup) 
#> Chain 4 Iteration: 500 / 1000 [ 50%]  (Warmup) 
#> Chain 4 Iteration: 501 / 1000 [ 50%]  (Sampling) 
#> Chain 4 Iteration: 600 / 1000 [ 60%]  (Sampling) 
#> Chain 4 Iteration: 700 / 1000 [ 70%]  (Sampling) 
#> Chain 4 Iteration: 800 / 1000 [ 80%]  (Sampling) 
#> Chain 4 Iteration: 900 / 1000 [ 90%]  (Sampling) 
#> Chain 4 Iteration: 1000 / 1000 [100%]  (Sampling) 
#> Chain 4 finished in 2.9 seconds.
#> 
#> All 4 chains finished successfully.
#> Mean chain execution time: 3.2 seconds.
#> Total execution time: 5.5 seconds.
#> 

# Inspect the estimated species-species residual covariances
mcmc_plot(mod, variable = 'Sigma', regex = TRUE, type = 'hist')
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# }