The **ref.ICAR** package performs objective Bayesian
analysis using a reference prior proposed by Keefe et al. (2019). This model provides an approach for
modeling spatially correlated areal data, using an intrinsic conditional
autoregressive (ICAR) component on a vector of spatial random effects
with a reference prior for all model parameters. Ferreira et al. (2021) developed faster MCMC sampling for the
ICAR model with reference prior, and Porter et al. (2023) developed objective Bayesian model
selection based on fractional Bayes factors for the model.

**ref.ICAR** can be used to analyze areal data
corresponding to a contiguous region, provided a shapefile or
neighborhood matrix and data. The functions implemented by
**ref.ICAR** are summarized below.

`shape.H()`

Takes a file path to a shapefile and forms the appropriate neighborhood matrix for analysis with the ICAR reference prior model.`ref.MCMC()`

Implements the posterior sampling algorithm proposed by Keefe et al. (2019). Generates MCMC chains for parameter and regional inferences.`ref.summary()`

Provides posterior inferences for the model parameters, \(\tau\), \(\boldsymbol{\beta}\), and \(\sigma^2\). Includes posterior medians, highest posterior density (HPD) intervals, and acceptance rates for each parameter.`reg.summary()`

Provides fitted posterior values and summaries for each subregion in the areal data set. Includes posterior medians and HPD intervals by region.`ref.plot()`

Outputs trace plots for the model parameters, \(\tau\), \(\boldsymbol{\beta}\), and \(\sigma^2\).`ref.analysis()`

Performs analysis by sequentially implementing each of the functions above. This function produces plots and a list containing MCMC chains, parameter estimates, regional estimates, and sampling acceptance rates.`probs.icar()`

Performs simultaneous, objective Bayesian model selection for covariates and spatial model structure for areal data. This function provides posterior model probabilities for all candidate ICAR models and OLMs.

The model implemented by **ref.ICAR** is summarized
below. \[\begin{equation}
\mathbf{Y}= X
\boldsymbol{\boldsymbol{\beta}}+\boldsymbol{\theta}+\boldsymbol{\phi}
\end{equation}\]

where

\(\mathbf{Y}\) is an \(n\times1\) vector for the response variable, where \(n\) corresponds to the number of regions in the shapefile. The current version of the package does not allow for missing data.

\(X\) is a matrix of covariates. This can include a vector \(\textbf{1}_n\) for an intercept, and additional columns corresponding to quantitative predictors.

\(\boldsymbol{\boldsymbol{\beta}}\) is the \(p\times1\) vector of fixed effect regression coefficients, where \(p\) corresponds to the number of columns in \(X\).

\(\boldsymbol{\theta}\) is an \(n\times1\) vector of independent and normally distributed unstructured random effects defined with mean 0 and variance \(\sigma^2\).

\(\boldsymbol{\phi}\) is an \(n\times1\) vector of spatial random effects that is assigned an intrinsic CAR prior with the sum-zero constraint \(\sum_{i=1}^n \phi_i=0\) (Keefe et al. 2018).

The model assumes a signal-to-noise ratio parameterization for the variance components of the random components of the model, so \(\sigma^2\) and \(\tau\) are used as below. \[\boldsymbol{\phi} \sim \bigg(\textbf{0},\frac{\sigma^2}{\tau}\Sigma_{\phi}\bigg)\]

The parameter \(\tau\) controls the strength of spatial dependence, and given the neighborhood structure, \(\Sigma_{\phi}\) is a fixed matrix. Specifically, \(\Sigma_{\phi}\) is the Moore-Penrose inverse of \(H\), where the neighborhood matrix \(H\) is an \(n\times n\) symmetric matrix constructed as follows. \[\begin{equation} (H)_{ij} = \begin{cases} h_i & \text{if } i=j \\ -g_{ij} & \text{if } i\in N_j \\ 0 & \text{otherwise}, \end{cases} \end{equation}\]

where \(g_{ij}=1\) if subregions \(i\) and \(j\) are neighbors, \(g_{ij}=0\) if subregions i and j are not neighbors,and \(h_i\) is the number of neighbors of subregion \(i\). Therefore, the neighborhood matrix \(H\) is an \(n\times n\) symmetric matrix where the diagonal elements correspond to the number of neighbors for each subregion in the data, and each off-diagonal element equals \(-1\) if the corresponding subregions are neighbors.

Provided a path to a shapefile, the `shape.H()`

function
in **ref.ICAR** constructs \(H\) as specified above, and checks for
symmetry and contiguous regions (i.e. no islands) prior to analysis. The
functions `shape.H()`

and `ref.analysis()`

requires a file path to a shapefile. If a user wants to analyze areal
data without a corresponding shapefile (e.g. neuroimaging), they will
need to construct \(H\) as above and
use this \(H\) in
`ref.MCMC()`

.
`ref.plot()`

,`ref.summary()`

, and
`reg.summary()`

can then be used with the MCMC chains
obtained from `ref.MCMC()`

. Additionally, if a user performs
analysis without `ref.analysis()`

, the regions corresponding
to data values in \(X\) and \(y\) must match the region order in \(H\); otherwise inferences will be matched
to incorrect regions.

Consider an example of areal data over the contiguous United States.
Figure 1 represents the average SAT scores reported in 1999 for each of
the contiguous United States and Washington D.C. This example will
explore these data and use the **ref.ICAR** package to fit
a model to the response, Verbal SAT scores, considering spatial
dependence and a single covariate, percent of eligible students that
took the SAT in each state in 1999. This data was analyzed in
*Hierarchical Modeling and Analysis for Spatial Data* (Banerjee et al. 2014). The data are available
online at https://www.counterpointstat.com/hierarchical-modeling-and-analysis-for-spatial-data.html.
We make it available in the **ref.ICAR** package with
permission from the authors. The shapefile is found from http://www.arcgis.com/home/item.html?id=f7f805eb65eb4ab787a0a3e1116ca7e5.

These data and the accompanying shapefile are attached to the
**ref.ICAR** package. The files can be loaded into R as
shown below. The `st_read()`

function from package
**sf** is used to read the shapefile.

```
library(sf)
system.path <- system.file("extdata", "us.shape48.shp", package = "ref.ICAR",
mustWork = TRUE)
shp.layer <- gsub(".shp", "", basename(system.path))
shp.path <- dirname(system.path)
us.shape48 <- st_read(dsn = path.expand(shp.path), layer = shp.layer,
quiet = TRUE)
```

The SAT data can be loaded into R from **ref.ICAR**
using `read.table()`

.

```
library(utils)
data.path <- system.file("extdata", "states-sats48.txt", package = "ref.ICAR",
mustWork = TRUE)
sats48 <- read.table(data.path, header = T)
us.shape48$verbal <- sats48$VERBAL
us.shape48$percent <- sats48$PERCENT
```

Now that the shapefile and data are loaded, the observed data can be plotted as a choropleth map (Figure 1). This map illustrates the spatial dependence to be analyzed by the model. The Midwestern states and Utah exhibit the highest average SAT scores, and overall, neighboring states have similar average scores.

```
library(ggplot2)
library(classInt)
library(dplyr)
breaks_qt <- classIntervals(c(min(us.shape48$verbal) - 1e-05,
us.shape48$verbal), n = 7, style = "quantile")
us.shape48_sf <- mutate(us.shape48, score_cat = cut(verbal, breaks_qt$brks))
ggplot(us.shape48_sf) + geom_sf(aes(fill = score_cat)) + scale_fill_brewer(palette = "OrRd") +
labs(title = "Plot of observed \n verbal SAT scores") + theme_bw() +
theme(axis.ticks.x = element_blank(), axis.text.x = element_blank(),
axis.ticks.y = element_blank(), axis.text.y = element_blank(),
axis.title = element_text(size = 25, face = "bold"),
plot.title = element_text(face = "bold", size = 25, hjust = 0.5)) +
guides(fill = guide_legend("Verbal score"))
```

Similarly, the covariate, percent of eligible students taking the SAT, can be plotted over the contiguous United States. These data exhibit a seemingly inverse relationship to the SAT scores; lower percentages of students take the SAT in the Midwest.

```
breaks_qt <- classIntervals(c(min(us.shape48$percent) - 1e-05,
us.shape48$percent), n = 7, style = "quantile")
us.shape48_sf <- mutate(us.shape48, pct_cat = cut(percent, breaks_qt$brks))
ggplot(us.shape48_sf) + geom_sf(aes(fill = pct_cat)) + scale_fill_brewer(palette = "OrRd") +
labs(title = "Plot of observed \n percent SAT takers") +
theme_bw() + theme(axis.ticks.x = element_blank(), axis.text.x = element_blank(),
axis.ticks.y = element_blank(), axis.text.y = element_blank(),
axis.title = element_text(size = 25, face = "bold"), plot.title = element_text(face = "bold",
size = 25, hjust = 0.5)) + guides(fill = guide_legend("Percent taking"))
```

Employing the functions in **ref.ICAR**, the
`shape.H()`

function first takes the path to the shape file
(obtained above), and returns a list of two objects. This list contains
the neighborhood matrix, \(H\) and a
\(\texttt{SpatialPolygonsDataFrame}\)
object corresponding to the shapefile, to be used by the remaining
functions.

```
## Reading layer `us.shape48' from data source
## `/private/var/folders/zf/nsyh0fy90678j3c41vf9hg7r0000gn/T/RtmpPB1YHQ/Rinstc23461c7c44a/ref.ICAR/extdata'
## using driver `ESRI Shapefile'
## Simple feature collection with 49 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7328 ymin: 24.95638 xmax: -66.96927 ymax: 49.37173
## Geodetic CRS: Clarke 1866
```

`## [1] "sf" "data.frame"`

`## [1] 7`

The response and covariates, \(Y\) and \(X\) must be defined before fitting the model. The response, \(Y\), is Verbal SAT scores. \(X\) has two columns corresponding to an intercept and the predictor, percent of eligible students taking the SAT in 1999.

Then sampling can be performed using `ref.MCMC()`

. The
default starting values are used below, with MCMC iterations and burn-in
larger than the default. The sampling for `ref.MCMC()`

is
based on developments by Ferreira et al. (2021), who express the spatial hierarchical
model in the spectral domain to obtain the faster Spectral Gibbs Sampler
(SGS). Previous versions of **ref.ICAR** implemented the
Spectral Decomposition of the Precision (SDP) algorithm proposed by
Keefe et al. (2019). See Ferreira et al.
(2021) for an outline of the algorithm and
computational comparisons.

```
library(mvtnorm)
set.seed(3456)
ref.SAT <- ref.MCMC(y = Y, X = X, H = H, iters = 15000, burnin = 5000,
verbose = FALSE)
names(ref.SAT)
```

```
## [1] "MCMCchain" "tauc.MCMC" "sigma2.MCMC" "beta.MCMC"
## [5] "phi.MCMC" "accept.phi" "accept.sigma2" "accept.tauc"
```

The object ref.SAT contains MCMC chains for each of the parameters in
the model \(\mathbf{Y}= X
\boldsymbol{\boldsymbol{\beta}}+\boldsymbol{\theta}+\boldsymbol{\phi}\),
using a signal-to-noise ratio parameterization. From these, the function
`ref.plot()`

creates trace plots for each parameter to
visually confirm convergence.

The remaining components for the analysis are the functions for
parameter and regional inferences. The function
`ref.summary()`

provides posterior medians and intervals for
the model parameters \(\boldsymbol{\beta}\), \(\tau\), and \(\sigma^2\). The function
`ref.summary()`

provides medians and Highest Posterior
Density intervals for the fitted \(y\)
values for each subregion in the data.

```
library(coda)
summary.params <- ref.summary(MCMCchain = ref.SAT$MCMCchain,
tauc.MCMC = ref.SAT$tauc.MCMC, sigma2.MCMC = ref.SAT$sigma2.MCMC,
beta.MCMC = ref.SAT$beta.MCMC, phi.MCMC = ref.SAT$phi.MCMC,
accept.phi = ref.SAT$accept.phi, accept.sigma2 = ref.SAT$accept.sigma2,
accept.tauc = ref.SAT$accept.tauc, iters = 15000, burnin = 5000)
names(summary.params)
```

```
## [1] "beta.median" "beta.hpd" "tauc.median" "tauc.hpd"
## [5] "sigma2.median" "sigma2.hpd" "tauc.accept" "sigma2.accept"
```

```
## $beta.median
## [1] 575.461382 -1.142777
##
## $beta.hpd
## lower upper
## var1 568.028129 582.5163300
## var2 -1.334464 -0.9487258
##
## $tauc.median
## [1] 0.08581395
##
## $tauc.hpd
## lower upper
## 0.0003402439 0.5212723440
##
## $sigma2.median
## [1] 31.2014
##
## $sigma2.hpd
## lower upper
## 0.1604447 96.0084162
##
## $tauc.accept
## [1] 0.3436667
##
## $sigma2.accept
## [1] 0.3436667
```

The posterior medians for \(\beta_0\) and \(\beta_1\) are 575.496 and -1.145, respectively. Additionally, the HPD interval for \(\beta_1\) does not include \(0\), which indicates that as the percent of eligible students taking the SAT increases, average Verbal SAT score tends to decrease. The \(\tau\) median is 0.08, with HPD interval between 0.0014 and 0.5237.

```
summary.region <- reg.summary(ref.SAT$MCMCchain, X, Y, burnin = 5000)
us.shape48$verbalfits <- summary.region$reg.medians
breaks_qt <- classIntervals(c(min(us.shape48$verbalfits) - 1e-05,
us.shape48$verbalfits), n = 7, style = "quantile")
us.shape48_sf <- mutate(us.shape48, reg_cat = cut(verbalfits,
breaks_qt$brks))
ggplot(us.shape48_sf) + geom_sf(aes(fill = reg_cat)) + scale_fill_brewer(palette = "OrRd") +
labs(title = "Plot of fitted \n verbal SAT scores") + theme_bw() +
theme(axis.ticks.x = element_blank(), axis.text.x = element_blank(),
axis.ticks.y = element_blank(), axis.text.y = element_blank(),
axis.title = element_text(size = 25, face = "bold"),
plot.title = element_text(face = "bold", size = 25, hjust = 0.5)) +
guides(fill = guide_legend("Region medians"))
```

Finally, the function `ref.analysis()`

in
**ref.ICAR** performs the entire reference analysis,
including:

Obtaining the neighborhood matrix, \(H\)

Running the MCMC chains

Producing parameter trace plots

Producing parameter summaries

Producing regional summaries

`ref.analysis()`

requires the following user inputs: \(X\), \(y\), a path to a shapefile, a vector of
region names corresponding to the values in \(X\), and a vector of region names
corresponding to the values in response \(y\). The region names in each of \(X\) and \(y\) must match and are required because
`ref.analysis()`

reorders the data according to the region
order in the shapefile. This ensures that the data values match to the
correct entries in the neighborhood matrix \(H\); otherwise analysis might map predicted
values to incorrect regions. If the provided shapefile does not have a
specified NAME column, the user will be asked to also provide a vector
of names corresponding to the shapefile. This vector is called \(\texttt{shp.reg.names}\) in the
documentation and function arguments; the default value is NULL.

```
### The SAT scores and percent of students are already
### arranged by state alphabetically
x.reg.names <- us.shape48$NAME
y.reg.names <- us.shape48$NAME
set.seed(3456)
par(mfrow = c(2, 2))
sat.analysis <- ref.analysis(system.path, X, Y, x.reg.names,
y.reg.names, shp.reg.names = NULL, iters = 15000, burnin = 5000,
verbose = FALSE, tauc.start = 0.1, beta.start = -1, sigma2.start = 0.1,
step.tauc = 0.5, step.sigma2 = 0.5)
```

```
## Reading layer `us.shape48' from data source
## `/private/var/folders/zf/nsyh0fy90678j3c41vf9hg7r0000gn/T/RtmpPB1YHQ/Rinstc23461c7c44a/ref.ICAR/extdata'
## using driver `ESRI Shapefile'
## Simple feature collection with 49 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7328 ymin: 24.95638 xmax: -66.96927 ymax: 49.37173
## Geodetic CRS: Clarke 1866
```

```
## [1] "H" "MCMC" "beta.median" "beta.hpd"
## [5] "tauc.median" "tauc.hpd" "sigma2.median" "sigma2.hpd"
## [9] "tauc.accept" "sigma2.accept" "fit.dist" "reg.medians"
## [13] "reg.hpd"
```

Porter et al. (2023) developed
objective Bayesian model selection for simultaneous selection of
covariates and spatial model structure for areal data. Since the joint
reference prior on model parameters is improper (Keefe et al. 2019), fractional Bayes factor
methodology is used to approximate Bayes factors and obtain valid
posterior model probabilities for all candidate ICAR models and OLMs
from the provided candidate covariates. See Porter et al. (2023) for the method details and simulation
results, including the minimal training size for the fractional Bayes
factor that is recommended for this approach. The following is a code
example that uses case study data seen in Porter et al. (2023). The data is available with the
**spdep** package, which is imported by
**ref.ICAR**. The outcome of interest is the residential
crime rate across the 49 neighborhoods of Columbus, Ohio. The five
candidate predictors include average housing value, average household
income, amount of open space in each neighborhood, the number of housing
units without available plumbing, and distance from the Columbus
business district. Reading the data into R using `st_read()`

,
creates an object, which includes the response variable and the
candidate covariates.

```
library(pracma)
library(stats)
library(spdep)
# read in the data as contained in the spdep package
columbus <- st_read(system.file("shapes/columbus.shp", package = "spData")[1],
quiet = TRUE)
```

Similarly to the last example, we can plot the response variable, residential crime rate, over the geographic region to visualize which of the 49 neighborhoods have the highest observed crime rates.

```
breaks_qt <- classIntervals(c(min(columbus$CRIME) - 1e-05, columbus$CRIME),
n = 7, style = "quantile")
columbus_sf <- mutate(columbus, crime_cat = cut(CRIME, breaks_qt$brks))
ggplot(columbus_sf) + geom_sf(aes(fill = crime_cat)) + scale_fill_brewer(palette = "OrRd") +
labs(title = "Plot of observed \n neighborhood crime rates") +
theme_bw() + theme(axis.ticks.x = element_blank(), axis.text.x = element_blank(),
axis.ticks.y = element_blank(), axis.text.y = element_blank(),
axis.title = element_text(size = 25, face = "bold"), plot.title = element_text(face = "bold",
size = 25, hjust = 0.5)) + guides(fill = guide_legend("Neighborhood \n crime rate"))
```

Upon reading in the data, we can begin to fit each of the candidate models for the data based on combinations of the covariates and whether or not the model contains ICAR random effects. We consider all possible OLMs and ICAR models from the 5 candidate covariates, resulting in a model space of size \(2 \times 2^5=64\). Each of the candidate ICAR models uses the same neighborhood matrix \(H\), based on the 49 subregions, which we can define as follows.

```
## Neighbour list object:
## Number of regions: 49
## Number of nonzero links: 236
## Percentage nonzero weights: 9.829238
## Average number of links: 4.816327
## Link number distribution:
##
## 2 3 4 5 6 7 8 9 10
## 5 9 12 5 9 3 4 1 1
## 5 least connected regions:
## 1 6 42 46 47 with 2 links
## 1 most connected region:
## 20 with 10 links
```

```
W <- nb2mat(columbus.listw, style = "B")
Dmat <- diag(apply(W, 1, sum))
num.reg <- length(columbus$CRIME)
H <- Dmat - W
H <- (H + t(H))/2
rownames(H) <- NULL
isSymmetric(H) # check that neighborhood matrix is symmetrix before proceeding
```

`## [1] TRUE`

```
# spectral quantities for use in model selection
H.spectral <- eigen(H, symmetric = TRUE)
Q <- H.spectral$vectors
eigH <- H.spectral$values
phimat <- diag(1/sqrt(eigH[1:(num.reg - 1)]))
Sig_phi <- matrix(0, num.reg, num.reg) #initialize
for (i in 1:(num.reg - 1)) {
total <- (1/(eigH[i])) * Q[, i] %*% t(Q[, i])
Sig_phi <- Sig_phi + total
}
# define response and design matrix
Y <- columbus$CRIME
X <- cbind(1, columbus$HOVAL, columbus$INC, columbus$OPEN,
columbus$PLUMB, columbus$DISCBD)
b <- (ncol(X) + 1)/num.reg # specify the minimal training size for this example
# perform model selection
columbus.select <- probs.icar(Y = Y, X = X, H = H, H.spectral = H.spectral,
Sig_phi = Sig_phi, b = b, verbose = FALSE)
# print the model with highest posterior model
# probability
columbus.select$probs.mat[which.max(columbus.select$probs.mat[,
1]), ]
```

```
## model prob model type model form
## 19 0.1193955 Independent Y ~ Intercept + X1 + X2 + X5
```

```
# print vector of posterior inclusion probabilities for
# each covariate
post.include.cov <- matrix(NA, nrow = 1, ncol = ncol(X) - 1)
labels <- c(rep(NA, ncol(X) - 1))
for (i in 1:(ncol(X) - 1)) {
labels[i] <- paste("X", i, sep = "")
}
colnames(post.include.cov) <- labels
for (j in 1:ncol(X) - 1) {
post.include.cov[, j] <- sum(columbus.select$probs.mat[grep(paste("X",
j, sep = ""), columbus.select$probs.mat$"model form"),
1])
}
post.include.cov
```

```
## X1 X2 X3 X4 X5
## [1,] 0.7454222 0.9238743 0.3009956 0.4312049 0.9273156
```

As an extension to the objective Bayesian model selection for spatial
ICAR models, the R package **GLMMselect** uses fractional
Bayes factor methodology to simultaneously select fixed effects and
random effects in Generalized Linear Mixed Models (GLMMs) where the
covariance structure for the random effects is a product of a unknown
scalar and a known semi-positive definite matrix.
**GLMMselect** (https://CRAN.R-project.org/package=GLMMselect) can
currently be used for model selection for Poisson and Bernoulli data,
based on the methodology in Xu et al. (2023).

Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2014), *Hierarchical
modeling and analysis for spatial data, second edition*, Chapman;
Hall/CRC. https://doi.org/10.1201/b17115.

Ferreira, M. A. R., Porter, E. M., and Franck, C. T. (2021), “Fast
and scalable computations for Gaussian hierarchical models
with intrinsic conditional autoregressive spatial random
effects,” *Computational Statistics and Data Analysis*,
162, 107264. https://doi.org/10.1016/j.csda.2021.107264.

Keefe, M. J., Ferreira, M. A. R., and Franck, C. T. (2018), “On
the formal specification of sum-zero constrained intrinsic conditional
autoregressive models,” *Spatial Statistics*, Elsevier
{BV}, 24, 54–65. https://doi.org/10.1016/j.spasta.2018.03.007.

Keefe, M. J., Ferreira, M. A. R., and Franck, C. T. (2019),
“Objective Bayesian analysis for
Gaussian hierarchical models with intrinsic conditional
autoregressive priors,” *Bayesian Analysis*, International
Society for Bayesian Analysis, 14, 181–209. https://doi.org/10.1214/18-BA1107.

Porter, E. M., Franck, C. T., and Ferreira, M. A. R. (2023),
“Objective Bayesian model selection for spatial
hierarchical models with intrinsic conditional autoregressive
priors,” *Bayesian Analysis*, International Society for
Bayesian Analysis, 1, 1–27. https://doi.org/10.1214/23-BA1375.

Xu, S., Ferreira, M. A. R., Porter, E. M., and Franck, C. T. (2023),
“Bayesian model selection for generalized linear mixed
models,” *Biometrics*. https://doi.org/10.1111/biom.13896.