The MiRKAT package (v1.1.0) contains functions that test global associations between the microbiota and different types of phenotypes, such as univariate continuous or binary phenotypes, survival (censored time-to-event) outcomes, longitudinal data, multivariate, and structured phenotypes. For all these effects, the microbiome community effect is modeled nonparametrically through a kernel function, which can incorporate the phylogenetic tree information.

Four additional functions have beeen added to the package in version 1.1.0. They are MiRKAT.R (robust linear regression), MiRKAT.Q (quantile regression), GLMMMiRKAT (for correlated outcome data), and CSKAT (an alternative to GLMMMiRKAT for correlated Gaussian outcomes, utilizing the Davies approximation to calculate the p-value computationally). In addition, measures of effect size (the KRV statistic and R-squared) have been added to most functions.

The following packages are required for functions and examples in the MiRKAT package: MASS, GUniFrac, CompQuadForm, quantreg, PearsonDS, propr, lme4, Matrix, permute, survival, and stats. All of these required packages are available on CRAN.

We begin by demonstrating the use of the Microbiome Regression-Based Kernel Association Test (MiRKAT) for binary and continuous traits.

We use the throat microbiome data (Charlson et al 2010) from the package GUniFrac to demonstrate our methods. Data are available for 60 subjects, of whom 28 were smokers and 32 were nonsmokers. Microbiome data were collected from right and left nasopharynx and oropharynx regions to form an OTU table with 856 OTUs. We want to evaluate whether smoking is associated with differences in microbiome composition in the upper respiratory tract, taking into consideration additional covariates including gender and antibiotic use within 3 months. We also use simulated data based on the throat dataset to demonstrate the usage of MiRKAT-S, MMiRKAT, MiRKAT.R, MiRKAT.Q, and KRV.

```
# Note: may want to rarefy the data to the minimum sequencing depth for unweighted (presence/absence) distances
## otu.tab.rff <- Rarefy(throat.otu.tab)$otu.tab.rff
unifracs <- GUniFrac(throat.otu.tab, throat.tree, alpha = c(0, 0.5, 1))$unifracs
D.weighted = unifracs[,,"d_1"]
D.unweighted = unifracs[,,"d_UW"]
D.generalized = unifracs[,,"d_0.5"]
D.BC = as.matrix(vegdist(throat.otu.tab, method="bray"))
```

The D2K function in MiRKAT converts distance matrices to kernel matrices via the transformation \[K = -\frac{1}{2} \left(I - \frac{1 1'}{n}\right) D^2 \left(I - \frac{1 1'}{n}\right)\] Here, \(I\) is the identity matrix and \(1\) is an \(n\)-vector of ones. To ensure that \(K\) is positive semi-definite, we replace negative eigenvalues with zero. That is, we perform an eigenvalue decomposition \(K = U \Lambda U\), where \(\Lambda = \text{diag}(\lambda_1,...,\lambda_n)\), and then reconstruct the kernel matrix using the non-negative eigenvalues \(\Lambda^* = \text{diag}(\text{max}(\lambda_1,0),...,\text{max}(\lambda_n,0))\) so that \(K = U \Lambda^* U\).

```
MiRKAT(y = Smoker, X = covar, Ks = K.weighted, out_type = "D",
method = "davies", returnKRV = TRUE, returnR2 = TRUE)
```

```
## $indivP
## [1] 0.005306786
##
## $KRV
## [1] 0.002884312
##
## $R2
## [1] 0.01757917
```

“Method” indicates which method the function should use to compute the **kernel-specific p-value**. “davies” represents an exact method that computes the p-value by inverting the characteristic function of the mixture chi-square distribution. We adopt an exact variance component test because most of the studies concerning microbiome compositions have modest sample size. “permutation” represents a residual permutation approach. “moment” represents an approximation method that matches the first two moments. When out_type=“C” (continuous outcome y), the “moment” method is the Satterthwaite approximation. When out_type = “D” (dichotomous outcome), the “moment” method is the small sample adjustment in Lee et al (2012). When sample size is modest (\(n<100\) for continuous or \(n<200\) for dichotomous outcome), the “moment” method can be inflated at very small size (such as \(\alpha\) = 0.001), although the type I error at \(\alpha\) = 0.05 is usually sustained. Therefore, we suggest using the Davies or permutation approaches for such situations.

If returnKRV = TRUE and/or returnR2 = TRUE, the function also returns measures of effect size KRV and \(R^2\) (respectively). The KRV statistic is as described in However, we caution that these measures of effect size remain very small (on the order of 1e-4 or smaller for the KRV statistic, and generally \(<0.1\) for R-squared) even when the majority of the variability in the outcome is explainable by the microbiome. Hence this should be used for internal comparisons rather than external discussions of, e.g., percent variability explained.

How to choose an appropriate distance matrix and kernel for testing is a difficult question. However, it is important, since the distance matrix used to generate the kernel strongly affects the power of our tests. In particular, MiRKAT has highest power when the form of association between the microbiota and the outcome assumed by the kernel matches the true form of association. Poor choice of kernel will lead to reduced power, although the type I error will be preserved.

In the case of the UniFrac families and the Bray-Curtis dissimilarity, the factors at play are (1) the abundance of the associated taxa and (2) whether closely related taxa (phylogenetically) tend to be related or not related to the outcome as a group. For example, the following are some of the distance metrics that have been used for studies of the microbiome:

Distance | Phylogeny? | Abundance? | Other notes | References |

Unweighted UniFrac | Yes | No | 1 | |

Weighted UniFrac | Yes | Yes | 2 | |

Generalized UniFrac | Yes | (Yes) | Parameter alpha defines extent to which abundance is taken into account | 3 |

Jaccard | No | No | 1 - (taxa in both)/(taxa in either); typically presence/absence, but can be extended to an abundance-weighted version | 4,5 |

Bray-Curtis | No | Yes | Similar to Jaccard, but uses counts | 6 |

In the table above, “Yes” indicates the distance or dissimilarity metric has the feature; “(Yes)” indicates that the feature is present either in some variations of the metric or is present to some extent; and “No” indicates that the feature is not present.

Depending on which of these characteristics are expected to be present in a particular study (based on prior knowledge or intuition), an appropriate distance or dissimilarity can be selected. If the study is exploratory and strong protection of type 1 error is not needed, several distance metrics can be explored. Depending on which one(s) are highly significant, some information can be gained about the nature of any association between the microbiota and the outcome.

We provide an omnibus test that takes into account multiple kernels simultaneously. The method is robust in the sense that it has substantial power gain compared to when an improper kernel is used, and has little power loss compared to when the best kernel is used.

```
Ks = list(K.weighted = K.weighted, K.unweighted = K.unweighted, K.BC = K.BC)
MiRKAT(y = Smoker, Ks = Ks, X = covar, out_type = "D",
method = "davies", returnKRV = TRUE, returnR2 = TRUE)
```

```
## $p_values
## K.weighted K.unweighted K.BC
## 0.005306786 0.003801015 0.002287489
##
## $omnibus_p
## [1] 0.0075
##
## $KRV
## K.weighted K.unweighted K.BC
## 0.0028843118 0.0010877663 0.0006235506
##
## $R2
## K.weighted K.unweighted K.BC
## 0.01757917 0.01981955 0.02403491
```

The only difference between using a continuous and dichotomous outcome variable is the value of the argument “out_type”, as shown below.

```
y <- rnorm(nrow(K.weighted))
MiRKAT(y = y, Ks = Ks, X = covar, out_type = "C", nperm = 999,
method = "davies", returnKRV = TRUE, returnR2 = TRUE)
```

```
## $indivP
## [1] 0.4401194 0.3623498 0.2494742
##
## $omnibus_p
## [1] 0.401
##
## $KRV
## K.weighted K.unweighted K.BC
## 0.0008987168 0.0006520430 0.0002601146
##
## $R2
## K.weighted K.unweighted K.BC
## 0.001706709 0.007121559 0.004182427
```

This function outputs p-values for association using each single kernel and an omnibus p-value considering all kernels. The omnibus p-value is obtained through residual permutation where the minimum p-values from each of the individual tests are used as test statistics. As before, the “method” option only indicates the method that is used for generating individual kernel p-values.

We again use Charlson’s throat microbiome data to demonstrate the use of MiRKAT-S. Data loading and preparation are the same as in the previous section. Because the original dataset has a binary phenotype (smoking) rather than a measure of censored time to event outcomes, we consider smoking status and gender as covariates and generate null outcome data from the Exponential distribution. Specifically, we generate survival times as \(S \sim \text{Exponential}(1 + I(\text{smoke}) + I(\text{male}))\), and censoring times as \(C \sim \text{Exponential}(0.75)\). Then the observed outcome measures are observation time \(T = \text{min}(S, C)\) and an indicator variable for whether the event was observed, \(\Delta = I(S \leq C)\). That is, when delta = 1, the corresponding “obstime” is the survival time, and when delta = 0, the corresponding observation is censored and “obstime” is the time of censoring. This simulation procedure results in approximately 33% censoring.

```
# Simulate outcomes
# Here, outcome is associated with covariates but unassociated with microbiota
# Approximately 33% censoring
SurvTime <- rexp(60, (1 + Smoker + Male))
CensTime <- rexp(60, 0.75)
Delta <- as.numeric(SurvTime <= CensTime )
ObsTime <- pmin(SurvTime, CensTime)
```

The p-value for the test may be generated using permutation or Davies’ exact method. Davies’ exact method, which computes the p-value based on a mixture of chi-square distributions, is used when “perm = F”. We use a small-sample correction to account for the modest sample sizes and sparse OTU count matrices that often result from studies of the microbiome.

```
# Davies' exact method
MiRKATS(obstime = ObsTime, delta = Delta, X = cbind(Smoker, Male, anti), Ks = Ks,
perm = F, returnKRV = T, returnR2 = T)
```

```
## $p_values
## K.weighted K.unweighted K.BC
## 0.07778749 0.47512860 0.08902296
##
## $omnibus_p
## [1] 0.1361361
##
## $KRV
## K.weighted K.unweighted K.BC
## 0.0016499209 0.0008857490 0.0003264481
##
## $R2
## K.weighted K.unweighted K.BC
## 0.005752281 0.013141469 0.006587602
```

Using “perm = T” indicates that a permutation p-value should be calculated. Overall, permutation is recommended when the sample size is small, as Davies’ method may be slightly anti-conservative with very small sample sizes. MiRKAT-S will generate a warning when permutation is not used for sample sizes \(n \leq 50\). “nperm” indicates the number of permutations to perform to generate the p-value (default = 1000)

```
# Permutation
MiRKATS(obstime = ObsTime, delta = Delta, X = cbind(Smoker, Male, anti), Ks = Ks,
perm = T, returnKRV = T, returnR2 = T)
```

```
## $p_values
## K.weighted K.unweighted K.BC
## 0.07707708 0.45145145 0.09409409
##
## $omnibus_p
## [1] 0.1341341
##
## $KRV
## K.weighted K.unweighted K.BC
## 0.0016499209 0.0008857490 0.0003264481
##
## $R2
## K.weighted K.unweighted K.BC
## 0.005752281 0.013141469 0.006587602
```

The omnibus p-value is similar to that of MiRKAT, in which residual permutation is used to create a null distribution of p-values, and the minimum p-value from the original MiRKAT-S analysis is tested against this distribution as the test statistic of the omnibus test. The omnibus test is described further at https://github.com/hk1785/OMiSA (Koh 2018, DOI: https://doi.org/10.1186/s12864-018-4599-8).

MMiRKAT is designed to test the association between the microbiome and a low-dimensional multivariate continuous outcome. For structured or high dimensional outcomes, the kernel RV test (KRV) is recommended instead, as demonstrated in the next section.

We again use Charlson’s throat microbiome data to demonstrate the use of MMiRKAT. We generate multivariate outcomes \(Y \sim N_{3n}(0, I_{3n})\).

MMiRKAT uses a small-sample correction procedure for p-value calculation. There is no omnibus test in MMiRKAT; only kernel-specific p-values and, optionally, the KRV and R-squared statistics are returned.

```
set.seed(1)
n <- nrow(throat.otu.tab)
Y <- matrix(rnorm(n*3, 0, 1), n, 3)
MMiRKAT(Y = Y, X = cbind(Male, anti), Ks = Ks,
returnKRV = TRUE, returnR2 = TRUE)
```

```
## $p_values
## K.weighted K.unweighted K.BC
## 0.1073553 0.2126433 0.2922759
##
## $KRV
## K.weighted K.unweighted K.BC
## 0.0012759596 0.0007062962 0.0002528821
##
## $R2
## K.weighted K.unweighted K.BC
## 0.01036837 0.02518368 0.01191402
```

KRV (kernel RV coefficient) is designed to evaluate the association between microbiome composition and a structured, potentially high-dimensional phenotype, such as gene expression of a set of genes which are functionally related. The KRV statistic can capture nonlinear associations and complex relationships within the individual data types and between the complex multivariate phenotype and microbiome composition by measuring general dependency.

Two kernels are involved in KRV statistics: one kernel for microbiome composition, which can be obtained by transforming the distance metrics as in the previous sections, and one kernel capturing similarities in the phenotypes. As an alternative, KRV can also take as input a kernel matrix for microbiome composition, a multivariate phenotype y, and a set of additional covariates X. In this case, a multivariate regression is carried out, and then the residuals of this regression are subsequently used to construct a Gaussian or linear kernel matrix.

```
## $p_values
## K.weighted K.unweighted K.BC
## 0.6231597 0.4956971 0.6321714
##
## $omnibus_p
## [1] 0.6134599
##
## $KRV
## K.weighted K.unweighted K.BC
## 1.713514e-04 1.226831e-04 4.210276e-05
##
## $R2
## K.weighted K.unweighted K.BC
## 0.03958771 0.16086562 0.06991840
```

```
## $p_values
## K.weighted K.unweighted K.BC
## 0.6228336 0.5955235 0.6499288
##
## $omnibus_p
## [1] 0.6423704
##
## $KRV
## NULL
##
## $R2
## NULL
```

An omnibus test has also been implemented for the KRV function. This omnibus test is different from the other omnibus tests in the package, as it creates an “omnibus kernel matrix” and then performs the KRV analysis on that omnibus kernel. The omnibus kernel matrix is constructed via the second formulation in Zhan et al (2017), an unsupervised weighted combination of kernels that the authors found to have the best performance across a range of simulation settings. Specifically, the omnibus kernel is defined as \[K_{om} = \sum_{i=1}^m \frac{K_i}{tr(K_i)}.\] The p-value from this single omnibus kernel is returned as the omnibus p-value for KRV.

MiRKAT is sensitive to strong skewness or outliers in continuous outcome variables, but many biological features have non-Gaussian distributions. To accommodate such features, rather than using OLS, robust MiRKAT uses a robust regression method. Specifically, a kernel matrix for the outcome is created using a linear kernel based on the residuals from robust linear regression (MiRKAT-R) or quantile regression (MiRKAT-Q) of the outcome on non-microbiome covariates, then related to the microbiome kernel using the KRV test. Note that if there are no additional covariates, MiRKAT-R and MiRKAT-Q simplify to MiRKAT.

We additionally note that quantile regression has lower power than robust regression in every scenario considered, and hence recommend use of MiRKAT-R in most circumstances. Both functions call the KRV function once an outcome kernel matrix has been created using quantile or robust regression, so the omnibus tests for MiRKAT.Q and MiRKAT.R are the same as for KRV.

```
## $p_values
## K.weighted K.unweighted K.BC
## 0.5502230 0.9116325 0.3288858
##
## $omnibus_p
## [1] 0.5506275
##
## $KRV
## NULL
##
## $R2
## NULL
```