Author: Ville-Petteri Mäkinen
Version: 1.0.7

Abstract: Allspice is a lightweight machine learning tool that was initially developed for RNA profiling of acute lymphoplastic leukemia, but it can be used for any problem where multiple classes need to be identified from multi-dimensional data. The classifier contains optimized mean profiles for the classes (centroids) as observed in the training data, and new samples are matched to these centroids using the shortest Euclidean distance. Allspice produces both numerical and visual output of the classification results and flags samples with mixed features from multiple classes or atypical values.

Citation: Ville-Petteri Mäkinen, Jacqueline Rehn, James Breen, David Yeung, Deborah L White (2022) Multi-cohort transcriptomic subtyping of B-cell acute lymphoblastic leukemia, Int J Mol Sci 2022, 23(9), 4574, https://doi.org/10.3390/ijms23094574

Resources: Additional assets were created from tissue-specific RNA-seq data from the Genotype-Tissue Expression Project v8 [Lonsdale J et al. Nat Genet 45, 580–585 (2013) https://doi.org/10.1038/ng.2653], from lymphoblastoid cell lines [Lappalainen et al. Nature 2013 September 26; 501(7468): 506–511. doi:10.1038/nature12531], from pediatric T-cell leukemia samples [Drobna-Śledzińska et al. ArrayExpress E-MTAB-11759, accessed 2022], and from 11 EPT and 134 T-cell ALL samples (South Australian Health and Medical Research Institute, 2022, unpublished).

image/svg+xml UnadjustedRNA data AdjustedRNA data UnadjustedRNA data AdjustedRNA data mRNA-seq Subtypes Drivers Tissues

1 Getting started

You can install Allspice using the standard procedure:

# Install the package from a remote repository.
install.packages("Allspice")

To activate and list the library functions, type

# Activate the library.
library("Allspice")
packageVersion("Allspice")
## [1] '1.0.7'
ls("package:Allspice")
##  [1] "assemble<-"      "asset"           "bcellALL"        "classifier"     
##  [5] "classify"        "configuration"   "configuration<-" "covariates<-"   
##  [9] "export"          "information"     "nomenclature<-"  "normalize"      
## [13] "predictions"     "profiles<-"      "report"          "scores"         
## [17] "standardize"     "visuals<-"

Each Allspice function comes with a help page that contains a code example, such as

# Access function documentation (not shown in vignette).
? Allspice::classifier

You can run all the code examples in one go by typing

# Run all code examples (not shown in vignette).
fn <- system.file("examples.R", package = "Allspice")
source(fn)

2 Dataset

To run this tutorial, we will use simulated data that mimics B-cell acute lymphoblast leukemia samples:

# Generate gene RNA read counts.
simu <- bcellALL(300)
print(simu$counts[1:5,1:6])
##        patient1 patient2 patient3 patient4 patient5 patient6
## NFYA       3198     3214     2125     2137     3250     3676
## STPG1        47       48      299       55       29       98
## ENPP4       586       38       75      196     1863       55
## SEMA3F      527       87      198       69     1707      132
## LAP3        863     1838     3187     2192     1149     2312
print(simu$metadata[1:6,])
##          MALE AGE    SUBTYPE
## patient1    0  12 ETV6.RUNX1
## patient2    0  27         Ph
## patient3    1  27      KMT2A
## patient4    1  34  PAX5.p80r
## patient5    1  29 ETV6.RUNX1
## patient6    1  50     ZNF384

The output contains the unadjusted read counts as would be produced by a standard RNA-seq pipeline and patients’ sex and age are also simulated. Please note that these data are artificial and meant only for demonstration purposes; they do not reflect the complex biology of ALL accurately.

The read counts must be stored in a matrix that has the samples organized into columns and the genes are listed as rows. For the meta-data, patients can be listed in rows as that is more convenient in most situations.

3 Classify samples

3.1 Predict labels

The first step is to create a new classifier:

# Set up a classifier for genetic B-cell ALL subtypes.
cls <- classifier()
## 
## ALL subtype
##     configuration -> 164 bytes
##     categories -> 984 bytes
##     reference -> 335,109 bytes
##     covariates -> 103 bytes
##     nomenclature -> 325,618 bytes
##     centroids -> 15,212 bytes
##     coefficients -> 3,150 bytes
## 
## Driver genes
##     configuration -> 165 bytes
##     categories -> 1,461 bytes
##     reference -> 333,812 bytes
##     covariates -> 102 bytes
##     nomenclature -> 324,271 bytes
##     centroids -> 22,764 bytes
##     coefficients -> 4,796 bytes
## 
## Source tissue
##     configuration -> 166 bytes
##     categories -> 2,619 bytes
##     reference -> 251,540 bytes
##     covariates -> 105 bytes
##     nomenclature -> 242,601 bytes
##     centroids -> 42,622 bytes
##     coefficients -> 9,144 bytes

By default, the command creates an object that contains three models: i) ALL subtype prediction, ii) driver gene prediction and iii) tissue profiling to make sure the RNA data are suitable for analyses.

The next step is to import the covariates (e.g. sex and age) into the classifier. If there are no covariate data available, this step can be skipped. The default classifier uses MALE and AGE as column headings, see the format of the metadata in the first section of the vignette. The names of covariates are also available as

# List covariates.
info <- information(cls)
print(info$covariates[,c("ASSET","TITLE","COVAR")])
##   ASSET         TITLE COVAR
## 1     1   ALL subtype  MALE
## 2     1   ALL subtype   AGE
## 3     2  Driver genes  MALE
## 4     2  Driver genes   AGE
## 5     3 Source tissue  MALE
## 6     3 Source tissue   AGE

To set the covariates, use the command

# Set covariates.
covariates(cls) <- simu$metadata

Missing values or incomplete data are allowed, however, a warning will be produced when the subtypes are predicted. Also, it is important to set the covariates before loading the RNA read counts.

Once the covariates have been set, the RNA-seq read counts are next:

# Load RNA-seq profiles.
profiles(cls) <- simu$counts

The above command pre-processes the data and predicts the subtypes based on the combined information from covariates and read counts. To get the results, use the command

# Prediction results.
pred <- predictions(cls)
primary <- pred[[1]]
print(primary[1:6,c("LABEL","FREQ","PROX","EXCL")])
##                   LABEL  FREQ  PROX  EXCL
## patient1    ETV6::RUNX1 0.977 0.997 0.998
## patient2 Ph (BCR::ABL1) 0.788 0.951 0.824
## patient3          KMT2A 0.996 1.000 0.997
## patient4      PAX5 P80R 0.841 0.984 0.965
## patient5    ETV6::RUNX1 0.993 0.999 1.000
## patient6         ZNF384 0.992 1.000 1.000

The output is a list of data frames, where each element contains the results generated by a classification asset (see the section on how to create your own classifier for more information on assets). Here, we are interested in the ALL subtype which is the primary asset in the classifier.

Please use the tabs at the top to navigate to the next page.

3.2 Metrics

Here are selected results from the previous page again:

# Prediction results.
ambig <- which(primary$CATEG == "Ambiguous")
uncla <- which(primary$CATEG == "Unclassified")
rows <- unique(c(1:5, ambig[1], uncla[1]))
print(primary[rows,c("LABEL","MATCH","FREQ","PROX","EXCL")])
##                    LABEL      MATCH     FREQ  PROX  EXCL
## patient1     ETV6::RUNX1 ETV6.RUNX1 9.77e-01 0.997 0.998
## patient2  Ph (BCR::ABL1)         Ph 7.88e-01 0.951 0.824
## patient3           KMT2A      KMT2A 9.96e-01 1.000 0.997
## patient4       PAX5 P80R  PAX5.p80r 8.41e-01 0.984 0.965
## patient5     ETV6::RUNX1 ETV6.RUNX1 9.93e-01 0.999 1.000
## patient17      Ambiguous   PAX5.alt 6.20e-01 0.940 0.463
## patient41   Unclassified        HLF 2.27e-05 0.117 0.220

The data frame includes additional information besides the most plausible ALL subtype LABEL. The ALL subtype that most resembles the patient’s profile is defined as the best-matching subtype MATCH. For most cases, LABEL will refer to the same subtype as MATCH, but failure to pass quality checks and the rarity of a subtype may sometimes lead to conflicting results.

The next column named FREQ tells the frequency of the predicted subtype given the observed RNA profile and covariates. For example, if FREQ is 0.9, then you would expect to see 9 out of 10 patients that share this data profile to have that predicted ALL subtype. This metric depends on the training set that was used when the classification asset was created. Thus rare subtypes are likely to get lower frequency estimates even if the RNA profile would be a close match.

The frequency metric does not capture mutual exclusivity because it is evaluated independently for each subtype. For example, Allspice allows patients to have more than one subtype in parallel when the statistical model is trained. In this scenario, parallel subtypes may all get high frequencies, of which the highest would be chosen as the predicted subtype. To identify samples with mixed transcriptional characteristics, the results also includes a probabilistic estimate for exclusivity in the column EXCL that tells how likely it is that a sample comes from a single ALL subtype.

PROX (proximity) and EXCL (exclusivity) are used as quality metrics. By default, Allspice labels any samples with <50% proximity as ‘Unclassified’. Of those that pass the probability threshold, samples with <50% exclusivity are labelled as ‘Ambiguous’. The proximity measure is conceptually the same as subtype frequency, however, any differences in subtype prevalence are adjusted away.

You may have noticed that the subtype labels and the matched subtype names are slightly different. This is because text labels with non-alphanumeric characters may be automatically changed by R when used as row or column names in data structures, thus the subtype identities are made safe compared to the human readable labels. You can access subtype labels by typing:

# Access subtype labelling information.
info <- information(cls)
print(info$categories[1:5,c("ASSET","TITLE","CATEG","LABEL")])
##   ASSET       TITLE      CATEG       LABEL
## 1     1 ALL subtype   BCL2_MYC    BCL2/MYC
## 2     1 ALL subtype  CDX2.high CDX2 hi-exp
## 3     1 ALL subtype      CRLF2       CRLF2
## 4     1 ALL subtype       DUX4        DUX4
## 5     1 ALL subtype ETV6.RUNX1 ETV6::RUNX1

See also the Iris data example in the section on creating new classifiers. It contains instructions on how to set the labels for a classification asset.

Please use the tabs at the top to navigate to the next page.

3.3 Visual report

Detailed classification results can be visualised on a case-by-case basis. In the first example, we first identify a patient sample that passed quality control and did not show mixed characteristics:

# Select successful classification.
rows <- which((primary$CATEG != "Ambiguous") & (primary$FREQ > 0.9))
print(primary[rows[1],c("LABEL","MATCH","FREQ","PROX","EXCL")])
##                LABEL      MATCH  FREQ  PROX  EXCL
## patient1 ETV6::RUNX1 ETV6.RUNX1 0.977 0.997 0.998
# Show patient report.
report(cls, name = rows[1], file = NULL)