Author: Ville-Petteri Mäkinen
Version: 1.0.7
Abstract: Allspice is a lightweight machine learning tool that was initially developed for RNA profiling of acute lymphoplastic leukemia, but it can be used for any problem where multiple classes need to be identified from multi-dimensional data. The classifier contains optimized mean profiles for the classes (centroids) as observed in the training data, and new samples are matched to these centroids using the shortest Euclidean distance. Allspice produces both numerical and visual output of the classification results and flags samples with mixed features from multiple classes or atypical values.
Citation: Ville-Petteri Mäkinen, Jacqueline Rehn, James Breen, David Yeung, Deborah L White (2022) Multi-cohort transcriptomic subtyping of B-cell acute lymphoblastic leukemia, Int J Mol Sci 2022, 23(9), 4574, https://doi.org/10.3390/ijms23094574
Resources: Additional assets were created from tissue-specific RNA-seq data from the Genotype-Tissue Expression Project v8 [Lonsdale J et al. Nat Genet 45, 580–585 (2013) https://doi.org/10.1038/ng.2653], from lymphoblastoid cell lines [Lappalainen et al. Nature 2013 September 26; 501(7468): 506–511. doi:10.1038/nature12531], from pediatric T-cell leukemia samples [Drobna-Śledzińska et al. ArrayExpress E-MTAB-11759, accessed 2022], and from 11 EPT and 134 T-cell ALL samples (South Australian Health and Medical Research Institute, 2022, unpublished).
You can install Allspice using the standard procedure:
# Install the package from a remote repository.
install.packages("Allspice")
To activate and list the library functions, type
# Activate the library.
library("Allspice")
packageVersion("Allspice")
## [1] '1.0.7'
ls("package:Allspice")
## [1] "assemble<-" "asset" "bcellALL" "classifier"
## [5] "classify" "configuration" "configuration<-" "covariates<-"
## [9] "export" "information" "nomenclature<-" "normalize"
## [13] "predictions" "profiles<-" "report" "scores"
## [17] "standardize" "visuals<-"
Each Allspice function comes with a help page that contains a code example, such as
# Access function documentation (not shown in vignette).
? Allspice::classifier
You can run all the code examples in one go by typing
# Run all code examples (not shown in vignette).
fn <- system.file("examples.R", package = "Allspice")
source(fn)
To run this tutorial, we will use simulated data that mimics B-cell acute lymphoblast leukemia samples:
# Generate gene RNA read counts.
simu <- bcellALL(300)
print(simu$counts[1:5,1:6])
## patient1 patient2 patient3 patient4 patient5 patient6
## NFYA 3198 3214 2125 2137 3250 3676
## STPG1 47 48 299 55 29 98
## ENPP4 586 38 75 196 1863 55
## SEMA3F 527 87 198 69 1707 132
## LAP3 863 1838 3187 2192 1149 2312
print(simu$metadata[1:6,])
## MALE AGE SUBTYPE
## patient1 0 12 ETV6.RUNX1
## patient2 0 27 Ph
## patient3 1 27 KMT2A
## patient4 1 34 PAX5.p80r
## patient5 1 29 ETV6.RUNX1
## patient6 1 50 ZNF384
The output contains the unadjusted read counts as would be produced by a standard RNA-seq pipeline and patients’ sex and age are also simulated. Please note that these data are artificial and meant only for demonstration purposes; they do not reflect the complex biology of ALL accurately.
The read counts must be stored in a matrix that has the samples organized into columns and the genes are listed as rows. For the meta-data, patients can be listed in rows as that is more convenient in most situations.
The first step is to create a new classifier:
# Set up a classifier for genetic B-cell ALL subtypes.
cls <- classifier()
##
## ALL subtype
## configuration -> 164 bytes
## categories -> 984 bytes
## reference -> 335,109 bytes
## covariates -> 103 bytes
## nomenclature -> 325,618 bytes
## centroids -> 15,212 bytes
## coefficients -> 3,150 bytes
##
## Driver genes
## configuration -> 165 bytes
## categories -> 1,461 bytes
## reference -> 333,812 bytes
## covariates -> 102 bytes
## nomenclature -> 324,271 bytes
## centroids -> 22,764 bytes
## coefficients -> 4,796 bytes
##
## Source tissue
## configuration -> 166 bytes
## categories -> 2,619 bytes
## reference -> 251,540 bytes
## covariates -> 105 bytes
## nomenclature -> 242,601 bytes
## centroids -> 42,622 bytes
## coefficients -> 9,144 bytes
By default, the command creates an object that contains three models: i) ALL subtype prediction, ii) driver gene prediction and iii) tissue profiling to make sure the RNA data are suitable for analyses.
The next step is to import the covariates (e.g. sex and age) into the classifier. If there are no covariate data available, this step can be skipped. The default classifier uses MALE and AGE as column headings, see the format of the metadata in the first section of the vignette. The names of covariates are also available as
# List covariates.
info <- information(cls)
print(info$covariates[,c("ASSET","TITLE","COVAR")])
## ASSET TITLE COVAR
## 1 1 ALL subtype MALE
## 2 1 ALL subtype AGE
## 3 2 Driver genes MALE
## 4 2 Driver genes AGE
## 5 3 Source tissue MALE
## 6 3 Source tissue AGE
To set the covariates, use the command
# Set covariates.
covariates(cls) <- simu$metadata
Missing values or incomplete data are allowed, however, a warning will be produced when the subtypes are predicted. Also, it is important to set the covariates before loading the RNA read counts.
Once the covariates have been set, the RNA-seq read counts are next:
# Load RNA-seq profiles.
profiles(cls) <- simu$counts
The above command pre-processes the data and predicts the subtypes based on the combined information from covariates and read counts. To get the results, use the command
# Prediction results.
pred <- predictions(cls)
primary <- pred[[1]]
print(primary[1:6,c("LABEL","FREQ","PROX","EXCL")])
## LABEL FREQ PROX EXCL
## patient1 ETV6::RUNX1 0.977 0.997 0.998
## patient2 Ph (BCR::ABL1) 0.788 0.951 0.824
## patient3 KMT2A 0.996 1.000 0.997
## patient4 PAX5 P80R 0.841 0.984 0.965
## patient5 ETV6::RUNX1 0.993 0.999 1.000
## patient6 ZNF384 0.992 1.000 1.000
The output is a list of data frames, where each element contains the results generated by a classification asset (see the section on how to create your own classifier for more information on assets). Here, we are interested in the ALL subtype which is the primary asset in the classifier.
Please use the tabs at the top to navigate to the next page.
Here are selected results from the previous page again:
# Prediction results.
ambig <- which(primary$CATEG == "Ambiguous")
uncla <- which(primary$CATEG == "Unclassified")
rows <- unique(c(1:5, ambig[1], uncla[1]))
print(primary[rows,c("LABEL","MATCH","FREQ","PROX","EXCL")])
## LABEL MATCH FREQ PROX EXCL
## patient1 ETV6::RUNX1 ETV6.RUNX1 9.77e-01 0.997 0.998
## patient2 Ph (BCR::ABL1) Ph 7.88e-01 0.951 0.824
## patient3 KMT2A KMT2A 9.96e-01 1.000 0.997
## patient4 PAX5 P80R PAX5.p80r 8.41e-01 0.984 0.965
## patient5 ETV6::RUNX1 ETV6.RUNX1 9.93e-01 0.999 1.000
## patient17 Ambiguous PAX5.alt 6.20e-01 0.940 0.463
## patient41 Unclassified HLF 2.27e-05 0.117 0.220
The data frame includes additional information besides the most plausible ALL subtype LABEL. The ALL subtype that most resembles the patient’s profile is defined as the best-matching subtype MATCH. For most cases, LABEL will refer to the same subtype as MATCH, but failure to pass quality checks and the rarity of a subtype may sometimes lead to conflicting results.
The next column named FREQ tells the frequency of the predicted subtype given the observed RNA profile and covariates. For example, if FREQ is 0.9, then you would expect to see 9 out of 10 patients that share this data profile to have that predicted ALL subtype. This metric depends on the training set that was used when the classification asset was created. Thus rare subtypes are likely to get lower frequency estimates even if the RNA profile would be a close match.
The frequency metric does not capture mutual exclusivity because it is evaluated independently for each subtype. For example, Allspice allows patients to have more than one subtype in parallel when the statistical model is trained. In this scenario, parallel subtypes may all get high frequencies, of which the highest would be chosen as the predicted subtype. To identify samples with mixed transcriptional characteristics, the results also includes a probabilistic estimate for exclusivity in the column EXCL that tells how likely it is that a sample comes from a single ALL subtype.
PROX (proximity) and EXCL (exclusivity) are used as quality metrics. By default, Allspice labels any samples with <50% proximity as ‘Unclassified’. Of those that pass the probability threshold, samples with <50% exclusivity are labelled as ‘Ambiguous’. The proximity measure is conceptually the same as subtype frequency, however, any differences in subtype prevalence are adjusted away.
You may have noticed that the subtype labels and the matched subtype names are slightly different. This is because text labels with non-alphanumeric characters may be automatically changed by R when used as row or column names in data structures, thus the subtype identities are made safe compared to the human readable labels. You can access subtype labels by typing:
# Access subtype labelling information.
info <- information(cls)
print(info$categories[1:5,c("ASSET","TITLE","CATEG","LABEL")])
## ASSET TITLE CATEG LABEL
## 1 1 ALL subtype BCL2_MYC BCL2/MYC
## 2 1 ALL subtype CDX2.high CDX2 hi-exp
## 3 1 ALL subtype CRLF2 CRLF2
## 4 1 ALL subtype DUX4 DUX4
## 5 1 ALL subtype ETV6.RUNX1 ETV6::RUNX1
See also the Iris data example in the section on creating new classifiers. It contains instructions on how to set the labels for a classification asset.
Please use the tabs at the top to navigate to the next page.
Detailed classification results can be visualised on a case-by-case basis. In the first example, we first identify a patient sample that passed quality control and did not show mixed characteristics:
# Select successful classification.
rows <- which((primary$CATEG != "Ambiguous") & (primary$FREQ > 0.9))
print(primary[rows[1],c("LABEL","MATCH","FREQ","PROX","EXCL")])
## LABEL MATCH FREQ PROX EXCL
## patient1 ETV6::RUNX1 ETV6.RUNX1 0.977 0.997 0.998
# Show patient report.
report(cls, name = rows[1], file = NULL)