calcRDI - Calculate repertoire distances

Description

Calculate repertoire distances from a matrix of vdjCounts

Usage

calcRDI(vdjCounts, distMethod = c("euclidean", "cor"), subsample = TRUE,
nIter = 100, constScale = TRUE, units = c("lfc", "pct"), ...)

Arguments

vdjCounts
a matrix of repertoire counts, as created by calcVDJCounts
distMethod
one of c(“euclidean”,”cor”) determining how to calculate the distance from the matrix of vdj counts. See Details.
subsample
logical; if true, all repertoires are subsampled to be equal size.
nIter
value defining how many times the subsampling should be repeated. Only used if subsample is TRUE.
constScale
logical; if TRUE, vdjCounts will be scaled such that the sum of each column will be equal to 500 counts (an arbitrary constant). Otherwise, the columns will be scaled to the average count of all the columns.
units
One of “lfc” or “pct”. This determines the method used for transforming the repertoire counts. See Details.
additional parameters; these are ignored by this function.

Value

A dissimilarity structure containing distances between repertoires, averaged across each subsampe run. In addition to the standard attributes in a dist object, three additional attributes are defined as follows:

ngenes
integers, the number of genes in each column of “genes” that were included in at least one repertoire.
nseq
integer, the number of sequences used after subsampling
the repertoires. If subsample=FALSE, this is not defined.

Details

There are two options for distance methods, “euclidean” and “cor”. Euclidean refers to standard euclidean distance, and is the standard for the RDI measure described in (Bolen et al. Bioinformatics 2016). In contrast, cor refers to a correlation-based distance metric, where the distance is defined as (1-correlation) between each column of vdjCounts.

The units parameter is used to determine the transformation function for the repertoire counts. If units='lfc' (default), then the arcsinh transformation is applied to the count matrix, resulting in a distance metric which will scale with the average log fold change of each gene. In contrast, units='pct' will result in no transformation of the count matrix, and distances will be proportional to the average percent change of each gene, instead. Note that “units” is a bit of a misnomer, as the distance metric doesn’t actually represent the true log-fold or percent change in the repertoires. In order to actually estimate these parameters, refer to the rdiModel and convertRDI functions.

Examples

#create genes
genes = sample(letters, 10000, replace=TRUE)

#create sequence annotations
seqAnnot = data.frame(donor = sample(1:4, 10000, replace=TRUE),
cellType = sample(c("B","T"), 10000, replace=TRUE)
)
##generate repertoire counts
cts = calcVDJcounts(genes,seqAnnot) 

##calculate RDI 
d = calcRDI(cts)

##calculate RDI in percent space
d_pct = calcRDI(cts,units="pct")

##convert RDI to actual 'lfc' estimates and compare
dtrue = convertRDI(d)$pred
plot(d, dtrue)

2