calcRDI - Calculate repertoire distances

## Description¶

Calculate repertoire distances from a matrix of vdjCounts

## Usage¶

``````calcRDI(vdjCounts, distMethod = c("euclidean", "cor"), subsample = TRUE,
nIter = 100, constScale = TRUE, units = c("lfc", "pct"), ...)
``````

## Arguments¶

vdjCounts
a matrix of repertoire counts, as created by calcVDJCounts
distMethod
one of c(“euclidean”,”cor”) determining how to calculate the distance from the matrix of vdj counts. See Details.
subsample
logical; if true, all repertoires are subsampled to be equal size.
nIter
value defining how many times the subsampling should be repeated. Only used if subsample is TRUE.
constScale
logical; if `TRUE`, vdjCounts will be scaled such that the sum of each column will be equal to 500 counts (an arbitrary constant). Otherwise, the columns will be scaled to the average count of all the columns.
units
One of “lfc” or “pct”. This determines the method used for transforming the repertoire counts. See Details.
additional parameters; these are ignored by this function.

## Value¶

A dissimilarity structure containing distances between repertoires, averaged across each subsampe run. In addition to the standard attributes in a dist object, three additional attributes are defined as follows:

 ngenes integers, the number of genes in each column of “genes” that were included in at least one repertoire. nseq integer, the number of sequences used after subsampling the repertoires. If `subsample=FALSE`, this is not defined.

## Details¶

There are two options for distance methods, “euclidean” and “cor”. Euclidean refers to standard euclidean distance, and is the standard for the RDI measure described in (Bolen et al. Bioinformatics 2016). In contrast, cor refers to a correlation-based distance metric, where the distance is defined as `(1-correlation)` between each column of vdjCounts.

The `units` parameter is used to determine the transformation function for the repertoire counts. If `units='lfc'` (default), then the arcsinh transformation is applied to the count matrix, resulting in a distance metric which will scale with the average log fold change of each gene. In contrast, `units='pct'` will result in no transformation of the count matrix, and distances will be proportional to the average percent change of each gene, instead. Note that “units” is a bit of a misnomer, as the distance metric doesn’t actually represent the true log-fold or percent change in the repertoires. In order to actually estimate these parameters, refer to the rdiModel and convertRDI functions.

## Examples¶

``````#create genes
genes = sample(letters, 10000, replace=TRUE)

#create sequence annotations
seqAnnot = data.frame(donor = sample(1:4, 10000, replace=TRUE),
cellType = sample(c("B","T"), 10000, replace=TRUE)
)
##generate repertoire counts
cts = calcVDJcounts(genes,seqAnnot)

##calculate RDI
d = calcRDI(cts)

##calculate RDI in percent space
d_pct = calcRDI(cts,units="pct")

##convert RDI to actual 'lfc' estimates and compare
dtrue = convertRDI(d)\$pred
plot(d, dtrue)
``````