calcRDI - Calculate repertoire distances
Calculate repertoire distances from a matrix of vdjCounts
calcRDI(vdjCounts, distMethod = c("euclidean", "cor"), subsample = TRUE, nIter = 100, constScale = TRUE, units = c("lfc", "pct"), ...)
- a matrix of repertoire counts, as created by calcVDJCounts
- one of c(“euclidean”,”cor”) determining how to calculate the distance from the matrix of vdj counts. See Details.
- logical; if true, all repertoires are subsampled to be equal size.
- value defining how many times the subsampling should be repeated. Only used if subsample is TRUE.
- logical; if
TRUE, vdjCounts will be scaled such that the sum of each column will be equal to 500 counts (an arbitrary constant). Otherwise, the columns will be scaled to the average count of all the columns.
- One of “lfc” or “pct”. This determines the method used for transforming the repertoire counts. See Details.
- additional parameters; these are ignored by this function.
A dissimilarity structure containing distances between repertoires, averaged across each subsampe run. In addition to the standard attributes in a dist object, three additional attributes are defined as follows:
|integers, the number of genes in each column of “genes” that were included in at least one repertoire.|
| integer, the number of sequences used after subsampling
the repertoires. If
There are two options for distance methods, “euclidean” and “cor”. Euclidean refers to
standard euclidean distance, and is the standard for the RDI measure described in
(Bolen et al. Bioinformatics 2016). In contrast, cor refers to a correlation-based
distance metric, where the distance is defined as
(1-correlation) between each
column of vdjCounts.
units parameter is used to determine the transformation function for the
repertoire counts. If
units='lfc' (default), then the arcsinh transformation
is applied to the count matrix, resulting in a distance metric which
will scale with the average log fold change of each gene. In contrast,
units='pct' will result in no transformation of the count matrix, and distances
will be proportional to the average percent change of each gene, instead. Note that
“units” is a bit of a misnomer, as the distance metric doesn’t actually represent the
true log-fold or percent change in the repertoires. In order to actually estimate
these parameters, refer to the rdiModel and convertRDI
#create genes genes = sample(letters, 10000, replace=TRUE) #create sequence annotations seqAnnot = data.frame(donor = sample(1:4, 10000, replace=TRUE), cellType = sample(c("B","T"), 10000, replace=TRUE) ) ##generate repertoire counts cts = calcVDJcounts(genes,seqAnnot) ##calculate RDI d = calcRDI(cts) ##calculate RDI in percent space d_pct = calcRDI(cts,units="pct") ##convert RDI to actual 'lfc' estimates and compare dtrue = convertRDI(d)$pred plot(d, dtrue)