calcVDJcounts - Calculate repertoire counts


Create count matrices for a set of repertoires


calcVDJcounts(genes, seqAnnot, select = NULL, combine = NULL,
vdjDrop = NULL, splitBy = NULL, simplifyNames = TRUE,
splitCommas = FALSE, ...)


vector (or matrix) containing the gene calls for each sequence. If genes is a matrix, counts will be calculated for each column of ‘genes’, and the resulting count matrices will be concatenated. See Details.
matrix containing repertoire annotations.
a list containing definitions of which repertoires to use. See Details.
a list defining repertoires that should be grouped together. See Details.
a list specifying specific genes to exclude from analysis. See Details.
the columns in seqAnnot to use for splitting repertoires. Default is to use all columns in seqAnnot.
logical; if true, any columns of seqAnnot where all selected (and collapsed) sequences share the same value will not be used to make the names of sequenceCodes.
logical; if true, seqAnnot is assumed to contain. comma-separated lists of sequence annotations, which will be split up before generating sequence codes. Note: setting this to TRUE will make processing much slower.
additional parameters; these are ignored by this function.


A matrix where each row represents a gene, and each column represents a repertoire.


In most cases, genes will be a single vector or one-column matrix. However, there are some cases where a row of seqAnnot corresponds to two (or more) genes (e.g. the V and J gene segments of a single immune sequence). Rather than make multiple rows for each gene, the calcVDJcounts function provides the option to provide a multi-column matrix for genes. The counts for each column will be tallied separately, and are then concatenated.

To ensure equal variance across all repertoires, the default RDI metric uses subsampling to ensure that all repertoires have the same number of sequences. The default RDI metric subsamples all repertoires to the size of the smallest repertoire, which may result in a loss of power for comparisons between larger repertoires. In order to increase power for various tests, it is often useful to only calculate the repertoire counts for a subset of the repertoires in seqAnnot. This can be done by using the select and combine parameters to specify which repertoires to include in the analysis.

Both parameters are lists containing entries with the same name as one of the columns of seqAnnot. For select, each entry is a vector defining which values to include (e.g., to include only Visit 1 and 3, you might specify select=list(visit=c("V1","V3")), where the 'visit' column in seqAnnot contains the values "V1","V2", and "V3"). In this case, any rows of genes and seqAnnot that come from a repertoire not specified in select will be discarded. By default, if a select code is not specified for a column in seqAnnot, all values from that column will be included.

The combine parameter works in a similar fashion, but instead of a vector describing which parameters to include, you can specify a vector of regular expressions, and any values of the seqAnnot column that match the regular expression will be combined into a single repertoire (e.g. to combine visits 1 and 3 into a single repertoire, you might specify combine=list(visit="V[13]")).

The vdjDrop parameter is also useful for limiting sequences. Like select and combine, this is a named list, with entries corresponding to the columns of genes. Each entry of vdjDrop is a vector of gene segment names to remove from the analysis. All sequences containing those genes are removed from the analysis before subsampling.

Once unwanted rows have been removed, the columns of seqAnnot are concatenated to generate “repertoire” labels for each row. The repertoire labels are then used to split the rows of genes, and gene prevalence is tallied within a repertoire. By default, columns of seqAnnot that are constant after subsetting will not be included in the label. However, this can be controlled by the simplifyNames parameter. If simplifyNames is FALSE, all columns of seqAnnot are included when generating labels.


#create genes
genes = sample(letters, 10000, replace=TRUE)

#create sequence annotations
seqAnnot = data.frame(donor = sample(1:4, 10000, replace=TRUE),
visit = sample(c("V1","V2","V3"), 10000, replace=TRUE),
cellType = sample(c("B","T"), 10000, replace=TRUE)

##generate repertoire counts for all repertoires
cts = calcVDJcounts(genes,seqAnnot) 

##Only include visit 1
cts = calcVDJcounts(genes,seqAnnot, select=list(visit="V1"))

## Just T cell repertoires, combining visit 1 and 3 together, and dropping visit 2
cts = calcVDJcounts(genes,seqAnnot, 
select=list(cellType="T", visit=c("V1","V3")),