calcVDJcounts - Calculate repertoire counts
Create count matrices for a set of repertoires
calcVDJcounts(genes, seqAnnot, select = NULL, combine = NULL, vdjDrop = NULL, splitBy = NULL, simplifyNames = TRUE, splitCommas = FALSE, ...)
- vector (or matrix) containing the gene calls for each sequence. If genes is a matrix, counts will be calculated for each column of ‘genes’, and the resulting count matrices will be concatenated. See Details.
- matrix containing repertoire annotations.
- a list containing definitions of which repertoires to use. See Details.
- a list defining repertoires that should be grouped together. See Details.
- a list specifying specific genes to exclude from analysis. See Details.
- the columns in seqAnnot to use for splitting repertoires. Default is to use all columns in seqAnnot.
- logical; if true, any columns of seqAnnot where all selected (and collapsed) sequences share the same value will not be used to make the names of sequenceCodes.
- logical; if true, seqAnnot is assumed to contain. comma-separated lists of sequence annotations, which will be split up before generating sequence codes. Note: setting this to TRUE will make processing much slower.
- additional parameters; these are ignored by this function.
A matrix where each row represents a gene, and each column represents a repertoire.
In most cases,
genes will be a single vector or one-column matrix. However,
there are some cases where a row of
seqAnnot corresponds to two (or more) genes
(e.g. the V and J gene segments of a single immune sequence). Rather than make multiple
rows for each gene, the
calcVDJcounts function provides the option to provide
a multi-column matrix for
genes. The counts for each column will be tallied
separately, and are then concatenated.
To ensure equal variance across all repertoires, the default RDI metric uses
subsampling to ensure that all repertoires have the same number of sequences. The
default RDI metric subsamples all repertoires to the size of the smallest repertoire,
which may result in a loss of power for comparisons between larger repertoires.
In order to increase power for various tests, it is often useful to only calculate
the repertoire counts for a subset of the repertoires in seqAnnot. This can be done by
combine parameters to specify which
repertoires to include in the analysis.
Both parameters are lists containing entries
with the same name as one of the columns of seqAnnot. For
select, each entry is
a vector defining which values to include (e.g., to include only Visit 1 and 3, you
select=list(visit=c("V1","V3")), where the
in seqAnnot contains the values
"V3"). In this
case, any rows of
seqAnnot that come from a repertoire not
select will be discarded. By default, if a
select code is
not specified for a column in
seqAnnot, all values from that column will be
combine parameter works in a similar fashion, but instead of a vector
describing which parameters to include, you can specify a vector of regular
expressions, and any values of the
seqAnnot column that match the regular
expression will be combined into a single repertoire (e.g. to combine visits 1 and 3
into a single repertoire, you might specify
vdjDrop parameter is also useful for limiting sequences. Like
combine, this is a named list, with entries corresponding to
the columns of
genes. Each entry of
vdjDrop is a vector of gene segment
names to remove from the analysis. All sequences containing those genes are removed
from the analysis before subsampling.
Once unwanted rows have been removed, the columns of
seqAnnot are concatenated
to generate “repertoire” labels for each row. The repertoire labels are then used
to split the rows of
genes, and gene prevalence is tallied within a repertoire.
By default, columns of
seqAnnot that are constant after subsetting will not be
included in the label. However, this can be controlled by the
simplifyNames is FALSE, all columns of
included when generating labels.
#create genes genes = sample(letters, 10000, replace=TRUE) #create sequence annotations seqAnnot = data.frame(donor = sample(1:4, 10000, replace=TRUE), visit = sample(c("V1","V2","V3"), 10000, replace=TRUE), cellType = sample(c("B","T"), 10000, replace=TRUE) ) ##generate repertoire counts for all repertoires cts = calcVDJcounts(genes,seqAnnot) ##Only include visit 1 cts = calcVDJcounts(genes,seqAnnot, select=list(visit="V1")) ## Just T cell repertoires, combining visit 1 and 3 together, and dropping visit 2 cts = calcVDJcounts(genes,seqAnnot, select=list(cellType="T", visit=c("V1","V3")), combine=list(visit="V"))