rdiModel - RDI Models
Generate models equating RDI values to true differences in underlying prevalence values
rdiModel(n, ngenes = NULL, baseVects = NULL, nIter = 50, nSample = 20, units = c("lfc", "pct"), constScale = TRUE)
- the repertoire size.
- numeric vector indicating the number of genes in each chain.
baseVectsis not provided, this parameter will be used to generate a base prevalence vector for each of the genes.
- A vector or list of vectors representing the total prevalence of each gene (for each chain) in the dataset. Differential datasets will be created from alterations of this vector. If not provided, a base vector will be randomly . generated at each subsample step containing the number of genes specified by ngenes.
- The number of iterations (i.e. number of datasets to generate).
- The number of samples to generate for each subsample. Each sample will have a different true fold change, but the same starting vector
- String; either “lfc” or “pct”, depending on what transform was used in the original RDI calculations. See Details.
- logical; if
TRUE, vdjCounts will be scaled such that the sum of each column will be equal to 500 (an arbitrary constant). Otherwise, the columns will be scaled to the average count of all the columns.
A list containing three objects:
| an object of class
| an object of class
This method uses simulated sequencing datasets to estimate the RDI values for datasets with a known true deviation.
Briefly, a baseline probability vector (either randomly generated or supplied by the
baseVects parameter) is randomly perturbed, and the difference between the
baseline vector and the perturbed vector is calculated. Then,
datasets of size n are randomly drawn from both the baseline vector and the perturbed
vector, and the RDI distance between all datasets calculated. This process is repeated
nIter times, resulting in a dataset of RDI values and matched true differences.
A set of spline models is then fit to the data: one from RDI to true difference, and
another from true difference to RDI value, allowing for bi-directional conversions.
If a baseline probability vector is not provided, one will be generated from an empirical model of gene segment prevalence. However, for best performance, this is not recommended. Estimates of true fold change is very sensitive to the distribution of features in your count dataset, and it is important that your baseline vector match your overall dataset as accurately as possible. The best baseline vector is almost always the average feature prevalence across all repertoires in a dataset, although manually generated baseline vectors may also work well.
The units used for the RDI model should always match the units used to generate your RDI values. For more details on units, refer to the details of calcRDI.
#create genes genes = sample(letters, 10000, replace=TRUE) #create sequence annotations seqAnnot = data.frame(donor = sample(1:4, 10000, replace=TRUE)) #calculate RDI d = rdi(genes, seqAnnot) ##create a "baseVect" with the same probability as our features ##since we sampled uniformly, the base vector has equal probability baseVect = rep(1/length(letters),length(letters)) ##generate an RDI model m = rdiModel(attr(d, "nseq"), baseVects=baseVect) ##plot the spline model plot(m$fit, xlab="log fold change",ylab="RDI",type='l')
##convert RDI to log fold change mean = predict(m$rev.fit, d)$y mean[mean<0] = 0