Use getSpikeInNFs
to obtain the spike-in normalization factors, or
spikeInNormGRanges
to return the input GRanges objects with their
readcounts spike-in normalized.
getSpikeInNFs(
dataset.gr,
si_pattern = NULL,
si_names = NULL,
method = c("SRPMC", "SNR", "RPM"),
batch_norm = TRUE,
ctrl_pattern = NULL,
ctrl_names = NULL,
field = "score",
sample_names = NULL,
expand_ranges = FALSE,
ncores = getOption("mc.cores", 2L)
)
spikeInNormGRanges(
dataset.gr,
si_pattern = NULL,
si_names = NULL,
method = c("SRPMC", "SNR", "RPM"),
batch_norm = TRUE,
ctrl_pattern = NULL,
ctrl_names = NULL,
field = "score",
sample_names = NULL,
expand_ranges = FALSE,
ncores = getOption("mc.cores", 2L)
)
A GRanges object, or (more typically) a list of GRanges objects.
A regular expression that matches spike-in chromosomes. Can
be used in addition to, or as an alternative to si_names
.
A character vector giving the names of the spike-in
chromosomes. Can be used in addition to, or as an alternative to
si_pattern
.
One of the shown methods, which generate normalization factors for converting raw readcounts into "Spike-in normalized Reads Per Million mapped in Control" (the default), "Spike-in Normalized Read counts", or "Reads Per Million mapped". See descriptions below.
A logical indicating if batch normalization should be used
(TRUE
by default). See descriptions below. If batch normalization is
used, sample names must end with "rep#", wherein "#" is one or more
characters (usually a number) giving the replicate. If this is not the
case, users can use the sample_names
argument to make the names
conform.
A regular expression that matches negative control sample names.
A character vector giving the names of the negative control
samples. Can be used as an alternative to ctrl_pattern
.
The metadata field in dataset.gr
that contains raw
readcounts. If each range is an individual read, set field = NULL
.
An optional character vector that can be used to rename
the samples in dataset.gr
. Intended use is if dataset.gr
is
an unnamed list, or if batch_norm = TRUE
but the sample names don't
conform to the required naming scheme.
Logical indicating if ranges in dataset.gr
should
be treated as descriptions of single molecules (FALSE
), or if ranges
should be treated as representing multiple adjacent positions with the same
signal (TRUE
). See
getCountsByRegions
.
The number of cores to use for computations.
A numeric vector of normalization factors for each sample in
dataset.gr
. Normalization factors are to be applied by
multiplication.
This is the default spike-in normalization method, as its meaning is the most portable and generalizable. Experimental Reads Per Spike-in read (RPS) are calculated for each sample, \(i\):
$$RPS_i=\frac{experimental\_reads_i}{ spikein\_reads_i}$$
RPS for each sample is divided by RPS for the negative control, which measures the change in total material vs. the negative control. This global adjustment is applied to standard RPM normalization for each sample:
$$NF_i=\frac{RPS_i}{RPS_{control}} \cdot \frac{1 x 10^6}{experimental\_reads_i}$$
Thus, the negative control(s) are simply RPM-normalized, while the other conditions are in equivalent, directly-comparable units ("Reads Per Million mapped reads in a negative control").
If batch_norm = TRUE
(the default), all negative controls will be
RPM-normalized, and the global changes in material for all other samples
are calculated within each batch (vs. the negative control within
the same batch).
If batch_norm = FALSE
, all samples are compared to the average RPS
of the negative controls. This method can only be justified if batch has
less effect on RPS than other sources of variation.
If batch_norm = FALSE
, these
normalization factors act to scale down the readcounts in each sample to
make the spike-in read counts match the sample with the lowest number of
spike-in reads:
$$NF_i=\frac{min(spikein\_reads)}{spikein\_reads_i}$$
If batch_norm = TRUE
, such normalization factors are calculated
within each batch, but a final batch (replicate) adjustment is performed
that results in the negative controls having the same normalized
readcounts. In this way, the negative controls are used to adjust the
normalized readcounts of their entire replicate. Just as when
batch_norm = FALSE
, one of the normalization factors will be
1
, while the rest will be <1
.
One use for these normalization factors is for normalizing-by-subsampling;
see subsampleBySpikeIn
.
A simple convenience wrapper for calculating normalization factors for RPM normalization:
$$NF_i=\frac{1 x 10^6}{experimental\_reads_i}$$
If spike-in reads are present, they're removed before the normalization factors are calculated.
#--------------------------------------------------#
# Make list of dummy GRanges
#--------------------------------------------------#
gr1_rep1 <- GRanges(seqnames = c("chr1", "chr2", "spikechr1", "spikechr2"),
ranges = IRanges(start = 1:4, width = 1),
strand = "+")
gr2_rep2 <- gr2_rep1 <- gr1_rep2 <- gr1_rep1
# set readcounts
score(gr1_rep1) <- c(1, 1, 1, 1) # 2 exp + 2 spike = 4 total
score(gr2_rep1) <- c(2, 2, 1, 1) # 4 exp + 2 spike = 6 total
score(gr1_rep2) <- c(1, 1, 2, 1) # 2 exp + 3 spike = 5 total
score(gr2_rep2) <- c(4, 4, 2, 2) # 8 exp + 4 spike = 12 total
grl <- list(gr1_rep1, gr2_rep1,
gr1_rep2, gr2_rep2)
names(grl) <- c("gr1_rep1", "gr2_rep1",
"gr1_rep2", "gr2_rep2")
grl
#> $gr1_rep1
#> GRanges object with 4 ranges and 1 metadata column:
#> seqnames ranges strand | score
#> <Rle> <IRanges> <Rle> | <numeric>
#> [1] chr1 1 + | 1
#> [2] chr2 2 + | 1
#> [3] spikechr1 3 + | 1
#> [4] spikechr2 4 + | 1
#> -------
#> seqinfo: 4 sequences from an unspecified genome; no seqlengths
#>
#> $gr2_rep1
#> GRanges object with 4 ranges and 1 metadata column:
#> seqnames ranges strand | score
#> <Rle> <IRanges> <Rle> | <numeric>
#> [1] chr1 1 + | 2
#> [2] chr2 2 + | 2
#> [3] spikechr1 3 + | 1
#> [4] spikechr2 4 + | 1
#> -------
#> seqinfo: 4 sequences from an unspecified genome; no seqlengths
#>
#> $gr1_rep2
#> GRanges object with 4 ranges and 1 metadata column:
#> seqnames ranges strand | score
#> <Rle> <IRanges> <Rle> | <numeric>
#> [1] chr1 1 + | 1
#> [2] chr2 2 + | 1
#> [3] spikechr1 3 + | 2
#> [4] spikechr2 4 + | 1
#> -------
#> seqinfo: 4 sequences from an unspecified genome; no seqlengths
#>
#> $gr2_rep2
#> GRanges object with 4 ranges and 1 metadata column:
#> seqnames ranges strand | score
#> <Rle> <IRanges> <Rle> | <numeric>
#> [1] chr1 1 + | 4
#> [2] chr2 2 + | 4
#> [3] spikechr1 3 + | 2
#> [4] spikechr2 4 + | 2
#> -------
#> seqinfo: 4 sequences from an unspecified genome; no seqlengths
#>
#--------------------------------------------------#
# Get RPM NFs
#--------------------------------------------------#
# can use the names of all spike-in chromosomes
getSpikeInNFs(grl, si_names = c("spikechr1", "spikechr2"),
method = "RPM", ncores = 1)
#> [1] 500000 250000 500000 125000
# or use a regular expression that matches the spike-in chromosome names
grep("spike", as.vector(seqnames(gr1_rep1)))
#> [1] 3 4
getSpikeInNFs(grl, si_pattern = "spike", method = "RPM", ncores = 1)
#> [1] 500000 250000 500000 125000
#--------------------------------------------------#
# Get simple spike-in NFs ("SNR")
#--------------------------------------------------#
# without batch normalization, NFs make all spike-in readcounts match
getSpikeInNFs(grl, si_pattern = "spike", ctrl_pattern = "gr1",
method = "SNR", batch_norm = FALSE, ncores = 1)
#> [1] 1.0000000 1.0000000 0.6666667 0.5000000
# with batch normalization, controls will have the same normalized counts;
# other samples are normalized to have same spike-in reads as their matched
# control
getSpikeInNFs(grl, si_pattern = "spike", ctrl_pattern = "gr1",
method = "SNR", batch_norm = TRUE, ncores = 1)
#> [1] 1.00 1.00 1.00 0.75
#--------------------------------------------------#
# Get spike-in NFs with more meaningful units ("RPMC")
#--------------------------------------------------#
# compare to raw RPM NFs above; takes into account spike-in reads;
# units are directly comparable to the negative controls
# with batch normalization, these negative controls are the same, as they
# have the same number of non-spike-in readcounts (they're simply RPM)
getSpikeInNFs(grl, si_pattern = "spike", ctrl_pattern = "gr1", ncores = 1)
#> [1] 500000 500000 500000 375000
# batch_norm = FALSE, the average reads-per-spike-in for the negative
# controls are used to calculate all NFs; unless the controls have the exact
# same ratio of non-spike-in to spike-in reads, nothing is precisely RPM
getSpikeInNFs(grl, si_pattern = "spike", ctrl_pattern = "gr1",
batch_norm = FALSE, ncores = 1)
#> [1] 6e+05 6e+05 4e+05 3e+05
#--------------------------------------------------#
# Apply NFs to the GRanges
#--------------------------------------------------#
spikeInNormGRanges(grl, si_pattern = "spike", ctrl_pattern = "gr1",
ncores = 1)
#> $gr1_rep1
#> GRanges object with 2 ranges and 1 metadata column:
#> seqnames ranges strand | score
#> <Rle> <IRanges> <Rle> | <numeric>
#> [1] chr1 1 + | 5e+05
#> [2] chr2 2 + | 5e+05
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
#>
#> $gr2_rep1
#> GRanges object with 2 ranges and 1 metadata column:
#> seqnames ranges strand | score
#> <Rle> <IRanges> <Rle> | <numeric>
#> [1] chr1 1 + | 1e+06
#> [2] chr2 2 + | 1e+06
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
#>
#> $gr1_rep2
#> GRanges object with 2 ranges and 1 metadata column:
#> seqnames ranges strand | score
#> <Rle> <IRanges> <Rle> | <numeric>
#> [1] chr1 1 + | 5e+05
#> [2] chr2 2 + | 5e+05
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
#>
#> $gr2_rep2
#> GRanges object with 2 ranges and 1 metadata column:
#> seqnames ranges strand | score
#> <Rle> <IRanges> <Rle> | <numeric>
#> [1] chr1 1 + | 1500000
#> [2] chr2 2 + | 1500000
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
#>