Calculating spike-in normalization factors

Use getSpikeInNFs to obtain the spike-in normalization factors, or spikeInNormGRanges to return the input GRanges objects with their readcounts spike-in normalized.

getSpikeInNFs(
  dataset.gr,
  si_pattern = NULL,
  si_names = NULL,
  method = c("SRPMC", "SNR", "RPM"),
  batch_norm = TRUE,
  ctrl_pattern = NULL,
  ctrl_names = NULL,
  field = "score",
  sample_names = NULL,
  expand_ranges = FALSE,
  ncores = getOption("mc.cores", 2L)
)

spikeInNormGRanges(
  dataset.gr,
  si_pattern = NULL,
  si_names = NULL,
  method = c("SRPMC", "SNR", "RPM"),
  batch_norm = TRUE,
  ctrl_pattern = NULL,
  ctrl_names = NULL,
  field = "score",
  sample_names = NULL,
  expand_ranges = FALSE,
  ncores = getOption("mc.cores", 2L)
)

Arguments

dataset.gr: A GRanges object, or (more typically) a list of GRanges objects.
si_pattern: A regular expression that matches spike-in chromosomes. Can be used in addition to, or as an alternative to si_names.
si_names: A character vector giving the names of the spike-in chromosomes. Can be used in addition to, or as an alternative to si_pattern.
method: One of the shown methods, which generate normalization factors for converting raw readcounts into "Spike-in normalized Reads Per Million mapped in Control" (the default), "Spike-in Normalized Read counts", or "Reads Per Million mapped". See descriptions below.
batch_norm: A logical indicating if batch normalization should be used (TRUE by default). See descriptions below. If batch normalization is used, sample names must end with "rep#", wherein "#" is one or more characters (usually a number) giving the replicate. If this is not the case, users can use the sample_names argument to make the names conform.
ctrl_pattern: A regular expression that matches negative control sample names.
ctrl_names: A character vector giving the names of the negative control samples. Can be used as an alternative to ctrl_pattern.
field: The metadata field in dataset.gr that contains raw readcounts. If each range is an individual read, set field = NULL.
sample_names: An optional character vector that can be used to rename the samples in dataset.gr. Intended use is if dataset.gr is an unnamed list, or if batch_norm = TRUE but the sample names don't conform to the required naming scheme.
expand_ranges: Logical indicating if ranges in dataset.gr should be treated as descriptions of single molecules (FALSE), or if ranges should be treated as representing multiple adjacent positions with the same signal (TRUE). See getCountsByRegions.
ncores: The number of cores to use for computations.

Value

A numeric vector of normalization factors for each sample in

dataset.gr. Normalization factors are to be applied by multiplication.

Spike-in normalized Reads Per Million mapped in Control (SRPMC)

This is the default spike-in normalization method, as its meaning is the most portable and generalizable. Experimental Reads Per Spike-in read (RPS) are calculated for each sample, $i$:

$$RPS_i=\frac{experimental\_reads_i}{ spikein\_reads_i}$$

RPS for each sample is divided by RPS for the negative control, which measures the change in total material vs. the negative control. This global adjustment is applied to standard RPM normalization for each sample:

$$NF_i=\frac{RPS_i}{RPS_{control}} \cdot \frac{1 x 10^6}{experimental\_reads_i}$$

Thus, the negative control(s) are simply RPM-normalized, while the other conditions are in equivalent, directly-comparable units ("Reads Per Million mapped reads in a negative control").

If batch_norm = TRUE (the default), all negative controls will be RPM-normalized, and the global changes in material for all other samples are calculated within each batch (vs. the negative control within the same batch).

If batch_norm = FALSE, all samples are compared to the average RPS of the negative controls. This method can only be justified if batch has less effect on RPS than other sources of variation.

Spike-in Normalized Reads (SNR)

If batch_norm = FALSE, these normalization factors act to scale down the readcounts in each sample to make the spike-in read counts match the sample with the lowest number of spike-in reads:

$$NF_i=\frac{min(spikein\_reads)}{spikein\_reads_i}$$

If batch_norm = TRUE, such normalization factors are calculated within each batch, but a final batch (replicate) adjustment is performed that results in the negative controls having the same normalized readcounts. In this way, the negative controls are used to adjust the normalized readcounts of their entire replicate. Just as when batch_norm = FALSE, one of the normalization factors will be 1, while the rest will be <1.

One use for these normalization factors is for normalizing-by-subsampling; see subsampleBySpikeIn.

Reads Per Million mapped reads (RPM)

A simple convenience wrapper for calculating normalization factors for RPM normalization:

$$NF_i=\frac{1 x 10^6}{experimental\_reads_i}$$

If spike-in reads are present, they're removed before the normalization factors are calculated.

Author

Mike DeBerardine

Examples

#--------------------------------------------------#
# Make list of dummy GRanges
#--------------------------------------------------#
gr1_rep1 <- GRanges(seqnames = c("chr1", "chr2", "spikechr1", "spikechr2"),
                    ranges = IRanges(start = 1:4, width = 1),
                    strand = "+")
gr2_rep2 <- gr2_rep1 <- gr1_rep2 <- gr1_rep1

# set readcounts
score(gr1_rep1) <- c(1, 1, 1, 1) # 2 exp + 2 spike = 4 total
score(gr2_rep1) <- c(2, 2, 1, 1) # 4 exp + 2 spike = 6 total
score(gr1_rep2) <- c(1, 1, 2, 1) # 2 exp + 3 spike = 5 total
score(gr2_rep2) <- c(4, 4, 2, 2) # 8 exp + 4 spike = 12 total

grl <- list(gr1_rep1, gr2_rep1,
            gr1_rep2, gr2_rep2)
names(grl) <- c("gr1_rep1", "gr2_rep1",
                "gr1_rep2", "gr2_rep2")

grl
#> $gr1_rep1
#> GRanges object with 4 ranges and 1 metadata column:
#>        seqnames    ranges strand |     score
#>           <Rle> <IRanges>  <Rle> | <numeric>
#>   [1]      chr1         1      + |         1
#>   [2]      chr2         2      + |         1
#>   [3] spikechr1         3      + |         1
#>   [4] spikechr2         4      + |         1
#>   -------
#>   seqinfo: 4 sequences from an unspecified genome; no seqlengths
#> 
#> $gr2_rep1
#> GRanges object with 4 ranges and 1 metadata column:
#>        seqnames    ranges strand |     score
#>           <Rle> <IRanges>  <Rle> | <numeric>
#>   [1]      chr1         1      + |         2
#>   [2]      chr2         2      + |         2
#>   [3] spikechr1         3      + |         1
#>   [4] spikechr2         4      + |         1
#>   -------
#>   seqinfo: 4 sequences from an unspecified genome; no seqlengths
#> 
#> $gr1_rep2
#> GRanges object with 4 ranges and 1 metadata column:
#>        seqnames    ranges strand |     score
#>           <Rle> <IRanges>  <Rle> | <numeric>
#>   [1]      chr1         1      + |         1
#>   [2]      chr2         2      + |         1
#>   [3] spikechr1         3      + |         2
#>   [4] spikechr2         4      + |         1
#>   -------
#>   seqinfo: 4 sequences from an unspecified genome; no seqlengths
#> 
#> $gr2_rep2
#> GRanges object with 4 ranges and 1 metadata column:
#>        seqnames    ranges strand |     score
#>           <Rle> <IRanges>  <Rle> | <numeric>
#>   [1]      chr1         1      + |         4
#>   [2]      chr2         2      + |         4
#>   [3] spikechr1         3      + |         2
#>   [4] spikechr2         4      + |         2
#>   -------
#>   seqinfo: 4 sequences from an unspecified genome; no seqlengths
#> 

#--------------------------------------------------#
# Get RPM NFs
#--------------------------------------------------#

# can use the names of all spike-in chromosomes
getSpikeInNFs(grl, si_names = c("spikechr1", "spikechr2"),
              method = "RPM", ncores = 1)
#> [1] 500000 250000 500000 125000

# or use a regular expression that matches the spike-in chromosome names
grep("spike", as.vector(seqnames(gr1_rep1)))
#> [1] 3 4

getSpikeInNFs(grl, si_pattern = "spike", method = "RPM", ncores = 1)
#> [1] 500000 250000 500000 125000


#--------------------------------------------------#
# Get simple spike-in NFs ("SNR")
#--------------------------------------------------#

# without batch normalization, NFs make all spike-in readcounts match
getSpikeInNFs(grl, si_pattern = "spike", ctrl_pattern = "gr1",
              method = "SNR", batch_norm = FALSE, ncores = 1)
#> [1] 1.0000000 1.0000000 0.6666667 0.5000000

# with batch normalization, controls will have the same normalized counts;
# other samples are normalized to have same spike-in reads as their matched
# control
getSpikeInNFs(grl, si_pattern = "spike", ctrl_pattern = "gr1",
              method = "SNR", batch_norm = TRUE, ncores = 1)
#> [1] 1.00 1.00 1.00 0.75

#--------------------------------------------------#
# Get spike-in NFs with more meaningful units ("RPMC")
#--------------------------------------------------#

# compare to raw RPM NFs above; takes into account spike-in reads;
# units are directly comparable to the negative controls

# with batch normalization, these negative controls are the same, as they
# have the same number of non-spike-in readcounts (they're simply RPM)
getSpikeInNFs(grl, si_pattern = "spike", ctrl_pattern = "gr1", ncores = 1)
#> [1] 500000 500000 500000 375000

# batch_norm = FALSE, the average reads-per-spike-in for the negative
# controls are used to calculate all NFs; unless the controls have the exact
# same ratio of non-spike-in to spike-in reads, nothing is precisely RPM
getSpikeInNFs(grl, si_pattern = "spike", ctrl_pattern = "gr1",
              batch_norm = FALSE, ncores = 1)
#> [1] 6e+05 6e+05 4e+05 3e+05

#--------------------------------------------------#
# Apply NFs to the GRanges
#--------------------------------------------------#

spikeInNormGRanges(grl, si_pattern = "spike", ctrl_pattern = "gr1",
                   ncores = 1)
#> $gr1_rep1
#> GRanges object with 2 ranges and 1 metadata column:
#>       seqnames    ranges strand |     score
#>          <Rle> <IRanges>  <Rle> | <numeric>
#>   [1]     chr1         1      + |     5e+05
#>   [2]     chr2         2      + |     5e+05
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
#> 
#> $gr2_rep1
#> GRanges object with 2 ranges and 1 metadata column:
#>       seqnames    ranges strand |     score
#>          <Rle> <IRanges>  <Rle> | <numeric>
#>   [1]     chr1         1      + |     1e+06
#>   [2]     chr2         2      + |     1e+06
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
#> 
#> $gr1_rep2
#> GRanges object with 2 ranges and 1 metadata column:
#>       seqnames    ranges strand |     score
#>          <Rle> <IRanges>  <Rle> | <numeric>
#>   [1]     chr1         1      + |     5e+05
#>   [2]     chr2         2      + |     5e+05
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
#> 
#> $gr2_rep2
#> GRanges object with 2 ranges and 1 metadata column:
#>       seqnames    ranges strand |     score
#>          <Rle> <IRanges>  <Rle> | <numeric>
#>   [1]     chr1         1      + |   1500000
#>   [2]     chr2         2      + |   1500000
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
#>