R/bootstrap_counts.R
    bootstrap-signal-by-position.RdThese functions perform bootstrap subsampling of mean readcounts at different
positions within regions of interest (metaSubsample), or, in the more
general case of metaSubsampleMatrix, column means of a matrix are
bootstrapped by sampling the rows. Mean signal counts can be calculated at
base-pair resolution, or over larger bins.
metaSubsample(
  dataset.gr,
  regions.gr,
  binsize = 1L,
  first.output.xval = 1L,
  sample.name = deparse(substitute(dataset.gr)),
  n.iter = 1000L,
  prop.sample = 0.1,
  lower = 0.125,
  upper = 0.875,
  field = "score",
  NF = NULL,
  remove.empty = FALSE,
  blacklist = NULL,
  zero_blacklisted = FALSE,
  expand_ranges = FALSE,
  ncores = getOption("mc.cores", 2L)
)
metaSubsampleMatrix(
  counts.mat,
  binsize = 1L,
  first.output.xval = 1L,
  sample.name = NULL,
  n.iter = 1000L,
  prop.sample = 0.1,
  lower = 0.125,
  upper = 0.875,
  NF = 1L,
  remove.empty = FALSE,
  ncores = getOption("mc.cores", 2L)
)A GRanges object in which signal is contained in metadata
(typically in the "score" field), or a list of such GRanges objects.
A GRanges object containing intervals over which to metaplot. All ranges must have the same width.
The size of bin (in basepairs, or number of columns for
metaSubsampleMatrix) to use for counting signal. Especially
important for counting signal over large or sparse regions.
The relative start position of the first bin, e.g.
if regions.gr begins at 50 bases upstream of the TSS, set
first.output.xval = -50. This number only affects the x-values that
are returned, which are provided as a convenience.
Defaults to the name of the input dataset. This is
included in the output as a convenience, as it allows row-binding outputs
from different samples. If length(field) > 1 and the default
sample.name is left, the sample names will be inferred from the
field names.
Number of random subsampling iterations to perform. Default is
1000.
The proportion of the ranges in regions.gr (e.g.
the proportion of genes) or the proportion of rows in counts.mat to
sample in each iteration. The default is 0.1 (10 percent).
The lower and upper quantiles of subsampled signal means
to return. The defaults, 0.125 and 0.875 (i.e. the 12.5th and
85.5th percentiles) return a 75 percent confidence interval about the
bootstrapped mean.
One or more metadata fields of dataset.gr to be counted.
An optional normalization factor by which to multiply the counts.
If given, length(NF) must be equal to length(field).
A logical indicating whether regions
(metaSubsample) or rows (metaSubsampleMatrix) without signal
should be removed from the analysis. Not recommended if using multiple
fields, as the gene lists will no longer be equivalent.
An optional GRanges object containing regions that should be excluded from signal counting.
When set to FALSE (the default), signal at
blacklisted sites is ignored from computations. If set to TRUE,
signal at blacklisted sites will be treated as equal to zero. For
bootstrapping, the default behavior of ignoring signal at blacklisted sites
should almost always be kept.
Logical indicating if ranges in dataset.gr should
be treated as descriptions of single molecules (FALSE), or if ranges
should be treated as representing multiple adjacent positions with the same
signal (TRUE). See 
getCountsByRegions.
Number of cores to use for computations.
A matrix over which to bootstrap column means by subsampling its rows. Typically, a matrix of readcounts with rows for genes and columns for positions within those genes.
A dataframe containing x-values, means, lower quantiles, upper quantiles, and the sample name (as a convenience for row-binding multiple of these dataframes). If a list of GRanges is given as input, or if multiple fields are given, a single, combined dataframe is returned containing data for all fields/datasets.
data("PROseq") # import included PROseq data
data("txs_dm6_chr4") # import included transcripts
# for each transcript, use promoter-proximal region from TSS to +100
pr <- promoters(txs_dm6_chr4, 0, 100)
#--------------------------------------------------#
# Bootstrap average signal in each 5 bp bin across all transcripts,
# and get confidence bands for middle 30% of bootstrapped means
#--------------------------------------------------#
set.seed(11)
df <- metaSubsample(PROseq, pr, binsize = 5,
                    lower = 0.35, upper = 0.65,
                    ncores = 1)
df[1:10, ]
#>     x       mean      lower      upper sample.name
#> 1   3 0.03529412 0.02941176 0.04705882      PROseq
#> 2   8 0.05882353 0.04117647 0.07058824      PROseq
#> 3  13 0.07647059 0.05294118 0.10000000      PROseq
#> 4  18 0.07058824 0.05294118 0.09029412      PROseq
#> 5  23 0.11764706 0.09411765 0.15294118      PROseq
#> 6  28 0.09411765 0.08235294 0.10588235      PROseq
#> 7  33 0.17058824 0.14117647 0.21764706      PROseq
#> 8  38 0.63529412 0.46470588 0.92176471      PROseq
#> 9  43 0.48235294 0.37058824 0.61176471      PROseq
#> 10 48 0.83529412 0.66470588 1.11382353      PROseq
#--------------------------------------------------#
# Plot bootstrapped means with confidence intervals
#--------------------------------------------------#
plot(mean ~ x, df, type = "l", main = "PROseq Signal",
     ylab = "Mean + 30% CI", xlab = "Distance from TSS")
polygon(c(df$x, rev(df$x)), c(df$lower, rev(df$upper)),
        col = adjustcolor("black", 0.1), border = FALSE)
#==================================================#
# Using a matrix as input
#==================================================#
# generate a matrix of counts in each region
countsmat <- getCountsByPositions(PROseq, pr)
dim(countsmat)
#> [1] 339 100
#--------------------------------------------------#
# bootstrap average signal in 10 bp bins across all transcripts
#--------------------------------------------------#
set.seed(11)
 df <- metaSubsampleMatrix(countsmat, binsize = 10,
                           sample.name = "PROseq",
                           ncores = 1)
df[1:10, ]
#>       x      mean     lower     upper sample.name
#> 1   5.5 0.4411765 0.2352941  1.117647      PROseq
#> 2  15.5 0.7941176 0.3823529  1.882353      PROseq
#> 3  25.5 1.1176471 0.6764706  2.088235      PROseq
#> 4  35.5 4.2205882 1.9411765 12.091912      PROseq
#> 5  45.5 6.8529412 3.2352941 12.882353      PROseq
#> 6  55.5 9.5588235 5.7022059 14.617647      PROseq
#> 7  65.5 4.9411765 3.2058824  7.000000      PROseq
#> 8  75.5 5.4558824 2.0294118  9.827206      PROseq
#> 9  85.5 3.0882353 2.0882353  4.827206      PROseq
#> 10 95.5 2.5882353 1.5294118  4.003676      PROseq
#--------------------------------------------------#
# the same, using a normalization factor, and changing the x-values
#--------------------------------------------------#
set.seed(11)
df <- metaSubsampleMatrix(countsmat, binsize = 10,
                          first.output.xval = 0, NF = 0.75,
                          sample.name = "PROseq", ncores = 1)
df[1:10, ]
#>       x      mean     lower      upper sample.name
#> 1   4.5 0.3308824 0.1764706  0.8382353      PROseq
#> 2  14.5 0.5955882 0.2867647  1.4117647      PROseq
#> 3  24.5 0.8382353 0.5073529  1.5661765      PROseq
#> 4  34.5 3.1654412 1.4558824  9.0689338      PROseq
#> 5  44.5 5.1397059 2.4264706  9.6617647      PROseq
#> 6  54.5 7.1691176 4.2766544 10.9632353      PROseq
#> 7  64.5 3.7058824 2.4044118  5.2500000      PROseq
#> 8  74.5 4.0919118 1.5220588  7.3704044      PROseq
#> 9  84.5 2.3161765 1.5661765  3.6204044      PROseq
#> 10 94.5 1.9411765 1.1470588  3.0027574      PROseq