R/bootstrap_counts.R
bootstrap-signal-by-position.Rd
These functions perform bootstrap subsampling of mean readcounts at different
positions within regions of interest (metaSubsample
), or, in the more
general case of metaSubsampleMatrix
, column means of a matrix are
bootstrapped by sampling the rows. Mean signal counts can be calculated at
base-pair resolution, or over larger bins.
metaSubsample(
dataset.gr,
regions.gr,
binsize = 1L,
first.output.xval = 1L,
sample.name = deparse(substitute(dataset.gr)),
n.iter = 1000L,
prop.sample = 0.1,
lower = 0.125,
upper = 0.875,
field = "score",
NF = NULL,
remove.empty = FALSE,
blacklist = NULL,
zero_blacklisted = FALSE,
expand_ranges = FALSE,
ncores = getOption("mc.cores", 2L)
)
metaSubsampleMatrix(
counts.mat,
binsize = 1L,
first.output.xval = 1L,
sample.name = NULL,
n.iter = 1000L,
prop.sample = 0.1,
lower = 0.125,
upper = 0.875,
NF = 1L,
remove.empty = FALSE,
ncores = getOption("mc.cores", 2L)
)
A GRanges object in which signal is contained in metadata
(typically in the "score"
field), or a list of such GRanges objects.
A GRanges object containing intervals over which to metaplot. All ranges must have the same width.
The size of bin (in basepairs, or number of columns for
metaSubsampleMatrix
) to use for counting signal. Especially
important for counting signal over large or sparse regions.
The relative start position of the first bin, e.g.
if regions.gr
begins at 50 bases upstream of the TSS, set
first.output.xval = -50
. This number only affects the x-values that
are returned, which are provided as a convenience.
Defaults to the name of the input dataset. This is
included in the output as a convenience, as it allows row-binding outputs
from different samples. If length(field) > 1
and the default
sample.name
is left, the sample names will be inferred from the
field names.
Number of random subsampling iterations to perform. Default is
1000
.
The proportion of the ranges in regions.gr
(e.g.
the proportion of genes) or the proportion of rows in counts.mat
to
sample in each iteration. The default is 0.1
(10 percent).
The lower and upper quantiles of subsampled signal means
to return. The defaults, 0.125
and 0.875
(i.e. the 12.5th and
85.5th percentiles) return a 75 percent confidence interval about the
bootstrapped mean.
One or more metadata fields of dataset.gr
to be counted.
An optional normalization factor by which to multiply the counts.
If given, length(NF)
must be equal to length(field)
.
A logical indicating whether regions
(metaSubsample
) or rows (metaSubsampleMatrix
) without signal
should be removed from the analysis. Not recommended if using multiple
fields, as the gene lists will no longer be equivalent.
An optional GRanges object containing regions that should be excluded from signal counting.
When set to FALSE
(the default), signal at
blacklisted sites is ignored from computations. If set to TRUE
,
signal at blacklisted sites will be treated as equal to zero. For
bootstrapping, the default behavior of ignoring signal at blacklisted sites
should almost always be kept.
Logical indicating if ranges in dataset.gr
should
be treated as descriptions of single molecules (FALSE
), or if ranges
should be treated as representing multiple adjacent positions with the same
signal (TRUE
). See
getCountsByRegions
.
Number of cores to use for computations.
A matrix over which to bootstrap column means by subsampling its rows. Typically, a matrix of readcounts with rows for genes and columns for positions within those genes.
A dataframe containing x-values, means, lower quantiles, upper quantiles, and the sample name (as a convenience for row-binding multiple of these dataframes). If a list of GRanges is given as input, or if multiple fields are given, a single, combined dataframe is returned containing data for all fields/datasets.
data("PROseq") # import included PROseq data
data("txs_dm6_chr4") # import included transcripts
# for each transcript, use promoter-proximal region from TSS to +100
pr <- promoters(txs_dm6_chr4, 0, 100)
#--------------------------------------------------#
# Bootstrap average signal in each 5 bp bin across all transcripts,
# and get confidence bands for middle 30% of bootstrapped means
#--------------------------------------------------#
set.seed(11)
df <- metaSubsample(PROseq, pr, binsize = 5,
lower = 0.35, upper = 0.65,
ncores = 1)
df[1:10, ]
#> x mean lower upper sample.name
#> 1 3 0.03529412 0.02941176 0.04705882 PROseq
#> 2 8 0.05882353 0.04117647 0.07058824 PROseq
#> 3 13 0.07647059 0.05294118 0.10000000 PROseq
#> 4 18 0.07058824 0.05294118 0.09029412 PROseq
#> 5 23 0.11764706 0.09411765 0.15294118 PROseq
#> 6 28 0.09411765 0.08235294 0.10588235 PROseq
#> 7 33 0.17058824 0.14117647 0.21764706 PROseq
#> 8 38 0.63529412 0.46470588 0.92176471 PROseq
#> 9 43 0.48235294 0.37058824 0.61176471 PROseq
#> 10 48 0.83529412 0.66470588 1.11382353 PROseq
#--------------------------------------------------#
# Plot bootstrapped means with confidence intervals
#--------------------------------------------------#
plot(mean ~ x, df, type = "l", main = "PROseq Signal",
ylab = "Mean + 30% CI", xlab = "Distance from TSS")
polygon(c(df$x, rev(df$x)), c(df$lower, rev(df$upper)),
col = adjustcolor("black", 0.1), border = FALSE)
#==================================================#
# Using a matrix as input
#==================================================#
# generate a matrix of counts in each region
countsmat <- getCountsByPositions(PROseq, pr)
dim(countsmat)
#> [1] 339 100
#--------------------------------------------------#
# bootstrap average signal in 10 bp bins across all transcripts
#--------------------------------------------------#
set.seed(11)
df <- metaSubsampleMatrix(countsmat, binsize = 10,
sample.name = "PROseq",
ncores = 1)
df[1:10, ]
#> x mean lower upper sample.name
#> 1 5.5 0.4411765 0.2352941 1.117647 PROseq
#> 2 15.5 0.7941176 0.3823529 1.882353 PROseq
#> 3 25.5 1.1176471 0.6764706 2.088235 PROseq
#> 4 35.5 4.2205882 1.9411765 12.091912 PROseq
#> 5 45.5 6.8529412 3.2352941 12.882353 PROseq
#> 6 55.5 9.5588235 5.7022059 14.617647 PROseq
#> 7 65.5 4.9411765 3.2058824 7.000000 PROseq
#> 8 75.5 5.4558824 2.0294118 9.827206 PROseq
#> 9 85.5 3.0882353 2.0882353 4.827206 PROseq
#> 10 95.5 2.5882353 1.5294118 4.003676 PROseq
#--------------------------------------------------#
# the same, using a normalization factor, and changing the x-values
#--------------------------------------------------#
set.seed(11)
df <- metaSubsampleMatrix(countsmat, binsize = 10,
first.output.xval = 0, NF = 0.75,
sample.name = "PROseq", ncores = 1)
df[1:10, ]
#> x mean lower upper sample.name
#> 1 4.5 0.3308824 0.1764706 0.8382353 PROseq
#> 2 14.5 0.5955882 0.2867647 1.4117647 PROseq
#> 3 24.5 0.8382353 0.5073529 1.5661765 PROseq
#> 4 34.5 3.1654412 1.4558824 9.0689338 PROseq
#> 5 44.5 5.1397059 2.4264706 9.6617647 PROseq
#> 6 54.5 7.1691176 4.2766544 10.9632353 PROseq
#> 7 64.5 3.7058824 2.4044118 5.2500000 PROseq
#> 8 74.5 4.0919118 1.5220588 7.3704044 PROseq
#> 9 84.5 2.3161765 1.5661765 3.6204044 PROseq
#> 10 94.5 1.9411765 1.1470588 3.0027574 PROseq