These functions perform bootstrap subsampling of mean readcounts at different positions within regions of interest (metaSubsample), or, in the more general case of metaSubsampleMatrix, column means of a matrix are bootstrapped by sampling the rows. Mean signal counts can be calculated at base-pair resolution, or over larger bins.

metaSubsample(
  dataset.gr,
  regions.gr,
  binsize = 1L,
  first.output.xval = 1L,
  sample.name = deparse(substitute(dataset.gr)),
  n.iter = 1000L,
  prop.sample = 0.1,
  lower = 0.125,
  upper = 0.875,
  field = "score",
  NF = NULL,
  remove.empty = FALSE,
  blacklist = NULL,
  zero_blacklisted = FALSE,
  expand_ranges = FALSE,
  ncores = getOption("mc.cores", 2L)
)

metaSubsampleMatrix(
  counts.mat,
  binsize = 1L,
  first.output.xval = 1L,
  sample.name = NULL,
  n.iter = 1000L,
  prop.sample = 0.1,
  lower = 0.125,
  upper = 0.875,
  NF = 1L,
  remove.empty = FALSE,
  ncores = getOption("mc.cores", 2L)
)

Arguments

dataset.gr

A GRanges object in which signal is contained in metadata (typically in the "score" field), or a list of such GRanges objects.

regions.gr

A GRanges object containing intervals over which to metaplot. All ranges must have the same width.

binsize

The size of bin (in basepairs, or number of columns for metaSubsampleMatrix) to use for counting signal. Especially important for counting signal over large or sparse regions.

first.output.xval

The relative start position of the first bin, e.g. if regions.gr begins at 50 bases upstream of the TSS, set first.output.xval = -50. This number only affects the x-values that are returned, which are provided as a convenience.

sample.name

Defaults to the name of the input dataset. This is included in the output as a convenience, as it allows row-binding outputs from different samples. If length(field) > 1 and the default sample.name is left, the sample names will be inferred from the field names.

n.iter

Number of random subsampling iterations to perform. Default is 1000.

prop.sample

The proportion of the ranges in regions.gr (e.g. the proportion of genes) or the proportion of rows in counts.mat to sample in each iteration. The default is 0.1 (10 percent).

lower, upper

The lower and upper quantiles of subsampled signal means to return. The defaults, 0.125 and 0.875 (i.e. the 12.5th and 85.5th percentiles) return a 75 percent confidence interval about the bootstrapped mean.

field

One or more metadata fields of dataset.gr to be counted.

NF

An optional normalization factor by which to multiply the counts. If given, length(NF) must be equal to length(field).

remove.empty

A logical indicating whether regions (metaSubsample) or rows (metaSubsampleMatrix) without signal should be removed from the analysis. Not recommended if using multiple fields, as the gene lists will no longer be equivalent.

blacklist

An optional GRanges object containing regions that should be excluded from signal counting.

zero_blacklisted

When set to FALSE (the default), signal at blacklisted sites is ignored from computations. If set to TRUE, signal at blacklisted sites will be treated as equal to zero. For bootstrapping, the default behavior of ignoring signal at blacklisted sites should almost always be kept.

expand_ranges

Logical indicating if ranges in dataset.gr should be treated as descriptions of single molecules (FALSE), or if ranges should be treated as representing multiple adjacent positions with the same signal (TRUE). See getCountsByRegions.

ncores

Number of cores to use for computations.

counts.mat

A matrix over which to bootstrap column means by subsampling its rows. Typically, a matrix of readcounts with rows for genes and columns for positions within those genes.

Value

A dataframe containing x-values, means, lower quantiles, upper quantiles, and the sample name (as a convenience for row-binding multiple of these dataframes). If a list of GRanges is given as input, or if multiple fields are given, a single, combined dataframe is returned containing data for all fields/datasets.

Author

Mike DeBerardine

Examples

data("PROseq") # import included PROseq data
data("txs_dm6_chr4") # import included transcripts

# for each transcript, use promoter-proximal region from TSS to +100
pr <- promoters(txs_dm6_chr4, 0, 100)

#--------------------------------------------------#
# Bootstrap average signal in each 5 bp bin across all transcripts,
# and get confidence bands for middle 30% of bootstrapped means
#--------------------------------------------------#

set.seed(11)
df <- metaSubsample(PROseq, pr, binsize = 5,
                    lower = 0.35, upper = 0.65,
                    ncores = 1)
df[1:10, ]
#>     x       mean      lower      upper sample.name
#> 1   3 0.03529412 0.02941176 0.04705882      PROseq
#> 2   8 0.05882353 0.04117647 0.07058824      PROseq
#> 3  13 0.07647059 0.05294118 0.10000000      PROseq
#> 4  18 0.07058824 0.05294118 0.09029412      PROseq
#> 5  23 0.11764706 0.09411765 0.15294118      PROseq
#> 6  28 0.09411765 0.08235294 0.10588235      PROseq
#> 7  33 0.17058824 0.14117647 0.21764706      PROseq
#> 8  38 0.63529412 0.46470588 0.92176471      PROseq
#> 9  43 0.48235294 0.37058824 0.61176471      PROseq
#> 10 48 0.83529412 0.66470588 1.11382353      PROseq

#--------------------------------------------------#
# Plot bootstrapped means with confidence intervals
#--------------------------------------------------#

plot(mean ~ x, df, type = "l", main = "PROseq Signal",
     ylab = "Mean + 30% CI", xlab = "Distance from TSS")
polygon(c(df$x, rev(df$x)), c(df$lower, rev(df$upper)),
        col = adjustcolor("black", 0.1), border = FALSE)



#==================================================#
# Using a matrix as input
#==================================================#

# generate a matrix of counts in each region
countsmat <- getCountsByPositions(PROseq, pr)
dim(countsmat)
#> [1] 339 100

#--------------------------------------------------#
# bootstrap average signal in 10 bp bins across all transcripts
#--------------------------------------------------#

set.seed(11)
 df <- metaSubsampleMatrix(countsmat, binsize = 10,
                           sample.name = "PROseq",
                           ncores = 1)
df[1:10, ]
#>       x      mean     lower     upper sample.name
#> 1   5.5 0.4411765 0.2352941  1.117647      PROseq
#> 2  15.5 0.7941176 0.3823529  1.882353      PROseq
#> 3  25.5 1.1176471 0.6764706  2.088235      PROseq
#> 4  35.5 4.2205882 1.9411765 12.091912      PROseq
#> 5  45.5 6.8529412 3.2352941 12.882353      PROseq
#> 6  55.5 9.5588235 5.7022059 14.617647      PROseq
#> 7  65.5 4.9411765 3.2058824  7.000000      PROseq
#> 8  75.5 5.4558824 2.0294118  9.827206      PROseq
#> 9  85.5 3.0882353 2.0882353  4.827206      PROseq
#> 10 95.5 2.5882353 1.5294118  4.003676      PROseq

#--------------------------------------------------#
# the same, using a normalization factor, and changing the x-values
#--------------------------------------------------#

set.seed(11)
df <- metaSubsampleMatrix(countsmat, binsize = 10,
                          first.output.xval = 0, NF = 0.75,
                          sample.name = "PROseq", ncores = 1)
df[1:10, ]
#>       x      mean     lower      upper sample.name
#> 1   4.5 0.3308824 0.1764706  0.8382353      PROseq
#> 2  14.5 0.5955882 0.2867647  1.4117647      PROseq
#> 3  24.5 0.8382353 0.5073529  1.5661765      PROseq
#> 4  34.5 3.1654412 1.4558824  9.0689338      PROseq
#> 5  44.5 5.1397059 2.4264706  9.6617647      PROseq
#> 6  54.5 7.1691176 4.2766544 10.9632353      PROseq
#> 7  64.5 3.7058824 2.4044118  5.2500000      PROseq
#> 8  74.5 4.0919118 1.5220588  7.3704044      PROseq
#> 9  84.5 2.3161765 1.5661765  3.6204044      PROseq
#> 10 94.5 1.9411765 1.1470588  3.0027574      PROseq