Randomly subsample reads from GRanges dataset — subsampleGRanges • BRGenomics

Random subsampling is not performed on ranges, but on reads. Readcounts should be given as a metadata field (usually "score"). This function can also subsample ranges directly if field = NULL, but the sample function can be used in this scenario.

subsampleGRanges(
  dataset.gr,
  n = NULL,
  prop = NULL,
  field = "score",
  expand_ranges = FALSE,
  ncores = getOption("mc.cores", 2L)
)

Arguments

dataset.gr: A GRanges object in which signal (e.g. readcounts) are contained within metadata, or a list of such GRanges objects.
n, prop: Either the number of reads to subsample (n), or the proportion of total signal to subsample (prop). Either n or prop can be given, but not both. If dataset.gr is a list, or if length(field) > 1, users can supply a vector or list of n or prop values to match the individual datasets, but care should be taken to ensure that a value is given for each and every dataset.
field: The metadata field of dataset.gr that contains readcounts for reach position. If each range represents a single read, set field = NULL. If multiple fields are given, and dataset.gr is not a list, then dataset.gr will be treated as a multiplexed GRanges, and each field will be treated as an indpendent dataset. See mergeGRangesData.
expand_ranges: Logical indicating if ranges in dataset.gr should be treated as descriptions of single molecules (FALSE), or if ranges should be treated as representing multiple adjacent positions with the same signal (TRUE). See getCountsByRegions.
ncores: Number of cores to use for computations. Multicore only used when dataset.gr is a list, or if length(field) > 1.

Value

A GRanges object identical in format to dataset.gr, but containing a random subset of its data. If field != NULL, the length of the output cannot be known a priori, but the sum of its score can.

Use with normalized readcounts

If the metadata field contains normalized readcounts, an attempt will be made to infer the normalization factor based on the lowest signal value found in the specified field.

Author

Mike DeBerardine

Examples

data("PROseq") # load included PROseq data

#--------------------------------------------------#
# sample 10% of the reads of a GRanges with signal coverage
#--------------------------------------------------#

ps_sample <- subsampleGRanges(PROseq, prop = 0.1)

# cannot predict number of ranges (positions) that will be sampled
length(PROseq)
#> [1] 47380
length(ps_sample)
#> [1] 6331

# 1/10th the score is sampled
sum(score(PROseq))
#> [1] 73887
sum(score(ps_sample))
#> [1] 7389

#--------------------------------------------------#
# Sample 10% of ranges (e.g. if each range represents one read)
#--------------------------------------------------#

ps_sample <- subsampleGRanges(PROseq, prop = 0.1, field = NULL)

length(PROseq)
#> [1] 47380
length(ps_sample)
#> [1] 4738

# Alternatively
ps_sample <- sample(PROseq, 0.1 * length(PROseq))
length(ps_sample)
#> [1] 4738