Random subsampling is not performed on ranges, but on reads. Readcounts
should be given as a metadata field (usually "score"). This function can also
subsample ranges directly if field = NULL
, but the sample
function can be used in this scenario.
subsampleGRanges(
dataset.gr,
n = NULL,
prop = NULL,
field = "score",
expand_ranges = FALSE,
ncores = getOption("mc.cores", 2L)
)
A GRanges object in which signal (e.g. readcounts) are contained within metadata, or a list of such GRanges objects.
Either the number of reads to subsample (n
), or the
proportion of total signal to subsample (prop
). Either
n
or prop
can be given, but not both. If dataset.gr
is
a list, or if length(field) > 1
, users can supply a vector or list
of n
or prop
values to match the individual datasets, but
care should be taken to ensure that a value is given for each and every
dataset.
The metadata field of dataset.gr
that contains readcounts
for reach position. If each range represents a single read, set field
= NULL
. If multiple fields are given, and dataset.gr
is not a list,
then dataset.gr
will be treated as a multiplexed GRanges, and each
field will be treated as an indpendent dataset. See
mergeGRangesData
.
Logical indicating if ranges in dataset.gr
should
be treated as descriptions of single molecules (FALSE
), or if ranges
should be treated as representing multiple adjacent positions with the same
signal (TRUE
). See
getCountsByRegions
.
Number of cores to use for computations. Multicore only used
when dataset.gr
is a list, or if length(field) > 1
.
A GRanges object identical in format to dataset.gr
, but
containing a random subset of its data. If field != NULL
, the length
of the output cannot be known a priori, but the sum of its score
can.
If the metadata field contains normalized readcounts, an attempt will be made to infer the normalization factor based on the lowest signal value found in the specified field.
data("PROseq") # load included PROseq data
#--------------------------------------------------#
# sample 10% of the reads of a GRanges with signal coverage
#--------------------------------------------------#
ps_sample <- subsampleGRanges(PROseq, prop = 0.1)
# cannot predict number of ranges (positions) that will be sampled
length(PROseq)
#> [1] 47380
length(ps_sample)
#> [1] 6331
# 1/10th the score is sampled
sum(score(PROseq))
#> [1] 73887
sum(score(ps_sample))
#> [1] 7389
#--------------------------------------------------#
# Sample 10% of ranges (e.g. if each range represents one read)
#--------------------------------------------------#
ps_sample <- subsampleGRanges(PROseq, prop = 0.1, field = NULL)
length(PROseq)
#> [1] 47380
length(ps_sample)
#> [1] 4738
# Alternatively
ps_sample <- sample(PROseq, 0.1 * length(PROseq))
length(ps_sample)
#> [1] 4738