Get the sum of the signal in dataset.gr
that overlaps each range in
regions.gr
. If expand_regions = FALSE
,
getCountsByRegions
is written to calculate readcounts
overlapping each region, while expand_regions = TRUE
will calculate
"coverage signal" (see details below).
getCountsByRegions(
dataset.gr,
regions.gr,
field = "score",
NF = NULL,
blacklist = NULL,
melt = FALSE,
region_names = NULL,
expand_ranges = FALSE,
ncores = getOption("mc.cores", 2L)
)
A GRanges object in which signal is contained in metadata (typically in the "score" field), or a named list of such GRanges objects. If a list is given, a dataframe is returned containing the counts in each region for each dataset.
A GRanges object containing regions of interest.
The metadata field of dataset.gr
to be counted. If
length(field) > 1
, a dataframe is returned containing the counts for
each region in each field. If field
not found in
names(mcols(dataset.gr))
, will default to using all fields found in
dataset.gr
.
An optional normalization factor by which to multiply the counts.
If given, length(NF)
must be equal to length(field)
.
An optional GRanges object containing regions that should be excluded from signal counting.
If melt = TRUE
, a dataframe is returned containing a
column for regions and another column for signal. If multiple datasets are
given (if dataset.gr
is a list or if length(field) > 1
), the
output dataframe is melted to contain a third column indicating the sample
names. (See section on return values below).
If melt = TRUE
, an optional vector of names for
the regions in regions.gr
. If left as NULL
, indices of
regions.gr
are used instead.
Logical indicating if ranges in dataset.gr
should
be treated as descriptions of single molecules (FALSE
), or if ranges
should be treated as representing multiple adjacent positions with the same
signal (TRUE
). If the ranges in dataset.gr
do not all have a
width of 1, this option has a substantial effect on the results
returned. (See details).
Multiple cores will only be used if dataset.gr
is a list
of multiple datasets, or if length(field) > 1
.
An atomic vector the same length as regions.gr
containing the
sum of the signal overlapping each range of regions.gr
. If
dataset.gr
is a list of multiple GRanges, or if length(field)
> 1
, a dataframe is returned. If melt = FALSE
(the default),
dataframes have a column for each dataset and a row for each region. If
melt = TRUE
, dataframes contain one column to indicate regions
(either by their indices, or by region_names
, if given), another
column to indicate signal, and a third column containing the sample name
(unless dataset.gr
is a single GRanges object).
expand_ranges = FALSE
In this configuration,
getCountsByRegions
is designed to work with data in which each range
represents one type of molecule, whether it's a single base (e.g. the 5'
ends, 3' ends, or centers of reads) or entire reads (i.e. paired 5' and 3'
ends of reads).
This is in contrast to standard run-length compressed GRanges object, as
imported using rtracklayer::import.bw
,
in which a single range can represent multiple contiguous positions that
share the same signal information.
As an example, a range of covering 10 bp with a score of 2 is treated as 2 reads (each spanning the same 10 bases), not 20 reads.
expand_ranges = TRUE
In this configuration, this function
assumes that ranges in dataset.gr
that cover multiple bases are
compressed representations of multiple adjacent positions that contain the
same signal. This type of representation is typical of "coverage" objects,
including bedGraphs and bigWigs generated by many command line utilities,
but not bigWigs as they are imported by
BRGenomics::import_bigWig
.
As an example, a range covering 10 bp with a score of 2 is treated as representing 20 signal counts, i.e. there are 10 adjacent positions that each contain a signal of 2.
If the data truly represents basepair-resolution coverage, the "coverage signal" is equivalent to readcounts. However, users should consider how they interpret results from whole-read coverage, as the "coverage signal" is determined by both the read counts as well as read lengths.
data("PROseq") # load included PROseq data
data("txs_dm6_chr4") # load included transcripts
counts <- getCountsByRegions(PROseq, txs_dm6_chr4)
length(txs_dm6_chr4)
#> [1] 339
length(counts)
#> [1] 339
head(counts)
#> [1] 1 59 13 126 263 2509
# Assign as metadata to the transcript GRanges
txs_dm6_chr4$PROseq <- counts
txs_dm6_chr4[1:6]
#> GRanges object with 6 ranges and 3 metadata columns:
#> seqnames ranges strand | tx_name gene_id PROseq
#> <Rle> <IRanges> <Rle> | <character> <character> <integer>
#> [1] chr4 879-5039 + | FBtr0346692 FBgn0267363 1
#> [2] chr4 42774-43374 + | FBtr0344900 FBgn0266617 59
#> [3] chr4 44774-46074 + | FBtr0340499 FBgn0265633 13
#> [4] chr4 56497-60974 + | FBtr0333704 FBgn0264617 126
#> [5] chr4 56497-63124 + | FBtr0333705 FBgn0264617 263
#> [6] chr4 69326-101419 + | FBtr0100246 FBgn0085432 2509
#> -------
#> seqinfo: 7 sequences from dm6 genome