Get the sum of the signal in dataset.gr that overlaps each range in regions.gr. If expand_regions = FALSE, getCountsByRegions is written to calculate readcounts overlapping each region, while expand_regions = TRUE will calculate "coverage signal" (see details below).

getCountsByRegions(
  dataset.gr,
  regions.gr,
  field = "score",
  NF = NULL,
  blacklist = NULL,
  melt = FALSE,
  region_names = NULL,
  expand_ranges = FALSE,
  ncores = getOption("mc.cores", 2L)
)

Arguments

dataset.gr

A GRanges object in which signal is contained in metadata (typically in the "score" field), or a named list of such GRanges objects. If a list is given, a dataframe is returned containing the counts in each region for each dataset.

regions.gr

A GRanges object containing regions of interest.

field

The metadata field of dataset.gr to be counted. If length(field) > 1, a dataframe is returned containing the counts for each region in each field. If field not found in names(mcols(dataset.gr)), will default to using all fields found in dataset.gr.

NF

An optional normalization factor by which to multiply the counts. If given, length(NF) must be equal to length(field).

blacklist

An optional GRanges object containing regions that should be excluded from signal counting.

melt

If melt = TRUE, a dataframe is returned containing a column for regions and another column for signal. If multiple datasets are given (if dataset.gr is a list or if length(field) > 1), the output dataframe is melted to contain a third column indicating the sample names. (See section on return values below).

region_names

If melt = TRUE, an optional vector of names for the regions in regions.gr. If left as NULL, indices of regions.gr are used instead.

expand_ranges

Logical indicating if ranges in dataset.gr should be treated as descriptions of single molecules (FALSE), or if ranges should be treated as representing multiple adjacent positions with the same signal (TRUE). If the ranges in dataset.gr do not all have a width of 1, this option has a substantial effect on the results returned. (See details).

ncores

Multiple cores will only be used if dataset.gr is a list of multiple datasets, or if length(field) > 1.

Value

An atomic vector the same length as regions.gr containing the sum of the signal overlapping each range of regions.gr. If

dataset.gr is a list of multiple GRanges, or if length(field) > 1, a dataframe is returned. If melt = FALSE (the default), dataframes have a column for each dataset and a row for each region. If

melt = TRUE, dataframes contain one column to indicate regions (either by their indices, or by region_names, if given), another column to indicate signal, and a third column containing the sample name (unless dataset.gr is a single GRanges object).

expand_ranges = FALSE

In this configuration, getCountsByRegions is designed to work with data in which each range represents one type of molecule, whether it's a single base (e.g. the 5' ends, 3' ends, or centers of reads) or entire reads (i.e. paired 5' and 3' ends of reads).

This is in contrast to standard run-length compressed GRanges object, as imported using rtracklayer::import.bw, in which a single range can represent multiple contiguous positions that share the same signal information.

As an example, a range of covering 10 bp with a score of 2 is treated as 2 reads (each spanning the same 10 bases), not 20 reads.

expand_ranges = TRUE

In this configuration, this function assumes that ranges in dataset.gr that cover multiple bases are compressed representations of multiple adjacent positions that contain the same signal. This type of representation is typical of "coverage" objects, including bedGraphs and bigWigs generated by many command line utilities, but not bigWigs as they are imported by BRGenomics::import_bigWig.

As an example, a range covering 10 bp with a score of 2 is treated as representing 20 signal counts, i.e. there are 10 adjacent positions that each contain a signal of 2.

If the data truly represents basepair-resolution coverage, the "coverage signal" is equivalent to readcounts. However, users should consider how they interpret results from whole-read coverage, as the "coverage signal" is determined by both the read counts as well as read lengths.

Author

Mike DeBerardine

Examples

data("PROseq") # load included PROseq data
data("txs_dm6_chr4") # load included transcripts

counts <- getCountsByRegions(PROseq, txs_dm6_chr4)

length(txs_dm6_chr4)
#> [1] 339
length(counts)
#> [1] 339
head(counts)
#> [1]    1   59   13  126  263 2509

# Assign as metadata to the transcript GRanges
txs_dm6_chr4$PROseq <- counts

txs_dm6_chr4[1:6]
#> GRanges object with 6 ranges and 3 metadata columns:
#>       seqnames       ranges strand |     tx_name     gene_id    PROseq
#>          <Rle>    <IRanges>  <Rle> | <character> <character> <integer>
#>   [1]     chr4     879-5039      + | FBtr0346692 FBgn0267363         1
#>   [2]     chr4  42774-43374      + | FBtr0344900 FBgn0266617        59
#>   [3]     chr4  44774-46074      + | FBtr0340499 FBgn0265633        13
#>   [4]     chr4  56497-60974      + | FBtr0333704 FBgn0264617       126
#>   [5]     chr4  56497-63124      + | FBtr0333705 FBgn0264617       263
#>   [6]     chr4 69326-101419      + | FBtr0100246 FBgn0085432      2509
#>   -------
#>   seqinfo: 7 sequences from dm6 genome