R/signal_counting.R
getCountsByPositions.Rd
Get the sum of the signal in dataset.gr
that overlaps each position
within each range in regions.gr
. If binning is used (i.e. positions
are wider than 1 bp), any function can be used to summarize the signal
overlapping each bin. For a description of the critical difference between
expand_ranges = FALSE
and expand_ranges = TRUE
, see
getCountsByRegions
.
A GRanges object in which signal is contained in metadata (typically in the "score" field), or a named list of such GRanges objects.
A GRanges object containing regions of interest.
Size of bins (in bp) to use for counting within each range of
regions.gr
. Note that counts will not be length-normalized.
If binsize > 1
, the function used to aggregate the signal
within each bin. By default, the signal is summed, but any function
operating on a numeric vector can be used.
A string indicating the output format if the
ranges in regions.gr
have variable widths. By default, an error is
returned. See details below.
The metadata field of dataset.gr
to be counted. If
length(field) > 1
, the output is a list whose elements contain the
output for generated each field. If field
not found in
names(mcols(dataset.gr))
, will default to using all fields found in
dataset.gr
.
An optional normalization factor by which to multiply the counts.
If given, length(NF)
must be equal to length(field)
.
An optional GRanges object containing regions that should be excluded from signal counting.
A logical indicating if NA values should be returned
for blacklisted regions. By default, signal in the blacklisted sites is
ignored, i.e. the reads are excluded. If NA_blacklisted = TRUE
,
those positions are set to NA
in the final output.
A logical indicating if the count matrices should be melted. If
set to TRUE
, a dataframe is returned in containing columns for
"region", "position", and "signal". If dataset.gr
is a list of
multiple GRanges, or if length(field) > 1
, a single dataframe is
returned, which contains an additional column "sample", which contains
individual sample names. If used with multi-width regions.gr
, the
resulting dataframe will only contain positions that are found within each
respective region.
Logical indicating if ranges in dataset.gr
should
be treated as descriptions of single molecules (FALSE
), or if ranges
should be treated as representing multiple adjacent positions with the same
signal (TRUE
). See
getCountsByRegions
.
Multiple cores will only be used if dataset.gr
is a list
of multiple datasets, or if length(field) > 1
.
If the widths of all ranges in regions.gr
are equal, a matrix
is returned that contains a row for each region of interest, and a column
for each position (each base if binsize = 1
) within each region. If
dataset.gr
is a list, a parallel list is returned containing a
matrix for each input dataset.
If the input
regions.gr
contains ranges of varying widths, setting
simplify.multi.widths = "list"
will output a list of variable-length
vectors, with each vector corresponding to an individual input region. If
simplify.multi.widths = "pad 0"
or "pad NA"
, the output is a
matrix containing a row for each range in regions.gr
, but the number
of columns is determined by the largest range in regions.gr
. For
each region of interest, columns that correspond to positions outside of
the input range are set, depending on the argument, to 0
or
NA
.
data("PROseq") # load included PROseq data
data("txs_dm6_chr4") # load included transcripts
#--------------------------------------------------#
# counts from 0 to 50 bp after the TSS
#--------------------------------------------------#
txs_pr <- promoters(txs_dm6_chr4, 0, 50) # first 50 bases
countsmat <- getCountsByPositions(PROseq, txs_pr)
countsmat[10:15, 41:50] # show only 41-50 bp after TSS
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 0 1 0 0 0 0 0 0 0 0
#> [2,] 1 0 0 4 0 2 2 0 1 3
#> [3,] 1 0 0 4 0 2 2 0 1 3
#> [4,] 0 0 0 0 0 0 0 0 0 0
#> [5,] 0 0 0 0 0 0 0 0 0 0
#> [6,] 0 0 0 5 2 1 6 2 0 5
#--------------------------------------------------#
# redo with 10 bp bins from 0 to 100
#--------------------------------------------------#
# column 5 is sums of rows shown above
txs_pr <- promoters(txs_dm6_chr4, 0, 100)
countsmat <- getCountsByPositions(PROseq, txs_pr, binsize = 10)
countsmat[10:15, ]
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 0 0 0 0 1 12 19 160 6 12
#> [2,] 2 0 0 21 13 9 6 8 7 0
#> [3,] 2 0 0 21 13 9 6 8 7 0
#> [4,] 4 3 3 0 0 0 2 0 1 0
#> [5,] 2 1 5 1 0 1 0 1 3 0
#> [6,] 0 0 0 3 21 23 5 0 1 0
#--------------------------------------------------#
# same as the above, but with the average signal in each bin
#--------------------------------------------------#
countsmat <- getCountsByPositions(PROseq, txs_pr, binsize = 10, FUN = mean)
countsmat[10:15, ]
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 0.0 0.0 0.0 0.0 0.1 1.2 1.9 16.0 0.6 1.2
#> [2,] 0.2 0.0 0.0 2.1 1.3 0.9 0.6 0.8 0.7 0.0
#> [3,] 0.2 0.0 0.0 2.1 1.3 0.9 0.6 0.8 0.7 0.0
#> [4,] 0.4 0.3 0.3 0.0 0.0 0.0 0.2 0.0 0.1 0.0
#> [5,] 0.2 0.1 0.5 0.1 0.0 0.1 0.0 0.1 0.3 0.0
#> [6,] 0.0 0.0 0.0 0.3 2.1 2.3 0.5 0.0 0.1 0.0
#--------------------------------------------------#
# standard deviation of signal in each bin
#--------------------------------------------------#
countsmat <- getCountsByPositions(PROseq, txs_pr, binsize = 10, FUN = sd)
round(countsmat[10:15, ], 1)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 0.0 0.0 0.0 0.0 0.3 3.8 3.8 38.0 0.8 0.8
#> [2,] 0.4 0.0 0.0 5.3 1.4 0.9 1.0 1.6 1.1 0.0
#> [3,] 0.4 0.0 0.0 5.3 1.4 0.9 1.0 1.6 1.1 0.0
#> [4,] 0.5 0.7 0.7 0.0 0.0 0.0 0.6 0.0 0.3 0.0
#> [5,] 0.4 0.3 0.7 0.3 0.0 0.3 0.0 0.3 0.5 0.0
#> [6,] 0.0 0.0 0.0 0.5 2.4 2.4 1.1 0.0 0.3 0.0