These functions divide up regions of interest according to associated names,
and perform an inter-range operation on them. intersectByGene
returns
the "consensus" segment that is common to all input ranges, and returns no
more than one range per gene. reduceByGene
collapses the input ranges
into one or more non-overlapping ranges that encompass all segments from the
input ranges.
intersectByGene(regions.gr, gene_names)
reduceByGene(regions.gr, gene_names, disjoin = FALSE)
A GRanges object containing regions of interest. If
regions.gr
has the class list
, GRangesList
, or
CompressedGRangesList
, it will be treated as if each list element is
a gene, and the GRanges within are the ranges associated with that gene.
A character vector with the same length as
regions.gr
.
Logical. If disjoin = TRUE
, the output GRanges is
disjoint, and each output range will match a single gene name. If
FALSE
, segments from different genes can overlap.
A GRanges object whose individual ranges are named for the associated gene.
These functions modify regions of interest that have associated names, such that several ranges share the same name, e.g. transcripts with associated gene names. Both functions "combine" the ranges on a gene-by-gene basis.
intersectByGene
For each unique gene, the segment that overlaps all input ranges is returned. If no single range can be constructed that overlaps all input ranges, no range is returned for that gene (i.e. the gene is effectively filtered).
In other words, for all the ranges associated with a gene, the most-downstream start site is selected, and the most upstream end site is selected.
reduceByGene
For each unique gene, the associated ranges are
reduced
to produce one or
more non-overlapping ranges. The output range(s) are effectively a
union
of the input ranges, and cover every input base.
With disjoin = FALSE
, no genomic segment is overlapped by more than
one range of the same gene, but ranges from different genes can
overlap. With disjoin = TRUE
, the output ranges are disjoint, and no
genomic position is overlapped more than once. Any segment that overlaps
more than one gene is removed, but any segment (i.e. any section of an
input range) that overlaps only one gene is still maintained.
A typical use for intersectByGene
is to avoid transcript isoform
selection, as the returned range is found in every isoform.
reduceByGene
can be used to count any and all reads that overlap any
part of a gene's annotation, but without double-counting any of them. With
disjoin = FALSE
, no reads will be double-counted for the same gene,
but the same read can be counted for multiple genes. With disjoin =
TRUE
, no read can be double-counted.
# Make example data:
# Ranges 1 and 2 overlap,
# Ranges 3 and 4 are adjacent
gr <- GRanges(seqnames = "chr1",
ranges = IRanges(start = c(1, 3, 7, 10),
end = c(4, 5, 9, 11)))
gr
#> GRanges object with 4 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr1 1-4 *
#> [2] chr1 3-5 *
#> [3] chr1 7-9 *
#> [4] chr1 10-11 *
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
#--------------------------------------------------#
# intersectByGene
#--------------------------------------------------#
intersectByGene(gr, c("A", "A", "B", "B"))
#> GRanges object with 1 range and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> A chr1 3-4 *
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
intersectByGene(gr, c("A", "A", "B", "C"))
#> GRanges object with 3 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> A chr1 3-4 *
#> B chr1 7-9 *
#> C chr1 10-11 *
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
gr2 <- gr
end(gr2)[1] <- 10
gr2
#> GRanges object with 4 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr1 1-10 *
#> [2] chr1 3-5 *
#> [3] chr1 7-9 *
#> [4] chr1 10-11 *
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
intersectByGene(gr2, c("A", "A", "B", "C"))
#> GRanges object with 3 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> A chr1 3-5 *
#> B chr1 7-9 *
#> C chr1 10-11 *
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
intersectByGene(gr2, c("A", "A", "A", "C"))
#> GRanges object with 1 range and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> C chr1 10-11 *
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
#--------------------------------------------------#
# reduceByGene
#--------------------------------------------------#
# For a given gene, overlapping/adjacent ranges are combined;
# gaps result in multiple ranges for that gene
gr
#> GRanges object with 4 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr1 1-4 *
#> [2] chr1 3-5 *
#> [3] chr1 7-9 *
#> [4] chr1 10-11 *
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
reduceByGene(gr, c("A", "A", "A", "A"))
#> GRanges object with 2 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> A chr1 1-5 *
#> A chr1 7-11 *
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
# With disjoin = FALSE, ranges from different genes can overlap
gnames <- c("A", "B", "B", "B")
reduceByGene(gr, gnames)
#> GRanges object with 3 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> A chr1 1-4 *
#> B chr1 3-5 *
#> B chr1 7-11 *
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
# With disjoin = TRUE, segments overlapping >1 gene are removed as well
reduceByGene(gr, gnames, disjoin = TRUE)
#> GRanges object with 3 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> A chr1 1-2 *
#> B chr1 5 *
#> B chr1 7-11 *
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
# Will use one more example to demonstrate how all
# unambiguous segments are identified and returned
gr2
#> GRanges object with 4 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr1 1-10 *
#> [2] chr1 3-5 *
#> [3] chr1 7-9 *
#> [4] chr1 10-11 *
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
gnames
#> [1] "A" "B" "B" "B"
reduceByGene(gr2, gnames, disjoin = TRUE)
#> GRanges object with 3 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> A chr1 1-2 *
#> A chr1 6 *
#> B chr1 11 *
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
#--------------------------------------------------#
# reduceByGene, then aggregate counts by gene
#--------------------------------------------------#
# Consider if you did getCountsByRegions on the last output,
# you can aggregate those counts according to the genes
gr2_redux <- reduceByGene(gr2, gnames, disjoin = TRUE)
counts <- c(5, 2, 3) # if these were the counts-by-regions
aggregate(counts ~ names(gr2_redux), FUN = sum)
#> names(gr2_redux) counts
#> 1 A 7
#> 2 B 3
# even more convenient if using a melted dataframe
df <- data.frame(gene = names(gr2_redux),
reads = counts)
aggregate(reads ~ gene, df, FUN = sum)
#> gene reads
#> 1 A 7
#> 2 B 3
# can be extended to multiple samples
df <- rbind(df, df)
df$sample <- rep(c("s1", "s2"), each = 3)
df$reads[4:6] <- c(3, 1, 2)
df
#> gene reads sample
#> 1 A 5 s1
#> 2 A 2 s1
#> 3 B 3 s1
#> 4 A 3 s2
#> 5 A 1 s2
#> 6 B 2 s2
aggregate(reads ~ sample*gene, df, FUN = sum)
#> sample gene reads
#> 1 s1 A 7
#> 2 s2 A 4
#> 3 s1 B 3
#> 4 s2 B 2