Intersect or reduce ranges according to gene names

These functions divide up regions of interest according to associated names, and perform an inter-range operation on them. intersectByGene returns the "consensus" segment that is common to all input ranges, and returns no more than one range per gene. reduceByGene collapses the input ranges into one or more non-overlapping ranges that encompass all segments from the input ranges.

intersectByGene(regions.gr, gene_names)

reduceByGene(regions.gr, gene_names, disjoin = FALSE)

Arguments

regions.gr: A GRanges object containing regions of interest. If regions.gr has the class list, GRangesList, or CompressedGRangesList, it will be treated as if each list element is a gene, and the GRanges within are the ranges associated with that gene.
gene_names: A character vector with the same length as regions.gr.
disjoin: Logical. If disjoin = TRUE, the output GRanges is disjoint, and each output range will match a single gene name. If FALSE, segments from different genes can overlap.

Value

A GRanges object whose individual ranges are named for the associated gene.

Details

These functions modify regions of interest that have associated names, such that several ranges share the same name, e.g. transcripts with associated gene names. Both functions "combine" the ranges on a gene-by-gene basis.

intersectByGene

For each unique gene, the segment that overlaps all input ranges is returned. If no single range can be constructed that overlaps all input ranges, no range is returned for that gene (i.e. the gene is effectively filtered).

In other words, for all the ranges associated with a gene, the most-downstream start site is selected, and the most upstream end site is selected.

reduceByGene

For each unique gene, the associated ranges are reduced to produce one or more non-overlapping ranges. The output range(s) are effectively a union of the input ranges, and cover every input base.

With disjoin = FALSE, no genomic segment is overlapped by more than one range of the same gene, but ranges from different genes can overlap. With disjoin = TRUE, the output ranges are disjoint, and no genomic position is overlapped more than once. Any segment that overlaps more than one gene is removed, but any segment (i.e. any section of an input range) that overlaps only one gene is still maintained.

Typical Uses

A typical use for intersectByGene is to avoid transcript isoform selection, as the returned range is found in every isoform.

reduceByGene can be used to count any and all reads that overlap any part of a gene's annotation, but without double-counting any of them. With disjoin = FALSE, no reads will be double-counted for the same gene, but the same read can be counted for multiple genes. With disjoin = TRUE, no read can be double-counted.

Author

Mike DeBerardine

Examples

# Make example data:
#  Ranges 1 and 2 overlap,
#  Ranges 3 and 4 are adjacent
gr <- GRanges(seqnames = "chr1",
              ranges = IRanges(start = c(1, 3, 7, 10),
                               end = c(4, 5, 9, 11)))
gr
#> GRanges object with 4 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1       1-4      *
#>   [2]     chr1       3-5      *
#>   [3]     chr1       7-9      *
#>   [4]     chr1     10-11      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

#--------------------------------------------------#
# intersectByGene
#--------------------------------------------------#

intersectByGene(gr, c("A", "A", "B", "B"))
#> GRanges object with 1 range and 0 metadata columns:
#>     seqnames    ranges strand
#>        <Rle> <IRanges>  <Rle>
#>   A     chr1       3-4      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

intersectByGene(gr, c("A", "A", "B", "C"))
#> GRanges object with 3 ranges and 0 metadata columns:
#>     seqnames    ranges strand
#>        <Rle> <IRanges>  <Rle>
#>   A     chr1       3-4      *
#>   B     chr1       7-9      *
#>   C     chr1     10-11      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

gr2 <- gr
end(gr2)[1] <- 10
gr2
#> GRanges object with 4 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1      1-10      *
#>   [2]     chr1       3-5      *
#>   [3]     chr1       7-9      *
#>   [4]     chr1     10-11      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

intersectByGene(gr2, c("A", "A", "B", "C"))
#> GRanges object with 3 ranges and 0 metadata columns:
#>     seqnames    ranges strand
#>        <Rle> <IRanges>  <Rle>
#>   A     chr1       3-5      *
#>   B     chr1       7-9      *
#>   C     chr1     10-11      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

intersectByGene(gr2, c("A", "A", "A", "C"))
#> GRanges object with 1 range and 0 metadata columns:
#>     seqnames    ranges strand
#>        <Rle> <IRanges>  <Rle>
#>   C     chr1     10-11      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

#--------------------------------------------------#
# reduceByGene
#--------------------------------------------------#

# For a given gene, overlapping/adjacent ranges are combined;
#  gaps result in multiple ranges for that gene
gr
#> GRanges object with 4 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1       1-4      *
#>   [2]     chr1       3-5      *
#>   [3]     chr1       7-9      *
#>   [4]     chr1     10-11      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

reduceByGene(gr, c("A", "A", "A", "A"))
#> GRanges object with 2 ranges and 0 metadata columns:
#>     seqnames    ranges strand
#>        <Rle> <IRanges>  <Rle>
#>   A     chr1       1-5      *
#>   A     chr1      7-11      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

# With disjoin = FALSE, ranges from different genes can overlap
gnames <- c("A", "B", "B", "B")
reduceByGene(gr, gnames)
#> GRanges object with 3 ranges and 0 metadata columns:
#>     seqnames    ranges strand
#>        <Rle> <IRanges>  <Rle>
#>   A     chr1       1-4      *
#>   B     chr1       3-5      *
#>   B     chr1      7-11      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

# With disjoin = TRUE, segments overlapping >1 gene are removed as well
reduceByGene(gr, gnames, disjoin = TRUE)
#> GRanges object with 3 ranges and 0 metadata columns:
#>     seqnames    ranges strand
#>        <Rle> <IRanges>  <Rle>
#>   A     chr1       1-2      *
#>   B     chr1         5      *
#>   B     chr1      7-11      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

# Will use one more example to demonstrate how all
#  unambiguous segments are identified and returned
gr2
#> GRanges object with 4 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1      1-10      *
#>   [2]     chr1       3-5      *
#>   [3]     chr1       7-9      *
#>   [4]     chr1     10-11      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

gnames
#> [1] "A" "B" "B" "B"
reduceByGene(gr2, gnames, disjoin = TRUE)
#> GRanges object with 3 ranges and 0 metadata columns:
#>     seqnames    ranges strand
#>        <Rle> <IRanges>  <Rle>
#>   A     chr1       1-2      *
#>   A     chr1         6      *
#>   B     chr1        11      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

#--------------------------------------------------#
# reduceByGene, then aggregate counts by gene
#--------------------------------------------------#

# Consider if you did getCountsByRegions on the last output,
#  you can aggregate those counts according to the genes
gr2_redux <- reduceByGene(gr2, gnames, disjoin = TRUE)
counts <- c(5, 2, 3) # if these were the counts-by-regions
aggregate(counts ~ names(gr2_redux), FUN = sum)
#>   names(gr2_redux) counts
#> 1                A      7
#> 2                B      3

# even more convenient if using a melted dataframe
df <- data.frame(gene = names(gr2_redux),
                 reads = counts)
aggregate(reads ~ gene, df, FUN = sum)
#>   gene reads
#> 1    A     7
#> 2    B     3

# can be extended to multiple samples
df <- rbind(df, df)
df$sample <- rep(c("s1", "s2"), each = 3)
df$reads[4:6] <- c(3, 1, 2)
df
#>   gene reads sample
#> 1    A     5     s1
#> 2    A     2     s1
#> 3    B     3     s1
#> 4    A     3     s2
#> 5    A     1     s2
#> 6    B     2     s2

aggregate(reads ~ sample*gene, df, FUN = sum)
#>   sample gene reads
#> 1     s1    A     7
#> 2     s2    A     4
#> 3     s1    B     3
#> 4     s2    B     2