Import bam files — import_bam • BRGenomics

Import single-end or paired-end bam files as GRanges objects, with various processing options. It is highly recommend to index the BAM file first.

import_bam(
  file,
  mapq = 20L,
  revcomp = FALSE,
  shift = 0L,
  trim.to = c("whole", "5p", "3p", "center"),
  ignore.strand = FALSE,
  field = "score",
  paired_end = NULL,
  yieldSize = NA,
  ncores = 1L
)

import_bam_PROseq(
  file,
  mapq = 20L,
  revcomp = TRUE,
  shift = -1L,
  trim.to = "3p",
  ignore.strand = FALSE,
  field = "score",
  paired_end = NULL,
  yieldSize = NA,
  ncores = 1L
)

import_bam_PROcap(
  file,
  mapq = 20L,
  revcomp = FALSE,
  shift = 0L,
  trim.to = "5p",
  ignore.strand = FALSE,
  field = "score",
  paired_end = NULL,
  yieldSize = NA,
  ncores = 1L
)

import_bam_ATACseq(
  file,
  mapq = 20L,
  revcomp = FALSE,
  shift = 0L,
  plus_offset = 4,
  minus_offset = -4,
  trim.to = "5p",
  ignore.strand = TRUE,
  field = "score",
  paired_end = TRUE,
  yieldSize = NA,
  ncores = 1L
)

Arguments

file: Path of a bam file, or a vector of paths.
mapq: Filter reads by a minimum MAPQ score. This is the correct way to filter multi-aligners.
revcomp: Logical indicating if aligned reads should be reverse-complemented.
shift: Either an integer giving the number of bases by which to shift the entire read upstream or downstream, or a pair of integers indicating shifts to be applied to the 5' and 3' ends of the reads, respectively. Shifting is strand-specific, with negative numbers shifting the reads upstream, and positive numbers shiftem them downstream. This option is applied after the revcomp, but before trim.to and ignore.strand options are applied.
trim.to: Option for selecting specific bases from the reads, applied after the revcomp and shift options. By default, the entire read is maintained. Other options are to take only the 5' base, only the 3' base, or the only the center base of the read.
ignore.strand: Logical indicating if the strand information should be discarded. If TRUE, strand information is discarded after revcomp, trim.to, and shift options are applied.
field: Metadata field name to use for readcounts, usually "score". If set to NULL, identical reads (or identical positions if trim.to options applied) are not combined, and the length of the output GRanges will be equal to the number of input reads.
paired_end: Logical indicating if reads should be treated as paired end reads. When set to NULL (the default), the first 100k reads are checked.
yieldSize: The number of bam file records to process simultaneously, e.g. the "chunk size". Setting a higher chunk size will use more memory, which can increase speed if there is enough memory available. If chunking is not necessary, set to NA.
ncores: Number of cores to use for importing bam files. Currently, multicore is only implemented for simultaneously importing multiple bam files. For smaller datasets or machines with higher memory, this can increase performance, but can otherwise lead to substantial performance penalties.
plus_offset: For importing ATAC-seq, the shift to apply to plus strand alignments. By default, plus strand reads are shifted 4 bp downstream.
minus_offset: For importing ATAC-seq, the shift to apply to minus strand alignments. By default, minus strand reads are shifted 4 bp upstream (in terms of the genomic coordinates).

Value

A GRanges object.

ATAC-seq data import

By default, import_bam_ATACseq will shift plus-aligned reads downstream 4 bp, minus-aligned reads upstream 4 bp, and then take the strand-specific start site of the reads before removing strand information and collapsing identical reads. These steps account for the 9bp gap between opposing fragments generated from the same Tn5 reaction, selecting the central base in the 9bp duplication.

While other sources often state that the offset should be +4 on plus strand and -5 on minus strand alignments (or alternatively +5, -4), this does not result in the two positions overlapping. I have verified that this is true based on the expected result of the Tn5 reaction and adapter ligation and sequencing steps, and also using real sequencing data, which confirms that only the +4/-4 shift results in a significant increase in the number positions that overlap. However, these arguments are left to the user if they insist on doing it differently.

Note that the order of operations performed is the same as the order of the associated arguments in the function proper, but not in the argument documentation i.e., the plus_offset and minus_offset arguments are applied after the shift argument and before the trim.to argument.

While this single-base precision analysis of ATAC-seq may be useful in some cases, for most users it is unlikely to be useful. Instead, you might use the plus_offset and minus_offset arguments correctly, but set trim.to = "whole" (and keep ignore.strand = TRUE). This will keep the entire ATAC-seq reads, which is the most common analysis approach. It is also common to use coverage data with ATAC-seq, but this eliminates read count information.

References

Hojoong Kwak, Nicholas J. Fuda, Leighton J. Core, John T. Lis (2013). Precise Maps of RNA Polymerase Reveal How Promoters Direct Initiation and Pausing. Science 339(6122): 950–953. https://doi.org/10.1126/science.1229386

Jason D. Buenrostro, Paul G. Giresi, Lisa C. Zaba, Howard Y. Chang, William J. Greenleaf (2013). Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, dna-binding proteins and nucleosome position. Nature Methods 10: 1213–1218. https://doi.org/10.1038/nmeth.2688

Author

Mike DeBerardine

Examples

# get local address for included bam file
ps.bam <- system.file("extdata", "PROseq_dm6_chr4.bam",
                      package = "BRGenomics")

#--------------------------------------------------#
# Import entire reads
#--------------------------------------------------#

# Note that PRO-seq reads are sequenced as reverse complement
import_bam(ps.bam, revcomp = TRUE, paired_end = FALSE)
#> GRanges object with 164 ranges and 1 metadata column:
#>         seqnames          ranges strand |     score
#>            <Rle>       <IRanges>  <Rle> | <integer>
#>     [1]     chr4       1270-1296      + |         1
#>     [2]     chr4     41411-41429      + |         1
#>     [3]     chr4     42556-42591      + |         1
#>     [4]     chr4     42559-42589      + |         1
#>     [5]     chr4     42559-42594      + |         3
#>     ...      ...             ...    ... .       ...
#>   [160]     chr4 1307741-1307776      - |         1
#>   [161]     chr4 1316536-1316563      - |         1
#>   [162]     chr4 1318959-1318994      - |         1
#>   [163]     chr4 1319003-1319038      - |         1
#>   [164]     chr4 1319368-1319403      - |         1
#>   -------
#>   seqinfo: 1870 sequences from an unspecified genome

#--------------------------------------------------#
# Import entire reads, 1 range per read
#--------------------------------------------------#

import_bam(ps.bam, revcomp = TRUE, field = NULL,
           paired_end = FALSE)
#> GRanges object with 190 ranges and 0 metadata columns:
#>         seqnames          ranges strand
#>            <Rle>       <IRanges>  <Rle>
#>     [1]     chr4       1270-1296      +
#>     [2]     chr4     41411-41429      +
#>     [3]     chr4     42556-42591      +
#>     [4]     chr4     42559-42589      +
#>     [5]     chr4     42559-42594      +
#>     ...      ...             ...    ...
#>   [186]     chr4 1307741-1307776      -
#>   [187]     chr4 1316536-1316563      -
#>   [188]     chr4 1318959-1318994      -
#>   [189]     chr4 1319003-1319038      -
#>   [190]     chr4 1319368-1319403      -
#>   -------
#>   seqinfo: 1870 sequences from an unspecified genome

#--------------------------------------------------#
# Import PRO-seq reads at basepair-resolution
#--------------------------------------------------#

# the typical manner to import PRO-seq data:
import_bam(ps.bam, revcomp = TRUE, trim.to = "3p",
           paired_end = FALSE)
#> GRanges object with 150 ranges and 1 metadata column:
#>         seqnames    ranges strand |     score
#>            <Rle> <IRanges>  <Rle> | <integer>
#>     [1]     chr4      1296      + |         1
#>     [2]     chr4     41429      + |         1
#>     [3]     chr4     42589      + |         1
#>     [4]     chr4     42591      + |         2
#>     [5]     chr4     42594      + |         5
#>     ...      ...       ...    ... .       ...
#>   [146]     chr4   1307741      - |         1
#>   [147]     chr4   1316536      - |         1
#>   [148]     chr4   1318959      - |         1
#>   [149]     chr4   1319003      - |         1
#>   [150]     chr4   1319368      - |         1
#>   -------
#>   seqinfo: 1870 sequences from an unspecified genome

#--------------------------------------------------#
# Import PRO-seq reads, removing the run-on base
#--------------------------------------------------#

# the best way to import PRO-seq data; removes the
# most 3' base, which was added in the run-on
import_bam(ps.bam, revcomp = TRUE, trim.to = "3p",
           shift = -1, paired_end = FALSE)
#> GRanges object with 150 ranges and 1 metadata column:
#>         seqnames    ranges strand |     score
#>            <Rle> <IRanges>  <Rle> | <integer>
#>     [1]     chr4      1295      + |         1
#>     [2]     chr4     41428      + |         1
#>     [3]     chr4     42588      + |         1
#>     [4]     chr4     42590      + |         2
#>     [5]     chr4     42593      + |         5
#>     ...      ...       ...    ... .       ...
#>   [146]     chr4   1307742      - |         1
#>   [147]     chr4   1316537      - |         1
#>   [148]     chr4   1318960      - |         1
#>   [149]     chr4   1319004      - |         1
#>   [150]     chr4   1319369      - |         1
#>   -------
#>   seqinfo: 1870 sequences from an unspecified genome

#--------------------------------------------------#
# Import 5' ends of PRO-seq reads
#--------------------------------------------------#

# will include bona fide TSSes as well as hydrolysis products
import_bam(ps.bam, revcomp = TRUE, trim.to = "5p",
           paired_end = FALSE)
#> GRanges object with 145 ranges and 1 metadata column:
#>         seqnames    ranges strand |     score
#>            <Rle> <IRanges>  <Rle> | <integer>
#>     [1]     chr4      1270      + |         1
#>     [2]     chr4     41411      + |         1
#>     [3]     chr4     42556      + |         1
#>     [4]     chr4     42559      + |         4
#>     [5]     chr4     42560      + |         4
#>     ...      ...       ...    ... .       ...
#>   [141]     chr4   1307776      - |         1
#>   [142]     chr4   1316563      - |         1
#>   [143]     chr4   1318994      - |         1
#>   [144]     chr4   1319038      - |         1
#>   [145]     chr4   1319403      - |         1
#>   -------
#>   seqinfo: 1870 sequences from an unspecified genome