Import single-end or paired-end bam files as GRanges objects, with various processing options. It is highly recommend to index the BAM file first.
import_bam(
file,
mapq = 20L,
revcomp = FALSE,
shift = 0L,
trim.to = c("whole", "5p", "3p", "center"),
ignore.strand = FALSE,
field = "score",
paired_end = NULL,
yieldSize = NA,
ncores = 1L
)
import_bam_PROseq(
file,
mapq = 20L,
revcomp = TRUE,
shift = -1L,
trim.to = "3p",
ignore.strand = FALSE,
field = "score",
paired_end = NULL,
yieldSize = NA,
ncores = 1L
)
import_bam_PROcap(
file,
mapq = 20L,
revcomp = FALSE,
shift = 0L,
trim.to = "5p",
ignore.strand = FALSE,
field = "score",
paired_end = NULL,
yieldSize = NA,
ncores = 1L
)
import_bam_ATACseq(
file,
mapq = 20L,
revcomp = FALSE,
shift = 0L,
plus_offset = 4,
minus_offset = -4,
trim.to = "5p",
ignore.strand = TRUE,
field = "score",
paired_end = TRUE,
yieldSize = NA,
ncores = 1L
)
Path of a bam file, or a vector of paths.
Filter reads by a minimum MAPQ score. This is the correct way to filter multi-aligners.
Logical indicating if aligned reads should be reverse-complemented.
Either an integer giving the number of bases by which to shift
the entire read upstream or downstream, or a pair of integers indicating
shifts to be applied to the 5' and 3' ends of the reads, respectively.
Shifting is strand-specific, with negative numbers shifting the reads
upstream, and positive numbers shiftem them downstream. This option is
applied after the revcomp
, but before trim.to
and
ignore.strand
options are applied.
Option for selecting specific bases from the reads, applied
after the revcomp
and shift
options. By default, the entire
read is maintained. Other options are to take only the 5' base, only the 3'
base, or the only the center base of the read.
Logical indicating if the strand information should be
discarded. If TRUE
, strand information is discarded after
revcomp
, trim.to
, and shift
options are applied.
Metadata field name to use for readcounts, usually "score". If
set to NULL
, identical reads (or identical positions if
trim.to
options applied) are not combined, and the length of the
output GRanges will be equal to the number of input reads.
Logical indicating if reads should be treated as paired end
reads. When set to NULL
(the default), the first 100k reads are
checked.
The number of bam file records to process simultaneously,
e.g. the "chunk size". Setting a higher chunk size will use more memory,
which can increase speed if there is enough memory available. If chunking
is not necessary, set to NA
.
Number of cores to use for importing bam files. Currently, multicore is only implemented for simultaneously importing multiple bam files. For smaller datasets or machines with higher memory, this can increase performance, but can otherwise lead to substantial performance penalties.
For importing ATAC-seq, the shift to apply to plus strand alignments. By default, plus strand reads are shifted 4 bp downstream.
For importing ATAC-seq, the shift to apply to minus strand alignments. By default, minus strand reads are shifted 4 bp upstream (in terms of the genomic coordinates).
A GRanges object.
By default, import_bam_ATACseq
will
shift plus-aligned reads downstream 4 bp, minus-aligned reads upstream 4 bp,
and then take the strand-specific start site of the reads before removing
strand information and collapsing identical reads. These steps account for
the 9bp gap between opposing fragments generated from the same Tn5 reaction,
selecting the central base in the 9bp duplication.
While other sources often state that the offset should be +4 on plus strand and -5 on minus strand alignments (or alternatively +5, -4), this does not result in the two positions overlapping. I have verified that this is true based on the expected result of the Tn5 reaction and adapter ligation and sequencing steps, and also using real sequencing data, which confirms that only the +4/-4 shift results in a significant increase in the number positions that overlap. However, these arguments are left to the user if they insist on doing it differently.
Note that the order of operations performed is the same as the order of the
associated arguments in the function proper, but not in the argument
documentation i.e., the plus_offset
and minus_offset
arguments
are applied after the shift
argument and before the
trim.to
argument.
While this single-base precision analysis of ATAC-seq may be useful in some
cases, for most users it is unlikely to be useful. Instead, you might use the
plus_offset
and minus_offset
arguments correctly, but set
trim.to = "whole"
(and keep ignore.strand = TRUE
). This will
keep the entire ATAC-seq reads, which is the most common analysis approach.
It is also common to use coverage data with ATAC-seq, but this eliminates
read count information.
Hojoong Kwak, Nicholas J. Fuda, Leighton J. Core, John T. Lis (2013). Precise Maps of RNA Polymerase Reveal How Promoters Direct Initiation and Pausing. Science 339(6122): 950–953. https://doi.org/10.1126/science.1229386
Jason D. Buenrostro, Paul G. Giresi, Lisa C. Zaba, Howard Y. Chang, William J. Greenleaf (2013). Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, dna-binding proteins and nucleosome position. Nature Methods 10: 1213–1218. https://doi.org/10.1038/nmeth.2688
# get local address for included bam file
ps.bam <- system.file("extdata", "PROseq_dm6_chr4.bam",
package = "BRGenomics")
#--------------------------------------------------#
# Import entire reads
#--------------------------------------------------#
# Note that PRO-seq reads are sequenced as reverse complement
import_bam(ps.bam, revcomp = TRUE, paired_end = FALSE)
#> GRanges object with 164 ranges and 1 metadata column:
#> seqnames ranges strand | score
#> <Rle> <IRanges> <Rle> | <integer>
#> [1] chr4 1270-1296 + | 1
#> [2] chr4 41411-41429 + | 1
#> [3] chr4 42556-42591 + | 1
#> [4] chr4 42559-42589 + | 1
#> [5] chr4 42559-42594 + | 3
#> ... ... ... ... . ...
#> [160] chr4 1307741-1307776 - | 1
#> [161] chr4 1316536-1316563 - | 1
#> [162] chr4 1318959-1318994 - | 1
#> [163] chr4 1319003-1319038 - | 1
#> [164] chr4 1319368-1319403 - | 1
#> -------
#> seqinfo: 1870 sequences from an unspecified genome
#--------------------------------------------------#
# Import entire reads, 1 range per read
#--------------------------------------------------#
import_bam(ps.bam, revcomp = TRUE, field = NULL,
paired_end = FALSE)
#> GRanges object with 190 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr4 1270-1296 +
#> [2] chr4 41411-41429 +
#> [3] chr4 42556-42591 +
#> [4] chr4 42559-42589 +
#> [5] chr4 42559-42594 +
#> ... ... ... ...
#> [186] chr4 1307741-1307776 -
#> [187] chr4 1316536-1316563 -
#> [188] chr4 1318959-1318994 -
#> [189] chr4 1319003-1319038 -
#> [190] chr4 1319368-1319403 -
#> -------
#> seqinfo: 1870 sequences from an unspecified genome
#--------------------------------------------------#
# Import PRO-seq reads at basepair-resolution
#--------------------------------------------------#
# the typical manner to import PRO-seq data:
import_bam(ps.bam, revcomp = TRUE, trim.to = "3p",
paired_end = FALSE)
#> GRanges object with 150 ranges and 1 metadata column:
#> seqnames ranges strand | score
#> <Rle> <IRanges> <Rle> | <integer>
#> [1] chr4 1296 + | 1
#> [2] chr4 41429 + | 1
#> [3] chr4 42589 + | 1
#> [4] chr4 42591 + | 2
#> [5] chr4 42594 + | 5
#> ... ... ... ... . ...
#> [146] chr4 1307741 - | 1
#> [147] chr4 1316536 - | 1
#> [148] chr4 1318959 - | 1
#> [149] chr4 1319003 - | 1
#> [150] chr4 1319368 - | 1
#> -------
#> seqinfo: 1870 sequences from an unspecified genome
#--------------------------------------------------#
# Import PRO-seq reads, removing the run-on base
#--------------------------------------------------#
# the best way to import PRO-seq data; removes the
# most 3' base, which was added in the run-on
import_bam(ps.bam, revcomp = TRUE, trim.to = "3p",
shift = -1, paired_end = FALSE)
#> GRanges object with 150 ranges and 1 metadata column:
#> seqnames ranges strand | score
#> <Rle> <IRanges> <Rle> | <integer>
#> [1] chr4 1295 + | 1
#> [2] chr4 41428 + | 1
#> [3] chr4 42588 + | 1
#> [4] chr4 42590 + | 2
#> [5] chr4 42593 + | 5
#> ... ... ... ... . ...
#> [146] chr4 1307742 - | 1
#> [147] chr4 1316537 - | 1
#> [148] chr4 1318960 - | 1
#> [149] chr4 1319004 - | 1
#> [150] chr4 1319369 - | 1
#> -------
#> seqinfo: 1870 sequences from an unspecified genome
#--------------------------------------------------#
# Import 5' ends of PRO-seq reads
#--------------------------------------------------#
# will include bona fide TSSes as well as hydrolysis products
import_bam(ps.bam, revcomp = TRUE, trim.to = "5p",
paired_end = FALSE)
#> GRanges object with 145 ranges and 1 metadata column:
#> seqnames ranges strand | score
#> <Rle> <IRanges> <Rle> | <integer>
#> [1] chr4 1270 + | 1
#> [2] chr4 41411 + | 1
#> [3] chr4 42556 + | 1
#> [4] chr4 42559 + | 4
#> [5] chr4 42560 + | 4
#> ... ... ... ... . ...
#> [141] chr4 1307776 - | 1
#> [142] chr4 1316563 - | 1
#> [143] chr4 1318994 - | 1
#> [144] chr4 1319038 - | 1
#> [145] chr4 1319403 - | 1
#> -------
#> seqinfo: 1870 sequences from an unspecified genome