This is a convenience function for generating DESeqDataSet
objects,
but this function also adds support for counting reads across non-contiguous
regions.
getDESeqDataSet(
dataset.list,
regions.gr,
sample_names = NULL,
gene_names = NULL,
sizeFactors = NULL,
field = "score",
blacklist = NULL,
expand_ranges = FALSE,
ncores = getOption("mc.cores", 2L),
quiet = FALSE
)
An object containing GRanges datasets that can be passed
to getCountsByRegions
,
typically a list of GRanges objects, or a
multiplexed GRanges
object (see
details below).
A GRanges object containing regions of interest.
Names for each dataset in dataset.list
are
required. By default (sample_names = NULL
), if dataset.list
is a list, the names of the list elements are used; for a multiplexed
GRanges object, the field names are used. The names must each contain the
string "_rep#", where "#" is a single character (usually a number)
indicating the replicate. Sample names across different replicates must be
otherwise identical.
An optional character vector giving gene names, or any
other identifier over which reads should be counted. Gene names are
required if counting is to be performed over non-contiguous ranges, i.e. if
any genes have multiple ranges. If supplied, gene names are added to the
resulting DESeqDataSet
object.
DESeq2 sizeFactors
can be optionally applied in to
the DESeqDataSet
object in this function, or they can be applied
later on, either by the user or in a call to getDESeqResults
.
Applying the sizeFactors
later is useful if multiple sets of factors
will be explored, although sizeFactors
can be overwritten at any
time. Note that DESeq2 sizeFactors
are not the same as
normalization factors defined elsewhere in this package. See details below.
Argument passed to getCountsByRegions
. Can be used to
specify fields in a single multiplexed GRanges object, or individual fields
for each GRanges object in dataset.list
.
An optional GRanges object containing regions that should be excluded from signal counting. Use of this argument is distinct from the use of non-contiguous gene regions (see details below), and the two can be used simultaneously. Blacklisting doesn't affect the ranges returned as rowRanges in the output DESeqDataSet object (unlike the use of non-contiguous regions).
Logical indicating if ranges in dataset.gr
should
be treated as descriptions of single molecules (FALSE
), or if ranges
should be treated as representing multiple adjacent positions with the same
signal (TRUE
). See
getCountsByRegions
.
Number of cores to use for read counting across all samples. By default, all available cores are used.
If TRUE
, all output messages from call to
DESeqDataSet
will be suppressed.
A DESeqData
object in which rowData
are given as
rowRanges
, which are equivalent to regions.gr
, unless there
are non-contiguous gene regions (see note below). Samples (as seen in
colData
) are factored so that samples are grouped by
replicate
and condition
, i.e. all non-replicate samples are
treated as distinct, and the DESeq2 design = ~condition
.
In DESeq2, genes must be defined
by single, contiguous chromosomal locations. In contrast, this function
allows individual genes to be encompassed by multiple distinct ranges in
regions.gr
. To use non-contiguous gene regions, provide
gene_names
in which some names are duplicated. For each unique gene
in gene_names
, this function will generate counts across all ranges
for that gene, but be aware that it will only keep the largest range for
each gene in the resulting DESeqDataSet
object's rowRanges
.
If the desired use is to blacklist certain sites in a genelist, note that
the blacklist
argument can be used.
DESeq2 sizeFactors
are
sample-specific normalization factors that are applied by division, i.e.
\(counts_{norm,i}=counts_i / sizeFactor_i\). This is in contrast to normalization factors as defined in
this package (and commonly elsewhere), which are applied by multiplication.
Also note that DESeq2's "normalizationFactors
" are not sample
specific, but rather gene specific factors used to correct for
ascertainment bias across different genes (e.g. as might be relevant for
GSEA or Go analysis).
Certain gene names can cause this function to return an error. We've never encountered errors using conventional, systematic naming schemes (e.g. ensembl IDs), but we have seen errors when using Drosophila (Flybase) "symbols". We expect this is due to the unconventional use of non-alphanumeric characters in some Drosophila gene names.
suppressPackageStartupMessages(require(DESeq2))
#> Warning: package ‘matrixStats’ was built under R version 4.1.2
data("PROseq") # import included PROseq data
data("txs_dm6_chr4") # import included transcripts
# divide PROseq data into 6 toy datasets
ps_a_rep1 <- PROseq[seq(1, length(PROseq), 6)]
ps_b_rep1 <- PROseq[seq(2, length(PROseq), 6)]
ps_c_rep1 <- PROseq[seq(3, length(PROseq), 6)]
ps_a_rep2 <- PROseq[seq(4, length(PROseq), 6)]
ps_b_rep2 <- PROseq[seq(5, length(PROseq), 6)]
ps_c_rep2 <- PROseq[seq(6, length(PROseq), 6)]
ps_list <- list(A_rep1 = ps_a_rep1, A_rep2 = ps_a_rep2,
B_rep1 = ps_b_rep1, B_rep2 = ps_b_rep2,
C_rep1 = ps_c_rep1, C_rep2 = ps_c_rep2)
# make flawed dataset (ranges in txs_dm6_chr4 not disjoint)
# this means there is double-counting
# also using discontinuous gene regions, as gene_ids are repeated
dds <- getDESeqDataSet(ps_list,
txs_dm6_chr4,
gene_names = txs_dm6_chr4$gene_id,
quiet = TRUE,
ncores = 1)
dds
#> class: DESeqDataSet
#> dim: 111 6
#> metadata(1): version
#> assays(1): counts
#> rownames(111): FBgn0267363 FBgn0266617 ... FBgn0039924 FBgn0027101
#> rowData names(2): tx_name gene_id
#> colnames(6): A_rep1 A_rep2 ... C_rep1 C_rep2
#> colData names(2): condition replicate