This is a convenience function for generating DESeqDataSet objects,
but this function also adds support for counting reads across non-contiguous
regions.
getDESeqDataSet(
  dataset.list,
  regions.gr,
  sample_names = NULL,
  gene_names = NULL,
  sizeFactors = NULL,
  field = "score",
  blacklist = NULL,
  expand_ranges = FALSE,
  ncores = getOption("mc.cores", 2L),
  quiet = FALSE
)An object containing GRanges datasets that can be passed
to getCountsByRegions,
typically a list of GRanges objects, or a
 multiplexed GRanges object (see
details below).
A GRanges object containing regions of interest.
Names for each dataset in dataset.list are
required. By default (sample_names = NULL), if dataset.list
is a list, the names of the list elements are used; for a multiplexed
GRanges object, the field names are used. The names must each contain the
string "_rep#", where "#" is a single character (usually a number)
indicating the replicate. Sample names across different replicates must be
otherwise identical.
An optional character vector giving gene names, or any
other identifier over which reads should be counted. Gene names are
required if counting is to be performed over non-contiguous ranges, i.e. if
any genes have multiple ranges. If supplied, gene names are added to the
resulting DESeqDataSet object.
DESeq2 sizeFactors can be optionally applied in to
the DESeqDataSet object in this function, or they can be applied
later on, either by the user or in a call to getDESeqResults.
Applying the sizeFactors later is useful if multiple sets of factors
will be explored, although sizeFactors can be overwritten at any
time. Note that DESeq2 sizeFactors are not the same as
normalization factors defined elsewhere in this package. See details below.
Argument passed to getCountsByRegions. Can be used to
specify fields in a single multiplexed GRanges object, or individual fields
for each GRanges object in dataset.list.
An optional GRanges object containing regions that should be excluded from signal counting. Use of this argument is distinct from the use of non-contiguous gene regions (see details below), and the two can be used simultaneously. Blacklisting doesn't affect the ranges returned as rowRanges in the output DESeqDataSet object (unlike the use of non-contiguous regions).
Logical indicating if ranges in dataset.gr should
be treated as descriptions of single molecules (FALSE), or if ranges
should be treated as representing multiple adjacent positions with the same
signal (TRUE). See 
getCountsByRegions.
Number of cores to use for read counting across all samples. By default, all available cores are used.
If TRUE, all output messages from call to
DESeqDataSet will be suppressed.
A DESeqData object in which rowData are given as
rowRanges, which are equivalent to regions.gr, unless there
  are non-contiguous gene regions (see note below). Samples (as seen in
colData) are factored so that samples are grouped by
replicate and condition, i.e. all non-replicate samples are
  treated as distinct, and the DESeq2 design = ~condition.
In DESeq2, genes must be defined
  by single, contiguous chromosomal locations. In contrast, this function
  allows individual genes to be encompassed by multiple distinct ranges in
  regions.gr. To use non-contiguous gene regions, provide
  gene_names in which some names are duplicated. For each unique gene
  in gene_names, this function will generate counts across all ranges
  for that gene, but be aware that it will only keep the largest range for
  each gene in the resulting DESeqDataSet object's rowRanges.
  If the desired use is to blacklist certain sites in a genelist, note that
  the blacklist argument can be used.
DESeq2 sizeFactors are
  sample-specific normalization factors that are applied by division, i.e.
  \(counts_{norm,i}=counts_i / sizeFactor_i\). This is in contrast to normalization factors as defined in
  this package (and commonly elsewhere), which are applied by multiplication.
  Also note that DESeq2's "normalizationFactors" are not sample
  specific, but rather gene specific factors used to correct for
  ascertainment bias across different genes (e.g. as might be relevant for
  GSEA or Go analysis).
Certain gene names can cause this function to return an error. We've never encountered errors using conventional, systematic naming schemes (e.g. ensembl IDs), but we have seen errors when using Drosophila (Flybase) "symbols". We expect this is due to the unconventional use of non-alphanumeric characters in some Drosophila gene names.
suppressPackageStartupMessages(require(DESeq2))
#> Warning: package ‘matrixStats’ was built under R version 4.1.2
data("PROseq") # import included PROseq data
data("txs_dm6_chr4") # import included transcripts
# divide PROseq data into 6 toy datasets
ps_a_rep1 <- PROseq[seq(1, length(PROseq), 6)]
ps_b_rep1 <- PROseq[seq(2, length(PROseq), 6)]
ps_c_rep1 <- PROseq[seq(3, length(PROseq), 6)]
ps_a_rep2 <- PROseq[seq(4, length(PROseq), 6)]
ps_b_rep2 <- PROseq[seq(5, length(PROseq), 6)]
ps_c_rep2 <- PROseq[seq(6, length(PROseq), 6)]
ps_list <- list(A_rep1 = ps_a_rep1, A_rep2 = ps_a_rep2,
                B_rep1 = ps_b_rep1, B_rep2 = ps_b_rep2,
                C_rep1 = ps_c_rep1, C_rep2 = ps_c_rep2)
# make flawed dataset (ranges in txs_dm6_chr4 not disjoint)
#    this means there is double-counting
# also using discontinuous gene regions, as gene_ids are repeated
dds <- getDESeqDataSet(ps_list,
                       txs_dm6_chr4,
                       gene_names = txs_dm6_chr4$gene_id,
                       quiet = TRUE,
                       ncores = 1)
dds
#> class: DESeqDataSet 
#> dim: 111 6 
#> metadata(1): version
#> assays(1): counts
#> rownames(111): FBgn0267363 FBgn0266617 ... FBgn0039924 FBgn0027101
#> rowData names(2): tx_name gene_id
#> colnames(6): A_rep1 A_rep2 ... C_rep1 C_rep2
#> colData names(2): condition replicate