Get DESeqDataSet objects for downstream analysis

This is a convenience function for generating DESeqDataSet objects, but this function also adds support for counting reads across non-contiguous regions.

getDESeqDataSet(
  dataset.list,
  regions.gr,
  sample_names = NULL,
  gene_names = NULL,
  sizeFactors = NULL,
  field = "score",
  blacklist = NULL,
  expand_ranges = FALSE,
  ncores = getOption("mc.cores", 2L),
  quiet = FALSE
)

Arguments

dataset.list: An object containing GRanges datasets that can be passed to getCountsByRegions, typically a list of GRanges objects, or a multiplexed GRanges object (see details below).
regions.gr: A GRanges object containing regions of interest.
sample_names: Names for each dataset in dataset.list are required. By default (sample_names = NULL), if dataset.list is a list, the names of the list elements are used; for a multiplexed GRanges object, the field names are used. The names must each contain the string "_rep#", where "#" is a single character (usually a number) indicating the replicate. Sample names across different replicates must be otherwise identical.
gene_names: An optional character vector giving gene names, or any other identifier over which reads should be counted. Gene names are required if counting is to be performed over non-contiguous ranges, i.e. if any genes have multiple ranges. If supplied, gene names are added to the resulting DESeqDataSet object.
sizeFactors: DESeq2 sizeFactors can be optionally applied in to the DESeqDataSet object in this function, or they can be applied later on, either by the user or in a call to getDESeqResults. Applying the sizeFactors later is useful if multiple sets of factors will be explored, although sizeFactors can be overwritten at any time. Note that DESeq2 sizeFactors are not the same as normalization factors defined elsewhere in this package. See details below.
field: Argument passed to getCountsByRegions. Can be used to specify fields in a single multiplexed GRanges object, or individual fields for each GRanges object in dataset.list.
blacklist: An optional GRanges object containing regions that should be excluded from signal counting. Use of this argument is distinct from the use of non-contiguous gene regions (see details below), and the two can be used simultaneously. Blacklisting doesn't affect the ranges returned as rowRanges in the output DESeqDataSet object (unlike the use of non-contiguous regions).
expand_ranges: Logical indicating if ranges in dataset.gr should be treated as descriptions of single molecules (FALSE), or if ranges should be treated as representing multiple adjacent positions with the same signal (TRUE). See getCountsByRegions.
ncores: Number of cores to use for read counting across all samples. By default, all available cores are used.
quiet: If TRUE, all output messages from call to DESeqDataSet will be suppressed.

Value

A DESeqData object in which rowData are given as

rowRanges, which are equivalent to regions.gr, unless there are non-contiguous gene regions (see note below). Samples (as seen in

colData) are factored so that samples are grouped by

replicate and condition, i.e. all non-replicate samples are treated as distinct, and the DESeq2 design = ~condition.

Use of non-contiguous gene regions

In DESeq2, genes must be defined by single, contiguous chromosomal locations. In contrast, this function allows individual genes to be encompassed by multiple distinct ranges in regions.gr. To use non-contiguous gene regions, provide gene_names in which some names are duplicated. For each unique gene in gene_names, this function will generate counts across all ranges for that gene, but be aware that it will only keep the largest range for each gene in the resulting DESeqDataSet object's rowRanges. If the desired use is to blacklist certain sites in a genelist, note that the blacklist argument can be used.

A note on DESeq2 sizeFactors

DESeq2 sizeFactors are sample-specific normalization factors that are applied by division, i.e. \(counts_{norm,i}=counts_i / sizeFactor_i\). This is in contrast to normalization factors as defined in this package (and commonly elsewhere), which are applied by multiplication. Also note that DESeq2's "normalizationFactors" are not sample specific, but rather gene specific factors used to correct for ascertainment bias across different genes (e.g. as might be relevant for GSEA or Go analysis).

On gene names and unexpected errors

Certain gene names can cause this function to return an error. We've never encountered errors using conventional, systematic naming schemes (e.g. ensembl IDs), but we have seen errors when using Drosophila (Flybase) "symbols". We expect this is due to the unconventional use of non-alphanumeric characters in some Drosophila gene names.

Author

Mike DeBerardine

Examples

suppressPackageStartupMessages(require(DESeq2))
#> Warning: package ‘matrixStats’ was built under R version 4.1.2
data("PROseq") # import included PROseq data
data("txs_dm6_chr4") # import included transcripts

# divide PROseq data into 6 toy datasets
ps_a_rep1 <- PROseq[seq(1, length(PROseq), 6)]
ps_b_rep1 <- PROseq[seq(2, length(PROseq), 6)]
ps_c_rep1 <- PROseq[seq(3, length(PROseq), 6)]

ps_a_rep2 <- PROseq[seq(4, length(PROseq), 6)]
ps_b_rep2 <- PROseq[seq(5, length(PROseq), 6)]
ps_c_rep2 <- PROseq[seq(6, length(PROseq), 6)]

ps_list <- list(A_rep1 = ps_a_rep1, A_rep2 = ps_a_rep2,
                B_rep1 = ps_b_rep1, B_rep2 = ps_b_rep2,
                C_rep1 = ps_c_rep1, C_rep2 = ps_c_rep2)

# make flawed dataset (ranges in txs_dm6_chr4 not disjoint)
#    this means there is double-counting
# also using discontinuous gene regions, as gene_ids are repeated
dds <- getDESeqDataSet(ps_list,
                       txs_dm6_chr4,
                       gene_names = txs_dm6_chr4$gene_id,
                       quiet = TRUE,
                       ncores = 1)
dds
#> class: DESeqDataSet 
#> dim: 111 6 
#> metadata(1): version
#> assays(1): counts
#> rownames(111): FBgn0267363 FBgn0266617 ... FBgn0039924 FBgn0027101
#> rowData names(2): tx_name gene_id
#> colnames(6): A_rep1 A_rep2 ... C_rep1 C_rep2
#> colData names(2): condition replicate