Generating and Aggregating Data Within N-dimensional Bins

Divide data along different dimensions into equally spaced bins, and summarize the datapoints that fall into any of these n-dimensional bins.

binNdimensions(
  dims.df,
  nbins = 10L,
  use_bin_numbers = TRUE,
  ncores = getOption("mc.cores", 2L)
)

aggregateByNdimBins(
  x,
  dims.df,
  nbins = 10L,
  FUN = mean,
  ...,
  ignore.na = TRUE,
  drop = FALSE,
  empty = NA,
  use_bin_numbers = TRUE,
  ncores = getOption("mc.cores", 2L)
)

densityInNdimBins(
  dims.df,
  nbins = 10L,
  use_bin_numbers = TRUE,
  ncores = getOption("mc.cores", 2L)
)

Arguments

dims.df: A dataframe containing one or more columns of numerical data for which bins will be generated.
nbins: Either a number giving the number of bins to use for all dimensions (default = 10), or a vector containing the number of bins to use for each dimension of input data given.
use_bin_numbers: A logical indicating if ordinal bin numbers should be returned (TRUE), or if in place of the bin number, the center value of that bin should be returned. For instance, if the first bin encompasses data from 1 to 3, with use_bin_numbers = TRUE, a 1 is returned, but when FALSE, 2 is returned.
ncores: Number of cores to use for computations.
x: The name of the dimension in dims.df to aggregate, or a separate numerical vector or dataframe of data to be aggregated. If x is a numerical vector, each value in x corresponds to a row of dims.df, and so length(x) must be equal to nrow(dims.df). Likewise, if x is a dataframe, nrow(x) must equal nrow(dims.df). Supplying a dataframe for x has the advantage of simultaneously aggregating different sets of data, and returning a single dataframe.
FUN: A function to use for aggregating data within each bin.
...: Additional arguments passed to FUN.
ignore.na: Logical indicating if NA values of x should be ignored. Default is TRUE.
drop: A logical indicating if empty bin combinations should be removed from the output. By default (FALSE), all possible combinations of bins are returned, and empty bins contain a value given by empty.
empty: When drop = FALSE, the value returned for empty bins. By default, empty bins return NA. However, in many circumstances (e.g. if FUN = sum), the empty value should be 0.

Value

A dataframe.

Details

These functions take in data along 1 or more dimensions, and for each dimension the data is divided into evenly-sized bins from the minimum value to the maximum value. For instance, if each row of dims.df were a gene, the columns (the different dimensions) would be various quantitative measures of that gene, e.g. expression level, number of exons, length, etc. If plotted in cartesian coordinates, each gene would be a single datapoint, and each measurement would be a separate dimension.

binNdimensions returns the bin numbers themselves. The output dataframe has the same dimensions as the input dims.df, but each input data has been replaced by its bin number (an integer). If codeuse_bin_numbers = FALSE, the center points of the bins are returned instead of the bin numbers.

aggregateByNdimBins summarizes some input data x in each combination of bins, i.e. in each n-dimensional bin. Each row of the output dataframe is a unique combination of the input bins (i.e. each n-dimensional bin), and the output columns are identical to those in dims.df, with the addition of one or more columns containing the aggregated data in each n-dimensional bin. If the input x was a vector, the column is named "value"; if the input x was a dataframe, the column names from x are maintained.

densityInNdimBins returns a dataframe just like aggregateByNdimBins, except the "value" column contains the number of observations that fall into each n-dimensional bin.

Author

Mike DeBerardine

Examples

data("PROseq") # import included PROseq data
data("txs_dm6_chr4") # import included transcripts

#--------------------------------------------------#
# find counts in promoter, early genebody, and near CPS
#--------------------------------------------------#

pr <- promoters(txs_dm6_chr4, 0, 100)
early_gb <- genebodies(txs_dm6_chr4, 500, 1000, fix.end = "start")
cps <- genebodies(txs_dm6_chr4, -500, 500, fix.start = "end")

df <- data.frame(counts_pr = getCountsByRegions(PROseq, pr),
                 counts_gb = getCountsByRegions(PROseq, early_gb),
                 counts_cps = getCountsByRegions(PROseq, cps))

#--------------------------------------------------#
# divide genes into 20 bins for each measurement
#--------------------------------------------------#

bin3d <- binNdimensions(df, nbins = 20, ncores = 1)

length(txs_dm6_chr4)
#> [1] 339
nrow(bin3d)
#> [1] 339
bin3d[1:6, ]
#>   bin.counts_pr bin.counts_gb bin.counts_cps
#> 1             1             1              1
#> 2             1             5              3
#> 3             1             1              1
#> 4             1             1              3
#> 5             1             1              1
#> 6             5             4              3

#--------------------------------------------------#
# get number of genes in each bin
#--------------------------------------------------#

bin_counts <- densityInNdimBins(df, nbins = 20, ncores = 1)

bin_counts[1:6, ]
#>   bin.counts_pr bin.counts_gb bin.counts_cps value
#> 1             1             1              1   128
#> 2             2             1              1     2
#> 3             3             1              1     0
#> 4             4             1              1     0
#> 5             5             1              1     0
#> 6             9             1              1     0

#--------------------------------------------------#
# get mean cps reads in bins of promoter and genebody reads
#--------------------------------------------------#

bin2d_cps <- aggregateByNdimBins("counts_cps", df, nbins = 20,
                                 ncores = 1)

bin2d_cps[1:6, ]
#>   bin.counts_pr bin.counts_gb counts_cps
#> 1             1             1   27.70395
#> 2             2             1    0.00000
#> 3             3             1         NA
#> 4             4             1         NA
#> 5             5             1         NA
#> 6             9             1         NA

subset(bin2d_cps, is.finite(counts_cps))[1:6, ]
#>    bin.counts_pr bin.counts_gb counts_cps
#> 1              1             1   27.70395
#> 2              2             1    0.00000
#> 9              1             2   64.19231
#> 10             2             2   89.85714
#> 11             3             2   38.50000
#> 14             9             2   79.00000

#--------------------------------------------------#
# get median cps reads for those bins
#--------------------------------------------------#

bin2d_cps_med <- aggregateByNdimBins("counts_cps", df, nbins = 20,
                                     FUN = median, ncores = 1)

bin2d_cps_med[1:6, ]
#>   bin.counts_pr bin.counts_gb counts_cps
#> 1             1             1          3
#> 2             2             1          0
#> 3             3             1         NA
#> 4             4             1         NA
#> 5             5             1         NA
#> 6             9             1         NA

subset(bin2d_cps_med, is.finite(counts_cps))[1:6, ]
#>    bin.counts_pr bin.counts_gb counts_cps
#> 1              1             1        3.0
#> 2              2             1        0.0
#> 9              1             2       67.5
#> 10             2             2       60.0
#> 11             3             2       38.5
#> 14             9             2       79.0