R/ndimensional_binning.R
binNdimensions.Rd
Divide data along different dimensions into equally spaced bins, and summarize the datapoints that fall into any of these n-dimensional bins.
binNdimensions(
dims.df,
nbins = 10L,
use_bin_numbers = TRUE,
ncores = getOption("mc.cores", 2L)
)
aggregateByNdimBins(
x,
dims.df,
nbins = 10L,
FUN = mean,
...,
ignore.na = TRUE,
drop = FALSE,
empty = NA,
use_bin_numbers = TRUE,
ncores = getOption("mc.cores", 2L)
)
densityInNdimBins(
dims.df,
nbins = 10L,
use_bin_numbers = TRUE,
ncores = getOption("mc.cores", 2L)
)
A dataframe containing one or more columns of numerical data for which bins will be generated.
Either a number giving the number of bins to use for all dimensions (default = 10), or a vector containing the number of bins to use for each dimension of input data given.
A logical indicating if ordinal bin numbers should be
returned (TRUE
), or if in place of the bin number, the center value
of that bin should be returned. For instance, if the first bin encompasses
data from 1 to 3, with use_bin_numbers = TRUE
, a 1 is returned, but
when FALSE
, 2 is returned.
Number of cores to use for computations.
The name of the dimension in dims.df
to aggregate, or a
separate numerical vector or dataframe of data to be aggregated. If
x
is a numerical vector, each value in x
corresponds to a row
of dims.df
, and so length(x)
must be equal to
nrow(dims.df)
. Likewise, if x
is a dataframe, nrow(x)
must equal nrow(dims.df)
. Supplying a dataframe for x
has the
advantage of simultaneously aggregating different sets of data, and
returning a single dataframe.
A function to use for aggregating data within each bin.
Additional arguments passed to FUN
.
Logical indicating if NA
values of x
should be
ignored. Default is TRUE
.
A logical indicating if empty bin combinations should be removed
from the output. By default (FALSE
), all possible combinations of
bins are returned, and empty bins contain a value given by empty
.
When drop = FALSE
, the value returned for empty bins. By
default, empty bins return NA
. However, in many circumstances (e.g.
if FUN = sum
), the empty value should be 0
.
A dataframe.
These functions take in data along 1 or more dimensions, and for
each dimension the data is divided into evenly-sized bins from the minimum
value to the maximum value. For instance, if each row of dims.df
were a gene, the columns (the different dimensions) would be various
quantitative measures of that gene, e.g. expression level, number of exons,
length, etc. If plotted in cartesian coordinates, each gene would be a
single datapoint, and each measurement would be a separate dimension.
binNdimensions
returns the bin numbers themselves. The output
dataframe has the same dimensions as the input dims.df
, but each
input data has been replaced by its bin number (an integer). If
codeuse_bin_numbers = FALSE, the center points of the bins are returned
instead of the bin numbers.
aggregateByNdimBins
summarizes some input data x
in each
combination of bins, i.e. in each n-dimensional bin. Each row of the output
dataframe is a unique combination of the input bins (i.e. each
n-dimensional bin), and the output columns are identical to those in
dims.df
, with the addition of one or more columns containing the
aggregated data in each n-dimensional bin. If the input x
was a
vector, the column is named "value"; if the input x
was a dataframe,
the column names from x
are maintained.
densityInNdimBins
returns a dataframe just like
aggregateByNdimBins
, except the "value" column contains the number
of observations that fall into each n-dimensional bin.
data("PROseq") # import included PROseq data
data("txs_dm6_chr4") # import included transcripts
#--------------------------------------------------#
# find counts in promoter, early genebody, and near CPS
#--------------------------------------------------#
pr <- promoters(txs_dm6_chr4, 0, 100)
early_gb <- genebodies(txs_dm6_chr4, 500, 1000, fix.end = "start")
cps <- genebodies(txs_dm6_chr4, -500, 500, fix.start = "end")
df <- data.frame(counts_pr = getCountsByRegions(PROseq, pr),
counts_gb = getCountsByRegions(PROseq, early_gb),
counts_cps = getCountsByRegions(PROseq, cps))
#--------------------------------------------------#
# divide genes into 20 bins for each measurement
#--------------------------------------------------#
bin3d <- binNdimensions(df, nbins = 20, ncores = 1)
length(txs_dm6_chr4)
#> [1] 339
nrow(bin3d)
#> [1] 339
bin3d[1:6, ]
#> bin.counts_pr bin.counts_gb bin.counts_cps
#> 1 1 1 1
#> 2 1 5 3
#> 3 1 1 1
#> 4 1 1 3
#> 5 1 1 1
#> 6 5 4 3
#--------------------------------------------------#
# get number of genes in each bin
#--------------------------------------------------#
bin_counts <- densityInNdimBins(df, nbins = 20, ncores = 1)
bin_counts[1:6, ]
#> bin.counts_pr bin.counts_gb bin.counts_cps value
#> 1 1 1 1 128
#> 2 2 1 1 2
#> 3 3 1 1 0
#> 4 4 1 1 0
#> 5 5 1 1 0
#> 6 9 1 1 0
#--------------------------------------------------#
# get mean cps reads in bins of promoter and genebody reads
#--------------------------------------------------#
bin2d_cps <- aggregateByNdimBins("counts_cps", df, nbins = 20,
ncores = 1)
bin2d_cps[1:6, ]
#> bin.counts_pr bin.counts_gb counts_cps
#> 1 1 1 27.70395
#> 2 2 1 0.00000
#> 3 3 1 NA
#> 4 4 1 NA
#> 5 5 1 NA
#> 6 9 1 NA
subset(bin2d_cps, is.finite(counts_cps))[1:6, ]
#> bin.counts_pr bin.counts_gb counts_cps
#> 1 1 1 27.70395
#> 2 2 1 0.00000
#> 9 1 2 64.19231
#> 10 2 2 89.85714
#> 11 3 2 38.50000
#> 14 9 2 79.00000
#--------------------------------------------------#
# get median cps reads for those bins
#--------------------------------------------------#
bin2d_cps_med <- aggregateByNdimBins("counts_cps", df, nbins = 20,
FUN = median, ncores = 1)
bin2d_cps_med[1:6, ]
#> bin.counts_pr bin.counts_gb counts_cps
#> 1 1 1 3
#> 2 2 1 0
#> 3 3 1 NA
#> 4 4 1 NA
#> 5 5 1 NA
#> 6 9 1 NA
subset(bin2d_cps_med, is.finite(counts_cps))[1:6, ]
#> bin.counts_pr bin.counts_gb counts_cps
#> 1 1 1 3.0
#> 2 2 1 0.0
#> 9 1 2 67.5
#> 10 2 2 60.0
#> 11 3 2 38.5
#> 14 9 2 79.0