Binning

The binning method consists of obtainin an estimate of the PDF using a histogram. Frequency at a point \(x_i\) from a histogram can be calculated as:

\[\hat{f}(x_i) = \frac{c_i}{n\Delta}.\]

Where \(\hat{f}\) is an estimate of the frequency count, \(c_i\) is the number of observations in the same bin as \(x_i\), \(n\) is the total number of observations and \(\Delta\) is the bin width.

To get probability density from this frequency count, the width of the bin must be accounted for and, then, the density estimate can be calculated as \(p(x) = \Delta / f(x)\).

The ideal \(\Delta\) varies depending on the data, but several “rules-of-thumb” exist that can be used as guidance. These have been implemented in the UNITE toolbox taking advantage and following the same notation as in numpy’s histogram_bin_edges. See: estimate_ideal_bins.

Having an estimate of density \(\hat{p}(x)\) at \(x\) makes it so entropy, KL divergence and mututal information can be calculated directly as resubstitution estimates using the following equations:

\[H(X) = -\frac{1}{n} \sum_{i=1}^{n} {\log{\left ( \hat{p}\left ( x_i \right ) \right )}}\]
\[D_{KL}(p||q) = \frac{1}{n} \sum_{i=1}^{n} {\log{\left ( \frac{\hat{p}(x_i)}{\hat{q}(x_i)} \right )}}\]
\[I(X;Y) = \frac{1}{n} \sum_{i=1}^{n} {\log{\left ( \frac{\hat{p}(x_i,y_i)}{\hat{p}(x_i)\hat{p}(y_i)} \right )}}\]
unite_toolbox.bin_estimators.estimate_ideal_bins(data: numpy.ndarray, *, counts: bool | None = True) dict[str, int | list[float]]

Estimate the number of ideal bins.

Estimates the ideal number of bins for each feature (column) of a 2D data array using different methods. See numpy.histogram_bin_edges for a list of available methods.

Parameters

datanumpy.ndarray

Array of shape (n_samples, d_features)

countsbool, optional

Whether to return the number of bins (True) or the bin edges (False).

Returns

dict

A dictionary with a key for each method, and the values are lists of number of bins or bin edges for each feature of the data (if counts=False).

unite_toolbox.bin_estimators.calc_bin_density(x: numpy.ndarray, data: numpy.ndarray, edges: list[int | list[float]]) numpy.ndarray

Calculate density using binning.

Calculates the density of every point of the 2D array x within the d-dimensional histogram created from data and edges.

Similar to a lookup operation where the entries in x are replaced by the bin indices in which they would fall given the binning scheme defined in edges. Then the indices are used to “look up” each value of x in the d-dimensional histogram created from data and edges.

Parameters

xnumpy.ndarray

Array of shape (n_samples, d_features)

datanumpy.ndarray

Array of shape (n_samples, d_features)

edgeslist or int

A list of length d_features which contains arrays describing the bin edges along each dimension or a list of ints describing the number of bins to use in each dimension.

Returns

fxnumpy.ndarray

Array of shape (n_samples, 1)

unite_toolbox.bin_estimators.calc_vol_array(edges: list[numpy.ndarray]) numpy.ndarray

Calculate the volume of a multidimensional array.

Calculates the volume of each cell of the multidimensional array defined by edges where edges is a list of arrays. As an example, if edges contains two arrays, this functions returns a 2D grid where each element in the grid contains the value for the area of that specific cell. If edges contains three arrays, the returned grid is 3D where each element of the grid is the volume of the cell, and so on.

As this is done to calculate the volume of each bin of a multidimensional histogram, the returned grid can be indexed by the same indices as a histogramdd from NumPy.

Parameters

edgeslist[np.ndarray]

List of 1D NumPy arrays

Returns

volnumpy.ndarray

Array of shape (len(arr0) - 1, len(arr1) - 1, …, len(arrn) - 1)

Example

>>> a = np.array([0.0, 1.0, 3.0, 7.0, 12.0])
>>> b = np.array([4.0, 8.0, 10.0])
>>> calc_vol_array([a, b])
array([[ 4.,  2.],
       [ 8.,  4.],
       [16.,  8.],
       [20., 10.]])
unite_toolbox.bin_estimators.calc_bin_entropy(data: numpy.ndarray, edges: list[int | float] | int) tuple[float]

Calculate entropy using binning.

Calculates the (joint) entropy of the input data after binning it along each dimension using specified bin edges or number of bins.

Parameters

datanumpy.ndarray

Array of shape (n_samples, d_features)

edgeslist or int

A list of length d_features which contains arrays describing the bin edges along each dimension or a list of ints describing the number of bins to use in each dimension. Input can also be a single int and the histogram will be created with the same number of bins for each dimension.

Returns

hfloat

The (joint) entropy of the input data after binning.

cffloat

The correction factor due to bin spacing. See Cover & Thomas (2006) Eq. 8.28 ISBN: 978-0-471-24195-9

unite_toolbox.bin_estimators.calc_uniform_bin_entropy(data: numpy.ndarray, edges: list[list[float]]) tuple[float]

Alternative method to calculate entropy using binning.

Calculates the (joint) entropy of the input data. Using this method, every data point is substituted by the specific bin it occupies in edges. Therefore limiting the required memory to only store the number of entries in data.

NOTE: this only works for uniform binning schemes as the correction factor for differential entropy is calculated as assuming that every bin is of the same size.

Parameters

datanumpy.ndarray

Array of shape (n_samples, d_features)

edgeslist or int

A list of length d_features which contains arrays describing the bin edges along each dimension

Returns

hfloat

The (joint) entropy of the input data after binning.

corr_factfloat

The correction factor due to bin spacing. See Cover & Thomas (2006) Eq. 8.28 ISBN: 978-0-471-24195-9

unite_toolbox.bin_estimators.calc_bin_kld(p: numpy.ndarray, q: numpy.ndarray, edges: list[list[float]]) float

Calculate Kullback-Leibler divergence (relative entropy) using binning.

Calculates the Kullback-Leibler divergence (relative entropy) between p and q [in nats] by approximating both distributions using some binning scheme defined by edges. edges must be able to support the values in q.

Parameters

pnumpy.ndarray

Array of shape (n_samples, d_features)

qnumpy.ndarray

Array of shape (m_samples, d_features)

edgeslist or int

A list of length d_features which contains arrays describing the bin edges along each dimension or a list of ints describing the number of bins to use in each dimension.

Returns

kldfloat

Kullback-Leibler divergence between p and q [in nats]

unite_toolbox.bin_estimators.calc_bin_mutual_information(x: numpy.ndarray, y: numpy.ndarray, edges: list[list[float]]) float

Calculate mutual information between x and y using binning.

Calculates the mutual information between an array x and an array y. Both x and y don’t necesarily need the same number of samples as binning is used. This approach builds multivariate histograms for x, y and x-y using the specified edges, and evaluates I(X; Y) in every bin where the density of x-y is not zero. This is a resubstitution estimate.

Parameters

xnumpy.ndarray

Array of shape (n_samples, d1_features)

ynumpy.ndarray

Array of shape (m_samples, d2_features)

edgeslist

A list of two lists each containing either integers for the number of bins in each axis or arrays of the edges for the binning scheme of each axis.

Returns

mifloat

Mutual information between x and y [in nats]

unite_toolbox.bin_estimators.calc_qs_entropy(sample: numpy.ndarray, alpha: float = 0.25, N_k: int = 500, seed: int | None = None) float

Calculate the 1-D entropy of the input data.

Calculates the 1-D entropy of the input data [in nats] using the quantile spacing (QS) estimator proposed by: Gupta et al. (2021) https://doi.org/10.3390/e23060740

Adapted from: https://github.com/rehsani/Entropy

Parameters

samplenumpy.ndarray

Flat array

alphafloat, optional

percent of the instances from the sample used for estimation of entropy (i.e., number of quantile-spacings).

N_kint, optional

number of sample subsets, used to estimate the sample distribution for each quantile empirically

seed: int, optional

random seed, if None: default is used

Returns

hfloat

Entropy [in nats] of ‘alpha’ percent of the instances