Binning

The binning method consists of obtainin an estimate of the PDF using a histogram. Frequency at a point \(x_i\) from a histogram can be calculated as:

\[\hat{f}(x_i) = \frac{c_i}{n\Delta}.\]

Where \(\hat{f}\) is an estimate of the frequency count, \(c_i\) is the number of observations in the same bin as \(x_i\), \(n\) is the total number of observations and \(\Delta\) is the bin width.

To get probability density from this frequency count, the width of the bin must be accounted for and, then, the density estimate can be calculated as \(p(x) = \Delta / f(x)\).

The ideal \(\Delta\) varies depending on the data, but several “rules-of-thumb” exist that can be used as guidance. These have been implemented in the UNITE toolbox taking advantage and following the same notation as in numpy’s histogram_bin_edges. See: estimate_ideal_bins.

Having an estimate of density \(\hat{p}(x)\) at \(x\) makes it so entropy, KL divergence and mututal information can be calculated directly as resubstitution estimates using the following equations:

\[H(X) = -\frac{1}{n} \sum_{i=1}^{n} {\log{\left ( \hat{p}\left ( x_i \right ) \right )}}\]

\[D_{KL}(p||q) = \frac{1}{n} \sum_{i=1}^{n} {\log{\left ( \frac{\hat{p}(x_i)}{\hat{q}(x_i)} \right )}}\]

\[I(X;Y) = \frac{1}{n} \sum_{i=1}^{n} {\log{\left ( \frac{\hat{p}(x_i,y_i)}{\hat{p}(x_i)\hat{p}(y_i)} \right )}}\]

unite_toolbox.bin_estimators.estimate_ideal_bins(data: numpy.ndarray, *, counts: bool | None = True) → dict[str, int | list[float]]

Estimate the number of ideal bins.

Estimates the ideal number of bins for each feature (column) of a 2D data array using different methods. See numpy.histogram_bin_edges for a list of available methods.

Parameters

datanumpy.ndarray: Array of shape (n_samples, d_features)
countsbool, optional: Whether to return the number of bins (True) or the bin edges (False).

Returns

dict: A dictionary with a key for each method, and the values are lists of number of bins or bin edges for each feature of the data (if counts=False).

unite_toolbox.bin_estimators.calc_bin_density(x: numpy.ndarray, data: numpy.ndarray, edges: list[int | list[float]]) → numpy.ndarray

Calculate density using binning.

Calculates the density of every point of the 2D array x within the d-dimensional histogram created from data and edges.

Similar to a lookup operation where the entries in x are replaced by the bin indices in which they would fall given the binning scheme defined in edges. Then the indices are used to “look up” each value of x in the d-dimensional histogram created from data and edges.

Parameters

xnumpy.ndarray: Array of shape (n_samples, d_features)
datanumpy.ndarray: Array of shape (n_samples, d_features)
edgeslist or int: A list of length d_features which contains arrays describing the bin edges along each dimension or a list of ints describing the number of bins to use in each dimension.

Returns

fxnumpy.ndarray: Array of shape (n_samples, 1)

unite_toolbox.bin_estimators.calc_vol_array(edges: list[numpy.ndarray]) → numpy.ndarray

Calculate the volume of a multidimensional array.

Calculates the volume of each cell of the multidimensional array defined by edges where edges is a list of arrays. As an example, if edges contains two arrays, this functions returns a 2D grid where each element in the grid contains the value for the area of that specific cell. If edges contains three arrays, the returned grid is 3D where each element of the grid is the volume of the cell, and so on.

As this is done to calculate the volume of each bin of a multidimensional histogram, the returned grid can be indexed by the same indices as a histogramdd from NumPy.

Parameters

edgeslist[np.ndarray]: List of 1D NumPy arrays

Returns

volnumpy.ndarray: Array of shape (len(arr0) - 1, len(arr1) - 1, …, len(arrn) - 1)

Example

>>> a = np.array([0.0, 1.0, 3.0, 7.0, 12.0])
>>> b = np.array([4.0, 8.0, 10.0])
>>> calc_vol_array([a, b])
array([[ 4.,  2.],
       [ 8.,  4.],
       [16.,  8.],
       [20., 10.]])

unite_toolbox.bin_estimators.calc_bin_entropy(data: numpy.ndarray, edges: list[int | float] | int) → tuple[float]

Calculate entropy using binning.

Calculates the (joint) entropy of the input data after binning it along each dimension using specified bin edges or number of bins.

Parameters

datanumpy.ndarray: Array of shape (n_samples, d_features)
edgeslist or int: A list of length d_features which contains arrays describing the bin edges along each dimension or a list of ints describing the number of bins to use in each dimension. Input can also be a single int and the histogram will be created with the same number of bins for each dimension.

Returns

hfloat: The (joint) entropy of the input data after binning.
cffloat: The correction factor due to bin spacing. See Cover & Thomas (2006) Eq. 8.28 ISBN: 978-0-471-24195-9

unite_toolbox.bin_estimators.calc_uniform_bin_entropy(data: numpy.ndarray, edges: list[list[float]]) → tuple[float]

Alternative method to calculate entropy using binning.

Calculates the (joint) entropy of the input data. Using this method, every data point is substituted by the specific bin it occupies in edges. Therefore limiting the required memory to only store the number of entries in data.

NOTE: this only works for uniform binning schemes as the correction factor for differential entropy is calculated as assuming that every bin is of the same size.

Parameters

datanumpy.ndarray: Array of shape (n_samples, d_features)
edgeslist or int: A list of length d_features which contains arrays describing the bin edges along each dimension

Returns

hfloat: The (joint) entropy of the input data after binning.
corr_factfloat: The correction factor due to bin spacing. See Cover & Thomas (2006) Eq. 8.28 ISBN: 978-0-471-24195-9

unite_toolbox.bin_estimators.calc_bin_kld(p: numpy.ndarray, q: numpy.ndarray, edges: list[list[float]]) → float

Calculate Kullback-Leibler divergence (relative entropy) using binning.

Calculates the Kullback-Leibler divergence (relative entropy) between p and q [in nats] by approximating both distributions using some binning scheme defined by edges. edges must be able to support the values in q.

Parameters

pnumpy.ndarray: Array of shape (n_samples, d_features)
qnumpy.ndarray: Array of shape (m_samples, d_features)
edgeslist or int: A list of length d_features which contains arrays describing the bin edges along each dimension or a list of ints describing the number of bins to use in each dimension.

Returns

kldfloat: Kullback-Leibler divergence between p and q [in nats]

unite_toolbox.bin_estimators.calc_bin_mutual_information(x: numpy.ndarray, y: numpy.ndarray, edges: list[list[float]]) → float

Calculate mutual information between x and y using binning.

Calculates the mutual information between an array x and an array y. Both x and y don’t necesarily need the same number of samples as binning is used. This approach builds multivariate histograms for x, y and x-y using the specified edges, and evaluates I(X; Y) in every bin where the density of x-y is not zero. This is a resubstitution estimate.

Parameters

xnumpy.ndarray: Array of shape (n_samples, d1_features)
ynumpy.ndarray: Array of shape (m_samples, d2_features)
edgeslist: A list of two lists each containing either integers for the number of bins in each axis or arrays of the edges for the binning scheme of each axis.

Returns

mifloat: Mutual information between x and y [in nats]

unite_toolbox.bin_estimators.calc_qs_entropy(sample: numpy.ndarray, alpha: float = 0.25, N_k: int = 500, seed: int | None = None) → float

Calculate the 1-D entropy of the input data.

Calculates the 1-D entropy of the input data [in nats] using the quantile spacing (QS) estimator proposed by: Gupta et al. (2021) https://doi.org/10.3390/e23060740

Adapted from: https://github.com/rehsani/Entropy

Parameters

samplenumpy.ndarray: Flat array
alphafloat, optional: percent of the instances from the sample used for estimation of entropy (i.e., number of quantile-spacings).
N_kint, optional: number of sample subsets, used to estimate the sample distribution for each quantile empirically
seed: int, optional: random seed, if None: default is used

Returns

hfloat: Entropy [in nats] of ‘alpha’ percent of the instances