Binning
The binning method consists of obtainin an estimate of the PDF using a histogram. Frequency at a point \(x_i\) from a histogram can be calculated as:
Where \(\hat{f}\) is an estimate of the frequency count, \(c_i\) is the number of observations in the same bin as \(x_i\), \(n\) is the total number of observations and \(\Delta\) is the bin width.
To get probability density from this frequency count, the width of the bin must be accounted for and, then, the density estimate can be calculated as \(p(x) = \Delta / f(x)\).
The ideal \(\Delta\) varies depending on the data, but several
“rules-of-thumb” exist that can be used as guidance. These have been
implemented in the UNITE toolbox taking advantage and following the same
notation as in numpy’s histogram_bin_edges. See: estimate_ideal_bins.
Having an estimate of density \(\hat{p}(x)\) at \(x\) makes it so entropy, KL divergence and mututal information can be calculated directly as resubstitution estimates using the following equations:
- unite_toolbox.bin_estimators.estimate_ideal_bins(data: numpy.ndarray, *, counts: bool | None = True) dict[str, int | list[float]]
Estimate the number of ideal bins.
Estimates the ideal number of bins for each feature (column) of a 2D data array using different methods. See numpy.histogram_bin_edges for a list of available methods.
Parameters
- datanumpy.ndarray
Array of shape (n_samples, d_features)
- countsbool, optional
Whether to return the number of bins (True) or the bin edges (False).
Returns
- dict
A dictionary with a key for each method, and the values are lists of number of bins or bin edges for each feature of the data (if counts=False).
- unite_toolbox.bin_estimators.calc_bin_density(x: numpy.ndarray, data: numpy.ndarray, edges: list[int | list[float]]) numpy.ndarray
Calculate density using binning.
Calculates the density of every point of the 2D array x within the d-dimensional histogram created from data and edges.
Similar to a lookup operation where the entries in x are replaced by the bin indices in which they would fall given the binning scheme defined in edges. Then the indices are used to “look up” each value of x in the d-dimensional histogram created from data and edges.
Parameters
- xnumpy.ndarray
Array of shape (n_samples, d_features)
- datanumpy.ndarray
Array of shape (n_samples, d_features)
- edgeslist or int
A list of length d_features which contains arrays describing the bin edges along each dimension or a list of ints describing the number of bins to use in each dimension.
Returns
- fxnumpy.ndarray
Array of shape (n_samples, 1)
- unite_toolbox.bin_estimators.calc_vol_array(edges: list[numpy.ndarray]) numpy.ndarray
Calculate the volume of a multidimensional array.
Calculates the volume of each cell of the multidimensional array defined by edges where edges is a list of arrays. As an example, if edges contains two arrays, this functions returns a 2D grid where each element in the grid contains the value for the area of that specific cell. If edges contains three arrays, the returned grid is 3D where each element of the grid is the volume of the cell, and so on.
As this is done to calculate the volume of each bin of a multidimensional histogram, the returned grid can be indexed by the same indices as a histogramdd from NumPy.
Parameters
- edgeslist[np.ndarray]
List of 1D NumPy arrays
Returns
- volnumpy.ndarray
Array of shape (len(arr0) - 1, len(arr1) - 1, …, len(arrn) - 1)
Example
>>> a = np.array([0.0, 1.0, 3.0, 7.0, 12.0]) >>> b = np.array([4.0, 8.0, 10.0]) >>> calc_vol_array([a, b]) array([[ 4., 2.], [ 8., 4.], [16., 8.], [20., 10.]])
- unite_toolbox.bin_estimators.calc_bin_entropy(data: numpy.ndarray, edges: list[int | float] | int) tuple[float]
Calculate entropy using binning.
Calculates the (joint) entropy of the input data after binning it along each dimension using specified bin edges or number of bins.
Parameters
- datanumpy.ndarray
Array of shape (n_samples, d_features)
- edgeslist or int
A list of length d_features which contains arrays describing the bin edges along each dimension or a list of ints describing the number of bins to use in each dimension. Input can also be a single int and the histogram will be created with the same number of bins for each dimension.
Returns
- hfloat
The (joint) entropy of the input data after binning.
- cffloat
The correction factor due to bin spacing. See Cover & Thomas (2006) Eq. 8.28 ISBN: 978-0-471-24195-9
- unite_toolbox.bin_estimators.calc_uniform_bin_entropy(data: numpy.ndarray, edges: list[list[float]]) tuple[float]
Alternative method to calculate entropy using binning.
Calculates the (joint) entropy of the input data. Using this method, every data point is substituted by the specific bin it occupies in edges. Therefore limiting the required memory to only store the number of entries in data.
NOTE: this only works for uniform binning schemes as the correction factor for differential entropy is calculated as assuming that every bin is of the same size.
Parameters
- datanumpy.ndarray
Array of shape (n_samples, d_features)
- edgeslist or int
A list of length d_features which contains arrays describing the bin edges along each dimension
Returns
- hfloat
The (joint) entropy of the input data after binning.
- corr_factfloat
The correction factor due to bin spacing. See Cover & Thomas (2006) Eq. 8.28 ISBN: 978-0-471-24195-9
- unite_toolbox.bin_estimators.calc_bin_kld(p: numpy.ndarray, q: numpy.ndarray, edges: list[list[float]]) float
Calculate Kullback-Leibler divergence (relative entropy) using binning.
Calculates the Kullback-Leibler divergence (relative entropy) between p and q [in nats] by approximating both distributions using some binning scheme defined by edges. edges must be able to support the values in q.
Parameters
- pnumpy.ndarray
Array of shape (n_samples, d_features)
- qnumpy.ndarray
Array of shape (m_samples, d_features)
- edgeslist or int
A list of length d_features which contains arrays describing the bin edges along each dimension or a list of ints describing the number of bins to use in each dimension.
Returns
- kldfloat
Kullback-Leibler divergence between p and q [in nats]
- unite_toolbox.bin_estimators.calc_bin_mutual_information(x: numpy.ndarray, y: numpy.ndarray, edges: list[list[float]]) float
Calculate mutual information between x and y using binning.
Calculates the mutual information between an array x and an array y. Both x and y don’t necesarily need the same number of samples as binning is used. This approach builds multivariate histograms for x, y and x-y using the specified edges, and evaluates I(X; Y) in every bin where the density of x-y is not zero. This is a resubstitution estimate.
Parameters
- xnumpy.ndarray
Array of shape (n_samples, d1_features)
- ynumpy.ndarray
Array of shape (m_samples, d2_features)
- edgeslist
A list of two lists each containing either integers for the number of bins in each axis or arrays of the edges for the binning scheme of each axis.
Returns
- mifloat
Mutual information between x and y [in nats]
- unite_toolbox.bin_estimators.calc_qs_entropy(sample: numpy.ndarray, alpha: float = 0.25, N_k: int = 500, seed: int | None = None) float
Calculate the 1-D entropy of the input data.
Calculates the 1-D entropy of the input data [in nats] using the quantile spacing (QS) estimator proposed by: Gupta et al. (2021) https://doi.org/10.3390/e23060740
Adapted from: https://github.com/rehsani/Entropy
Parameters
- samplenumpy.ndarray
Flat array
- alphafloat, optional
percent of the instances from the sample used for estimation of entropy (i.e., number of quantile-spacings).
- N_kint, optional
number of sample subsets, used to estimate the sample distribution for each quantile empirically
- seed: int, optional
random seed, if None: default is used
Returns
- hfloat
Entropy [in nats] of ‘alpha’ percent of the instances