KDE
KDE-based estimation consists of estimating a PDF based on kernels as weights, with the kernel being a non-negative window function. The density \(p(x)\) at a point \(x\) is estimated as:
where:
and \(n\) is the total number of samples, \(K\) is a multivariate kernel function, \(x_i = [x_{1,i}, x_{2,i}, \dots, x_{d,i}]^\intercal\) is a \(d\)- dimensional vector of samples, \(\Sigma\) is the covariance matrix of the samples, and \(h\) is a smoothing parameter.
The UNITE toolbox uses a multivariate Gaussian kernel by default:
with Silverman’s bandwidth estimate:
Having an estimate of density \(\hat{p}(x)\) at \(x\) makes it so entropy, KL divergence and mututal information can be calculated directly as resubstitution estimates using the following equations:
Further, integral estimates can also be calculated using numerical integration. For example, using numerical integration, entropy is estimated as:
- class unite_toolbox.kde_estimators.KDEMode(value)
Enumeration representing different modes for calculating KDE estimates.
Attributes
- RESUBSTITUTIONstr
Calculate an estimate using resubstitution.
- INTEGRALstr
Calculate an estimate by integrating over the KDE.
- unite_toolbox.kde_estimators.calc_kde_density(x: numpy.ndarray, data: numpy.ndarray, bandwidth: float | None = None) numpy.ndarray
Calculate density using KDE.
Calculates the density of every point of the 2D array x within KDE representation of data. Simply, every point in x is evaluated in a KDE-based distribution of data.
Parameters
- xnumpy.ndarray
Array of shape (n_samples, d_features)
- datanumpy.ndarray
Array of shape (m_samples, d_features)
- bandwidthfloat, optional
bandwidth of the gaussian kernel
Returns
- pnumpy.ndarray
Array of shape (n_samples, 1)
- unite_toolbox.kde_estimators.calc_kde_entropy(data: numpy.ndarray, bandwidth: float | None = None, mode: str = 'resubstitution') float
Calculate the entropy of a dataset using kernel density estimation (KDE).
Calculates the (joint) entropy of the input data [in nats] by approximating the (joint) density of the distribution using a Gaussian kernel density estimator (KDE). By defaul the Scott estimate for the bandwith is used for the Gaussian kernel. This function has two modes: resubstition and integral.
Parameters
- datanp.ndarray
Array of shape (n_samples, n_features)
- bandwidthfloat, optional
Bandwidth of the Gaussian kernel
- modestr, “resubstitution” or “integral”, optional
Method for entropy calculation, defaults to ‘resubstitution’
Returns
- hfloat
Entropy of the dataset [in nats]
- unite_toolbox.kde_estimators.calc_kde_kld(p: numpy.ndarray, q: numpy.ndarray, bandwidth: float | None = None, mode: str = 'resubstitution') float
Calculate KLD using KDE.
Calculates the Kullback-Leibler divergence (relative entropy) between two data sets (p and q) [in nats] by approximating both distributions using a Gaussian kernel density estimate (KDE). The divergence is measured between both of the estimated densities. Both density estimates are independent, therefore a different number of total samples in p and q is valid. This function has two modes: resubstition and integral.
Parameters
- pnumpy.ndarray
Array of shape (n_samples, d_features)
- qnumpy.ndarray
Array of shape (m_samples, d_features)
- bandwidthfloat, optional
bandwith of the gaussian kernel
- modestr, “resubstitution” or “integral”, optional
Method for entropy calculation, defaults to ‘resubstitution’
Returns
- kldfloat
Kullback-Leibler divergence between p and q [in nats]
- unite_toolbox.kde_estimators.calc_kde_mutual_information(x: numpy.ndarray, y: numpy.ndarray, bandwidth: float | None = None, mode: str = 'resubstitution') float
Calculate MI between x and y using KDE.
Calculates the mutual information between x and y [in nats] using KDE. This method uses a multivariate Gaussian kernel so, both x an y can have multivariate data. The method evaluates density at every point in x, y and x-y, therefore, x and y must have the same number of entries. This function has two modes: resubstition and integral.
Parameters
- xnumpy.ndarray
Array of shape (n_samples, d1_features)
- ynumpy.ndarray
Array of shape (n_samples, d2_features)
- bandwidthfloat, optional
bandwith of the gaussian kernel, “scott” by default
- modestr, “resubstitution” or “integral”, optional
Method for entropy calculation, defaults to ‘resubstitution’
Returns
- mifloat
Mutual information between x and y [in nats]