KDE

KDE-based estimation consists of estimating a PDF based on kernels as weights, with the kernel being a non-negative window function. The density \(p(x)\) at a point \(x\) is estimated as:

\[\hat{p}(x) = \frac{1}{n}\sum_{i=1}^d K(u)\]

where:

\[u = \frac{\left (x-x_i \right )^\intercal \Sigma^{-1}\left ( x-x_i \right )}{h^2}\]

and \(n\) is the total number of samples, \(K\) is a multivariate kernel function, \(x_i = [x_{1,i}, x_{2,i}, \dots, x_{d,i}]^\intercal\) is a \(d\)- dimensional vector of samples, \(\Sigma\) is the covariance matrix of the samples, and \(h\) is a smoothing parameter.

The UNITE toolbox uses a multivariate Gaussian kernel by default:

\[K(u)=\frac{1}{\left (2\pi \right )^{d/2} h^d \det{\left ( \Sigma \right )}^{1/2}} e^{-u/2}\]

with Silverman’s bandwidth estimate:

\[h=\left ( \frac{n\left ( d+2 \right )}{4} \right )^{-1/\left ( d+4 \right )}\]

Having an estimate of density \(\hat{p}(x)\) at \(x\) makes it so entropy, KL divergence and mututal information can be calculated directly as resubstitution estimates using the following equations:

\[H(X) = -\frac{1}{n} \sum_{i=1}^{n} {\log{\left ( \hat{p}\left ( x_i \right ) \right )}}\]
\[D_{KL}(p||q) = \frac{1}{n} \sum_{i=1}^{n} {\log{\left ( \frac{\hat{p}(x_i)}{\hat{q}(x_i)} \right )}}\]
\[I(X;Y) = \frac{1}{n} \sum_{i=1}^{n} {\log{\left ( \frac{\hat{p}(x_i,y_i)}{\hat{p}(x_i)\hat{p}(y_i)} \right )}}\]

Further, integral estimates can also be calculated using numerical integration. For example, using numerical integration, entropy is estimated as:

\[H(X) = -\int_\mathcal{X} \hat{p}(x) \log \hat{p}(x)\,\text{d}x\]
class unite_toolbox.kde_estimators.KDEMode(value)

Enumeration representing different modes for calculating KDE estimates.

Attributes

RESUBSTITUTIONstr

Calculate an estimate using resubstitution.

INTEGRALstr

Calculate an estimate by integrating over the KDE.

unite_toolbox.kde_estimators.calc_kde_density(x: numpy.ndarray, data: numpy.ndarray, bandwidth: float | None = None) numpy.ndarray

Calculate density using KDE.

Calculates the density of every point of the 2D array x within KDE representation of data. Simply, every point in x is evaluated in a KDE-based distribution of data.

Parameters

xnumpy.ndarray

Array of shape (n_samples, d_features)

datanumpy.ndarray

Array of shape (m_samples, d_features)

bandwidthfloat, optional

bandwidth of the gaussian kernel

Returns

pnumpy.ndarray

Array of shape (n_samples, 1)

unite_toolbox.kde_estimators.calc_kde_entropy(data: numpy.ndarray, bandwidth: float | None = None, mode: str = 'resubstitution') float

Calculate the entropy of a dataset using kernel density estimation (KDE).

Calculates the (joint) entropy of the input data [in nats] by approximating the (joint) density of the distribution using a Gaussian kernel density estimator (KDE). By defaul the Scott estimate for the bandwith is used for the Gaussian kernel. This function has two modes: resubstition and integral.

Parameters

datanp.ndarray

Array of shape (n_samples, n_features)

bandwidthfloat, optional

Bandwidth of the Gaussian kernel

modestr, “resubstitution” or “integral”, optional

Method for entropy calculation, defaults to ‘resubstitution’

Returns

hfloat

Entropy of the dataset [in nats]

unite_toolbox.kde_estimators.calc_kde_kld(p: numpy.ndarray, q: numpy.ndarray, bandwidth: float | None = None, mode: str = 'resubstitution') float

Calculate KLD using KDE.

Calculates the Kullback-Leibler divergence (relative entropy) between two data sets (p and q) [in nats] by approximating both distributions using a Gaussian kernel density estimate (KDE). The divergence is measured between both of the estimated densities. Both density estimates are independent, therefore a different number of total samples in p and q is valid. This function has two modes: resubstition and integral.

Parameters

pnumpy.ndarray

Array of shape (n_samples, d_features)

qnumpy.ndarray

Array of shape (m_samples, d_features)

bandwidthfloat, optional

bandwith of the gaussian kernel

modestr, “resubstitution” or “integral”, optional

Method for entropy calculation, defaults to ‘resubstitution’

Returns

kldfloat

Kullback-Leibler divergence between p and q [in nats]

unite_toolbox.kde_estimators.calc_kde_mutual_information(x: numpy.ndarray, y: numpy.ndarray, bandwidth: float | None = None, mode: str = 'resubstitution') float

Calculate MI between x and y using KDE.

Calculates the mutual information between x and y [in nats] using KDE. This method uses a multivariate Gaussian kernel so, both x an y can have multivariate data. The method evaluates density at every point in x, y and x-y, therefore, x and y must have the same number of entries. This function has two modes: resubstition and integral.

Parameters

xnumpy.ndarray

Array of shape (n_samples, d1_features)

ynumpy.ndarray

Array of shape (n_samples, d2_features)

bandwidthfloat, optional

bandwith of the gaussian kernel, “scott” by default

modestr, “resubstitution” or “integral”, optional

Method for entropy calculation, defaults to ‘resubstitution’

Returns

mifloat

Mutual information between x and y [in nats]