k-NN

Different authors have proposed equations to calculate entropy, KL divergence and mutual information using k-nearest neighbors (k-NN). In principle one could make an estimate of density \(p(x)\) at \(x_i\) using calc_knn_density and then use a resubstitution estimate of a desired quantity but this is not recommended for k-NN-based methods. See more in the tutorial: Non-parametric Density Estimation.

See calc_knn_entropy, calc_knn_kld and calc_knn_mutual_information for details specific for each equation as well as their source.

Distance is a specific topic which requires a few words. Distance is the length between two points \(x\) and \(y\), and is given by a \(p\)-norm function where \(p \geq 1\) as follows:

\[\left\|x - y\right\|_{p} = (|x_{1} - y_{1}|^{p} + |x_{2} - y_{2}|^{p} + \cdots + |x_{n} - y_{n}|^{p})^{\frac{1}{p}}\]

with \(p=2\) suggested for both estimating density, entropy and KL divergence. For mutual information, Kraskov, et al. propose the usage of \(p=\inf\) or the infinite norm.

unite_toolbox.knn_estimators.vol_lp_ball(r: float, d: int, p_norm: float) float

Calculate volume of \(L^p\) ball.

Calculates the volume of a \(L^p\) ball of radius r in d-dimensional space.

\[V_d^p(R) = \frac{(2\,\Gamma(\frac{1}{p}+1))^d}{\Gamma(\frac{d}{p}+1)} \,R^d\]
Where:
  • \(\Gamma\) is the Gamma function.

  • \(p\) is the order of the Minkowski distance.

  • \(d\) is the number of dimensions of the d-ball.

  • \(R\) is the radius of the d-ball.

Parameters

rfloat

radius of the d-ball

dint

dimension of the d-ball

p_normfloat

p (Minkowski) distance. p = 1 is the Manhattan distance, p = 2 is the Euclidian distance, etc. np.inf is the maximum (Chebyshev) distance.

Returns

volfloat

Volume of the L^p ball

unite_toolbox.knn_estimators.calc_knn_density(x: numpy.ndarray, data: numpy.ndarray, k: int = 5, p_norm: float = 2) numpy.ndarray

Calculate the probability density of x using k-NN.

Calculates the probability density of every point in x based on the proximity to the data using k-nearest neighbors and the equation proposed by Wang et al. (2009). Note: x and data must have the same number of d_features. 10.1109/TIT.2009.2016060

\[\hat{p}_k(x_i) = \frac{k}{N-1} \cdot \frac{1}{c_1(d) \cdot \rho^{d}_k(i)}\]
Where:
  • \(k\) is the number of neighbors used.

  • \(N\) is the number of samples in the data.

  • \(c_1(d)\) is the volume of a d-dimensional unit \(L^p\)

    ball.

  • \(d\) is the number of dimensions of the data.

  • \(\rho^{d}_k(i)\) is the distance between a point \(i\) and

    its k-th nearest neighbor.

Parameters

xnumpy.ndarray

Array of shape (n_samples, d_features)

datanumpy.ndarray

Array of shape (n_samples, d_features)

kint, optional

no. of nearest neighbors to use

p_normfloat

p (Minkowski) distance. p = 1 is the Manhattan distance, p = 2 is the Euclidian distance, etc. np.inf is the maximum (Chebyshev) distance.

Returns

pnumpy.ndarray

Array of shape (n_samples, 1)

References

for Multidimensional Densities Via k-Nearest-Neighbor Distances. https://doi.org/10.1109/TIT.2009.2016060

unite_toolbox.knn_estimators.calc_knn_entropy(data: numpy.ndarray, k: int = 1, p_norm: float = 2) float

Calculate the (joint) entropy of the input n-d array.

Calculates the (joint) entropy of the input data [in nats] using the Kozachenko and Leonenko (KL) (1987) estimator which is an approach based on k-nearest neighbors (k-NN). By default, the Euclidean norm distance (p-norm = 2) is used to calculate distances. http://mi.mathnet.ru/ppi797

\[\hat{H}(X) = \psi(N) - \psi(k) + log(c_1(d)) + \frac{d}{N}\sum_{i=1}^{N}\log(\rho^{d}_k(i))\]
Where:
  • \(\psi\) is the digamma function.

  • \(N\) is the number of samples in the data.

  • \(k\) is the number of neighbors used.

  • \(c_1(d)\) is the volume of a d-dimensional unit \(L^p\)

    ball.

  • \(d\) is the number of dimensions of the data.

  • \(\rho^{d}_k(i)\) is the distance between a point \(i\) and

    its k-th nearest neighbor.

Parameters

datanumpy.ndarray

Array of shape (n_samples, d_features)

kint, optional

no. of nearest neighbors to use

p_normfloat

p (Minkowski) distance. p = 1 is the Manhattan distance, p = 2 is the Euclidian distance, etc. np.inf is the maximum (Chebyshev) distance.

Returns

hfloat

Entropy of the data set [in nats]

References

for the entropy of a random vector. https://zbmath.org/?q=an:0633.62005

unite_toolbox.knn_estimators.calc_knn_kld(p: numpy.ndarray, q: numpy.ndarray, k: int = 1, p_norm: float = 2) float

Calculate the Kullback-Leibler divergence between two arrays of data.

Calculates the Kullback-Leibler divergence (relative entropy) between two data sets (p and q) [in nats] using the estimation method proposed by Wang et al. (2009). Both p and q are n-d arrays where d >= 1 which means they can have multiple features. Typically p represents the true distribution while q is the approximate distribution. A different number of total samples in p and q is acceptable, 10.1109/TIT.2009.2016060

\[\hat{D}_{KL\,n,m}(p||q) = \frac{d}{n} \sum_{i=1}^{n} \log \left ( \frac{\nu_k(i)}{\rho_k(i)} \right ) + \log\left (\frac{m}{n-1}\right )\]
Where:
  • \(d\) is the number of dimensions of the data.

  • \(n\) is the number of samples in p.

  • \(\rho_k(i)\) is the distance between \(x_i\) and its k-NN

    in \({x_j}_(j \neq i)\).

  • \(\nu_k(i)\) is the distance between \(x_i\) and its k-NN

    in \(y_j\).

  • \(m\) is the number of points in q.

Parameters

pnumpy.ndarray

Array of shape (n_samples, d_features)

qnumpy.ndarray

Array of shape (m_samples, d_features)

kint, optional

no. of nearest neighbors to use

p_normfloat

p (Minkowski) distance. p = 1 is the Manhattan distance, p = 2 is the Euclidian distance, etc. np.inf is the maximum (Chebyshev) distance.

Returns

kldfloat

Kullback-Leibler divergence between p and q [in nats]

References

for Multidimensional Densities Via k-Nearest-Neighbor Distances. https://doi.org/10.1109/TIT.2009.2016060

unite_toolbox.knn_estimators.calc_knn_mutual_information(x: numpy.ndarray, y: numpy.ndarray, k: int = 15) float

Calculate mutual information between x and y using k-NN.

Estimates the mutual information between x and y [in nats] using the method of estimation proposed Kraskov et al. (2004). Both x and y can have d >= 1 which means they can have multiple features. By default, the maximum norm (p-norm = ∞) and fifteen neighbors (k = 15) are used. 10.1103/PhysRevE.69.066138

\[\hat{I}(X;Y) = \psi(k) - \frac{1}{N} \sum_{i=1}^{N} \mathbb{E} \left[\psi(n_{i,x} + 1) + \psi(n_{i,y} + 1) \right] + \psi(N)\]
Where:
  • \(\psi\) is the digamma function.

  • \(N\) is the number of samples.

  • \(\mathbb{E}\) is the expectation operation.

  • \(n_{i,x}\) and \(n_{i,y}\) are the number of neighbors of

every point within a given radius calculated as the distance to the k-th nearest neighbor in the joint X-Y space.

Parameters

xnumpy.ndarray

Array of shape (n_samples, d_features)

ynumpy.ndarray

Array of shape (n_samples, d_features)

kint, optional

no. of nearest neighbors to use

Returns

mifloat

Mutual information between x and y [in nats]

References

mutual information. https://doi.org/10.1103/PhysRevE.69.066138