API reference¶

📝

This page documents the functions provided by the lightkde package.

kde_1d(sample_vec, n_x_vec=16384, x_min=None, x_max=None, weight_vec=None, return_bandwidth=False)[source]¶

Reliable and extremely fast kernel density estimator for one-dimensional sample.

Gaussian kernel is assumed and the bandwidth is chosen automatically. Unlike many other implementations, this one is immune to problems caused by multimodal densities with widely separated modes. The estimation does not deteriorate for multimodal densities, because we never assume a parametric model for the sample.

Note

  • The elements of sample_vec that fall between x_min and x_max will be treated as the full sample, i.e. the kernel density over [x_min, x_max] will integrate to one.

  • If the search for finding the optimal bandwidth fails the functions falls back to scipy.stats.gaussian_kde.

Parameters
  • sample_vec (Union[numpy.ndarray, list]) – A vector of sample points from which the density estimate is constructed.

  • n_x_vec (int) – The number of x_vec points used in the uniform discretization of the interval [x_min, x_max]. n_x_vec has to be a power of two. If n_x_vec is not a power of two, then n_x_vec is rounded up to the next power of two, i.e., n_x_vec is set to n_x_vec=2**ceil(log2(n_x_vec)); the default value of n_x_vec is n_x_vec=2**14.

  • x_min (Optional[Union[int, float]]) – The lower boundary of the interval over which the density estimate is constructed.

  • x_max (Optional[Union[int, float]]) – The upper boundary of the interval over which the density estimate is constructed.

  • weight_vec (Optional[Union[numpy.ndarray, list]]) – Weights of sample points. This must have the same shape as sample_vec. If None (default), the samples are assumed to be equally weighted. Only the values of elements relative to each other matter, i.e. multiplying weight_vec by a non-negative scalar does not change the results.

  • return_bandwidth (bool) – Should the used bandwidth be returned?

Raises

ValueError – If weight_vec has at least one negative value.

Warns

Root finding failed (Brent’s method) – Optimal bandwidth finding failed, falling back to the rule-of-thumb bandwidth of scipy.stats.gaussian_kde.

Returns

Kernel densities, a vector of length n_x_vec with the values of the density estimate at the grid points (x_vec).

Kernel density grid (x_vec), a vector of grid points over which the kernel density estimate is computed.

Optimal bandwidth (Gaussian kernel assumed), returned only if return_bandwidth is True.

Return type

Union[Tuple[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray, float]]

Examples

import numpy as np
import matplotlib.pyplot as plt
from lightkde import kde_1d
sample_vec = [
    -1.3145, -0.5197, 0.9326, 3.2358, 0.3814,
    -0.3226, 2.1121, 1.1357, 0.4376, -0.0332
]
density_vec, x_vec = kde_1d(sample_vec)
sample_vec = np.hstack((np.random.normal(loc=-8, size=100),
    np.random.normal(loc=-3, size=100),
    np.random.normal(loc=7, size=100)))
density_vec, x_vec = kde_1d(sample_vec)

plt.subplots()
plt.plot(x_vec, density_vec)
plt.show()

The kde bandwidth selection method is outlined in [1]. This implementation is based on the implementation of Daniel B. Smith [2] who based his implementation on the Matlab implementation by Zdravko Botev [3].

References

[1] Z. I. Botev, J. F. Grotowski, and D. P. Kroese (2010) Annals of Statistics, Volume 38, Number 5, pages 2916-2957.

[2] https://github.com/Daniel-B-Smith/KDE-for-SciPy/blob/a9982909bbb92a7e243e5fc9a74f957d883f1c5d/kde.py # noqa: E501 Updated on: 6 Feb 2013.

[3] https://nl.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-estimator # noqa: E501 Updated on: 30 Dec 2015.

kde_2d(sample_mx, n_row_mx=256, xy_min=None, xy_max=None, weight_vec=None, return_bandwidth=False)[source]¶

Fast and accurate state-of-the-art bivariate kernel density estimator with diagonal bandwidth matrix.

The kernel is assumed to be Gaussian. The two bandwidth parameters are chosen optimally without ever using/assuming a parametric model for the sample_vec or any “rules of thumb”. Unlike many other procedures, this one is immune to accuracy failures in the estimation of multimodal densities with widely separated modes.

Parameters
  • sample_mx (Union[numpy.ndarray, list]) – A 2D matrix of sample_vec from which the density estimate is constructed, the matrix must have two columns that represent the two coordinates (x,y) of the 2D sample_vec.

  • n_row_mx (int) – Number of points along each dimension (same for columns) where the estimate of the density will be returned, i.e. total number of points is n_row_x_mx**2.

  • xy_min (Optional[Union[numpy.ndarray, Iterable]]) – The lower x and y boundaries of the interval over which the density estimate is constructed.

  • xy_max (Optional[Union[numpy.ndarray, Iterable]]) – The upper x and y boundaries of the interval over which the density estimate is constructed.

  • weight_vec (Optional[Union[numpy.ndarray, list]]) – Weights of sample points. This must have the same number of elements as rows in sample_vec, the same weight is applied to both coordinates of the same sample_vec point. If None (default), the samples are assumed to be equally weighted. The absolute value of the elements of weight_vec does not matter, only the values of elements relative to each other, i.e. multiplying weight_vec by a scalar does not change the results.

  • return_bandwidth (bool) – Should the used bandwidth be returned?

Raises

ValueError – If the number of columns in sample_mx is not two. If weight_vec has at least one negative value.

Returns

Kernel densities, 2D matrix with the values of the density estimate at the grid points formed by x_mx and y_mx.

Kernel density grid (x_mx), the x coordinates of the grid points over which the density estimate is computed in the form of a 2D matrix that is the outcome of np.meshgrid.

Kernel density grid (y_mx), the x coordinates of the grid points over which the density estimate is computed in the form of a 2D matrix that is the outcome of np.meshgrid.

Optimal bandwidth (Gaussian kernel assumed), returned only if return_bandwidth is True.

Return type

Union[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, float]]

Note

To ease testing and debugging the implementation very closely follows [2], i.e. [2] is assumed to be correct.

References

[1] Z. I. Botev, J. F. Grotowski, and D. P. Kroese (2010) Annals of Statistics, Volume 38, Number 5, pages 2916-2957.

[2] https://nl.mathworks.com/matlabcentral/fileexchange/17204-kernel-density-estimation. # noqa: E501 Updated on: 30 Dec 2015.