API reference#
π
This page documents the functions provided by the lightkde
package.
- kde_1d(sample_vec, n_x_vec=16384, x_min=None, x_max=None, weight_vec=None, return_bandwidth=False)[source]#
Reliable and extremely fast kernel density estimator for one-dimensional sample.
Gaussian kernel is assumed and the bandwidth is chosen automatically. Unlike many other implementations, this one is immune to problems caused by multimodal densities with widely separated modes. The estimation does not deteriorate for multimodal densities, because we never assume a parametric model for the sample.
Note
The elements of
sample_vec
that fall betweenx_min
andx_max
will be treated as the full sample, i.e. the kernel density over[x_min, x_max]
will integrate to one.If the search for finding the optimal bandwidth fails the functions falls back to
scipy.stats.gaussian_kde
.
- Parameters
sample_vec (Union[numpy.ndarray, list]) β A vector of sample points from which the density estimate is constructed.
n_x_vec (int) β The number of
x_vec
points used in the uniform discretization of the interval[x_min, x_max]
.n_x_vec
has to be a power of two. Ifn_x_vec
is not a power of two, thenn_x_vec
is rounded up to the next power of two, i.e.,n_x_vec
is set ton_x_vec=2**ceil(log2(n_x_vec))
; the default value ofn_x_vec
isn_x_vec=2**14
.x_min (Optional[Union[int, float]]) β The lower boundary of the interval over which the density estimate is constructed.
x_max (Optional[Union[int, float]]) β The upper boundary of the interval over which the density estimate is constructed.
weight_vec (Optional[Union[numpy.ndarray, list]]) β Weights of sample points. This must have the same shape as
sample_vec
. IfNone
(default), the samples are assumed to be equally weighted. Only the values of elements relative to each other matter, i.e. multiplyingweight_vec
by a non-negative scalar does not change the results.return_bandwidth (bool) β Should the used bandwidth be returned?
- Raises
ValueError β If
weight_vec
has at least one negative value.- Warns
Root finding failed (Brentβs method) β Optimal bandwidth finding failed, falling back to the rule-of-thumb bandwidth of
scipy.stats.gaussian_kde
.- Returns
Kernel densities, a vector of length
n_x_vec
with the values of the density estimate at the grid points (x_vec
).Kernel density grid (
x_vec
), a vector of grid points over which the kernel density estimate is computed.Optimal bandwidth (Gaussian kernel assumed), returned only if
return_bandwidth
isTrue
.- Return type
Union[Tuple[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray, float]]
Examples
import numpy as np import matplotlib.pyplot as plt from lightkde import kde_1d
sample_vec = [ -1.3145, -0.5197, 0.9326, 3.2358, 0.3814, -0.3226, 2.1121, 1.1357, 0.4376, -0.0332 ] density_vec, x_vec = kde_1d(sample_vec)
sample_vec = np.hstack((np.random.normal(loc=-8, size=100), np.random.normal(loc=-3, size=100), np.random.normal(loc=7, size=100))) density_vec, x_vec = kde_1d(sample_vec) plt.subplots() plt.plot(x_vec, density_vec) plt.show()
The kde bandwidth selection method is outlined in [1]. This implementation is based on the implementation of Daniel B. Smith [2] who based his implementation on the Matlab implementation by Zdravko Botev [3].
References
[1] Z. I. Botev, J. F. Grotowski, and D. P. Kroese (2010) Annals of Statistics, Volume 38, Number 5, pages 2916-2957.
[2] https://github.com/Daniel-B-Smith/KDE-for-SciPy/blob/a9982909bbb92a7e243e5fc9a74f957d883f1c5d/kde.py # noqa: E501 Updated on: 6 Feb 2013.
[3] https://nl.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-estimator # noqa: E501 Updated on: 30 Dec 2015.
- kde_2d(sample_mx, n_row_mx=256, xy_min=None, xy_max=None, weight_vec=None, return_bandwidth=False)[source]#
Fast and accurate state-of-the-art bivariate kernel density estimator with diagonal bandwidth matrix.
The kernel is assumed to be Gaussian. The two bandwidth parameters are chosen optimally without ever using/assuming a parametric model for the sample_vec or any βrules of thumbβ. Unlike many other procedures, this one is immune to accuracy failures in the estimation of multimodal densities with widely separated modes.
- Parameters
sample_mx (Union[numpy.ndarray, list]) β A 2D matrix of sample_vec from which the density estimate is constructed, the matrix must have two columns that represent the two coordinates (x,y) of the 2D sample_vec.
n_row_mx (int) β Number of points along each dimension (same for columns) where the estimate of the density will be returned, i.e. total number of points is
n_row_x_mx**2
.xy_min (Optional[Union[numpy.ndarray, Iterable]]) β The lower x and y boundaries of the interval over which the density estimate is constructed.
xy_max (Optional[Union[numpy.ndarray, Iterable]]) β The upper x and y boundaries of the interval over which the density estimate is constructed.
weight_vec (Optional[Union[numpy.ndarray, list]]) β Weights of sample points. This must have the same number of elements as rows in
sample_vec
, the same weight is applied to both coordinates of the samesample_vec
point. IfNone
(default), the samples are assumed to be equally weighted. The absolute value of the elements ofweight_vec
does not matter, only the values of elements relative to each other, i.e. multiplyingweight_vec
by a scalar does not change the results.return_bandwidth (bool) β Should the used bandwidth be returned?
- Raises
ValueError β If the number of columns in
sample_mx
is not two. Ifweight_vec
has at least one negative value.- Returns
Kernel densities, 2D matrix with the values of the density estimate at the grid points formed by
x_mx
andy_mx
.Kernel density grid (
x_mx
), the x coordinates of the grid points over which the density estimate is computed in the form of a 2D matrix that is the outcome ofnp.meshgrid
.Kernel density grid (
y_mx
), the x coordinates of the grid points over which the density estimate is computed in the form of a 2D matrix that is the outcome ofnp.meshgrid
.Optimal bandwidth (Gaussian kernel assumed), returned only if
return_bandwidth
isTrue
.- Return type
Union[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, float]]
Note
To ease testing and debugging the implementation very closely follows [2], i.e. [2] is assumed to be correct.
References
[1] Z. I. Botev, J. F. Grotowski, and D. P. Kroese (2010) Annals of Statistics, Volume 38, Number 5, pages 2916-2957.
[2] https://nl.mathworks.com/matlabcentral/fileexchange/17204-kernel-density-estimation. # noqa: E501 Updated on: 30 Dec 2015.