lightkde#
A lightning fast, lightweight, and reliable kernel density estimation.
Easy to use, e.g.
density_vec, x_vec = kde_1d(sample_vec=sample)
.Works with 1d and 2d samples.
Works with weighted samples as well.
Based on the MATLAB implementations of Botev: [Botev, 2015], [Botev, 2015].
Installation#
π₯ π£ π₯
pip#
You can install the most recent stable version by using:
pip install lightkde
From the source#
Clone the repository and install the package locally using (run from the root of the repository):
pip install .
pip install -e .[tests,lint_type_checks,docs]
Examples#
π
Note
The examples are selected to show how to use lightkde
and to illustrate some
cases where it outperforms rule-of-thumb bandwidth selection based approaches
like scipy.stats.gaussian_kde
. For many cases, the rule-of-thumb bandwidth
selection approaches work well.
Note
Click here to download the full example code
1D bimodal example#
This example shows how to use lightkde.kde_1d
and how it compares to
scipy.stats.gaussian_kde
for a bimodal univariate case.
Import packages:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde, norm
from lightkde import kde_1d
Generate synthetic data from two univariate normal distributions:
np.random.seed(42)
sample = np.hstack((norm.rvs(size=2_000), 0.3 * norm.rvs(size=1_000) + 5))
Estimate kernel density using lightkde
:
density_vec, x_vec = kde_1d(sample_vec=sample)
Estimate kernel density using scipy
:
gkde = gaussian_kde(dataset=sample)
scipy_density_vec = gkde.evaluate(x_vec)
Plot the data against the kernel density estimates:
plt.plot(x_vec, density_vec, "--r", label="lightkde")
plt.plot(x_vec, scipy_density_vec, label="scipy.stats.gaussian_kde")
plt.hist(sample, bins=100, density=True, alpha=0.5, label="data")
plt.legend()
plt.show()

The scipy
method oversmooths the kernel density and it is far
from the histogram of the data that it is expected to follow.
Total running time of the script: ( 0 minutes 1.862 seconds)
Note
Click here to download the full example code
1D truncated unimodal example#
This example shows how to use lightkde.kde_1d
with and without a limit and
how it compares to scipy.stats.gaussian_kde
for a truncated unimodal distribution.
Import packages
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde, norm
from lightkde import kde_1d
Generate synthetic data from a univariate normal distribution and truncate it:
np.random.seed(42)
sample = norm.rvs(size=2_000)
sample = sample[sample > 0]
Estimate kernel density using lightkde
:
density_vec_without_x_min, x_vec_without_x_min = kde_1d(sample_vec=sample)
density_vec_with_x_min, x_vec_with_x_min = kde_1d(sample_vec=sample, x_min=0)
Estimate kernel density using scipy
:
gkde = gaussian_kde(dataset=sample)
scipy_density_vec = gkde.evaluate(x_vec_without_x_min)
Plot the data against the kernel density estimates:
fig, ax = plt.subplots(figsize=(6, 4))
ax.plot(x_vec_with_x_min, density_vec_with_x_min, "--r", label="lightkde; with x_min=0")
ax.plot(
x_vec_without_x_min,
density_vec_without_x_min,
":r",
label="lightkde; without x_min",
)
ax.plot(
x_vec_without_x_min, scipy_density_vec, zorder=1, label="scipy.stats.gaussian_kde"
)
ax.hist(sample, bins=30, density=True, alpha=0.5, label="data")
ax.legend()
plt.show()

When x_min=0
is used, lightkde
approximates well the histogram of the data.
Without this option it behaves similarly to the scipy
method, both extend to a
region without data.
Total running time of the script: ( 0 minutes 1.939 seconds)
Note
Click here to download the full example code
2D bimodal example#
This example shows how to use lightkde.kde_2d
and how it compares to
scipy.stats.gaussian_kde
for a bimodal bivariate case.
Import packages:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde, multivariate_normal
from lightkde import kde_2d
Generate synthetic data from two univariate normal distributions:
np.random.seed(42)
sample = np.vstack(
(
multivariate_normal.rvs(mean=[0, 0], cov=0.3, size=2000),
multivariate_normal.rvs(
mean=[2, 2], cov=[[0.5, -0.48], [-0.48, 0.5]], size=2000
),
)
)
Estimate kernel density using lightkde
:
density_mx, x_mx, y_mx = kde_2d(sample_mx=sample)
Estimate kernel density using scipy
:
gkde = gaussian_kde(dataset=sample.T)
xy_mx = np.hstack((x_mx.reshape(-1, 1), y_mx.reshape(-1, 1)))
scipy_density_mx = gkde.evaluate(xy_mx.T).reshape(x_mx.shape)
Plot the data against the kernel density estimates:
# pre-process
bins = (30, 30)
data_density_mx, xedges, yedges = np.histogram2d(
sample[:, 0], sample[:, 1], bins=bins, density=True
)
z_min = 0
z_max = np.max(data_density_mx)
x_min, x_max = min(xedges), max(xedges)
y_min, y_max = min(yedges), max(yedges)
# plot
cmap = "afmhot_r"
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(10, 3), sharex="all", sharey="all")
# data
h = ax1.hist2d(
sample[:, 0],
sample[:, 1],
bins=bins,
density=True,
vmin=z_min,
vmax=z_max,
cmap=cmap,
)[-1]
ax1.set_title("data")
# lightkde
ax2.contourf(x_mx, y_mx, density_mx, levels=50, vmin=z_min, vmax=z_max, cmap=cmap)
ax2.set_title("lightkde")
# scipy
ax3.contourf(x_mx, y_mx, scipy_density_mx, levels=50, vmin=z_min, vmax=z_max, cmap=cmap)
ax3.set_xlim(x_min, x_max)
ax3.set_ylim(y_min, y_max)
ax3.set_title("scipy.stats.gaussian_kde")
fig.subplots_adjust(right=0.89)
cbar_ax = fig.add_axes([0.90, 0.15, 0.05, 0.7], aspect=30)
fig.colorbar(h, cax=cbar_ax, label="density")
plt.show()

The scipy
method oversmooths the kernel density and it is far
from the histogram of the data that it is expected to follow.
Total running time of the script: ( 0 minutes 5.966 seconds)
Note
Click here to download the full example code
2D truncated unimodal example#
This example shows how to use lightkde.kde_2d
with and without a limit and
how it compares to scipy.stats.gaussian_kde
for a truncated unimodal bivariate
distribution.
Import packages
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde, multivariate_normal
from lightkde import kde_2d
Generate synthetic data from a univariate normal distribution and truncate it:
np.random.seed(42)
sample = multivariate_normal.rvs(mean=[0, 0], size=8000)
sample = sample[sample[:, 0] > 0]
Estimate kernel density using lightkde
:
density_mx_without_x_min, x_mx_without_x_min, y_mx_without_x_min = kde_2d(
sample_mx=sample
)
density_mx_with_x_min, x_mx_with_x_min, y_mx_with_x_min = kde_2d(
sample_mx=sample, xy_min=[0, -5]
)
Estimate kernel density using scipy
:
gkde = gaussian_kde(dataset=sample.T)
xy_mx = np.hstack(
(x_mx_without_x_min.reshape(-1, 1), y_mx_without_x_min.reshape(-1, 1))
)
scipy_density_mx = gkde.evaluate(xy_mx.T).reshape(x_mx_without_x_min.shape)
Plot the data against the kernel density estimates:
bins = (30, 30)
data_density_mx, xedges, yedges = np.histogram2d(
sample[:, 0], sample[:, 1], bins=bins, density=True
)
z_min = 0
z_max = np.max(data_density_mx)
x_min, x_max = -0.5, max(xedges)
y_min, y_max = min(yedges), max(yedges)
# plot
cmap = "afmhot_r"
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(
2, 2, figsize=(8, 7), sharex="all", sharey="all"
)
# data
h = ax1.hist2d(
sample[:, 0],
sample[:, 1],
bins=bins,
density=True,
vmin=z_min,
vmax=z_max,
cmap=cmap,
)[-1]
ax1.set_title("data")
# scipy
ax2.contourf(
x_mx_without_x_min,
y_mx_without_x_min,
scipy_density_mx,
levels=50,
vmin=z_min,
vmax=z_max,
cmap=cmap,
)
ax2.set_xlim(x_min, x_max)
ax3.set_ylim(y_min, y_max)
ax2.set_title("scipy.stats.gaussian_kde")
# lightkde
ax3.contourf(
x_mx_with_x_min,
y_mx_with_x_min,
density_mx_with_x_min,
levels=50,
vmin=z_min,
vmax=z_max,
cmap=cmap,
)
ax3.set_title("lightkde; with x_min=0")
ax4.contourf(
x_mx_without_x_min,
y_mx_without_x_min,
density_mx_without_x_min,
levels=50,
vmin=z_min,
vmax=z_max,
cmap=cmap,
)
ax4.set_title("lightkde; without x_min")
fig.subplots_adjust(right=0.80)
cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7], aspect=50)
fig.colorbar(h, cax=cbar_ax, label="density")
plt.show()

When x_min=0
is used, lightkde
approximates well the histogram of the data.
Total running time of the script: ( 0 minutes 6.354 seconds)
Methods#
π
Overview#
Diffusion based kernel density estimation after [Botev et al., 2010]. This Python implementation is based on Botevβs MATLAB implementations for 1d and 2d samples, [Botev, 2015] and [Botev, 2015] respectively. The 1d python implementation is influenced by [Smith, 2013].
A brief description of the implemented method:
Reliable and extremely fast kernel density estimator for one-dimensional data; Gaussian kernel is assumed and the bandwidth is chosen automatically; Unlike many other implementations, this one is immune to problems caused by multimodal densities with widely separated modes (see example). The estimation does not deteriorate for multimodal densities, because we never assume a parametric model for the data (like those used in rules of thumb).
Differences to the MATLAB implementation#
Extended to be able to handle weighted samples.
Extensive testing (mostly against the MATLAB implementation that is assumed to be correct).
Additional examples (Examples).
API reference#
π
This page documents the functions provided by the lightkde
package.
- kde_1d(sample_vec, n_x_vec=16384, x_min=None, x_max=None, weight_vec=None, return_bandwidth=False)[source]#
Reliable and extremely fast kernel density estimator for one-dimensional sample.
Gaussian kernel is assumed and the bandwidth is chosen automatically. Unlike many other implementations, this one is immune to problems caused by multimodal densities with widely separated modes. The estimation does not deteriorate for multimodal densities, because we never assume a parametric model for the sample.
Note
The elements of
sample_vec
that fall betweenx_min
andx_max
will be treated as the full sample, i.e. the kernel density over[x_min, x_max]
will integrate to one.If the search for finding the optimal bandwidth fails the functions falls back to
scipy.stats.gaussian_kde
.
- Parameters
sample_vec (Union[numpy.ndarray, list]) β A vector of sample points from which the density estimate is constructed.
n_x_vec (int) β The number of
x_vec
points used in the uniform discretization of the interval[x_min, x_max]
.n_x_vec
has to be a power of two. Ifn_x_vec
is not a power of two, thenn_x_vec
is rounded up to the next power of two, i.e.,n_x_vec
is set ton_x_vec=2**ceil(log2(n_x_vec))
; the default value ofn_x_vec
isn_x_vec=2**14
.x_min (Optional[Union[int, float]]) β The lower boundary of the interval over which the density estimate is constructed.
x_max (Optional[Union[int, float]]) β The upper boundary of the interval over which the density estimate is constructed.
weight_vec (Optional[Union[numpy.ndarray, list]]) β Weights of sample points. This must have the same shape as
sample_vec
. IfNone
(default), the samples are assumed to be equally weighted. Only the values of elements relative to each other matter, i.e. multiplyingweight_vec
by a non-negative scalar does not change the results.return_bandwidth (bool) β Should the used bandwidth be returned?
- Raises
ValueError β If
weight_vec
has at least one negative value.- Warns
Root finding failed (Brentβs method) β Optimal bandwidth finding failed, falling back to the rule-of-thumb bandwidth of
scipy.stats.gaussian_kde
.- Returns
Kernel densities, a vector of length
n_x_vec
with the values of the density estimate at the grid points (x_vec
).Kernel density grid (
x_vec
), a vector of grid points over which the kernel density estimate is computed.Optimal bandwidth (Gaussian kernel assumed), returned only if
return_bandwidth
isTrue
.- Return type
Union[Tuple[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray, float]]
Examples
import numpy as np import matplotlib.pyplot as plt from lightkde import kde_1d
sample_vec = [ -1.3145, -0.5197, 0.9326, 3.2358, 0.3814, -0.3226, 2.1121, 1.1357, 0.4376, -0.0332 ] density_vec, x_vec = kde_1d(sample_vec)
sample_vec = np.hstack((np.random.normal(loc=-8, size=100), np.random.normal(loc=-3, size=100), np.random.normal(loc=7, size=100))) density_vec, x_vec = kde_1d(sample_vec) plt.subplots() plt.plot(x_vec, density_vec) plt.show()
The kde bandwidth selection method is outlined in [1]. This implementation is based on the implementation of Daniel B. Smith [2] who based his implementation on the Matlab implementation by Zdravko Botev [3].
References
[1] Z. I. Botev, J. F. Grotowski, and D. P. Kroese (2010) Annals of Statistics, Volume 38, Number 5, pages 2916-2957.
[2] https://github.com/Daniel-B-Smith/KDE-for-SciPy/blob/a9982909bbb92a7e243e5fc9a74f957d883f1c5d/kde.py # noqa: E501 Updated on: 6 Feb 2013.
[3] https://nl.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-estimator # noqa: E501 Updated on: 30 Dec 2015.
- kde_2d(sample_mx, n_row_mx=256, xy_min=None, xy_max=None, weight_vec=None, return_bandwidth=False)[source]#
Fast and accurate state-of-the-art bivariate kernel density estimator with diagonal bandwidth matrix.
The kernel is assumed to be Gaussian. The two bandwidth parameters are chosen optimally without ever using/assuming a parametric model for the sample_vec or any βrules of thumbβ. Unlike many other procedures, this one is immune to accuracy failures in the estimation of multimodal densities with widely separated modes.
- Parameters
sample_mx (Union[numpy.ndarray, list]) β A 2D matrix of sample_vec from which the density estimate is constructed, the matrix must have two columns that represent the two coordinates (x,y) of the 2D sample_vec.
n_row_mx (int) β Number of points along each dimension (same for columns) where the estimate of the density will be returned, i.e. total number of points is
n_row_x_mx**2
.xy_min (Optional[Union[numpy.ndarray, Iterable]]) β The lower x and y boundaries of the interval over which the density estimate is constructed.
xy_max (Optional[Union[numpy.ndarray, Iterable]]) β The upper x and y boundaries of the interval over which the density estimate is constructed.
weight_vec (Optional[Union[numpy.ndarray, list]]) β Weights of sample points. This must have the same number of elements as rows in
sample_vec
, the same weight is applied to both coordinates of the samesample_vec
point. IfNone
(default), the samples are assumed to be equally weighted. The absolute value of the elements ofweight_vec
does not matter, only the values of elements relative to each other, i.e. multiplyingweight_vec
by a scalar does not change the results.return_bandwidth (bool) β Should the used bandwidth be returned?
- Raises
ValueError β If the number of columns in
sample_mx
is not two. Ifweight_vec
has at least one negative value.- Returns
Kernel densities, 2D matrix with the values of the density estimate at the grid points formed by
x_mx
andy_mx
.Kernel density grid (
x_mx
), the x coordinates of the grid points over which the density estimate is computed in the form of a 2D matrix that is the outcome ofnp.meshgrid
.Kernel density grid (
y_mx
), the x coordinates of the grid points over which the density estimate is computed in the form of a 2D matrix that is the outcome ofnp.meshgrid
.Optimal bandwidth (Gaussian kernel assumed), returned only if
return_bandwidth
isTrue
.- Return type
Union[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, float]]
Note
To ease testing and debugging the implementation very closely follows [2], i.e. [2] is assumed to be correct.
References
[1] Z. I. Botev, J. F. Grotowski, and D. P. Kroese (2010) Annals of Statistics, Volume 38, Number 5, pages 2916-2957.
[2] https://nl.mathworks.com/matlabcentral/fileexchange/17204-kernel-density-estimation. # noqa: E501 Updated on: 30 Dec 2015.
References#
- Bot15a
Z.I. Botev. Kernel density estimator (last updated 2015-december-30). 2015. MATLAB Central File Exchange. URL: https://www.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-estimator.
- Bot15b
Z.I. Botev. Kernel density estimation (last updated 2015-december-30). 2015. MATLAB Central File Exchange. URL: https://www.mathworks.com/matlabcentral/fileexchange/17204-kernel-density-estimation.
- BGK10
Z.I. Botev, J.F. Grotowski, and D.P. Kroese. Kernel density estimation via diffusion. Annals of Statistics, 38(5):2916β2957, 2010.
- Smi13
D.B. Smith. Kde-for-scipy (last updated 2013-february-06). 2013. GitHub. URL: https://github.com/Daniel-B-Smith/KDE-for-SciPy/blob/a9982909bbb92a7e243e5fc9a74f957d883f1c5d/kde.py.
For contributors#
π»
All contributions are welcome, feel free to make a pull request on GitHub.
Semantic versioning and automatic release using python-semantic-release.
ReadTheDocs for documenting.
pytest
for unit testing.