lightkde#

A lightning fast, lightweight, and reliable kernel density estimation.

  • Easy to use, e.g. density_vec, x_vec = kde_1d(sample_vec=sample).

  • Works with 1d and 2d samples.

  • Works with weighted samples as well.

  • Based on the MATLAB implementations of Botev: [Botev, 2015], [Botev, 2015].

alt text

Installation#

πŸ₯š 🐣 πŸ₯

pip#

You can install the most recent stable version by using:

pip install lightkde

From the source#

Clone the repository and install the package locally using (run from the root of the repository):

pip install .
pip install -e .[tests,lint_type_checks,docs]

Examples#

πŸ“–

Note

The examples are selected to show how to use lightkde and to illustrate some cases where it outperforms rule-of-thumb bandwidth selection based approaches like scipy.stats.gaussian_kde. For many cases, the rule-of-thumb bandwidth selection approaches work well.

1D bimodal example

1D bimodal example

1D bimodal example
1D truncated unimodal example

1D truncated unimodal example

1D truncated unimodal example
2D bimodal example

2D bimodal example

2D bimodal example
2D truncated unimodal example

2D truncated unimodal example

2D truncated unimodal example

1D bimodal example#

This example shows how to use lightkde.kde_1d and how it compares to scipy.stats.gaussian_kde for a bimodal univariate case.

Import packages:

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde, norm

from lightkde import kde_1d

Generate synthetic data from two univariate normal distributions:

np.random.seed(42)
sample = np.hstack((norm.rvs(size=2_000), 0.3 * norm.rvs(size=1_000) + 5))

Estimate kernel density using lightkde:

density_vec, x_vec = kde_1d(sample_vec=sample)

Estimate kernel density using scipy:

gkde = gaussian_kde(dataset=sample)
scipy_density_vec = gkde.evaluate(x_vec)

Plot the data against the kernel density estimates:

plt.plot(x_vec, density_vec, "--r", label="lightkde")
plt.plot(x_vec, scipy_density_vec, label="scipy.stats.gaussian_kde")
plt.hist(sample, bins=100, density=True, alpha=0.5, label="data")
plt.legend()
plt.show()
plot bimodal kde 1d

The scipy method oversmooths the kernel density and it is far from the histogram of the data that it is expected to follow.

Total running time of the script: ( 0 minutes 1.862 seconds)

1D truncated unimodal example#

This example shows how to use lightkde.kde_1d with and without a limit and how it compares to scipy.stats.gaussian_kde for a truncated unimodal distribution.

Import packages

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde, norm

from lightkde import kde_1d

Generate synthetic data from a univariate normal distribution and truncate it:

np.random.seed(42)
sample = norm.rvs(size=2_000)
sample = sample[sample > 0]

Estimate kernel density using lightkde:

density_vec_without_x_min, x_vec_without_x_min = kde_1d(sample_vec=sample)
density_vec_with_x_min, x_vec_with_x_min = kde_1d(sample_vec=sample, x_min=0)

Estimate kernel density using scipy:

gkde = gaussian_kde(dataset=sample)
scipy_density_vec = gkde.evaluate(x_vec_without_x_min)

Plot the data against the kernel density estimates:

fig, ax = plt.subplots(figsize=(6, 4))
ax.plot(x_vec_with_x_min, density_vec_with_x_min, "--r", label="lightkde; with x_min=0")
ax.plot(
    x_vec_without_x_min,
    density_vec_without_x_min,
    ":r",
    label="lightkde; without x_min",
)
ax.plot(
    x_vec_without_x_min, scipy_density_vec, zorder=1, label="scipy.stats.gaussian_kde"
)
ax.hist(sample, bins=30, density=True, alpha=0.5, label="data")
ax.legend()
plt.show()
plot truncated unimodal kde 1d

When x_min=0 is used, lightkde approximates well the histogram of the data. Without this option it behaves similarly to the scipy method, both extend to a region without data.

Total running time of the script: ( 0 minutes 1.939 seconds)

2D bimodal example#

This example shows how to use lightkde.kde_2d and how it compares to scipy.stats.gaussian_kde for a bimodal bivariate case.

Import packages:

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde, multivariate_normal

from lightkde import kde_2d

Generate synthetic data from two univariate normal distributions:

np.random.seed(42)
sample = np.vstack(
    (
        multivariate_normal.rvs(mean=[0, 0], cov=0.3, size=2000),
        multivariate_normal.rvs(
            mean=[2, 2], cov=[[0.5, -0.48], [-0.48, 0.5]], size=2000
        ),
    )
)

Estimate kernel density using lightkde:

density_mx, x_mx, y_mx = kde_2d(sample_mx=sample)

Estimate kernel density using scipy:

gkde = gaussian_kde(dataset=sample.T)
xy_mx = np.hstack((x_mx.reshape(-1, 1), y_mx.reshape(-1, 1)))
scipy_density_mx = gkde.evaluate(xy_mx.T).reshape(x_mx.shape)

Plot the data against the kernel density estimates:

# pre-process
bins = (30, 30)
data_density_mx, xedges, yedges = np.histogram2d(
    sample[:, 0], sample[:, 1], bins=bins, density=True
)
z_min = 0
z_max = np.max(data_density_mx)
x_min, x_max = min(xedges), max(xedges)
y_min, y_max = min(yedges), max(yedges)

# plot
cmap = "afmhot_r"
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(10, 3), sharex="all", sharey="all")

# data
h = ax1.hist2d(
    sample[:, 0],
    sample[:, 1],
    bins=bins,
    density=True,
    vmin=z_min,
    vmax=z_max,
    cmap=cmap,
)[-1]
ax1.set_title("data")

# lightkde
ax2.contourf(x_mx, y_mx, density_mx, levels=50, vmin=z_min, vmax=z_max, cmap=cmap)
ax2.set_title("lightkde")

# scipy
ax3.contourf(x_mx, y_mx, scipy_density_mx, levels=50, vmin=z_min, vmax=z_max, cmap=cmap)
ax3.set_xlim(x_min, x_max)
ax3.set_ylim(y_min, y_max)
ax3.set_title("scipy.stats.gaussian_kde")

fig.subplots_adjust(right=0.89)
cbar_ax = fig.add_axes([0.90, 0.15, 0.05, 0.7], aspect=30)
fig.colorbar(h, cax=cbar_ax, label="density")
plt.show()
data, lightkde, scipy.stats.gaussian_kde

The scipy method oversmooths the kernel density and it is far from the histogram of the data that it is expected to follow.

Total running time of the script: ( 0 minutes 5.966 seconds)

2D truncated unimodal example#

This example shows how to use lightkde.kde_2d with and without a limit and how it compares to scipy.stats.gaussian_kde for a truncated unimodal bivariate distribution.

Import packages

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde, multivariate_normal

from lightkde import kde_2d

Generate synthetic data from a univariate normal distribution and truncate it:

np.random.seed(42)
sample = multivariate_normal.rvs(mean=[0, 0], size=8000)
sample = sample[sample[:, 0] > 0]

Estimate kernel density using lightkde:

density_mx_without_x_min, x_mx_without_x_min, y_mx_without_x_min = kde_2d(
    sample_mx=sample
)
density_mx_with_x_min, x_mx_with_x_min, y_mx_with_x_min = kde_2d(
    sample_mx=sample, xy_min=[0, -5]
)

Estimate kernel density using scipy:

gkde = gaussian_kde(dataset=sample.T)
xy_mx = np.hstack(
    (x_mx_without_x_min.reshape(-1, 1), y_mx_without_x_min.reshape(-1, 1))
)
scipy_density_mx = gkde.evaluate(xy_mx.T).reshape(x_mx_without_x_min.shape)

Plot the data against the kernel density estimates:

bins = (30, 30)
data_density_mx, xedges, yedges = np.histogram2d(
    sample[:, 0], sample[:, 1], bins=bins, density=True
)
z_min = 0
z_max = np.max(data_density_mx)
x_min, x_max = -0.5, max(xedges)
y_min, y_max = min(yedges), max(yedges)

# plot
cmap = "afmhot_r"
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(
    2, 2, figsize=(8, 7), sharex="all", sharey="all"
)

# data
h = ax1.hist2d(
    sample[:, 0],
    sample[:, 1],
    bins=bins,
    density=True,
    vmin=z_min,
    vmax=z_max,
    cmap=cmap,
)[-1]
ax1.set_title("data")

# scipy
ax2.contourf(
    x_mx_without_x_min,
    y_mx_without_x_min,
    scipy_density_mx,
    levels=50,
    vmin=z_min,
    vmax=z_max,
    cmap=cmap,
)
ax2.set_xlim(x_min, x_max)
ax3.set_ylim(y_min, y_max)
ax2.set_title("scipy.stats.gaussian_kde")

# lightkde
ax3.contourf(
    x_mx_with_x_min,
    y_mx_with_x_min,
    density_mx_with_x_min,
    levels=50,
    vmin=z_min,
    vmax=z_max,
    cmap=cmap,
)
ax3.set_title("lightkde; with x_min=0")

ax4.contourf(
    x_mx_without_x_min,
    y_mx_without_x_min,
    density_mx_without_x_min,
    levels=50,
    vmin=z_min,
    vmax=z_max,
    cmap=cmap,
)
ax4.set_title("lightkde; without x_min")

fig.subplots_adjust(right=0.80)
cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7], aspect=50)
fig.colorbar(h, cax=cbar_ax, label="density")
plt.show()
data, scipy.stats.gaussian_kde, lightkde; with x_min=0, lightkde; without x_min

When x_min=0 is used, lightkde approximates well the histogram of the data.

Total running time of the script: ( 0 minutes 6.354 seconds)

Methods#

πŸ“”

Overview#

Diffusion based kernel density estimation after [Botev et al., 2010]. This Python implementation is based on Botev’s MATLAB implementations for 1d and 2d samples, [Botev, 2015] and [Botev, 2015] respectively. The 1d python implementation is influenced by [Smith, 2013].

A brief description of the implemented method:

Reliable and extremely fast kernel density estimator for one-dimensional data; Gaussian kernel is assumed and the bandwidth is chosen automatically; Unlike many other implementations, this one is immune to problems caused by multimodal densities with widely separated modes (see example). The estimation does not deteriorate for multimodal densities, because we never assume a parametric model for the data (like those used in rules of thumb).

[Botev, 2015]

Differences to the MATLAB implementation#

  • Extended to be able to handle weighted samples.

  • Extensive testing (mostly against the MATLAB implementation that is assumed to be correct).

  • Additional examples (Examples).

API reference#

πŸ“

This page documents the functions provided by the lightkde package.

kde_1d(sample_vec, n_x_vec=16384, x_min=None, x_max=None, weight_vec=None, return_bandwidth=False)[source]#

Reliable and extremely fast kernel density estimator for one-dimensional sample.

Gaussian kernel is assumed and the bandwidth is chosen automatically. Unlike many other implementations, this one is immune to problems caused by multimodal densities with widely separated modes. The estimation does not deteriorate for multimodal densities, because we never assume a parametric model for the sample.

Note

  • The elements of sample_vec that fall between x_min and x_max will be treated as the full sample, i.e. the kernel density over [x_min, x_max] will integrate to one.

  • If the search for finding the optimal bandwidth fails the functions falls back to scipy.stats.gaussian_kde.

Parameters
  • sample_vec (Union[numpy.ndarray, list]) – A vector of sample points from which the density estimate is constructed.

  • n_x_vec (int) – The number of x_vec points used in the uniform discretization of the interval [x_min, x_max]. n_x_vec has to be a power of two. If n_x_vec is not a power of two, then n_x_vec is rounded up to the next power of two, i.e., n_x_vec is set to n_x_vec=2**ceil(log2(n_x_vec)); the default value of n_x_vec is n_x_vec=2**14.

  • x_min (Optional[Union[int, float]]) – The lower boundary of the interval over which the density estimate is constructed.

  • x_max (Optional[Union[int, float]]) – The upper boundary of the interval over which the density estimate is constructed.

  • weight_vec (Optional[Union[numpy.ndarray, list]]) – Weights of sample points. This must have the same shape as sample_vec. If None (default), the samples are assumed to be equally weighted. Only the values of elements relative to each other matter, i.e. multiplying weight_vec by a non-negative scalar does not change the results.

  • return_bandwidth (bool) – Should the used bandwidth be returned?

Raises

ValueError – If weight_vec has at least one negative value.

Warns

Root finding failed (Brent’s method) – Optimal bandwidth finding failed, falling back to the rule-of-thumb bandwidth of scipy.stats.gaussian_kde.

Returns

Kernel densities, a vector of length n_x_vec with the values of the density estimate at the grid points (x_vec).

Kernel density grid (x_vec), a vector of grid points over which the kernel density estimate is computed.

Optimal bandwidth (Gaussian kernel assumed), returned only if return_bandwidth is True.

Return type

Union[Tuple[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray, float]]

Examples

import numpy as np
import matplotlib.pyplot as plt
from lightkde import kde_1d
sample_vec = [
    -1.3145, -0.5197, 0.9326, 3.2358, 0.3814,
    -0.3226, 2.1121, 1.1357, 0.4376, -0.0332
]
density_vec, x_vec = kde_1d(sample_vec)
sample_vec = np.hstack((np.random.normal(loc=-8, size=100),
    np.random.normal(loc=-3, size=100),
    np.random.normal(loc=7, size=100)))
density_vec, x_vec = kde_1d(sample_vec)

plt.subplots()
plt.plot(x_vec, density_vec)
plt.show()

The kde bandwidth selection method is outlined in [1]. This implementation is based on the implementation of Daniel B. Smith [2] who based his implementation on the Matlab implementation by Zdravko Botev [3].

References

[1] Z. I. Botev, J. F. Grotowski, and D. P. Kroese (2010) Annals of Statistics, Volume 38, Number 5, pages 2916-2957.

[2] https://github.com/Daniel-B-Smith/KDE-for-SciPy/blob/a9982909bbb92a7e243e5fc9a74f957d883f1c5d/kde.py # noqa: E501 Updated on: 6 Feb 2013.

[3] https://nl.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-estimator # noqa: E501 Updated on: 30 Dec 2015.

kde_2d(sample_mx, n_row_mx=256, xy_min=None, xy_max=None, weight_vec=None, return_bandwidth=False)[source]#

Fast and accurate state-of-the-art bivariate kernel density estimator with diagonal bandwidth matrix.

The kernel is assumed to be Gaussian. The two bandwidth parameters are chosen optimally without ever using/assuming a parametric model for the sample_vec or any β€œrules of thumb”. Unlike many other procedures, this one is immune to accuracy failures in the estimation of multimodal densities with widely separated modes.

Parameters
  • sample_mx (Union[numpy.ndarray, list]) – A 2D matrix of sample_vec from which the density estimate is constructed, the matrix must have two columns that represent the two coordinates (x,y) of the 2D sample_vec.

  • n_row_mx (int) – Number of points along each dimension (same for columns) where the estimate of the density will be returned, i.e. total number of points is n_row_x_mx**2.

  • xy_min (Optional[Union[numpy.ndarray, Iterable]]) – The lower x and y boundaries of the interval over which the density estimate is constructed.

  • xy_max (Optional[Union[numpy.ndarray, Iterable]]) – The upper x and y boundaries of the interval over which the density estimate is constructed.

  • weight_vec (Optional[Union[numpy.ndarray, list]]) – Weights of sample points. This must have the same number of elements as rows in sample_vec, the same weight is applied to both coordinates of the same sample_vec point. If None (default), the samples are assumed to be equally weighted. The absolute value of the elements of weight_vec does not matter, only the values of elements relative to each other, i.e. multiplying weight_vec by a scalar does not change the results.

  • return_bandwidth (bool) – Should the used bandwidth be returned?

Raises

ValueError – If the number of columns in sample_mx is not two. If weight_vec has at least one negative value.

Returns

Kernel densities, 2D matrix with the values of the density estimate at the grid points formed by x_mx and y_mx.

Kernel density grid (x_mx), the x coordinates of the grid points over which the density estimate is computed in the form of a 2D matrix that is the outcome of np.meshgrid.

Kernel density grid (y_mx), the x coordinates of the grid points over which the density estimate is computed in the form of a 2D matrix that is the outcome of np.meshgrid.

Optimal bandwidth (Gaussian kernel assumed), returned only if return_bandwidth is True.

Return type

Union[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, float]]

Note

To ease testing and debugging the implementation very closely follows [2], i.e. [2] is assumed to be correct.

References

[1] Z. I. Botev, J. F. Grotowski, and D. P. Kroese (2010) Annals of Statistics, Volume 38, Number 5, pages 2916-2957.

[2] https://nl.mathworks.com/matlabcentral/fileexchange/17204-kernel-density-estimation. # noqa: E501 Updated on: 30 Dec 2015.

References#

Bot15a

Z.I. Botev. Kernel density estimator (last updated 2015-december-30). 2015. MATLAB Central File Exchange. URL: https://www.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-estimator.

Bot15b

Z.I. Botev. Kernel density estimation (last updated 2015-december-30). 2015. MATLAB Central File Exchange. URL: https://www.mathworks.com/matlabcentral/fileexchange/17204-kernel-density-estimation.

BGK10

Z.I. Botev, J.F. Grotowski, and D.P. Kroese. Kernel density estimation via diffusion. Annals of Statistics, 38(5):2916–2957, 2010.

Smi13

D.B. Smith. Kde-for-scipy (last updated 2013-february-06). 2013. GitHub. URL: https://github.com/Daniel-B-Smith/KDE-for-SciPy/blob/a9982909bbb92a7e243e5fc9a74f957d883f1c5d/kde.py.

For contributors#

πŸ’»

All contributions are welcome, feel free to make a pull request on GitHub.