speed up the encoder kde by not calling np.histogram on the hot path by jl33-ai · Pull Request #16 · LorenFrankLab/realtime_decoder

jl33-ai · 2026-06-10T00:17:09Z

Makes the per spike encoder kde about 3.3x faster: median goes from 2096 to 635 microseconds at a full 50k mark buffer.

The cost was np.histogram. Given explicit bin edges it does a binary search per sample. Our position bins are evenly spaced integers, so a direct index plus np.bincount does the same job much faster.
Also preallocates the kde scratch instead of allocating about a dozen temporary arrays per spike, and fuses the squared distance into a single np.einsum pass.
Same result to within floating point rounding, about 3e-13. Set encoder.mark_kernel.exact_histogram to true to restore the exact original numerics.
Verified by benchmarks/bench_hotpath.py --verify, which matches the old code on every spike of a 400 spike run.
The decoder posterior was measured too (65 to 173 us per bin) and left alone, it is not a bottleneck.
Adds the benchmark and docs/latency_analysis.md, which shows end to end latency is dominated by the time bin width and the delay_samples jitter buffer, not the compute.
Heads up: the fast path is on by default, so output shifts at the 1e-13 level. Happy to make it opt in instead if you prefer.

The per spike kde in get_joint_prob was spending about two thirds of its time inside np.histogram. When you give np.histogram an explicit array of bin edges it cannot assume the bins are evenly spaced, so it does a binary search for every sample to find its bin. Our position bins are always the evenly spaced integers 0 to num_bins, so the bin is just the integer part of the position. Working it out directly and summing with np.bincount does the same thing and is a lot faster. Two more changes on top of that. I preallocate the scratch arrays the kde uses instead of making a fresh set on every spike, and I fuse the squared distance into a single np.einsum pass instead of squaring and then summing. On a full 50k mark buffer the median per spike time goes from about 2096 to 635 microseconds, roughly 3.3 times faster. The result is the same to within floating point rounding, about 3e-13. The config option encoder.mark_kernel.exact_histogram restores the exact original numerics if you ever need them back. Adds a small benchmark in benchmarks and a writeup in docs.

jl33-ai force-pushed the hotpath-bench-and-kernel-prealloc branch from 963dbd4 to d528ecf Compare June 10, 2026 00:23

jl33-ai changed the title ~~speed up encoder KDE by dropping np.histogram on the hot path~~ speed up the encoder kde by not calling np.histogram on the hot path Jun 10, 2026

jl33-ai force-pushed the hotpath-bench-and-kernel-prealloc branch from d528ecf to b5ea273 Compare June 10, 2026 00:59

jl33-ai force-pushed the hotpath-bench-and-kernel-prealloc branch from b5ea273 to 6a85435 Compare June 10, 2026 01:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed up the encoder kde by not calling np.histogram on the hot path#16

speed up the encoder kde by not calling np.histogram on the hot path#16
jl33-ai wants to merge 1 commit into
LorenFrankLab:mainfrom
jl33-ai:hotpath-bench-and-kernel-prealloc

jl33-ai commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jl33-ai commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jl33-ai commented Jun 10, 2026 •

edited

Loading