Skip to content

speed up the encoder kde by not calling np.histogram on the hot path#16

Open
jl33-ai wants to merge 1 commit into
LorenFrankLab:mainfrom
jl33-ai:hotpath-bench-and-kernel-prealloc
Open

speed up the encoder kde by not calling np.histogram on the hot path#16
jl33-ai wants to merge 1 commit into
LorenFrankLab:mainfrom
jl33-ai:hotpath-bench-and-kernel-prealloc

Conversation

@jl33-ai

@jl33-ai jl33-ai commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Makes the per spike encoder kde about 3.3x faster: median goes from 2096 to 635 microseconds at a full 50k mark buffer.

  • The cost was np.histogram. Given explicit bin edges it does a binary search per sample. Our position bins are evenly spaced integers, so a direct index plus np.bincount does the same job much faster.
  • Also preallocates the kde scratch instead of allocating about a dozen temporary arrays per spike, and fuses the squared distance into a single np.einsum pass.
  • Same result to within floating point rounding, about 3e-13. Set encoder.mark_kernel.exact_histogram to true to restore the exact original numerics.
  • Verified by benchmarks/bench_hotpath.py --verify, which matches the old code on every spike of a 400 spike run.
  • The decoder posterior was measured too (65 to 173 us per bin) and left alone, it is not a bottleneck.
  • Adds the benchmark and docs/latency_analysis.md, which shows end to end latency is dominated by the time bin width and the delay_samples jitter buffer, not the compute.
  • Heads up: the fast path is on by default, so output shifts at the 1e-13 level. Happy to make it opt in instead if you prefer.

@jl33-ai jl33-ai force-pushed the hotpath-bench-and-kernel-prealloc branch from 963dbd4 to d528ecf Compare June 10, 2026 00:23
@jl33-ai jl33-ai changed the title speed up encoder KDE by dropping np.histogram on the hot path speed up the encoder kde by not calling np.histogram on the hot path Jun 10, 2026
@jl33-ai jl33-ai force-pushed the hotpath-bench-and-kernel-prealloc branch from d528ecf to b5ea273 Compare June 10, 2026 00:59
The per spike kde in get_joint_prob was spending about two thirds of its
time inside np.histogram. When you give np.histogram an explicit array of
bin edges it cannot assume the bins are evenly spaced, so it does a binary
search for every sample to find its bin. Our position bins are always the
evenly spaced integers 0 to num_bins, so the bin is just the integer part
of the position. Working it out directly and summing with np.bincount does
the same thing and is a lot faster.

Two more changes on top of that. I preallocate the scratch arrays the kde
uses instead of making a fresh set on every spike, and I fuse the squared
distance into a single np.einsum pass instead of squaring and then summing.

On a full 50k mark buffer the median per spike time goes from about 2096 to
635 microseconds, roughly 3.3 times faster.

The result is the same to within floating point rounding, about 3e-13. The
config option encoder.mark_kernel.exact_histogram restores the exact original
numerics if you ever need them back.

Adds a small benchmark in benchmarks and a writeup in docs.
@jl33-ai jl33-ai force-pushed the hotpath-bench-and-kernel-prealloc branch from b5ea273 to 6a85435 Compare June 10, 2026 01:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant