fix(onebrc lane B): dispatch SIMD width (U8x64 zmm / U8x32 ymm) instead of hardcoding 32#636
Conversation
… stride
The scan hardcoded `array_chunks::<u8, 32>` + `U8x32` throughout, pinning
the delimiter walk to 32-byte `ymm` (AVX2) regardless of target-cpu — so
under `target-cpu=x86-64-v4`/`native` it strided `ymm`, never the 64-byte
`zmm` the AVX-512 build provides. (The probe's `.cargo/config.toml` v3 pin
is a deliberate CI-parity choice; this is about honoring native/v4 when a
run opts into it — "here v4 or native is a must".)
- `SimdByte` = compile-time width alias: `U8x64` under
`cfg(target_feature = "avx512f")`, `U8x32` otherwise. Both are
`ndarray::simd` types (iron rule; no raw intrinsic). `cmpeq_mask`
returns `u64`/`u32` respectively; the set-bit walk was already generic
over the mask width, so the body is unchanged apart from the alias.
- `array_chunks::<u8, { SimdByte::LANES }>` — the const-generic tracks the
dispatched width; `aligned_end`, `pos`, needles, and `from_slice` all key
off `SimdByte::LANES`. No literal stride remains.
- Module + fn docs rewritten to describe the dispatch (64-byte zmm avx512
/ 32-byte ymm avx2) instead of asserting a fixed 32.
- Test `..._straddle_32_byte_block_boundaries` → `..._straddle_block_boundaries`,
now asserts crossing at the dispatched `lane_b::SIMD_LANES` (test-gated
const) instead of a literal `/ 32`; the 68-byte corpus straddles a
boundary at BOTH widths (`long_name` @32, `Vv` @64), so coverage holds
either way.
Verified both arms: v3 default (U8x32, 32B) and `RUSTFLAGS=-Ctarget-cpu=native`
(U8x64, 64B zmm on this avx512f host) — 16/16 lane-b tests byte-parity with
lane A, clippy `-D warnings` clean (lib + all-targets) on both, fmt clean.
README/FINDINGS narrative on the v3-pin correction is deferred to the
parallel session's §5.5 to avoid clobbering its in-flight README edits.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01MLBnPuScZy6w9di2QEjsXM
|
Warning Review limit reached
Next review available in: 20 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 45623c2993
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| #[cfg(target_feature = "avx512f")] | ||
| use ndarray::simd::U8x64 as SimdByte; |
There was a problem hiding this comment.
Require AVX512BW before selecting U8x64
On targets that advertise avx512f without avx512bw (for example -Ctarget-cpu=knl or a manual -Ctarget-feature=+avx512f build), this alias selects U8x64, and the scan later calls byte cmpeq_mask. In the ndarray fork that method is implemented with the AVX-512 byte-compare intrinsic, which needs AVX512BW, so Lane B can execute an unsupported instruction instead of falling back to the 32-byte path. Please gate the 64-byte alias on both avx512f and avx512bw.
Useful? React with 👍 / 👎.
Follow-up on merged #635. The lane-B delimiter scan hardcoded
array_chunks::<u8, 32>+U8x32throughout, pinning the walk to 32-byteymm(AVX2) regardless of target-cpu — so undertarget-cpu=x86-64-v4/nativeit never strided the 64-bytezmmthe AVX-512 build provides. (The probe's.cargo/config.tomlv3 pin stays — it's a deliberate CI-parity choice; this only makes lane B honor native/v4 when a run opts in.)Change
SimdByte= compile-time width alias:U8x64undercfg(target_feature = "avx512f"),U8x32otherwise. Both arendarray::simdtypes (the "all SIMD fromndarray::simd" iron rule — no raw intrinsic).cmpeq_maskreturnsu64/u32respectively; the set-bit walk was already generic over the mask width, so the body is unchanged apart from the alias.array_chunks::<u8, { SimdByte::LANES }>— the const-generic tracks the dispatched width;aligned_end,pos, needles, andfrom_sliceall key offSimdByte::LANES. No literal stride remains.zmmavx512 / 32-byteymmavx2) instead of asserting a fixed 32...._straddle_32_byte_block_boundaries→..._straddle_block_boundaries, now asserting the crossing at the dispatchedlane_b::SIMD_LANES(test-gated const) instead of/ 32. The 68-byte corpus straddles a boundary at both widths (long_name@32,Vv@64), so cross-block-carry coverage holds either way.Verification
Both arms, from the crate dir (onebrc-probe builds standalone):
ymm): 16/16 lane-b tests byte-parity with lane A; clippy-D warningsclean (lib + all-targets); fmt clean.RUSTFLAGS=-Ctarget-cpu=native(U8x64, 64-bytezmmon anavx512fhost): 16/16; clippy clean (all-targets).README/FINDINGSnarrative on the v3-pin correction is intentionally left to the parallel session's §5.5 to avoid clobbering its in-flight edits.🤖 Generated with Claude Code
https://claude.ai/code/session_01MLBnPuScZy6w9di2QEjsXM
Generated by Claude Code