Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions .claude/board/EPIPHANIES.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,23 @@
## 2026-07-02 — E-1BRC-GRIDLAKE-SWEETSPOT-1: the 64×64 gridlake SoA is the measured sweet spot — the batch pipeline at tile scale equals the best streamed topology while carrying the double-WAL
**Status:** FINDING (measured, onebrc-probe lane J t7; closes the operator's four follow-up questions and the t4→t7 kanban-update arc)

Lane J parameterized lane I with the operator's questions as knobs (grid × sink-lanes × registry). t7 @4 cores, same-session refs G(1)=46.3 / H=40.5 / F=70.1: **J(gridlake 4096, 1 lane, no registry) = 46.2–46.3 Mrows/s — equal to the best streamed ownership topology, above the lazy-owner orchestrator, with the strongest witness (double-WAL both ends, 312 msgs).** The four answers: (1) t6's ~20 vs H's 39.4 decomposes into registry RESIDENCY (knob-isolated: ON halves steady state net of spawn — t6's CONJECTURE promoted to FINDING) + the L2-busting 64K-cell table/memo working set; (2) the cache WAS wrong — matching the batch SoA to the 64×64 gridlake tile (4096 cells ≈ 80 KB integer; the literal 4096×BF16=16 KB pair is ndarray #227's proven VDPBF16PS tier) recovers it completely; (3) 1 sink lane suffices, 8 free, 64 over-lanes — lane count scales with per-batch APPLY work, never with data or address space; (4) orchestration-with-occupancy is NOT the sweet spot when ownership can live as the index-aligned guarantee table — H's lazy mechanisms stay right only when fine-grained ownership must be actors. Composed recipe: **64×64 gridlake batch SoA + codebook CAM addressing + 1–8 sink lane pairs + whole-table double-cast + flush cache; ownership = index-aligned guarantee table, no standing per-cell registry; BF16 planes per #227 when tile-GEMM lands.** Tables: crates/onebrc-probe/README.md §5.7.
## 2026-07-02 — E-1BRC-BATCH-PIPELINE-1: the operator's batch-pipeline spec measured — 312 messages total, double-WAL on both ends, flush cache interleaves; the remaining cost is residency, not architecture
**Status:** FINDING (measured, onebrc-probe lane I t6; completes the t4→t5→t6 kanban-update arc; operator spec implemented verbatim)

The spec: all 65536 mailboxes UPFRONT; two fixed aligned indices (mailbox idx == SoA row idx — ownership = the `row_owner[i]==i` binding + write-on-behalf, never a message path); codebook-minted identity → direct CAM addressing (no probe/compare in the hot loop; ~400 global mint locks per run, total); whole-table DOUBLE-CASTS (one Arc per 64K-row batch to BOTH the mailbox-ownership-guarantee sink and the Lance row-address sink); flush cache so flushing and reindex-next interleave. t6 measured: **312 messages total** (156 batches × 2 ends — vs ~63K flat t4a, ~2.6K orchestrated t5: message count tracks BATCHES, independent of occupancy AND address-space size); flush_cache peak 2–3 tables/worker (the worker never waits for a flush); ownership==lance==156 journals + 156 version ticks asserted (the double-WAL: replayable from either end — the full W1b batch-writer shape, stronger than G/H's ownership-only witness); 64K spawn = 1.1–2.7 s one-time standing infrastructure (17–40 µs/actor); steady state ~20–22 Mrows/s ≈ ½ of the 1-owner streamed G(1) 43 — attribution CONJECTURE: resident-64K-actor memory footprint on a 4-core container + two serialized sinks. RULING: **the batch pipeline wins the messaging war outright and carries the strongest witness story; the standing 64K registry is affordable infrastructure; the remaining optimization surface is residency footprint, not architecture.** Composition across the arc: H's lazy activation (registry need not pre-exist) and I's whole-table double-cast + flush cache (when it does) are complementary — the shared invariant is that producers NEVER address fine-grained owners directly. Tables: crates/onebrc-probe/README.md §5.6.
## 2026-07-02 — E-1BRC-ORCHESTRATION-SWEETSPOT-1: the sweet spot is the orchestration tier itself — lazy activation + ahead-firing batching flatten the ownership curve (23× recovery at the 64K end)
**Status:** FINDING (measured, onebrc-probe lane H t5; completes the E-1BRC-KANBAN-UPDATE-1 / E-1BRC-OWNER-GRANULARITY-1 arc; operator: "the 65536 mailboxes had no Orchestration at all — find the sweet spot")

t4a's 20× cliff was the FLAT topology: 64K eager spawns, ~63K owner-addressed casts, producers addressing owners directly. Lane H interposes the planner/kanban-executor domain's own two mechanisms — **lazy activation** (router tier spawns an owner only on first traffic: live mailboxes track OCCUPANCY ~413, never the 64K address space) and the **ahead-firing batch writer** (routers buffer per-owner entries, fire batched Applys at batch_k) — over lane G's UNCHANGED one-mailbox-per-SoA substrate, witness discipline intact (owner journals == router casts). t5 medians @4 cores: H(16) 42.2 / H(256) 36.8 / H(4096) 40.2 / **H(65536) 39.4 vs same-session flat 1.7 — 23× recovery, within ~9% of the best coarse topology (G(1) 43.2; F reference 81.7)**. The ruling that closes the arc: **orchestration FLATTENS the granularity curve — ownership granularity becomes a semantic choice (per-tile addressability, per-owner WAL), not a performance gamble. Fine-grained mailbox-as-owner is viable IF AND ONLY IF producers never address owners directly: the router/delegation tier is a LOAD-BEARING part of the kanban-update architecture, and flat fan-out to fine owners is the measured 20× anti-pattern.** graph-flow (rs-graph-llm) remains the OUTER loop by design — task-granularity persisted-cursor orchestration; per-morsel it would measure storage latency, and its in-container build is blocked by the pre-existing burn 403 (W3b). Tables: crates/onebrc-probe/README.md §5.5.
## 2026-07-02 — E-1BRC-OWNER-GRANULARITY-1: one mailbox per SoA (operator correction) — the ownership curve is a plateau then a 20× cliff; Morton tile GROUPING is what makes mailbox-as-owner viable
**Status:** FINDING (measured, onebrc-probe lane G t4a; corrects the framing of E-1BRC-KANBAN-UPDATE-1 — the numbers there stand, the topology language was inverted)

Operator: "I thought we spawn one ractor mailbox per SoA?" — ratified: that IS the canon, and lane G's "sharding the 64K SoA" framing was an ownership inversion (the code's owners were always independent, but each allocated a full 64K-slot table, making the fine-grained end unrunnable). Reworked: each owner's actor State is its OWN `OwnerSoa` sized to its tile span — one mailbox = one SoA, verbatim — unlocking the full granularity sweep including the literal 64K-concurrent-SoAs end. Medians @4 cores: G(1) 43.4 / G(16) 30.3 / G(256) 35.9 / G(4096) 18.3 / **G(65536 = one mailbox per tile) 2.1 Mrows/s — a 20× collapse** vs one owner (64K spawns paid in-run; cast fragmentation ~150→~63K messages as each morsel's ~413 stations scatter to ~413 owners; 64K mailbox tasks on 4 cores). The completed ruling: **the ownership-granularity curve is a plateau (1–256 owners, ~30–43 Mrows/s, topology noise-dominated) then a cliff; one-mailbox-per-semantic-cell is architecturally clean and measurably catastrophic at OLAP arrival rates. Morton tile GROUPING is not an optimization detail — it is the mechanism that makes mailbox-as-owner viable: the mailbox is the OWNER boundary, the tile is the ADDRESS boundary, and they must never be conflated 1:1 under load.** Owners' memory now ∝ span (the collapse is scheduling+messaging, not memory). Tables: crates/onebrc-probe/README.md §5.4a.
## 2026-07-02 — E-1BRC-KANBAN-UPDATE-1: the kanban-update write path measured — 0.54× at morsel granularity, the tax is all boundary, and ownership must not shard below contention
**Status:** FINDING (measured, onebrc-probe lane G t4, same recipe corpus as E-1BRC-ADDRESSING-1; tables `crates/onebrc-probe/README.md` §5.4; operator-requested follow-up "compare morton and the kanban vs without / 64k concurrent SoA vs Morton tile ... when using kanban update")

Lane G holds lane F's Morton-tile 64K SoA as OWNED state behind shard mailbox actors: workers pre-reduce 64K-row morsels (#227's morsel size; ndarray rebased onto master to sit on its merged Morton/morsel probe), cast dirty entries prefix-routed to owners, every applied batch witnessed with a KanbanMove (journal==casts asserted). t4 medians @4 cores: **F 79.5 (private merge, no witness) / G 43.0 @1 shard / 39.9 @4 / 36.0 @16 (one thrash collapse to 11.7); workers=3 strictly worse.** Three rulings for the architecture: (1) **kanban update costs ~0.54×** at morsel granularity and the tax decomposes entirely into boundary costs (Arc corpus copy, blocking+async oversubscription, per-morsel messaging) — the witness itself is ~free (lane E) — buying live bounded-staleness state, witnessed replayable writes, single-writer safety, bounded worker memory; (2) **do not shard ownership below contention** — at ~400 groups ONE mailbox absorbs all apply work and every extra shard is pure scheduling overhead; shard count scales with owner WORK, never with rows; (3) **the Morton prefix ROUTE is free as a mechanism** (G@4 within ~7% of G@1 before thrash) — tile-sharding stays the right tool, its trigger is owner-side contention (high cardinality / heavy per-entry work). W2d consequence: private-merge when the product is one final answer; pay the ~2× only when the product IS the live/witnessed/owned state — and the 550 ms Libet budget is untouched either way.
## 2026-07-02 — E-1BRC-ADDRESSING-1: addressing-is-aggregation measured — route-and-write is 3× the classic map; the Morton dress costs ~10%
**Status:** FINDING (measured, onebrc-probe t0–t3, recipe corpus rows=10000000 seed=42 sha256=f1853caa…5691, 4-core container; tables in `crates/onebrc-probe/README.md` §5–5.3)

Expand Down
117 changes: 117 additions & 0 deletions .claude/v3/INTEGRATION-PLAN.md
Original file line number Diff line number Diff line change
Expand Up @@ -591,3 +591,120 @@ where the win lives (B was 1.06×). All six lanes A–F + R now measured
on one regenerable recipe corpus. Board: E-1BRC-ADDRESSING-1. The probe
is COMPLETE; follow-ups (100M container-scale run, high-cardinality
corpus, SWAR parse, mmap) are priced and parked in README §1/§5.3.

#### Addendum-13 status update (2026-07-02, t4 — lane G, operator follow-up)

Operator: "compare morton and the kanban vs without — if 64k concurrent
SoA vs Morton tile can help us understand the pros and cons of our
architecture when using kanban update." Lane G SHIPPED (feature
`lane-g`): the lane-F Morton-tile 64K SoA as OWNED state behind shard
mailbox actors — prefix-routed morsel casts (64K rows, #227's morsel
size, clear-by-undo extraction), every applied batch witnessed with a
KanbanMove, journal==casts asserted. ndarray checkout rebased onto
master (#227 merged — its Morton scatter/morsel probe is the sibling
reference). t4 medians: F 79.5 / G(1 shard) 43.0 / G(4) 39.9 /
G(16) 36.0 (one thrash collapse 11.7) / G(workers=3) strictly worse.
**Ledger: kanban update = 0.54× at morsel granularity, and the tax is
all boundary (corpus copy + oversubscription + messaging), not the
witness (lane E: journal ~free). It buys live bounded-staleness state,
witnessed replayable writes, single-writer safety, bounded worker
memory. Do NOT shard ownership below contention — at ~400 groups one
mailbox absorbs everything; shards scale with owner WORK, never with
rows; the Morton prefix route itself is free (G(4)≈G(1)).** Tables +
full readings: crates/onebrc-probe/README.md §5.4. Board follow-up
appended to E-1BRC-ADDRESSING-1 thread as E-1BRC-KANBAN-UPDATE-1.

#### Addendum-13 status update (2026-07-02, t4a — topology corrected, curve completed)

Operator correction ratified: **one ractor mailbox per SoA** (canon).
Lane G reworked — each owner's State is its OWN `OwnerSoa` sized to its
tile span (no full-64K tables per owner, no "sharded one SoA" framing);
flush grouping made sort-based (no dense per-shard vecs at 64K owners);
parity test extended to 4096 mailboxes. Full ownership-granularity curve
@4 workers, medians: G(1) 43.4 / G(16) 30.3 / G(256) 35.9 / G(4096) 18.3
/ **G(65536, one mailbox per tile) 2.1 — a 20× collapse** (spawn ×64K +
cast fragmentation ~150→~63K + 64K tasks on 4 cores). **Ruling: the
ownership plateau spans 1–256 owners; Morton tile GROUPING is what makes
mailbox-as-owner viable — mailbox = OWNER boundary, tile = ADDRESS
boundary, never conflate 1:1 under load.** README §5.4a; board
E-1BRC-KANBAN-UPDATE-1 correction appended as E-1BRC-OWNER-GRANULARITY-1.

#### Addendum-13 status update (2026-07-02, t5 — orchestration sweet spot, operator follow-up)

Operator: "the 65536 mailboxes had no Orchestration at all — find the
sweet spot with rs-graph-llm or lance-graph-planner + kanban update."
Lane H SHIPPED (feature `lane-h`): router tier with LAZY owner
activation (live mailboxes track occupancy ~413, never the 64K address
space) + AHEAD-FIRING batched delivery (batch_k=64) over lane G's
unchanged one-mailbox-per-SoA substrate; witness discipline preserved
(owner journals == router casts asserted). graph-flow stays the OUTER
loop (task-granularity cursor; burn-submodule 403 blocks in-container
builds anyway) — the in-loop mechanisms are the planner/kanban-executor
domain's own. t5 medians @4 cores: H(16) 42.2 / H(256) 36.8 / H(4096)
40.2 / **H(65536) 39.4 vs flat 1.7 same-session — 23× recovery, within
~9% of G(1)=43.2; F=81.7.** RULING: orchestration FLATTENS the
granularity curve — the sweet spot is not a shard count, it is the
orchestration tier itself; fine-grained mailbox-as-owner is viable iff
producers never address owners directly (the ahead-firing batch-writer
is load-bearing, not an optimization; flat fan-out = the measured 20×
anti-pattern). README §5.5; board E-1BRC-ORCHESTRATION-SWEETSPOT-1.

#### Addendum-13 status update (2026-07-02, t6 — lane I, operator batch-pipeline spec)

Operator spec implemented verbatim as lane I (feature `lane-i`): all
65536 mailboxes UPFRONT (standing ownership registry; spawn measured
separately: 1.1–2.7 s, 17–40 µs/actor); two fixed aligned indices
(mailbox idx == SoA row idx — ownership guarantee is the
`row_owner[i]==i` binding + write-on-behalf, never a message path);
codebook-minted identity → direct CAM addressing (no probe in the hot
loop; worker-local memo, ~400 global mint locks total); whole-table
DOUBLE-CASTS (one Arc per 64K-row batch to BOTH the ownership-guarantee
sink and the Lance row-address sink — 312 messages total vs 63K flat /
2.6K orchestrated); flush cache interleaving flush and refill (peak 2–3
tables/worker, worker never waits). Both ends journal every batch
(ownership==lance==156 asserted) + one DatasetVersion tick per batch —
the full double-WAL the W1b batch writer needs. t6: steady state ~20–22
Mrows/s (≈½ of G(1) 43 — residency-footprint attribution CONJECTURE);
total incl. spawn 3.2–6.1. RULING: the batch pipeline wins the
messaging war outright (messages ∝ batches, independent of occupancy
AND address space); the standing 64K registry is affordable
infrastructure; the remaining surface is residency, not architecture.
README §5.6; board E-1BRC-BATCH-PIPELINE-1.

#### Addendum-13 status update (2026-07-02, t7 — lane J knob matrix, PROBE ARC COMPLETE)

Lane J (feature `lane-j`) parameterizes lane I with the operator's four
follow-up questions as knobs: grid (4096 gridlake vs 65536), sink lanes
(1/8/64), registry (on/off). t7 @4 cores, same-session refs G(1)=46.3 /
H=40.5 / F=70.1: **J(gridlake 4096, 1 lane, no registry) = 46.2–46.3 —
the measured sweet spot: equals the best streamed topology while
carrying the double-WAL.** Registry ON halves steady state net of spawn
(t6 residency CONJECTURE → FINDING); grid 65536 → 40 (L2-busting
table+memo); lanes 1≈8, 64 over-lanes (apply work is O(dirty) —
lanes scale with APPLY work, never data). The composed recipe: 64×64
gridlake batch SoA + codebook CAM + 1–8 lane pairs + whole-table
double-cast + flush cache; ownership as the index-aligned guarantee
table, NOT a standing per-cell actor registry; BF16 planes per ndarray
#227's proven VDPBF16PS tier as the tile-GEMM continuation. README

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Markdownlint MD018: missing space after #.

Likely triggered by an inline #227-style reference sitting at the start of a wrapped line, which markdown parses as an ATX heading attempt.

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 688-688: No space after hash on atx style heading

(MD018, no-missing-space-atx)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/v3/INTEGRATION-PLAN.md at line 688, The Markdown in
INTEGRATION-PLAN.md is triggering MD018 because a wrapped line starts with an
inline `#227-style` reference that looks like a heading marker. Update the
affected prose so the line no longer begins with a bare # reference, or reflow
the surrounding text in the section containing the `#227` reference to keep the
hash from appearing at the start of a line. Use the nearby markdown content in
the integration plan to locate and adjust the wrapped sentence.

Source: Linters/SAST tools

§5.7; board E-1BRC-GRIDLAKE-SWEETSPOT-1.

#### Addendum-13 status update (2026-07-02, consolidation — findings/commentary split, 8 presets, simd_ops wiring)

Operator-requested consolidation SHIPPED: (1) `crates/onebrc-probe/FINDINGS.md`
— the AGNOSTIC record (environment, methods, all t0–t7 tables, all 11
invariants WITH their code, reproduction commands; zero interpretation);
(2) `crates/onebrc-probe/COMMENTARY.md` — this session's prime stored
SEPARATELY (readings, rulings executed, composed recipe, flagged
uncertainty, suggested lab sweeps) so another session can analyze the
findings from its own angle; (3) `src/presets.rs` (feature `presets`) —
the 8 batching methods frozen as named presets (map-private-merge /
grid-private-merge / stream-single-owner / orchestrated-lazy-owners /
batch-64k-registry / gridlake / gridlake-8-lanes / batch-64k-no-registry)
sharing one signature + one parity harness (`all_presets_agree_with_lane_a`
— every preset byte-identical to lane A); (4) honest answer to the simd
question: lane B had used ONLY `U8x32::cmpeq_mask`; NOW also routes the
stride walk through `ndarray::simd::array_chunks` (simd_ops.rs, the
non-overlapping walker; `array_windows` is the overlapping GEMM sibling,
deliberately unused); `simd_soa.rs::SoaBytes` remains an OPEN follow-up
(natural carrier for vectorized sink sweeps + batch tables). Note: probe
target/debug purged mid-round (disk full at 100%); gates re-run green.
Loading
Loading