feat: proving-scoped zk-alloc arena allocator (benchmark feature)#412
feat: proving-scoped zk-alloc arena allocator (benchmark feature)#412MegaRedHand wants to merge 8 commits into
Conversation
Bump the multisig dependency to commit 0520822 and switch the git remote from the renamed leanMultisig repo to leanVM. Pin transitive Plonky3 to 3f67d136 (the rev leanVM 0520822 locks against); the floating Plonky3 HEAD pulled in a newer rev that requires the unstable maybe_uninit_slice feature and fails to build on stable.
…ure) Add an off-by-default zk-alloc feature enabling leanVM's bump-arena allocator for XMSS aggregation without destabilizing a long-running node. Installing leanVM's ZkAllocator directly as the global allocator would corrupt unrelated threads: begin_phase() resets every thread's slab, so any long-lived buffer the tokio/p2p/storage/actor threads allocate during a phase is silently overwritten on the next reset. Instead, ScopedZkAlloc is a dispatcher global allocator that routes to the arena only on flagged proving threads; everything else always uses the system allocator. Proving threads are the global rayon pool's workers, flagged at construction via ThreadPoolBuilder::start_handler + build_global (leanVM's prover is the pool's only user in ethlambda). Two earlier designs failed and are documented in src/zk_alloc.rs: per-phase rayon::broadcast flagging deadlocks the pool's sleep accounting, and a dedicated install() pool crashes once its crossbeam internals are first allocated inside a phase (UAF after the next slab reset). The global pool works because setup_prover grows those internals in System memory before any phase exists. All proving is serialized behind a global mutex (begin_phase nesting between the spawn_blocking aggregation worker and actor-thread block building would otherwise panic), and proofs are serialized to bytes after end_phase, while still holding the lock, so outputs land in System and survive the next reset. Tradeoffs (benchmark build only): drops jemalloc and the /debug/pprof heap endpoints (ZkAllocator::dealloc hardcodes System for non-arena frees), and zk_alloc::init() asserts the runtime core count matches the build host, so the binary must be built on the machine it runs on. Validated: crypto round-trip tests with ScopedZkAlloc as the test-binary allocator, and a 4-node binary-mode devnet soak of 1019 slots (all nodes on the same head root, finalized=1008, zero panics; 1053 aggregations, p50=203ms / p90=283ms through the arena).
🤖 Claude Code ReviewI have enough context from the diff to write the review. Let me compile my findings. PR Review: feat: proving-scoped zk-alloc arena allocator (benchmark feature)OverviewThis PR integrates leanVM's Correctness & Safety1. Panic safety: In // zk_alloc.rs
pub(crate) fn prove<T, F>(&self, produce: F) -> T {
begin_phase();
struct EndOnDrop;
impl Drop for EndOnDrop {
fn drop(&mut self) { end_phase(); }
}
let _end = EndOnDrop; // ← initialized AFTER begin_phase
produce()
}If Fix: initialize pub(crate) fn prove<T, F>(&self, produce: F) -> T {
struct EndOnDrop;
impl Drop for EndOnDrop {
fn drop(&mut self) { end_phase(); }
}
let _end = EndOnDrop; // ← guard is live before begin_phase
begin_phase();
produce()
}2. unsafe fn realloc(&self, ptr: *mut u8, layout: Layout, new_size: usize) -> *mut u8 {
if new_size <= layout.size() {
return ptr; // ← no-op for System allocations
}
...
}For pointers allocated via 3. unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
unsafe { ZkAllocator.dealloc(ptr, layout) };
}This relies entirely on 4. // lib.rs, zk-alloc path of prove_type1
zk_alloc::init_arena_rayon_pool(); // ← Once, so nearly free after startup
let session = zk_alloc::ArenaSession::begin();
ensure_prover_ready();The call in Design Notes5. Proving serialization is global, not per-call-site
6. Rayon pool ordering is subtly fragile The safety argument depends on 7. Mutex poison recovery let lock = PROVING.lock().unwrap_or_else(|poison| poison.into_inner());After a prover panic, the recovered lock is immediately used to start another phase. If Code Quality
SummaryOne real correctness issue (Item 1 — Automated review by Claude (Anthropic) · sonnet · custom prompt |
🤖 Codex Code Review
No consensus/fork-choice/state-transition logic was modified here, and I didn’t see a direct correctness regression in those areas from this diff. I could not run Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
Greptile SummaryThis PR introduces an off-by-default
Confidence Score: 3/5Safe to merge as a benchmark-only feature (off by default), but the arena phase management in zk_alloc.rs has two fragile spots worth addressing before widening use. The allocator dispatch logic and proving serialization are well-thought-out and backed by a 68-minute devnet soak. Two concerns sit in ArenaSession::prove: proof serialization reads arena memory after end_phase() has been called, relying on an undocumented leanVM invariant that slab resets only happen at begin_phase(); and if begin_phase() itself panics, EndOnDrop is never constructed, leaving ARENA_ACTIVE potentially stuck active. Neither is a current defect given the tested leanVM version and the benchmark-only scope, but both create silent failure modes if leanVM internals change or an unexpected panic occurs in begin_phase(). crates/common/crypto/src/zk_alloc.rs — specifically the ArenaSession::prove method drop ordering and the gap between begin_phase() and EndOnDrop construction.
|
| Filename | Overview |
|---|---|
| crates/common/crypto/src/zk_alloc.rs | New file: implements ScopedZkAlloc dispatcher allocator and ArenaSession. Core fragility: end_phase() is called inside prove() via EndOnDrop before the returned proof value is serialized by callers; correctness depends on end_phase() not invalidating arena memory. Also: if begin_phase() itself panics, EndOnDrop is never created and end_phase() is never called, leaving ARENA_ACTIVE potentially set. |
| crates/common/crypto/src/lib.rs | Refactors all prover call sites into prove_type1/prove_type2 helpers that manage the ArenaSession lifecycle; non-zk-alloc paths are unchanged. The compress_*_to_byte_list calls occur after end_phase() but while the PROVING lock is held. |
| bin/ethlambda/src/main.rs | Adds zk-alloc startup block: installs ScopedZkAlloc as global allocator, calls init_allocator() and init_arena_rayon_pool() early in main. Feature-gating for jemalloc/malloc_conf is correct. |
| bin/ethlambda/Cargo.toml | Adds optional ethlambda-crypto dependency and zk-alloc feature gate; clean and correct. |
| crates/common/crypto/Cargo.toml | Adds optional rayon dependency behind zk-alloc feature; minimal and correct. |
Sequence Diagram
sequenceDiagram
participant Main as main.rs
participant CryptoLib as crypto/lib.rs
participant ZkAlloc as zk_alloc.rs
participant Rayon as Global Rayon Pool
participant LeanVM as lean_multisig
Main->>ZkAlloc: init_allocator()
Main->>ZkAlloc: init_arena_rayon_pool()
ZkAlloc->>Rayon: "build_global() + start_handler sets USE_ARENA=true"
Note over Main,LeanVM: Per proof call
Main->>CryptoLib: aggregate_signatures / merge / split
CryptoLib->>ZkAlloc: ArenaSession::begin() lock PROVING mutex
CryptoLib->>LeanVM: ensure_prover_ready() outside phase System memory
CryptoLib->>ZkAlloc: session.prove(closure)
ZkAlloc->>LeanVM: "begin_phase() ARENA_ACTIVE=true"
ZkAlloc->>LeanVM: "produce() rayon workers USE_ARENA=true arena allocs"
LeanVM-->>ZkAlloc: LMType1/LMType2 arena memory
Note over ZkAlloc: EndOnDrop drop calls end_phase() HERE
ZkAlloc-->>CryptoLib: LMType1 arena ptr phase already ended
CryptoLib->>LeanVM: compress reads arena memory post end_phase
LeanVM-->>CryptoLib: ByteList512KiB System memory
Note over CryptoLib,ZkAlloc: session drops PROVING mutex released
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
crates/common/crypto/src/zk_alloc.rs:175-187
**Post-`end_phase()` arena read is an undocumented leanVM invariant**
`EndOnDrop` calls `end_phase()` when `prove()` returns — before `compress_type1_to_byte_list` / `compress_type2_to_byte_list` runs in the callers. Those functions read the `LMType1`/`LMType2` value, which may contain pointers into arena memory. Safety depends entirely on `end_phase()` not freeing or invalidating arena memory (only `begin_phase()` resetting the slab does). This invariant is internal to `lean_multisig` and not part of any published API contract. A future leanVM release that flushes or frees slabs in `end_phase()` would silently turn proof serialization into a use-after-free.
A structurally safer alternative is to move `begin_phase()`/`end_phase()` into `ArenaSession`'s constructor and `Drop` impl respectively, so the phase lifetime matches the session lifetime. Compression then runs before the session (and the phase) ends, making the arena-memory liveness invariant explicit in the code structure rather than an assumption about upstream behaviour.
### Issue 2 of 2
crates/common/crypto/src/zk_alloc.rs:174-187
**`begin_phase()` panic leaves `end_phase()` uncalled**
`_end = EndOnDrop` is only created *after* `begin_phase()` returns. If `begin_phase()` itself panics (e.g. upstream assertion on nested phases), `EndOnDrop` is never constructed, so `end_phase()` is never called. The panic unwinds through `prove_type1`, drops `session`, and poisons the `PROVING` mutex. The next `ArenaSession::begin()` recovers the poison and proceeds — but if `begin_phase()` had already set `ARENA_ACTIVE = true` before panicking, the arena switch is now permanently stuck active. Any allocation on a flagged rayon worker from that point on will silently land in the (unreset) arena.
Moving the `EndOnDrop` guard to before `begin_phase()` is called — or initialising `_end` first and then calling `begin_phase()` inside its constructor — would close this window.
Reviews (1): Last reviewed commit: "feat(crypto): proving-scoped zk-alloc ar..." | Re-trigger Greptile
| begin_phase(); | ||
| // Guarantees `end_phase` runs even if the prover panics, so the global | ||
| // arena switch is never left stuck active. leanVM's `end_phase` also | ||
| // flushes the global pool's injector (its job blocks may live in arena). | ||
| struct EndOnDrop; | ||
| impl Drop for EndOnDrop { | ||
| fn drop(&mut self) { | ||
| end_phase(); | ||
| } | ||
| } | ||
| let _end = EndOnDrop; | ||
| produce() | ||
| } |
There was a problem hiding this comment.
Post-
end_phase() arena read is an undocumented leanVM invariant
EndOnDrop calls end_phase() when prove() returns — before compress_type1_to_byte_list / compress_type2_to_byte_list runs in the callers. Those functions read the LMType1/LMType2 value, which may contain pointers into arena memory. Safety depends entirely on end_phase() not freeing or invalidating arena memory (only begin_phase() resetting the slab does). This invariant is internal to lean_multisig and not part of any published API contract. A future leanVM release that flushes or frees slabs in end_phase() would silently turn proof serialization into a use-after-free.
A structurally safer alternative is to move begin_phase()/end_phase() into ArenaSession's constructor and Drop impl respectively, so the phase lifetime matches the session lifetime. Compression then runs before the session (and the phase) ends, making the arena-memory liveness invariant explicit in the code structure rather than an assumption about upstream behaviour.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/common/crypto/src/zk_alloc.rs
Line: 175-187
Comment:
**Post-`end_phase()` arena read is an undocumented leanVM invariant**
`EndOnDrop` calls `end_phase()` when `prove()` returns — before `compress_type1_to_byte_list` / `compress_type2_to_byte_list` runs in the callers. Those functions read the `LMType1`/`LMType2` value, which may contain pointers into arena memory. Safety depends entirely on `end_phase()` not freeing or invalidating arena memory (only `begin_phase()` resetting the slab does). This invariant is internal to `lean_multisig` and not part of any published API contract. A future leanVM release that flushes or frees slabs in `end_phase()` would silently turn proof serialization into a use-after-free.
A structurally safer alternative is to move `begin_phase()`/`end_phase()` into `ArenaSession`'s constructor and `Drop` impl respectively, so the phase lifetime matches the session lifetime. Compression then runs before the session (and the phase) ends, making the arena-memory liveness invariant explicit in the code structure rather than an assumption about upstream behaviour.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| { | ||
| begin_phase(); | ||
| // Guarantees `end_phase` runs even if the prover panics, so the global | ||
| // arena switch is never left stuck active. leanVM's `end_phase` also | ||
| // flushes the global pool's injector (its job blocks may live in arena). | ||
| struct EndOnDrop; | ||
| impl Drop for EndOnDrop { | ||
| fn drop(&mut self) { | ||
| end_phase(); | ||
| } | ||
| } | ||
| let _end = EndOnDrop; | ||
| produce() | ||
| } |
There was a problem hiding this comment.
begin_phase() panic leaves end_phase() uncalled
_end = EndOnDrop is only created after begin_phase() returns. If begin_phase() itself panics (e.g. upstream assertion on nested phases), EndOnDrop is never constructed, so end_phase() is never called. The panic unwinds through prove_type1, drops session, and poisons the PROVING mutex. The next ArenaSession::begin() recovers the poison and proceeds — but if begin_phase() had already set ARENA_ACTIVE = true before panicking, the arena switch is now permanently stuck active. Any allocation on a flagged rayon worker from that point on will silently land in the (unreset) arena.
Moving the EndOnDrop guard to before begin_phase() is called — or initialising _end first and then calling begin_phase() inside its constructor — would close this window.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/common/crypto/src/zk_alloc.rs
Line: 174-187
Comment:
**`begin_phase()` panic leaves `end_phase()` uncalled**
`_end = EndOnDrop` is only created *after* `begin_phase()` returns. If `begin_phase()` itself panics (e.g. upstream assertion on nested phases), `EndOnDrop` is never constructed, so `end_phase()` is never called. The panic unwinds through `prove_type1`, drops `session`, and poisons the `PROVING` mutex. The next `ArenaSession::begin()` recovers the poison and proceeds — but if `begin_phase()` had already set `ARENA_ACTIVE = true` before panicking, the arena switch is now permanently stuck active. Any allocation on a flagged rayon worker from that point on will silently land in the (unreset) arena.
Moving the `EndOnDrop` guard to before `begin_phase()` is called — or initialising `_end` first and then calling `begin_phase()` inside its constructor — would close this window.
How can I resolve this? If you propose a fix, please make it concise.
🗒️ Description / Motivation
Stacked on #408. leanVM 0520822 ships
zk-alloc, a bump-arena allocator that speeds up proving — but it does nothing unless the binary installs it as the global allocator, and installing it naively corrupts a live node:begin_phase()resets every thread's slab, so long-lived buffers allocated by tokio/p2p/RocksDB/actor threads during a phase are silently overwritten by the next reset.This PR adds an off-by-default
zk-allocfeature that scopes the arena to proving only, so a node can run it long enough to benchmark.What Changed
ScopedZkAlloc(crates/common/crypto/src/zk_alloc.rs): a dispatcher#[global_allocator]that routes allocations to leanVM's arena only on flagged proving threads; every other thread always uses System.deallocdefers toZkAllocator's address-based check;reallocis reimplemented to avoid leaking arena memory onto non-proving threads.ThreadPoolBuilder::start_handler+build_global, before leanVM'ssetup_proverbuilds the pool unflagged). Two failed designs are documented in the module: per-phaserayon::broadcastflagging deadlocks the pool, and a dedicatedinstall()pool segfaults once its crossbeam internals are first allocated inside a phase.begin_phasenesting between the spawn_blocking aggregation worker and actor-thread block building); proof bytes serialized afterend_phase, while the lock is held, so outputs land in System.--features zk-allocswaps jemalloc forScopedZkAllocand runsinit_allocator()+init_arena_rayon_pool()at startup. Default build is unchanged.Tradeoffs (benchmark build only)
/debug/pprofheap endpoints are unavailable (ZkAllocator::deallochardcodes System for non-arena frees).zk_alloc::init()asserts runtime core count == build-host core count → build on the machine you run on (binary-mode devnets, not Docker).Validation
ScopedZkAllocinstalled as the test-binary allocator (real arena phases): 5/5 pass.justified=1014 / finalized=1008, zero panics / zero ERROR lines, RSS stable at ~2.7–3.0 GiB/node. 1053 aggregations through the arena: p50 203 ms, p90 283 ms, p99 1.1 s (max 3.0 s = cold first proof incl. setup).A jemalloc-baseline run of the same topology is the natural follow-up to quantify the speedup.
Side finding (upstream, not addressed here)
handle_tickschedules the next tick afteron_tickreturns, relative to its entry interval — any duty exceeding the 800 ms interval (block building proves inline, ~1.3 s) silently skips the next interval, so a proposing node produces no attestations that slot. Single-node devnets can never justify; multi-node devnets lose the proposer's attestations every slot. Deserves a separate issue.