Skip to content

transport: byte-budget the bulk gossip lane drain — fix outbound-phase block starvation (coverage decay)#280

Merged
ch4r10t33r merged 1 commit into
mainfrom
fix/gossip-bulk-drain-budget
Jun 28, 2026
Merged

transport: byte-budget the bulk gossip lane drain — fix outbound-phase block starvation (coverage decay)#280
ch4r10t33r merged 1 commit into
mainfrom
fix/gossip-bulk-drain-budget

Conversation

@ch4r10t33r

Copy link
Copy Markdown
Collaborator

Diagnosed directly from the live Loki logs. v0.2.53 (flood gate) fixed the inbound saturation (inbound_streams 134ms→~0) and the req/resp collapse (RawAppStreamSlotsFull 300→0/5m), but the SLOW drive iter breakdown then showed the cost moved to outbound = 655–822ms:

total=719ms outbound=700 inbound_streams=2
pending-stream-send queue full (33553835 bytes) on stream_id=1   ← 33MB of blocks on the gossip stream
connection lost ... cwnd=25MB lost=0 srtt=18379ms               ← over-buffered, not collapsed
priority outbox cap (1024) hit                                  ← attestations dropped

Root cause

drainGossipLane drained the whole bulk lane into zquic each tick, so a multi-MB block built a ~33 MB pending backlog whose encryption dominated the outbound drive phase → starved the priority (attestation) lane → attestations dropped → coverage decay to 21; and conns over-buffered to multi-second srtt → 60s no-ACK teardown churn.

Fix

Per-tick byte_budget on drainGossipLane: bulk (block) lane budgeted at 256 KiB; priority (attestation) lane unbounded. A block is fed over several ticks instead of monopolizing a lap, so the priority lane drains every tick. Budget is checked only at clean frame boundaries, preserving the one-frame-mid-flight /meshsub invariant. Recipients that miss a gossiped block re-sync via req/resp (fixed in v0.2.53).

Adversarial review (done)

No wire desync (break confined to frame boundaries), no wedge false-positive (progressing bulk resets the wedge timer), priority lane unaffected, forward progress guaranteed (incl. a bb==0 guard). Verdict: ship.

Build clean; 504/506 tests. Pure zig-libp2p.

…e block starvation

v0.2.53 (flood gate) fixed inbound saturation (inbound_streams 134ms->~0) but
the live SLOW drive iter then showed outbound=655-822ms: the bulk lane drained
the WHOLE block backlog into zquic's pending queue each tick (33MB pending on
the persistent /meshsub stream), whose encryption dominated the outbound drive
phase and starved the priority (attestation) lane -> priority outbox cap (1024)
hit -> attestations dropped -> coverage decay to 21. Conns also over-buffered
(srtt 4-18s, bif 4MB) -> 60s no-ACK teardown churn.

Fix: drainGossipLane gains a per-tick byte_budget; the bulk (block) lane is
budgeted (gossip_bulk_drain_budget_bytes=256KiB) so a multi-MB block is fed over
several ticks instead of monopolizing one lap; the priority (attestation) lane
stays unbounded and drains every tick. Budget checked only at clean frame
boundaries (partial_flag clear) so the one-frame-mid-flight /meshsub invariant
is preserved. Recipients that miss a gossiped block re-sync via req/resp (fixed
in v0.2.53). Adversarially reviewed: no wire desync, no wedge false-positive,
priority unaffected, progress guaranteed. Build clean; 504/506 tests.
@ch4r10t33r ch4r10t33r merged commit 21d097a into main Jun 28, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant