transport: byte-budget the bulk gossip lane drain — fix outbound-phase block starvation (coverage decay)#280
Merged
Conversation
…e block starvation v0.2.53 (flood gate) fixed inbound saturation (inbound_streams 134ms->~0) but the live SLOW drive iter then showed outbound=655-822ms: the bulk lane drained the WHOLE block backlog into zquic's pending queue each tick (33MB pending on the persistent /meshsub stream), whose encryption dominated the outbound drive phase and starved the priority (attestation) lane -> priority outbox cap (1024) hit -> attestations dropped -> coverage decay to 21. Conns also over-buffered (srtt 4-18s, bif 4MB) -> 60s no-ACK teardown churn. Fix: drainGossipLane gains a per-tick byte_budget; the bulk (block) lane is budgeted (gossip_bulk_drain_budget_bytes=256KiB) so a multi-MB block is fed over several ticks instead of monopolizing one lap; the priority (attestation) lane stays unbounded and drains every tick. Budget checked only at clean frame boundaries (partial_flag clear) so the one-frame-mid-flight /meshsub invariant is preserved. Recipients that miss a gossiped block re-sync via req/resp (fixed in v0.2.53). Adversarially reviewed: no wire desync, no wedge false-positive, priority unaffected, progress guaranteed. Build clean; 504/506 tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Diagnosed directly from the live Loki logs. v0.2.53 (flood gate) fixed the inbound saturation (
inbound_streams134ms→~0) and the req/resp collapse (RawAppStreamSlotsFull300→0/5m), but theSLOW drive iterbreakdown then showed the cost moved to outbound = 655–822ms:Root cause
drainGossipLanedrained the whole bulk lane into zquic each tick, so a multi-MB block built a ~33 MB pending backlog whose encryption dominated the outbound drive phase → starved the priority (attestation) lane → attestations dropped → coverage decay to 21; and conns over-buffered to multi-second srtt → 60s no-ACK teardown churn.Fix
Per-tick
byte_budgetondrainGossipLane: bulk (block) lane budgeted at 256 KiB; priority (attestation) lane unbounded. A block is fed over several ticks instead of monopolizing a lap, so the priority lane drains every tick. Budget is checked only at clean frame boundaries, preserving the one-frame-mid-flight/meshsubinvariant. Recipients that miss a gossiped block re-sync via req/resp (fixed in v0.2.53).Adversarial review (done)
No wire desync (break confined to frame boundaries), no wedge false-positive (progressing bulk resets the wedge timer), priority lane unaffected, forward progress guaranteed (incl. a
bb==0guard). Verdict: ship.Build clean; 504/506 tests. Pure zig-libp2p.