fix(loops): survive cold-start sandbox acquisition (don't re-POST-thrash) by drewstone · Pull Request #185 · tangle-network/agent-runtime

drewstone · 2026-06-06T21:25:43Z

Problem

Prod runs with 0 warm boxes (the ContainerPool floor is disabled in deploy — separate orchestrator fix). So every acquisition cold-scales-from-zero. The SDK create request times out (~30s) before the orchestrator finishes provisioning the named box; the recovery scanned list() once, missed the still-provisioning row, then re-POSTed a fresh cold provision every backoff — restarting the same wall and never converging within the 600s budget → could not acquire a running sandbox within budget. This blocked the eyes-present self-improvement proof.

Fix (`src/runtime/sandbox-acquire.ts`)

After a retryable create error, poll list() across a short window (appearScans=5 × pollMs) for the named box to appear, and attach to it — the orchestrator usually accepted the create and the row shows up seconds later. Only re-create if it genuinely never appears (true rollback). Turns a cold acquire from a hard failure into a (slower) success — defense-in-depth that can unblock batch runs without touching the prod orchestrator.

Validation

tsc clean; loops suite 143/143 (acquire 9/9). Happy path (box on first scan) byte-identical to the prior single-scan. Updated the one timing-pinned test to a budget that fits the new scan-window (fake clock → still instant).

Scope / the durable companion

This is the kernel-side half. The primary fix is the orchestrator warm-box floor (enable + host-aware ContainerPool in agent-dev-container deploy.yml/generate-env.sh) so acquisitions hit a ready box instead of cold-provisioning. That's a HIGH-risk prod-deploy (ship Part A alone in multi-host → trips broken → silently 0 warm) — staged separately, not in this PR.

…light box, don't re-POST-thrash) On a cold scale-from-zero (0 warm boxes), the SDK create request times out (~30s) before the orchestrator finishes provisioning the NAMED box. The recovery scanned list() ONCE — missing the still-provisioning row — then re-POSTed a fresh cold provision every backoff, restarting the same wall and never converging within the 600s budget ('could not acquire a running sandbox within budget'). Fix: after a retryable create error, poll list() across a short window (appearScans=5 × pollMs) for the named box to APPEAR, and attach to it — the orchestrator usually accepted the create and the row shows up seconds later. Only re-create if it truly never appears (genuine rollback). Turns a cold acquire from a hard failure into a (slower) success without touching the orchestrator. Defense-in-depth for the warm-pool-disabled prod regime; the durable fix is the ContainerPool warm-box floor (separate, orchestrator-side). - typecheck clean; loops suite 143/143 (acquire 9/9). Default behavior preserved when the box appears on the first scan (the prior single-scan happy path).

…de-dup steer-firewall (#187) * fix(runtime): recover orphaned read-retry + provision-retry hardenings (reconciled with #185) * refactor(bench): extract shared stats.mts + buildRunRecordFromAttempts; configurable worker provider (dedup the gate zoo) * refactor(runtime): de-dup steer-firewall to one site; drop dead analyst-driver-hook; document canonical atom

drewstone merged commit 7245096 into main Jun 6, 2026
1 check passed

drewstone mentioned this pull request Jun 6, 2026

chore(deep-clean): recover orphaned hardenings + dedup bench gates + de-dup steer-firewall #187

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(loops): survive cold-start sandbox acquisition (don't re-POST-thrash)#185

fix(loops): survive cold-start sandbox acquisition (don't re-POST-thrash)#185
drewstone merged 1 commit into
mainfrom
fix/sandbox-acquire-coldstart

drewstone commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Jun 6, 2026

Problem

Fix (src/runtime/sandbox-acquire.ts)

Validation

Scope / the durable companion

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix (`src/runtime/sandbox-acquire.ts`)