Skip to content

fix(loops): survive cold-start sandbox acquisition (don't re-POST-thrash)#185

Merged
drewstone merged 1 commit into
mainfrom
fix/sandbox-acquire-coldstart
Jun 6, 2026
Merged

fix(loops): survive cold-start sandbox acquisition (don't re-POST-thrash)#185
drewstone merged 1 commit into
mainfrom
fix/sandbox-acquire-coldstart

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Problem

Prod runs with 0 warm boxes (the ContainerPool floor is disabled in deploy — separate orchestrator fix). So every acquisition cold-scales-from-zero. The SDK create request times out (~30s) before the orchestrator finishes provisioning the named box; the recovery scanned list() once, missed the still-provisioning row, then re-POSTed a fresh cold provision every backoff — restarting the same wall and never converging within the 600s budget → could not acquire a running sandbox within budget. This blocked the eyes-present self-improvement proof.

Fix (src/runtime/sandbox-acquire.ts)

After a retryable create error, poll list() across a short window (appearScans=5 × pollMs) for the named box to appear, and attach to it — the orchestrator usually accepted the create and the row shows up seconds later. Only re-create if it genuinely never appears (true rollback). Turns a cold acquire from a hard failure into a (slower) success — defense-in-depth that can unblock batch runs without touching the prod orchestrator.

Validation

  • tsc clean; loops suite 143/143 (acquire 9/9). Happy path (box on first scan) byte-identical to the prior single-scan. Updated the one timing-pinned test to a budget that fits the new scan-window (fake clock → still instant).

Scope / the durable companion

This is the kernel-side half. The primary fix is the orchestrator warm-box floor (enable + host-aware ContainerPool in agent-dev-container deploy.yml/generate-env.sh) so acquisitions hit a ready box instead of cold-provisioning. That's a HIGH-risk prod-deploy (ship Part A alone in multi-host → trips broken → silently 0 warm) — staged separately, not in this PR.

…light box, don't re-POST-thrash)

On a cold scale-from-zero (0 warm boxes), the SDK create request times out (~30s)
before the orchestrator finishes provisioning the NAMED box. The recovery scanned
list() ONCE — missing the still-provisioning row — then re-POSTed a fresh cold
provision every backoff, restarting the same wall and never converging within the
600s budget ('could not acquire a running sandbox within budget').

Fix: after a retryable create error, poll list() across a short window
(appearScans=5 × pollMs) for the named box to APPEAR, and attach to it — the
orchestrator usually accepted the create and the row shows up seconds later. Only
re-create if it truly never appears (genuine rollback). Turns a cold acquire from
a hard failure into a (slower) success without touching the orchestrator.

Defense-in-depth for the warm-pool-disabled prod regime; the durable fix is the
ContainerPool warm-box floor (separate, orchestrator-side).

- typecheck clean; loops suite 143/143 (acquire 9/9). Default behavior preserved
  when the box appears on the first scan (the prior single-scan happy path).
@drewstone drewstone merged commit 7245096 into main Jun 6, 2026
1 check passed
drewstone added a commit that referenced this pull request Jun 6, 2026
…de-dup steer-firewall (#187)

* fix(runtime): recover orphaned read-retry + provision-retry hardenings (reconciled with #185)

* refactor(bench): extract shared stats.mts + buildRunRecordFromAttempts; configurable worker provider (dedup the gate zoo)

* refactor(runtime): de-dup steer-firewall to one site; drop dead analyst-driver-hook; document canonical atom
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant