Skip to content

refactor(runtime): openSandboxRun — one tested run/stream/resume seam, two benchmark callers converged#177

Merged
drewstone merged 2 commits into
mainfrom
feat/unified-coding-run-facade
Jun 6, 2026
Merged

refactor(runtime): openSandboxRun — one tested run/stream/resume seam, two benchmark callers converged#177
drewstone merged 2 commits into
mainfrom
feat/unified-coding-run-facade

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

@drewstone drewstone commented Jun 6, 2026

What

openSandboxRun (renamed from openCodingRun) is the one harness- and domain-agnostic seam for running an agent in a sandbox over a persistent artifact: run it, stream it, resume the same server-side session across turns. It is a thin facade over already-hardened code — acquireSandbox (cold-start/502 recovery), buildBackendOptions (harness = backend.type), createSandboxLineage (start mints a session; resume continues it with a fail-loud assertSessionLive).

The one genuinely-new piece is Deliverable<Out>: it widens the pure OutputAdapter.parse(events) to also admit a post-turn read off the box FS — the structural gap that made the bench gates hand-roll box.fs.read, because a large produced file (a git diff, a generated document) truncates in the chat stream and a pure events-parser can't reach the workspace.

Why renamed (answering "is this needed / does it generalize / one per domain?")

It is not coding-specific and there is no per-domain copy. The domain lives entirely in the caller-supplied Deliverable<Out> (events-parse vs artifact-read). To prove that concretely this PR converges two different benchmark callers onto the one facade:

  • bench/src/commit0-gate.mts#runShot — commit0 Python repos, diff at /tmp/solution.patch
  • bench/src/worker.ts#solveShot — SWE-bench GitHub issues, diff at /tmp/solution.patch

Each previously hand-rolled the identical new Sandbox + acquireSandbox + stream-loop + try/read/catch + finally/delete dance. Both now call openSandboxRun(client, { agentRun, signal }, deliverable) and differ only in their Deliverable + prompt. Both also gain resume for free. That is the net-surface-reduction + generalization argument in one diff: research/tax/legal would each be a new Deliverable, not a new function.

Hardening (the PR's prior HIGH: untested public API)

  • box passed into settle() → the start-before-handle invariant is type-level, not call-order discipline (removes the handle! non-null assertion).
  • second start() throws (use resume()); resume()/getters before start() throw.
  • abort-aware artifact read (re-checks the signal before touching the box FS).
  • @experimental on every export.

Tests — tests/runtime/sandbox-run.test.ts (11)

events and artifact deliverables (the generalization proof), readError surfaced without throwing (fault ≠ empty deliverable), resume reuses box+session, fail-loud on a reaped session, and every lifecycle guardrail above.

Also

  • router-client: the "exhausted retries" terminal is now genuinely reachable — retryable statuses always continue to the loop's attempt bound, non-retryable fails fast immediately (was the dead post-loop throw the reviewer flagged). Dropped the dead readFile import.

Verification

pnpm run lint · pnpm run typecheck · pnpm run build · pnpm test (669 pass) · bench: tsc --noEmit — all green.

Deliberately NOT in scope

  • The worker-zoo collapse onto runExperiment + lifting routerChatWithUsage into the package — that's the one-flow-worker consolidation, its own PR.
  • worker.ts#solveShot preserves its exact create options (behavior-preserving refactor). If solve-one ever hits a ProviderAuthError, it likely needs the same OPENAI_* box env the commit0 path sets — flagged, not silently changed here.

…am/resume seam

The bench gates were each reinventing raw `new Sandbox` + ad-hoc streamPrompt and
hitting stream-drops/502s the runtime's own sandbox layer already handles — the root
of the "sandbox keeps breaking" thrash. This adds the ONE generalized seam (a ~120-LOC
facade, NOT a new layer) over the existing hardened pieces, plus the gate hardening
that rides the shared client.

- src/runtime/coding-run.ts: `openCodingRun<Out>(client, {agentRun, signal}, deliverable)`
  over acquireSandbox (cold-start/502 recovery) + buildBackendOptions (harness = the
  backend.type dial: opencode/codex/claude-code/kimi-code) + createSandboxLineage
  (start mints a session; resume = continue with fail-loud assertSessionLive). The one
  new type, `Deliverable<Out>` = {events}|{artifact,path}, widens the pure OutputAdapter
  to read a WORKSPACE-RELATIVE file off the box — the real fix for diffs that truncate
  in the chat stream (absolute /tmp reads fail "outside allowed roots"). Fault-isolated
  (TurnResult.readError), no custom replay (SDK owns it), no fork verb (probe-gated, absent).
- router-client.ts: retry transient 429/5xx with backoff + auto-handle the "temperature 1
  only" 400 (kimi-k2.6) — once, shared by every gate (no per-gate retry loops).
- aec-gate.ts: per-attempt fault isolation — infra failures EXCLUDED from stats (retried,
  never scored 0, never crash the multi-model run).
- commit0-gate.mts: untimed agentic rollouts (a tight cap was killing opencode mid-work)
  + a local cli-bridge fallback for when the sandbox is down.

Live smoke (workflow): provision + stream + resume on one box all work now; resume is real.
Verified: root typecheck/build clean; bench typecheck clean.
@tangletools
Copy link
Copy Markdown
Contributor

❌ Needs Work — 3357302c

Readiness 35/100 · Confidence 70/100 · 15 findings (1 high, 5 medium, 9 low)

deepseek glm aggregate
Readiness 35 53 35
Confidence 70 70 70
Correctness 35 53 35
Security 35 53 35
Testing 35 53 35
Architecture 35 53 35

Full multi-shot audit completed 2/2 planned shots over 5 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 5 changed files. Global verifier still owns final merge decision.

Blocking

🔴 HIGH Zero test coverage for new public API module — src/runtime/coding-run.ts

The 129-line coding-run.ts module has no test file anywhere in the repo. It is a new top-level runtime export (re-exported from src/runtime/index.ts:148-154) with zero test coverage. The underlying sandbox-lineage has tests, but the facade's own behavior — start/resume ordering, close teardown, deliverable dispatch (events vs artifact), readError surfacing on FS failures, and edge cases (double start, resume before start) — is untested. Impact: regressions on this module won't be caught; the public API surface has no contract verification.

Other

🟠 MEDIUM sh() timeout=0 passed to spawn when SHOT_TIMEOUT_MS unset for local backend — bench/src/commit0-gate.mts

Line 213: timeoutMs: cfg.timeoutMs where cfg.timeoutMs is 0 for the default local backend (line 261). The sh() helper passes opts.timeoutMs to Node's spawn({ timeout }) when truthy (line 160: ...(opts.timeoutMs ? { timeout: opts.timeoutMs } : {})), so 0 correctly skips the timeout — the ternary guard handles this. No bug, but worth noting the intent is 'untimed' and the implementation is correct via

🟠 MEDIUM router-client.ts retry loop body already consumed on non-ok response — bench/src/router-client.ts

Lines 40-48: When res.ok is false, res.text() is called (line 48). If the temperature-1 path is hit (line 50), the loop continues and re-fetches correctly. But if a 429/5xx occurs on attempt 4 (the last), the check attempt < 4 (line 54) is false, so it falls through to `throw n

🟠 MEDIUM Fragile non-null assertion on mutable closure variable handle! — src/runtime/coding-run.ts

Line 94: raw = await handle!.box.fs.read(deliverable.path) uses a non-null assertion on handle which is a let variable in closure scope initialized to undefined. The assertion is correct today because settle() is only called from start() (after handle = r.handle is assigned on L118) and resume() (after the if (!handle) throw guard on L122). However, settle is defined at L85 BEFORE handle is ever assigned. The invariant is call-order discipline, not a type-level guard — if anyone adds a future call path to settle before handle is assigned, it silently crashes. Fix: pass handle as a parameter to settle or extract the handle read into t

🟠 MEDIUM No guard against calling start() twice — silent session loss — src/runtime/coding-run.ts

Line 117: handle = r.handle unconditionally overwrites the handle variable. Calling start() twice silently replaces the first session's handle (the second call creates a new box and session via lineage.start(), overwriting the closure-scoped handle). The first box remains tracked in the lineage's internal owned array and will be torn down on close(), but the caller loses the reference and cannot resume() it. A second start() should either throw (e.g. if (handle) throw new Error('CodingRun already started')) or be documented as idempotent/overwriting.

🟠 MEDIUM Zero test coverage for new public API — src/runtime/coding-run.ts

No test files reference openCodingRun, CodingRun, Deliverable, or TurnResult anywhere in the repo. This is a 129-line public entrypoint that orchestrates sandbox lifecycle (probe → lineage → start → settle → teardown) with non-trivial closure-over-mutable-state and an artifact-read-with-fallback path. At minimum: (1) start-then-settle-events path, (2) start-then-settle-artifact-read-success, (3) artifact-read-failure sets readError, (4) resume-before-start throws, (5) box/sessionId getters before start throw, (6) close-without-start is safe.

🟡 LOW Dead import: readFile unused in commit0-gate.mts — bench/src/commit0-gate.mts

Line 27: readFile is imported from node:fs/promises but never referenced in the file body. Remove it from the import to keep the dependency surface clean.

🟡 LOW commit0-gate local backend default model differs from sandbox backend — bench/src/commit0-gate.mts

Line 251: model = process.env.WORKER_MODEL ?? (backend === 'local' ? 'kimi-for-coding/kimi-k2-thinking' : 'gpt-4.1'). The two backends default to different models. This is intentional (local uses opencode's own auth; sandbox uses the router which has OpenAI), but it means the two backends are NOT interchangeable for the same experiment — switching backends also switches models unless WORKER_MODEL is explicitly set. No fix needed — the env var override is the escape hatch.

🟡 LOW runShotLocal sh() for git diff uses bash -c with interpolated paths — bench/src/commit0-gate.mts

Line 216: JSON.stringify provides safe escaping for shell injection (wraps in double-quotes, escapes internal quotes/backslashes). The staging is intentional — it captures agent edits to tracked files under srcDir. No security issue, but the git add side-effect (staging) modifies the index state. Acceptable for a benchmark harness. No fix needed.

🟡 LOW No test coverage for retry/fallback logic across all three files — bench/src/router-client.ts

None of router-client.ts, aec-gate.mts, or commit0-gate.mts have dedicated test files. The retry backoff, temperature-400 workaround, infraError propagation, and local-backend path (runShotLocal, including shell-command diff extraction) have zero automated test coverage. The benchmarks these scripts drive are their own validation, but a single transient-429 simulation or a local-backend smoke test would raise confidence that the two-tier retry and exclusion logic don't interact badly. Corpus-replay downstream tolerates missing score/valid fields (they're already optional), so the payload correctness risk is low — but the interaction risk (e.g. 15 total retries when both layers retry the same failure) is untested.

🟡 LOW Unreachable throw after retry loop in routerChatWithUsage — bench/src/router-client.ts

Line 60: throw new Error(\${lastErr} (exhausted retries)`)is dead code. The loop runs forattempt0–4 (5 total). The retryable-status path gates onattempt < 4, and the temperature-400 path fires once (temperature becomes 1). On iteration 4 (the 5th), neither continuepath is available — any non-ok status falls through to the in-loopthrowon [line 58](https://github.com/tangle-network/agent-runtime/blob/3357302c11c450631ecc55690b7bb312c554ef02/bench/src/router-client.ts#L58). The post-loopthrow` is never reached. Simplest fix: remove it (the in-loop throw already handles every non-ok exit).

🟡 LOW Artifact FS read not abort-aware — src/runtime/coding-run.ts

Line 94: raw = await handle!.box.fs.read(deliverable.path) does not receive the AbortSignal. If the caller's signal fires while the FS read is pending, the read continues until the platform timeout. The stream collection (for await in settle) IS signal-aware via the lineage's streamPrompt({ signal }), so only the post-drain FS read is affected. Low severity — sandbox FS reads are fast and platform-bounded. Fix: pass signal to read if the SDK supports it, or wrap in withTimeout(read, signal).

🟡 LOW Missing @experimental tags on public types — src/runtime/coding-run.ts

Every other type in src/runtime/ (SandboxLineageHandle, SandboxCapabilities, Driver, Iteration, etc.) carries an @experimental JSDoc tag. The new Deliverable, TurnResult, CodingRun, OpenCodingRunOptions, and openCodingRun do not. This is inconsistent with the rest of the runtime surface. Add @experimental to match repo convention.

🟡 LOW Type-hole cast as AgentRunSpec in start() — src/runtime/coding-run.ts

Line 117: options.agentRun as AgentRunSpec<unknown>. The OpenCodingRunOptions.agentRun is typed as AgentRunSpec<string> but lineage.start() expects AgentRunSpec<unknown>. The comment justifies that lineage.start only uses spec.profile and spec.sandboxOverrides (not taskToPrompt), so the cast is safe in practice. However, the type hole means if lineage.start() ever grows to use the task type parameter, it silently compiles with a wrong assumption. Fix: narrow the lineage contract or pass a derived subset of fields.

🟡 LOW User-controlled file path passed to box.fs.read() without client-side sanitization — src/runtime/coding-run.ts

Line 94: deliverable.path from Deliverable.kind === 'artifact' is passed directly to box.fs.read() with no client-side validation. The docs (L16-17) note the platform blocks absolute paths and enforces workspace-relative roots. However, there is no defense-in-depth client-side check for path traversal patterns (e.g. ../, ~, null bytes). If the platform's enforcement has a gap, this becomes an escape vector. Impact is low because the sandbox platform is the trust boundary, but client-side validation is cheap insurance.


tangletools · 2026-06-06T16:36:16Z · trace

Copy link
Copy Markdown
Contributor

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ 1 Blocking Finding — 3357302c

Full multi-shot audit completed 2/2 planned shots over 5 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 5 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-06T16:36:16Z · immutable trace

… converge two benchmark callers

The facade is harness- and domain-agnostic over `Deliverable<Out>`, not
coding-specific — rename it (CodingRun→SandboxRun, openCodingRun→openSandboxRun,
OpenCodingRunOptions→OpenSandboxRunOptions) so the name matches the seam. The
domain lives only in the caller's Deliverable; there is no per-domain copy.

Harden the public API (was untested, the PR's HIGH finding):
- pass `box` into `settle()` so the start-before-handle invariant is type-level,
  not call-order discipline (kills the `handle!` non-null assertion);
- guard against a second `start()` (use `resume()` to continue);
- fail loud on `resume()`/getter access before `start()`;
- abort-aware artifact read (re-check the signal before touching the box FS);
- `@experimental` on every export.

Add tests/runtime/sandbox-run.test.ts (11): events AND artifact deliverables
(the generalization proof — same facade, two output shapes), readError surfaced
without throwing (fault ≠ empty deliverable), resume reuses box+session, fail-loud
on a reaped session, and the lifecycle guardrails above.

Converge the two raw-sandbox bench callers onto it — commit0-gate#runShot and
worker.ts#solveShot — removing the duplicated `new Sandbox` + acquireSandbox +
stream-loop + try/read/catch + finally/delete dance from each. Two benchmarks
(commit0 Python repos, SWE-bench GitHub issues) on one facade, differing only in
Deliverable + prompt: the concrete answer to "does it generalize or is it one per
domain". Both callers also gain resume for free.

router-client: make the "exhausted retries" terminal genuinely reachable
(retryable statuses always continue to the loop's attempt bound; non-retryable
fails fast immediately) and drop the dead `readFile` import in commit0-gate.
@drewstone drewstone changed the title feat(runtime): openCodingRun — one hardened harness-agnostic run/stream/resume seam refactor(runtime): openSandboxRun — one tested run/stream/resume seam, two benchmark callers converged Jun 6, 2026
@drewstone drewstone merged commit b5225db into main Jun 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants