refactor(runtime): openSandboxRun — one tested run/stream/resume seam, two benchmark callers converged#177
Conversation
…am/resume seam
The bench gates were each reinventing raw `new Sandbox` + ad-hoc streamPrompt and
hitting stream-drops/502s the runtime's own sandbox layer already handles — the root
of the "sandbox keeps breaking" thrash. This adds the ONE generalized seam (a ~120-LOC
facade, NOT a new layer) over the existing hardened pieces, plus the gate hardening
that rides the shared client.
- src/runtime/coding-run.ts: `openCodingRun<Out>(client, {agentRun, signal}, deliverable)`
over acquireSandbox (cold-start/502 recovery) + buildBackendOptions (harness = the
backend.type dial: opencode/codex/claude-code/kimi-code) + createSandboxLineage
(start mints a session; resume = continue with fail-loud assertSessionLive). The one
new type, `Deliverable<Out>` = {events}|{artifact,path}, widens the pure OutputAdapter
to read a WORKSPACE-RELATIVE file off the box — the real fix for diffs that truncate
in the chat stream (absolute /tmp reads fail "outside allowed roots"). Fault-isolated
(TurnResult.readError), no custom replay (SDK owns it), no fork verb (probe-gated, absent).
- router-client.ts: retry transient 429/5xx with backoff + auto-handle the "temperature 1
only" 400 (kimi-k2.6) — once, shared by every gate (no per-gate retry loops).
- aec-gate.ts: per-attempt fault isolation — infra failures EXCLUDED from stats (retried,
never scored 0, never crash the multi-model run).
- commit0-gate.mts: untimed agentic rollouts (a tight cap was killing opencode mid-work)
+ a local cli-bridge fallback for when the sandbox is down.
Live smoke (workflow): provision + stream + resume on one box all work now; resume is real.
Verified: root typecheck/build clean; bench typecheck clean.
❌ Needs Work —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 35 | 53 | 35 |
| Confidence | 70 | 70 | 70 |
| Correctness | 35 | 53 | 35 |
| Security | 35 | 53 | 35 |
| Testing | 35 | 53 | 35 |
| Architecture | 35 | 53 | 35 |
Full multi-shot audit completed 2/2 planned shots over 5 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 5 changed files. Global verifier still owns final merge decision.
Blocking
🔴 HIGH Zero test coverage for new public API module — src/runtime/coding-run.ts
The 129-line coding-run.ts module has no test file anywhere in the repo. It is a new top-level runtime export (re-exported from src/runtime/index.ts:148-154) with zero test coverage. The underlying sandbox-lineage has tests, but the facade's own behavior — start/resume ordering, close teardown, deliverable dispatch (events vs artifact), readError surfacing on FS failures, and edge cases (double start, resume before start) — is untested. Impact: regressions on this module won't be caught; the public API surface has no contract verification.
Other
🟠 MEDIUM sh() timeout=0 passed to spawn when SHOT_TIMEOUT_MS unset for local backend — bench/src/commit0-gate.mts
Line 213:
timeoutMs: cfg.timeoutMswhere cfg.timeoutMs is 0 for the default local backend (line 261). Thesh()helper passesopts.timeoutMsto Node'sspawn({ timeout })when truthy (line 160:...(opts.timeoutMs ? { timeout: opts.timeoutMs } : {})), so 0 correctly skips the timeout — the ternary guard handles this. No bug, but worth noting the intent is 'untimed' and the implementation is correct via
🟠 MEDIUM router-client.ts retry loop body already consumed on non-ok response — bench/src/router-client.ts
Lines 40-48: When
res.okis false,res.text()is called (line 48). If the temperature-1 path is hit (line 50), the loop continues and re-fetches correctly. But if a 429/5xx occurs on attempt 4 (the last), the checkattempt < 4(line 54) is false, so it falls through to `throw n
🟠 MEDIUM Fragile non-null assertion on mutable closure variable handle! — src/runtime/coding-run.ts
Line 94:
raw = await handle!.box.fs.read(deliverable.path)uses a non-null assertion onhandlewhich is aletvariable in closure scope initialized toundefined. The assertion is correct today becausesettle()is only called fromstart()(afterhandle = r.handleis assigned on L118) andresume()(after theif (!handle) throwguard on L122). However,settleis defined at L85 BEFOREhandleis ever assigned. The invariant is call-order discipline, not a type-level guard — if anyone adds a future call path tosettlebeforehandleis assigned, it silently crashes. Fix: passhandleas a parameter tosettleor extract thehandleread into t
🟠 MEDIUM No guard against calling start() twice — silent session loss — src/runtime/coding-run.ts
Line 117:
handle = r.handleunconditionally overwrites thehandlevariable. Callingstart()twice silently replaces the first session's handle (the second call creates a new box and session vialineage.start(), overwriting the closure-scopedhandle). The first box remains tracked in the lineage's internalownedarray and will be torn down onclose(), but the caller loses the reference and cannotresume()it. A secondstart()should either throw (e.g.if (handle) throw new Error('CodingRun already started')) or be documented as idempotent/overwriting.
🟠 MEDIUM Zero test coverage for new public API — src/runtime/coding-run.ts
No test files reference
openCodingRun,CodingRun,Deliverable, orTurnResultanywhere in the repo. This is a 129-line public entrypoint that orchestrates sandbox lifecycle (probe → lineage → start → settle → teardown) with non-trivial closure-over-mutable-state and an artifact-read-with-fallback path. At minimum: (1) start-then-settle-events path, (2) start-then-settle-artifact-read-success, (3) artifact-read-failure sets readError, (4) resume-before-start throws, (5) box/sessionId getters before start throw, (6) close-without-start is safe.
🟡 LOW Dead import: readFile unused in commit0-gate.mts — bench/src/commit0-gate.mts
Line 27:
readFileis imported fromnode:fs/promisesbut never referenced in the file body. Remove it from the import to keep the dependency surface clean.
🟡 LOW commit0-gate local backend default model differs from sandbox backend — bench/src/commit0-gate.mts
Line 251:
model = process.env.WORKER_MODEL ?? (backend === 'local' ? 'kimi-for-coding/kimi-k2-thinking' : 'gpt-4.1'). The two backends default to different models. This is intentional (local uses opencode's own auth; sandbox uses the router which has OpenAI), but it means the two backends are NOT interchangeable for the same experiment — switching backends also switches models unless WORKER_MODEL is explicitly set. No fix needed — the env var override is the escape hatch.
🟡 LOW runShotLocal sh() for git diff uses bash -c with interpolated paths — bench/src/commit0-gate.mts
Line 216: JSON.stringify provides safe escaping for shell injection (wraps in double-quotes, escapes internal quotes/backslashes). The staging is intentional — it captures agent edits to tracked files under srcDir. No security issue, but the
git addside-effect (staging) modifies the index state. Acceptable for a benchmark harness. No fix needed.
🟡 LOW No test coverage for retry/fallback logic across all three files — bench/src/router-client.ts
None of
router-client.ts,aec-gate.mts, orcommit0-gate.mtshave dedicated test files. The retry backoff, temperature-400 workaround, infraError propagation, and local-backend path (runShotLocal, including shell-command diff extraction) have zero automated test coverage. The benchmarks these scripts drive are their own validation, but a single transient-429 simulation or a local-backend smoke test would raise confidence that the two-tier retry and exclusion logic don't interact badly. Corpus-replay downstream tolerates missing score/valid fields (they're already optional), so the payload correctness risk is low — but the interaction risk (e.g. 15 total retries when both layers retry the same failure) is untested.
🟡 LOW Unreachable throw after retry loop in routerChatWithUsage — bench/src/router-client.ts
Line 60:
throw new Error(\${lastErr} (exhausted retries)`)is dead code. The loop runs forattempt0–4 (5 total). The retryable-status path gates onattempt < 4, and the temperature-400 path fires once (temperature becomes 1). On iteration 4 (the 5th), neithercontinuepath is available — any non-ok status falls through to the in-loopthrowon [line 58](https://github.com/tangle-network/agent-runtime/blob/3357302c11c450631ecc55690b7bb312c554ef02/bench/src/router-client.ts#L58). The post-loopthrow` is never reached. Simplest fix: remove it (the in-loop throw already handles every non-ok exit).
🟡 LOW Artifact FS read not abort-aware — src/runtime/coding-run.ts
Line 94:
raw = await handle!.box.fs.read(deliverable.path)does not receive the AbortSignal. If the caller's signal fires while the FS read is pending, the read continues until the platform timeout. The stream collection (for awaitin settle) IS signal-aware via the lineage's streamPrompt({ signal }), so only the post-drain FS read is affected. Low severity — sandbox FS reads are fast and platform-bounded. Fix: pass signal to read if the SDK supports it, or wrap inwithTimeout(read, signal).
🟡 LOW Missing @experimental tags on public types — src/runtime/coding-run.ts
Every other type in src/runtime/ (SandboxLineageHandle, SandboxCapabilities, Driver, Iteration, etc.) carries an
@experimentalJSDoc tag. The newDeliverable,TurnResult,CodingRun,OpenCodingRunOptions, andopenCodingRundo not. This is inconsistent with the rest of the runtime surface. Add@experimentalto match repo convention.
🟡 LOW Type-hole cast as AgentRunSpec in start() — src/runtime/coding-run.ts
Line 117:
options.agentRun as AgentRunSpec<unknown>. TheOpenCodingRunOptions.agentRunis typed asAgentRunSpec<string>butlineage.start()expectsAgentRunSpec<unknown>. The comment justifies thatlineage.startonly usesspec.profileandspec.sandboxOverrides(nottaskToPrompt), so the cast is safe in practice. However, the type hole means iflineage.start()ever grows to use the task type parameter, it silently compiles with a wrong assumption. Fix: narrow the lineage contract or pass a derived subset of fields.
🟡 LOW User-controlled file path passed to box.fs.read() without client-side sanitization — src/runtime/coding-run.ts
Line 94:
deliverable.pathfromDeliverable.kind === 'artifact'is passed directly tobox.fs.read()with no client-side validation. The docs (L16-17) note the platform blocks absolute paths and enforces workspace-relative roots. However, there is no defense-in-depth client-side check for path traversal patterns (e.g.../,~, null bytes). If the platform's enforcement has a gap, this becomes an escape vector. Impact is low because the sandbox platform is the trust boundary, but client-side validation is cheap insurance.
tangletools · 2026-06-06T16:36:16Z · trace
tangletools
left a comment
There was a problem hiding this comment.
❌ 1 Blocking Finding — 3357302c
Full multi-shot audit completed 2/2 planned shots over 5 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 5 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-06T16:36:16Z · immutable trace
… converge two benchmark callers The facade is harness- and domain-agnostic over `Deliverable<Out>`, not coding-specific — rename it (CodingRun→SandboxRun, openCodingRun→openSandboxRun, OpenCodingRunOptions→OpenSandboxRunOptions) so the name matches the seam. The domain lives only in the caller's Deliverable; there is no per-domain copy. Harden the public API (was untested, the PR's HIGH finding): - pass `box` into `settle()` so the start-before-handle invariant is type-level, not call-order discipline (kills the `handle!` non-null assertion); - guard against a second `start()` (use `resume()` to continue); - fail loud on `resume()`/getter access before `start()`; - abort-aware artifact read (re-check the signal before touching the box FS); - `@experimental` on every export. Add tests/runtime/sandbox-run.test.ts (11): events AND artifact deliverables (the generalization proof — same facade, two output shapes), readError surfaced without throwing (fault ≠ empty deliverable), resume reuses box+session, fail-loud on a reaped session, and the lifecycle guardrails above. Converge the two raw-sandbox bench callers onto it — commit0-gate#runShot and worker.ts#solveShot — removing the duplicated `new Sandbox` + acquireSandbox + stream-loop + try/read/catch + finally/delete dance from each. Two benchmarks (commit0 Python repos, SWE-bench GitHub issues) on one facade, differing only in Deliverable + prompt: the concrete answer to "does it generalize or is it one per domain". Both callers also gain resume for free. router-client: make the "exhausted retries" terminal genuinely reachable (retryable statuses always continue to the loop's attempt bound; non-retryable fails fast immediately) and drop the dead `readFile` import in commit0-gate.
What
openSandboxRun(renamed fromopenCodingRun) is the one harness- and domain-agnostic seam for running an agent in a sandbox over a persistent artifact: run it, stream it, resume the same server-side session across turns. It is a thin facade over already-hardened code —acquireSandbox(cold-start/502 recovery),buildBackendOptions(harness =backend.type),createSandboxLineage(startmints a session;resumecontinues it with a fail-loudassertSessionLive).The one genuinely-new piece is
Deliverable<Out>: it widens the pureOutputAdapter.parse(events)to also admit a post-turn read off the box FS — the structural gap that made the bench gates hand-rollbox.fs.read, because a large produced file (a git diff, a generated document) truncates in the chat stream and a pure events-parser can't reach the workspace.Why renamed (answering "is this needed / does it generalize / one per domain?")
It is not coding-specific and there is no per-domain copy. The domain lives entirely in the caller-supplied
Deliverable<Out>(events-parse vs artifact-read). To prove that concretely this PR converges two different benchmark callers onto the one facade:bench/src/commit0-gate.mts#runShot— commit0 Python repos, diff at/tmp/solution.patchbench/src/worker.ts#solveShot— SWE-bench GitHub issues, diff at/tmp/solution.patchEach previously hand-rolled the identical
new Sandbox+acquireSandbox+ stream-loop +try/read/catch+finally/deletedance. Both now callopenSandboxRun(client, { agentRun, signal }, deliverable)and differ only in theirDeliverable+ prompt. Both also gainresumefor free. That is the net-surface-reduction + generalization argument in one diff: research/tax/legal would each be a newDeliverable, not a new function.Hardening (the PR's prior HIGH: untested public API)
boxpassed intosettle()→ the start-before-handle invariant is type-level, not call-order discipline (removes thehandle!non-null assertion).start()throws (useresume());resume()/getters beforestart()throw.@experimentalon every export.Tests —
tests/runtime/sandbox-run.test.ts(11)events and artifact deliverables (the generalization proof),
readErrorsurfaced without throwing (fault ≠ empty deliverable),resumereuses box+session, fail-loud on a reaped session, and every lifecycle guardrail above.Also
router-client: the "exhausted retries" terminal is now genuinely reachable — retryable statuses always continue to the loop's attempt bound, non-retryable fails fast immediately (was the dead post-loop throw the reviewer flagged). Dropped the deadreadFileimport.Verification
pnpm run lint·pnpm run typecheck·pnpm run build·pnpm test(669 pass) ·bench: tsc --noEmit— all green.Deliberately NOT in scope
runExperiment+ liftingrouterChatWithUsageinto the package — that's the one-flow-worker consolidation, its own PR.worker.ts#solveShotpreserves its exact create options (behavior-preserving refactor). Ifsolve-oneever hits aProviderAuthError, it likely needs the sameOPENAI_*box env the commit0 path sets — flagged, not silently changed here.