feat(bench): full pilot matrix snapshot + client timeout (#101) by Connorrmcd6 · Pull Request #115 · Connorrmcd6/surface

Connorrmcd6 · 2026-06-13T18:52:31Z

Commits the first full Surface agent-impact pilot as a reproducible snapshot, and hardens the runner against the hang that bit us mid-run. Closes #101 (and finishes milestone "Empirical validation of Surface").

The result

All 11 scenarios × C0–C3 × N=10 across haiku / sonnet / opus — 1320 calls, 0 errors, $13.98.

Cascade family (the drifted dependency is hidden — the agent knows it only by doc):

Model	C1 stale (succ / misled)	C2 fresh	C3 surf report	H1 (C2−C1)	H3 (C3−C1)
haiku	0% / 100%	100%	90%	+100pp	+90pp
sonnet	0% / 100%	100%	100%	+100pp	+100pp
opus	0% / 100%	100%	100%	+100pp	+100pp

A stale doc about code the agent can't see makes every model wrong 100% of the time — and a more capable model is not more robust (opus = haiku here). Kills the "just use a better model" objection.
Fresh docs → 100% on all three; the surf report recovers the loss (full on sonnet/opus, 90% on haiku).

Comprehension family (code visible): success ceilings near 100%, but a stale doc still costs +57–107 extra output tokens per model — the wasted-token tax of rot you can see.

Two axes of damage: hidden dependency → correctness collapse; visible dependency → silent token tax. Full CIs in report.md / summary.json.

What's in the snapshot

bench/results/2026-06-13-pilot-full-matrix/ (force-added past .gitignore): raw.jsonl (1320 rows) · summary.json · report.md · success_{haiku,sonnet,opus}.png · run.json · PROVENANCE.md.

Harness fix (same PR)

The original run hung on a single API request that never returned — the client had no wall-clock timeout, so one bad request stalled the whole matrix. Added a 120s per-request timeout + retries; a hung call now fails over to an error row and the run continues. The completed rows were preserved and the unfinished scenarios re-run under the timeout — see PROVENANCE.md for the honest two-run accounting (same prompts/grading, clean merge, every scenario exactly 120 rows).

Verification

cargo fmt --all --check clean · clippy -D warnings clean · cargo test --all all suites ok (bench is outside the Rust workspace; run for cleanliness).
Dataset integrity: 11 scenarios × 120 rows each, 0 duplicates, 0 errors (asserted during merge).

Closes #101

🤖 Generated with Claude Code

Commit the first full Surface agent-impact pilot: all 11 scenarios × C0–C3 × N=10 across haiku/sonnet/opus (1320 calls, 0 errors, ~$13.98). Force-added past .gitignore as a curated results/<snapshot>/ with raw.jsonl, summary.json, report.md, per-model plots, and a PROVENANCE.md. Headline: in the cascade family (the drifted dependency is hidden from the agent) a stale doc yields 0% success / 100% misled on *every* model, a fresh doc 100%, and the surf report recovers to 90% (haiku) / 100% (sonnet, opus) — H1 = +100pp flat across the capability range, i.e. a more capable model is not more robust to rot when it cannot see the code. The comprehension family ceilings on success but a stale doc still costs +57–107 extra output tokens per model. Also harden the Anthropic client with a 120s per-request timeout + retries. The original matrix run hung on a single request that never returned (the SDK had no wall-clock cap short enough for a long unattended run), stalling the whole matrix; the completed rows were preserved and the unfinished scenarios re-run under the timeout (see PROVENANCE.md). Prevents one bad request from taking the run hostage again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

report.md is otherwise pure auto-generated tables; prepend a human read of the results so the committed snapshot is self-interpreting for the blog write-up — the two-axis story (hidden dependency -> correctness collapse, fresh docs and the surf report both fix it; visible code -> token tax), the flat +100pp H1 across haiku/sonnet/opus (a smarter model is not more rot-resistant), and the caveats. Marked as authored, since re-running surface_bench.report would regenerate the tables below it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…101) Replace the report's auto-generated tables with a complete, self-contained report so the committed snapshot is readable end to end (and a ready base for the blog): overview, hypotheses (H1/H2/H3), methodology (conditions, the two scenario families, models/grading/metrics, the 11 scenarios), the exact prompts given to agents (system prompt, user-turn structure, a worked C0-C3 example), results (per-family per-model tables with CIs + spend), interpretation, the learnings (the framing pivot, the hidden-dependency insight, the authoring-leak and timeout bugs we caught), and future work. Machine-readable metrics remain in summary.json. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…#101) report-eli5.md restates the pilot's findings without computer-science jargon for a non-technical AI user — a "handyman + building manual" analogy, the four paperwork conditions, the visible-vs-hidden distinction, the headline (stale docs break every model 100% of the time and a smarter model doesn't help, but flagging the rot fixes it), and the learnings/next-steps. Points to report.md for the numbers and methodology. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Section 4.2's prompt-structure template put triple-backtick fences inside a triple-backtick block, which mis-pairs: the ## lines rendered as headings, the <...> placeholders were eaten as HTML, and the final fence opened an unclosed block swallowing the rest of the doc. Wrap the template in a four-backtick fence so the inner fences are literal. report-eli5.md verified clean (no fences); tables in both checked for consistent columns. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…101) Per review, the plain-English companion now calls the assistant a "model" (or "agent"), matching how the technical report names haiku/sonnet/opus, with a one-line gloss up top so a non-technical reader isn't lost. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

#101) The ELI5 version read as condescending. Rename report-eli5.md -> report-summary.md and rewrite in a clear, plain register for a smart non-technical reader: drop the extended handyman metaphor and cutesy section titles, keep the substance, define "tokens" inline, avoid statistics vocabulary. Still uses "model"/"agent", not "AI". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Decision-maker framing of Surface's value, kept honest by separating measured from assumed. Token savings from keeping docs fresh are real but small (~$0.30–$1.60 per 1,000 tasks, and a floor since this is single-shot); in the hidden-code case stale docs are even slightly *cheaper* on tokens, so token accounting understates the value. The dominant term is avoided wrong work: without Surface the model shipped a wrong change on 100% of tasks that relied on a drifted, unverifiable dependency vs ~0% with it — expressed as a transparent formula (task volume × exposure share × failure-rate drop × remediation cost) with a clearly-labelled illustrative example. report.md gets a full §7 (sections renumbered); report-summary.md gets a plain-language "What it's worth". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…101) The old plots were one bar chart per model averaged across all scenarios, which blended the two families and buried the headline (cascade's 0%->100% diluted under the comprehension ceiling). Rewrite maybe_plot to be standalone-readable: split by family (cascade on success, comprehension on output tokens), plain- English condition labels (No docs / Stale docs / Fresh docs (Surface) / Stale + Surface report), value annotations, capability-ordered models, and self- explanatory titles. Emits overview.png (two-panel summary) and cascade_success.png (headline chart); benefits every future run too. Regenerate the snapshot figures, embed overview.png in report.md and report-summary.md so the reports are visual, update file references, and drop the old success_<model>.png charts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Connorrmcd6 and others added 9 commits June 13, 2026 20:52

Connorrmcd6 merged commit 7edfeff into main Jun 13, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): full pilot matrix snapshot + client timeout (#101)#115

feat(bench): full pilot matrix snapshot + client timeout (#101)#115
Connorrmcd6 merged 9 commits into
mainfrom
chore/101-matrix-snapshot

Connorrmcd6 commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Connorrmcd6 commented Jun 13, 2026

The result

What's in the snapshot

Harness fix (same PR)

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant