feat(bench): full pilot matrix snapshot + client timeout (#101)#115
Merged
Conversation
Commit the first full Surface agent-impact pilot: all 11 scenarios × C0–C3 × N=10 across haiku/sonnet/opus (1320 calls, 0 errors, ~$13.98). Force-added past .gitignore as a curated results/<snapshot>/ with raw.jsonl, summary.json, report.md, per-model plots, and a PROVENANCE.md. Headline: in the cascade family (the drifted dependency is hidden from the agent) a stale doc yields 0% success / 100% misled on *every* model, a fresh doc 100%, and the surf report recovers to 90% (haiku) / 100% (sonnet, opus) — H1 = +100pp flat across the capability range, i.e. a more capable model is not more robust to rot when it cannot see the code. The comprehension family ceilings on success but a stale doc still costs +57–107 extra output tokens per model. Also harden the Anthropic client with a 120s per-request timeout + retries. The original matrix run hung on a single request that never returned (the SDK had no wall-clock cap short enough for a long unattended run), stalling the whole matrix; the completed rows were preserved and the unfinished scenarios re-run under the timeout (see PROVENANCE.md). Prevents one bad request from taking the run hostage again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
report.md is otherwise pure auto-generated tables; prepend a human read of the results so the committed snapshot is self-interpreting for the blog write-up — the two-axis story (hidden dependency -> correctness collapse, fresh docs and the surf report both fix it; visible code -> token tax), the flat +100pp H1 across haiku/sonnet/opus (a smarter model is not more rot-resistant), and the caveats. Marked as authored, since re-running surface_bench.report would regenerate the tables below it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…101) Replace the report's auto-generated tables with a complete, self-contained report so the committed snapshot is readable end to end (and a ready base for the blog): overview, hypotheses (H1/H2/H3), methodology (conditions, the two scenario families, models/grading/metrics, the 11 scenarios), the exact prompts given to agents (system prompt, user-turn structure, a worked C0-C3 example), results (per-family per-model tables with CIs + spend), interpretation, the learnings (the framing pivot, the hidden-dependency insight, the authoring-leak and timeout bugs we caught), and future work. Machine-readable metrics remain in summary.json. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…#101) report-eli5.md restates the pilot's findings without computer-science jargon for a non-technical AI user — a "handyman + building manual" analogy, the four paperwork conditions, the visible-vs-hidden distinction, the headline (stale docs break every model 100% of the time and a smarter model doesn't help, but flagging the rot fixes it), and the learnings/next-steps. Points to report.md for the numbers and methodology. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Section 4.2's prompt-structure template put triple-backtick fences inside a triple-backtick block, which mis-pairs: the ## lines rendered as headings, the <...> placeholders were eaten as HTML, and the final fence opened an unclosed block swallowing the rest of the doc. Wrap the template in a four-backtick fence so the inner fences are literal. report-eli5.md verified clean (no fences); tables in both checked for consistent columns. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…101) Per review, the plain-English companion now calls the assistant a "model" (or "agent"), matching how the technical report names haiku/sonnet/opus, with a one-line gloss up top so a non-technical reader isn't lost. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
#101) The ELI5 version read as condescending. Rename report-eli5.md -> report-summary.md and rewrite in a clear, plain register for a smart non-technical reader: drop the extended handyman metaphor and cutesy section titles, keep the substance, define "tokens" inline, avoid statistics vocabulary. Still uses "model"/"agent", not "AI". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Decision-maker framing of Surface's value, kept honest by separating measured from assumed. Token savings from keeping docs fresh are real but small (~$0.30–$1.60 per 1,000 tasks, and a floor since this is single-shot); in the hidden-code case stale docs are even slightly *cheaper* on tokens, so token accounting understates the value. The dominant term is avoided wrong work: without Surface the model shipped a wrong change on 100% of tasks that relied on a drifted, unverifiable dependency vs ~0% with it — expressed as a transparent formula (task volume × exposure share × failure-rate drop × remediation cost) with a clearly-labelled illustrative example. report.md gets a full §7 (sections renumbered); report-summary.md gets a plain-language "What it's worth". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…101) The old plots were one bar chart per model averaged across all scenarios, which blended the two families and buried the headline (cascade's 0%->100% diluted under the comprehension ceiling). Rewrite maybe_plot to be standalone-readable: split by family (cascade on success, comprehension on output tokens), plain- English condition labels (No docs / Stale docs / Fresh docs (Surface) / Stale + Surface report), value annotations, capability-ordered models, and self- explanatory titles. Emits overview.png (two-panel summary) and cascade_success.png (headline chart); benefits every future run too. Regenerate the snapshot figures, embed overview.png in report.md and report-summary.md so the reports are visual, update file references, and drop the old success_<model>.png charts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Commits the first full Surface agent-impact pilot as a reproducible snapshot, and hardens the runner against the hang that bit us mid-run. Closes #101 (and finishes milestone "Empirical validation of Surface").
The result
All 11 scenarios × C0–C3 × N=10 across haiku / sonnet / opus — 1320 calls, 0 errors, $13.98.
Cascade family (the drifted dependency is hidden — the agent knows it only by doc):
surfreport recovers the loss (full on sonnet/opus, 90% on haiku).Comprehension family (code visible): success ceilings near 100%, but a stale doc still costs +57–107 extra output tokens per model — the wasted-token tax of rot you can see.
Two axes of damage: hidden dependency → correctness collapse; visible dependency → silent token tax. Full CIs in
report.md/summary.json.What's in the snapshot
bench/results/2026-06-13-pilot-full-matrix/(force-added past.gitignore):raw.jsonl(1320 rows) ·summary.json·report.md·success_{haiku,sonnet,opus}.png·run.json·PROVENANCE.md.Harness fix (same PR)
The original run hung on a single API request that never returned — the client had no wall-clock timeout, so one bad request stalled the whole matrix. Added a 120s per-request timeout + retries; a hung call now fails over to an error row and the run continues. The completed rows were preserved and the unfinished scenarios re-run under the timeout — see
PROVENANCE.mdfor the honest two-run accounting (same prompts/grading, clean merge, every scenario exactly 120 rows).Verification
cargo fmt --all --checkclean ·clippy -D warningsclean ·cargo test --allall suites ok (bench is outside the Rust workspace; run for cleanliness).Closes #101
🤖 Generated with Claude Code