Skip to content

feat(bench): full pilot matrix snapshot + client timeout (#101)#115

Merged
Connorrmcd6 merged 9 commits into
mainfrom
chore/101-matrix-snapshot
Jun 13, 2026
Merged

feat(bench): full pilot matrix snapshot + client timeout (#101)#115
Connorrmcd6 merged 9 commits into
mainfrom
chore/101-matrix-snapshot

Conversation

@Connorrmcd6

Copy link
Copy Markdown
Owner

Commits the first full Surface agent-impact pilot as a reproducible snapshot, and hardens the runner against the hang that bit us mid-run. Closes #101 (and finishes milestone "Empirical validation of Surface").

The result

All 11 scenarios × C0–C3 × N=10 across haiku / sonnet / opus — 1320 calls, 0 errors, $13.98.

Cascade family (the drifted dependency is hidden — the agent knows it only by doc):

Model C1 stale (succ / misled) C2 fresh C3 surf report H1 (C2−C1) H3 (C3−C1)
haiku 0% / 100% 100% 90% +100pp +90pp
sonnet 0% / 100% 100% 100% +100pp +100pp
opus 0% / 100% 100% 100% +100pp +100pp
  • A stale doc about code the agent can't see makes every model wrong 100% of the time — and a more capable model is not more robust (opus = haiku here). Kills the "just use a better model" objection.
  • Fresh docs → 100% on all three; the surf report recovers the loss (full on sonnet/opus, 90% on haiku).

Comprehension family (code visible): success ceilings near 100%, but a stale doc still costs +57–107 extra output tokens per model — the wasted-token tax of rot you can see.

Two axes of damage: hidden dependency → correctness collapse; visible dependency → silent token tax. Full CIs in report.md / summary.json.

What's in the snapshot

bench/results/2026-06-13-pilot-full-matrix/ (force-added past .gitignore): raw.jsonl (1320 rows) · summary.json · report.md · success_{haiku,sonnet,opus}.png · run.json · PROVENANCE.md.

Harness fix (same PR)

The original run hung on a single API request that never returned — the client had no wall-clock timeout, so one bad request stalled the whole matrix. Added a 120s per-request timeout + retries; a hung call now fails over to an error row and the run continues. The completed rows were preserved and the unfinished scenarios re-run under the timeout — see PROVENANCE.md for the honest two-run accounting (same prompts/grading, clean merge, every scenario exactly 120 rows).

Verification

  • cargo fmt --all --check clean · clippy -D warnings clean · cargo test --all all suites ok (bench is outside the Rust workspace; run for cleanliness).
  • Dataset integrity: 11 scenarios × 120 rows each, 0 duplicates, 0 errors (asserted during merge).

Closes #101

🤖 Generated with Claude Code

Connorrmcd6 and others added 9 commits June 13, 2026 20:52
Commit the first full Surface agent-impact pilot: all 11 scenarios × C0–C3 ×
N=10 across haiku/sonnet/opus (1320 calls, 0 errors, ~$13.98). Force-added past
.gitignore as a curated results/<snapshot>/ with raw.jsonl, summary.json,
report.md, per-model plots, and a PROVENANCE.md.

Headline: in the cascade family (the drifted dependency is hidden from the
agent) a stale doc yields 0% success / 100% misled on *every* model, a fresh doc
100%, and the surf report recovers to 90% (haiku) / 100% (sonnet, opus) — H1 =
+100pp flat across the capability range, i.e. a more capable model is not more
robust to rot when it cannot see the code. The comprehension family ceilings on
success but a stale doc still costs +57–107 extra output tokens per model.

Also harden the Anthropic client with a 120s per-request timeout + retries. The
original matrix run hung on a single request that never returned (the SDK had no
wall-clock cap short enough for a long unattended run), stalling the whole
matrix; the completed rows were preserved and the unfinished scenarios re-run
under the timeout (see PROVENANCE.md). Prevents one bad request from taking the
run hostage again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
report.md is otherwise pure auto-generated tables; prepend a human read of the
results so the committed snapshot is self-interpreting for the blog write-up —
the two-axis story (hidden dependency -> correctness collapse, fresh docs and
the surf report both fix it; visible code -> token tax), the flat +100pp H1
across haiku/sonnet/opus (a smarter model is not more rot-resistant), and the
caveats. Marked as authored, since re-running surface_bench.report would
regenerate the tables below it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…101)

Replace the report's auto-generated tables with a complete, self-contained
report so the committed snapshot is readable end to end (and a ready base for
the blog): overview, hypotheses (H1/H2/H3), methodology (conditions, the two
scenario families, models/grading/metrics, the 11 scenarios), the exact prompts
given to agents (system prompt, user-turn structure, a worked C0-C3 example),
results (per-family per-model tables with CIs + spend), interpretation, the
learnings (the framing pivot, the hidden-dependency insight, the authoring-leak
and timeout bugs we caught), and future work. Machine-readable metrics remain in
summary.json.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…#101)

report-eli5.md restates the pilot's findings without computer-science jargon for
a non-technical AI user — a "handyman + building manual" analogy, the four
paperwork conditions, the visible-vs-hidden distinction, the headline (stale docs
break every model 100% of the time and a smarter model doesn't help, but flagging
the rot fixes it), and the learnings/next-steps. Points to report.md for the
numbers and methodology.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Section 4.2's prompt-structure template put triple-backtick fences inside a
triple-backtick block, which mis-pairs: the ## lines rendered as headings, the
<...> placeholders were eaten as HTML, and the final fence opened an unclosed
block swallowing the rest of the doc. Wrap the template in a four-backtick fence
so the inner fences are literal. report-eli5.md verified clean (no fences);
tables in both checked for consistent columns.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…101)

Per review, the plain-English companion now calls the assistant a "model" (or
"agent"), matching how the technical report names haiku/sonnet/opus, with a
one-line gloss up top so a non-technical reader isn't lost.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
#101)

The ELI5 version read as condescending. Rename report-eli5.md ->
report-summary.md and rewrite in a clear, plain register for a smart
non-technical reader: drop the extended handyman metaphor and cutesy section
titles, keep the substance, define "tokens" inline, avoid statistics vocabulary.
Still uses "model"/"agent", not "AI".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Decision-maker framing of Surface's value, kept honest by separating measured
from assumed. Token savings from keeping docs fresh are real but small
(~$0.30–$1.60 per 1,000 tasks, and a floor since this is single-shot); in the
hidden-code case stale docs are even slightly *cheaper* on tokens, so token
accounting understates the value. The dominant term is avoided wrong work:
without Surface the model shipped a wrong change on 100% of tasks that relied on
a drifted, unverifiable dependency vs ~0% with it — expressed as a transparent
formula (task volume × exposure share × failure-rate drop × remediation cost)
with a clearly-labelled illustrative example. report.md gets a full §7 (sections
renumbered); report-summary.md gets a plain-language "What it's worth".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…101)

The old plots were one bar chart per model averaged across all scenarios, which
blended the two families and buried the headline (cascade's 0%->100% diluted
under the comprehension ceiling). Rewrite maybe_plot to be standalone-readable:
split by family (cascade on success, comprehension on output tokens), plain-
English condition labels (No docs / Stale docs / Fresh docs (Surface) / Stale +
Surface report), value annotations, capability-ordered models, and self-
explanatory titles. Emits overview.png (two-panel summary) and cascade_success.png
(headline chart); benefits every future run too.

Regenerate the snapshot figures, embed overview.png in report.md and
report-summary.md so the reports are visual, update file references, and drop the
old success_<model>.png charts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Connorrmcd6 Connorrmcd6 merged commit 7edfeff into main Jun 13, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Full matrix run + blog-ready snapshot

1 participant