feat(bench): cascade scenario family + de-bias the harness (#100)#114
Merged
Conversation
The first pilot smoke test (haiku×2, recorded in #113) hit a success-rate ceiling: under stale docs the model was 14/14 correct and 0% misled, so the Surface effect was flat zero. Two causes — the system prompt told the model "the source code is the ground truth" (i.e. ignore the docs), and the single-file "comprehension" framing let the model just read the drifted function. Neither reflects real context rot. De-bias the harness: * Neutralize the system prompt — declare no precedence between docs and code. * Add hidden-dependency support: meta.toml `hidden_paths` lists code/ files that stay present for grading (so surf seals a real divergence and the grader runs against them) but are withheld from the prompt. prompts.py skips them when rendering the codebase; nothing else changes. Add a "cascade" scenario family modelling real rot: the agent edits a visible function whose correctness depends on a hidden dependency it knows only through that dependency's doc. A stale doc propagates into a wrong cascaded edit; in C3 the surf report's new_code is the only window onto the truth. Graders derive the expected value from the real hidden dependency, so the test stays honest. cascade-quota-batcher-code hidden limiter capacity (<,<=) code, py cascade-retry-budget-code hidden retry attempt cap (3->5) code, py cascade-access-policy-code hidden allow-list -> block-list code, py cascade-page-size-ts-code hidden default page size (50->25) code, ts Validated on haiku×2 (~$0.50): on every cascade C2 (fresh) and C3 (surf report) succeed 2/2 while C0/C1 fail — H1 and H3 demonstrated, sanity gate passed. The 7 comprehension scenarios remain as a secondary sub-suite. Full pilot (N>=10) is the next step. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Drop the "you are an expert software engineer assisting a teammate" persona — it's artificial and primes diligent, skeptical behaviour, biasing against the stale-doc effect we're measuring. Single-shot prompting like this best mirrors how people actually use Claude (paste/tag some files, maybe a doc, ask for the change), so the system prompt is now a thin, neutral, persona-free frame with no docs-vs-code precedence. Re-smoked the cascades on haiku×2 under the new prompt: sanity gate still passes (C2/C3 succeed, C0/C1 fail on every scenario). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes the pilot ceiling found in the first smoke test (#113) and adds the realistic "cascade" scenario family that is the bench's intended headline.
Why
The first smoke run (haiku×2, #113) showed a success-rate ceiling: under stale docs the model was 14/14 correct, 0% misled, so the Surface effect was flat zero. Two causes:
What changed
Harness (small, backward-compatible):
meta.tomlhidden_pathslistscode/files that stay present for grading (sosurfseals a real divergence and the grader runs against them) but are withheld from the prompt.prompts._render_codeskips them;run.py/metrics.py/grade_code.pyunchanged.New "cascade" family — the agent edits a visible function whose correctness depends on a hidden dependency it knows only through that dependency's doc. A stale doc propagates into a wrong cascaded edit; in C3 the
surfreport'snew_codeis the only window onto the truth. Graders derive the expected value from the real hidden dependency, so the test stays honest.cascade-quota-batcher-code<→<=)cascade-retry-budget-codecascade-access-policy-codecascade-page-size-ts-codeThe 7 comprehension scenarios from #99 remain as a secondary sub-suite (still evidence for H2).
Verification
Rust gates (bench is outside the workspace, so unaffected — run for cleanliness):
cargo fmt --all --check— cleancargo clippy --all-targets --all-features -- -D warnings— cleancargo test --all— all suites pass (72/10/38/25/34, 0 failed)Bench:
uv run python tools/author.py --all— all 11 scenarios seal; each stale hub genuinely diverges.uv run python -m surface_bench.run --models mock— full offline pipeline runs every cell, no crash; confirmshidden_pathsrendering.ok/not-misled and a simulated stale answer grades not-ok/misled(bothpython3andnode --testpaths, underuv run), with the dependency confirmed absent from the prompt.Paid smoke (haiku×2, ~$0.50): on all four cascades, C2 (fresh) and C3 (surf report) succeed 2/2 while C0/C1 fail — H1 and H3 demonstrated, sanity gate passed:
A nice incidental: an early C3 "miss" turned out to be the harness correctly catching a leaked doc-trust instruction in my own task text — fixed, and it proves the bench is sensitive.
Next
Full pilot at
--trials 10(user-run, billed) once this lands.Closes #100
🤖 Generated with Claude Code