feat(bench): cascade scenario family + de-bias the harness (#100) by Connorrmcd6 · Pull Request #114 · Connorrmcd6/surface

Connorrmcd6 · 2026-06-13T15:29:59Z

Fixes the pilot ceiling found in the first smoke test (#113) and adds the realistic "cascade" scenario family that is the bench's intended headline.

Why

The first smoke run (haiku×2, #113) showed a success-rate ceiling: under stale docs the model was 14/14 correct, 0% misled, so the Surface effect was flat zero. Two causes:

The system prompt told the model "the source code is the ground truth" — i.e. ignore the docs, exactly the behaviour that nullifies a stale-doc effect.
The single-file comprehension framing (drifted code + contradicting doc side-by-side) lets the model just read the one function. That isn't how real context rot bites.

What changed

Harness (small, backward-compatible):

Neutralized the system prompt — declares no precedence between docs and code.
Hidden-dependency support — meta.toml hidden_paths lists code/ files that stay present for grading (so surf seals a real divergence and the grader runs against them) but are withheld from the prompt. prompts._render_code skips them; run.py/metrics.py/grade_code.py unchanged.

New "cascade" family — the agent edits a visible function whose correctness depends on a hidden dependency it knows only through that dependency's doc. A stale doc propagates into a wrong cascaded edit; in C3 the surf report's new_code is the only window onto the truth. Graders derive the expected value from the real hidden dependency, so the test stays honest.

Scenario	Lang	Hidden dependency drift
`cascade-quota-batcher-code`	py	limiter capacity (`<`→`<=`)
`cascade-retry-budget-code`	py	retry attempt cap (3→5)
`cascade-access-policy-code`	py	allow-list → block-list
`cascade-page-size-ts-code`	ts	default page size (50→25)

The 7 comprehension scenarios from #99 remain as a secondary sub-suite (still evidence for H2).

Verification

Rust gates (bench is outside the workspace, so unaffected — run for cleanliness):

cargo fmt --all --check — clean
cargo clippy --all-targets --all-features -- -D warnings — clean
cargo test --all — all suites pass (72/10/38/25/34, 0 failed)

Bench:

uv run python tools/author.py --all — all 11 scenarios seal; each stale hub genuinely diverges.
uv run python -m surface_bench.run --models mock — full offline pipeline runs every cell, no crash; confirms hidden_paths rendering.
For each cascade: a simulated correct answer grades ok/not-misled and a simulated stale answer grades not-ok/misled (both python3 and node --test paths, under uv run), with the dependency confirmed absent from the prompt.

Paid smoke (haiku×2, ~$0.50): on all four cascades, C2 (fresh) and C3 (surf report) succeed 2/2 while C0/C1 fail — H1 and H3 demonstrated, sanity gate passed:

	C0	C1	C2	C3
ok (of 2)	0	0	2	2

A nice incidental: an early C3 "miss" turned out to be the harness correctly catching a leaked doc-trust instruction in my own task text — fixed, and it proves the bench is sensitive.

The first pilot smoke test (haiku×2, recorded in #113) hit a success-rate ceiling: under stale docs the model was 14/14 correct and 0% misled, so the Surface effect was flat zero. Two causes — the system prompt told the model "the source code is the ground truth" (i.e. ignore the docs), and the single-file "comprehension" framing let the model just read the drifted function. Neither reflects real context rot. De-bias the harness: * Neutralize the system prompt — declare no precedence between docs and code. * Add hidden-dependency support: meta.toml `hidden_paths` lists code/ files that stay present for grading (so surf seals a real divergence and the grader runs against them) but are withheld from the prompt. prompts.py skips them when rendering the codebase; nothing else changes. Add a "cascade" scenario family modelling real rot: the agent edits a visible function whose correctness depends on a hidden dependency it knows only through that dependency's doc. A stale doc propagates into a wrong cascaded edit; in C3 the surf report's new_code is the only window onto the truth. Graders derive the expected value from the real hidden dependency, so the test stays honest. cascade-quota-batcher-code hidden limiter capacity (<,<=) code, py cascade-retry-budget-code hidden retry attempt cap (3->5) code, py cascade-access-policy-code hidden allow-list -> block-list code, py cascade-page-size-ts-code hidden default page size (50->25) code, ts Validated on haiku×2 (~$0.50): on every cascade C2 (fresh) and C3 (surf report) succeed 2/2 while C0/C1 fail — H1 and H3 demonstrated, sanity gate passed. The 7 comprehension scenarios remain as a secondary sub-suite. Full pilot (N>=10) is the next step. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Drop the "you are an expert software engineer assisting a teammate" persona — it's artificial and primes diligent, skeptical behaviour, biasing against the stale-doc effect we're measuring. Single-shot prompting like this best mirrors how people actually use Claude (paste/tag some files, maybe a doc, ask for the change), so the system prompt is now a thin, neutral, persona-free frame with no docs-vs-code precedence. Re-smoked the cascades on haiku×2 under the new prompt: sanity gate still passes (C2/C3 succeed, C0/C1 fail on every scenario). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Connorrmcd6 and others added 2 commits June 13, 2026 17:29

Connorrmcd6 merged commit 3b47674 into main Jun 13, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): cascade scenario family + de-bias the harness (#100)#114

feat(bench): cascade scenario family + de-bias the harness (#100)#114
Connorrmcd6 merged 2 commits into
mainfrom
chore/100-cascade-family

Connorrmcd6 commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Connorrmcd6 commented Jun 13, 2026

Why

What changed

Verification

Next

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant