Skip to content

feat(bench): cascade scenario family + de-bias the harness (#100)#114

Merged
Connorrmcd6 merged 2 commits into
mainfrom
chore/100-cascade-family
Jun 13, 2026
Merged

feat(bench): cascade scenario family + de-bias the harness (#100)#114
Connorrmcd6 merged 2 commits into
mainfrom
chore/100-cascade-family

Conversation

@Connorrmcd6

Copy link
Copy Markdown
Owner

Fixes the pilot ceiling found in the first smoke test (#113) and adds the realistic "cascade" scenario family that is the bench's intended headline.

Why

The first smoke run (haiku×2, #113) showed a success-rate ceiling: under stale docs the model was 14/14 correct, 0% misled, so the Surface effect was flat zero. Two causes:

  1. The system prompt told the model "the source code is the ground truth" — i.e. ignore the docs, exactly the behaviour that nullifies a stale-doc effect.
  2. The single-file comprehension framing (drifted code + contradicting doc side-by-side) lets the model just read the one function. That isn't how real context rot bites.

What changed

Harness (small, backward-compatible):

  • Neutralized the system prompt — declares no precedence between docs and code.
  • Hidden-dependency supportmeta.toml hidden_paths lists code/ files that stay present for grading (so surf seals a real divergence and the grader runs against them) but are withheld from the prompt. prompts._render_code skips them; run.py/metrics.py/grade_code.py unchanged.

New "cascade" family — the agent edits a visible function whose correctness depends on a hidden dependency it knows only through that dependency's doc. A stale doc propagates into a wrong cascaded edit; in C3 the surf report's new_code is the only window onto the truth. Graders derive the expected value from the real hidden dependency, so the test stays honest.

Scenario Lang Hidden dependency drift
cascade-quota-batcher-code py limiter capacity (<<=)
cascade-retry-budget-code py retry attempt cap (3→5)
cascade-access-policy-code py allow-list → block-list
cascade-page-size-ts-code ts default page size (50→25)

The 7 comprehension scenarios from #99 remain as a secondary sub-suite (still evidence for H2).

Verification

Rust gates (bench is outside the workspace, so unaffected — run for cleanliness):

  • cargo fmt --all --check — clean
  • cargo clippy --all-targets --all-features -- -D warnings — clean
  • cargo test --all — all suites pass (72/10/38/25/34, 0 failed)

Bench:

  • uv run python tools/author.py --all — all 11 scenarios seal; each stale hub genuinely diverges.
  • uv run python -m surface_bench.run --models mock — full offline pipeline runs every cell, no crash; confirms hidden_paths rendering.
  • For each cascade: a simulated correct answer grades ok/not-misled and a simulated stale answer grades not-ok/misled (both python3 and node --test paths, under uv run), with the dependency confirmed absent from the prompt.

Paid smoke (haiku×2, ~$0.50): on all four cascades, C2 (fresh) and C3 (surf report) succeed 2/2 while C0/C1 fail — H1 and H3 demonstrated, sanity gate passed:

C0 C1 C2 C3
ok (of 2) 0 0 2 2

A nice incidental: an early C3 "miss" turned out to be the harness correctly catching a leaked doc-trust instruction in my own task text — fixed, and it proves the bench is sensitive.

Next

Full pilot at --trials 10 (user-run, billed) once this lands.

Closes #100

🤖 Generated with Claude Code

Connorrmcd6 and others added 2 commits June 13, 2026 17:29
The first pilot smoke test (haiku×2, recorded in #113) hit a success-rate
ceiling: under stale docs the model was 14/14 correct and 0% misled, so the
Surface effect was flat zero. Two causes — the system prompt told the model
"the source code is the ground truth" (i.e. ignore the docs), and the
single-file "comprehension" framing let the model just read the drifted
function. Neither reflects real context rot.

De-bias the harness:
  * Neutralize the system prompt — declare no precedence between docs and code.
  * Add hidden-dependency support: meta.toml `hidden_paths` lists code/ files
    that stay present for grading (so surf seals a real divergence and the
    grader runs against them) but are withheld from the prompt. prompts.py
    skips them when rendering the codebase; nothing else changes.

Add a "cascade" scenario family modelling real rot: the agent edits a visible
function whose correctness depends on a hidden dependency it knows only through
that dependency's doc. A stale doc propagates into a wrong cascaded edit; in C3
the surf report's new_code is the only window onto the truth. Graders derive the
expected value from the real hidden dependency, so the test stays honest.

  cascade-quota-batcher-code   hidden limiter capacity (<,<=)      code, py
  cascade-retry-budget-code    hidden retry attempt cap (3->5)     code, py
  cascade-access-policy-code   hidden allow-list -> block-list     code, py
  cascade-page-size-ts-code    hidden default page size (50->25)   code, ts

Validated on haiku×2 (~$0.50): on every cascade C2 (fresh) and C3 (surf report)
succeed 2/2 while C0/C1 fail — H1 and H3 demonstrated, sanity gate passed. The
7 comprehension scenarios remain as a secondary sub-suite. Full pilot (N>=10)
is the next step.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Drop the "you are an expert software engineer assisting a teammate" persona —
it's artificial and primes diligent, skeptical behaviour, biasing against the
stale-doc effect we're measuring. Single-shot prompting like this best mirrors
how people actually use Claude (paste/tag some files, maybe a doc, ask for the
change), so the system prompt is now a thin, neutral, persona-free frame with no
docs-vs-code precedence.

Re-smoked the cascades on haiku×2 under the new prompt: sanity gate still passes
(C2/C3 succeed, C0/C1 fail on every scenario).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Connorrmcd6 Connorrmcd6 merged commit 3b47674 into main Jun 13, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pilot run + calibration

1 participant