chore(bench): author the full seed suite + uv harness env (#99) by Connorrmcd6 · Pull Request #112 · Connorrmcd6/surface

Connorrmcd6 · 2026-06-13T14:29:52Z

Authors the full bench seed suite for milestone 11 (#99) and pins the harness env so the pilot (#100) runs unambiguously.

What changed

Five new scenarios (suite is now 7, spanning the drift archetypes surf check fires on, in both QA and code-edit form, across Python and TypeScript):

Scenario	Lang	Type	Drift archetype
`ratelimit-window-code`	py	code	comparison flip `<`→`<=` (admits N+1)
`access-invert-qa`	py	QA	allow-list inverted into a block-list
`retry-budget-code`	py	code	changed constant (attempt cap 3→5)
`pagination-ts-code`	ts	code	changed constant (page size 50→25)
`dropped-await-qa`	py	QA	dropped `await` → fire-and-forget write

Each is sealed by the real surf binary via tools/author.py, so the stale hub genuinely diverges and surf_report.json is authentic surf check --format json output — the C3 context the agent sees is not hand-mocked. Authored so the correct answer is non-obvious from a quick code read (a flipped operator, an inverted membership test, a changed literal, a missing await), or the doc would carry no weight.

Harness hardening so the pilot env is unambiguous:

grade_code runs python/python3 grader commands under sys.executable, so the hidden checks inherit the harness interpreter (uv venv / plain venv / system) instead of gambling on which python3 is on PATH.
Adopt uv for the harness env with a committed uv.lock (reproducible spend), and document that Node ≥ 22.18 is required for the TypeScript scenario's node --test type-stripping (no npm install / tsc step).

Verification

Bench is outside the Rust workspace, so a Python/scenario change can't affect the Rust gates — ran them anyway to prove the tree is clean:

cargo fmt --all --check — clean
cargo clippy --all-targets --all-features -- -D warnings — clean (no warnings)
cargo test --all — 34 passed, 0 failed

Bench gates:

uv run python tools/author.py --all — all 7 scenarios seal; each stale hub genuinely diverges (author.py fails otherwise)
uv run python -m surface_bench.run --models mock — full offline pipeline runs all 280 cells (7×4×10), no crash
For every new scenario, a simulated correct answer grades ok/not-misled and a simulated misled (stale-doc) answer grades not-ok/misled — verified for both the python3 and node --test graders under uv run.

Closes #99

🤖 Generated with Claude Code

Add five new bench scenarios so the suite now spans seven across the drift archetypes `surf check` fires on, with both QA and code-edit variants and a TypeScript polyglot story: ratelimit-window-code comparison flip `<` -> `<=` (admits N+1) code, py access-invert-qa allow-list inverted into a block-list qa, py retry-budget-code changed constant (attempt cap 3 -> 5) code, py pagination-ts-code changed constant (page size 50 -> 25) code, ts dropped-await-qa dropped `await` -> fire-and-forget write qa, py Each is sealed by the real `surf` binary via tools/author.py, so the stale hub genuinely diverges and surf_report.json is authentic `surf check` output — the C3 context the agent sees is not hand-mocked. Every grader was checked to map a simulated correct answer to ok and a misled (stale-doc) answer to misled. Also harden how the harness runs, so the pilot env is unambiguous: * grade_code runs `python`/`python3` grader commands under sys.executable, so the hidden checks inherit the harness interpreter (uv venv, plain venv, or system) instead of gambling on which `python3` is on PATH. * Adopt uv for the harness env with a committed uv.lock (reproducible spend), and document that node>=22.18 is required for the TypeScript scenario's `node --test` type-stripping (no npm install / tsc step). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Connorrmcd6 merged commit b222984 into main Jun 13, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(bench): author the full seed suite + uv harness env (#99)#112

chore(bench): author the full seed suite + uv harness env (#99)#112
Connorrmcd6 merged 1 commit into
mainfrom
chore/99-seed-suite

Connorrmcd6 commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Connorrmcd6 commented Jun 13, 2026

What changed

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant