chore(bench): author the full seed suite + uv harness env (#99)#112
Merged
Conversation
Add five new bench scenarios so the suite now spans seven across the drift
archetypes `surf check` fires on, with both QA and code-edit variants and a
TypeScript polyglot story:
ratelimit-window-code comparison flip `<` -> `<=` (admits N+1) code, py
access-invert-qa allow-list inverted into a block-list qa, py
retry-budget-code changed constant (attempt cap 3 -> 5) code, py
pagination-ts-code changed constant (page size 50 -> 25) code, ts
dropped-await-qa dropped `await` -> fire-and-forget write qa, py
Each is sealed by the real `surf` binary via tools/author.py, so the stale hub
genuinely diverges and surf_report.json is authentic `surf check` output — the
C3 context the agent sees is not hand-mocked. Every grader was checked to map a
simulated correct answer to ok and a misled (stale-doc) answer to misled.
Also harden how the harness runs, so the pilot env is unambiguous:
* grade_code runs `python`/`python3` grader commands under sys.executable, so
the hidden checks inherit the harness interpreter (uv venv, plain venv, or
system) instead of gambling on which `python3` is on PATH.
* Adopt uv for the harness env with a committed uv.lock (reproducible spend),
and document that node>=22.18 is required for the TypeScript scenario's
`node --test` type-stripping (no npm install / tsc step).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Authors the full bench seed suite for milestone 11 (#99) and pins the harness env so the pilot (#100) runs unambiguously.
What changed
Five new scenarios (suite is now 7, spanning the drift archetypes
surf checkfires on, in both QA and code-edit form, across Python and TypeScript):ratelimit-window-code<→<=(admits N+1)access-invert-qaretry-budget-codepagination-ts-codedropped-await-qaawait→ fire-and-forget writeEach is sealed by the real
surfbinary viatools/author.py, so the stale hub genuinely diverges andsurf_report.jsonis authenticsurf check --format jsonoutput — the C3 context the agent sees is not hand-mocked. Authored so the correct answer is non-obvious from a quick code read (a flipped operator, an inverted membership test, a changed literal, a missingawait), or the doc would carry no weight.Harness hardening so the pilot env is unambiguous:
grade_coderunspython/python3grader commands undersys.executable, so the hidden checks inherit the harness interpreter (uv venv / plain venv / system) instead of gambling on whichpython3is onPATH.uv.lock(reproducible spend), and document that Node ≥ 22.18 is required for the TypeScript scenario'snode --testtype-stripping (nonpm install/tscstep).Verification
Bench is outside the Rust workspace, so a Python/scenario change can't affect the Rust gates — ran them anyway to prove the tree is clean:
cargo fmt --all --check— cleancargo clippy --all-targets --all-features -- -D warnings— clean (no warnings)cargo test --all— 34 passed, 0 failedBench gates:
uv run python tools/author.py --all— all 7 scenarios seal; each stale hub genuinely diverges (author.py fails otherwise)uv run python -m surface_bench.run --models mock— full offline pipeline runs all 280 cells (7×4×10), no crashok/not-misled and a simulated misled (stale-doc) answer grades not-ok/misled— verified for both thepython3andnode --testgraders underuv run.Closes #99
🤖 Generated with Claude Code