Skip to content

chore(bench): author the full seed suite + uv harness env (#99)#112

Merged
Connorrmcd6 merged 1 commit into
mainfrom
chore/99-seed-suite
Jun 13, 2026
Merged

chore(bench): author the full seed suite + uv harness env (#99)#112
Connorrmcd6 merged 1 commit into
mainfrom
chore/99-seed-suite

Conversation

@Connorrmcd6

Copy link
Copy Markdown
Owner

Authors the full bench seed suite for milestone 11 (#99) and pins the harness env so the pilot (#100) runs unambiguously.

What changed

Five new scenarios (suite is now 7, spanning the drift archetypes surf check fires on, in both QA and code-edit form, across Python and TypeScript):

Scenario Lang Type Drift archetype
ratelimit-window-code py code comparison flip <<= (admits N+1)
access-invert-qa py QA allow-list inverted into a block-list
retry-budget-code py code changed constant (attempt cap 3→5)
pagination-ts-code ts code changed constant (page size 50→25)
dropped-await-qa py QA dropped await → fire-and-forget write

Each is sealed by the real surf binary via tools/author.py, so the stale hub genuinely diverges and surf_report.json is authentic surf check --format json output — the C3 context the agent sees is not hand-mocked. Authored so the correct answer is non-obvious from a quick code read (a flipped operator, an inverted membership test, a changed literal, a missing await), or the doc would carry no weight.

Harness hardening so the pilot env is unambiguous:

  • grade_code runs python/python3 grader commands under sys.executable, so the hidden checks inherit the harness interpreter (uv venv / plain venv / system) instead of gambling on which python3 is on PATH.
  • Adopt uv for the harness env with a committed uv.lock (reproducible spend), and document that Node ≥ 22.18 is required for the TypeScript scenario's node --test type-stripping (no npm install / tsc step).

Verification

Bench is outside the Rust workspace, so a Python/scenario change can't affect the Rust gates — ran them anyway to prove the tree is clean:

  • cargo fmt --all --check — clean
  • cargo clippy --all-targets --all-features -- -D warnings — clean (no warnings)
  • cargo test --all34 passed, 0 failed

Bench gates:

  • uv run python tools/author.py --all — all 7 scenarios seal; each stale hub genuinely diverges (author.py fails otherwise)
  • uv run python -m surface_bench.run --models mock — full offline pipeline runs all 280 cells (7×4×10), no crash
  • For every new scenario, a simulated correct answer grades ok/not-misled and a simulated misled (stale-doc) answer grades not-ok/misled — verified for both the python3 and node --test graders under uv run.

Closes #99

🤖 Generated with Claude Code

Add five new bench scenarios so the suite now spans seven across the drift
archetypes `surf check` fires on, with both QA and code-edit variants and a
TypeScript polyglot story:

  ratelimit-window-code  comparison flip `<` -> `<=` (admits N+1)        code, py
  access-invert-qa       allow-list inverted into a block-list           qa,   py
  retry-budget-code      changed constant (attempt cap 3 -> 5)           code, py
  pagination-ts-code     changed constant (page size 50 -> 25)           code, ts
  dropped-await-qa       dropped `await` -> fire-and-forget write        qa,   py

Each is sealed by the real `surf` binary via tools/author.py, so the stale hub
genuinely diverges and surf_report.json is authentic `surf check` output — the
C3 context the agent sees is not hand-mocked. Every grader was checked to map a
simulated correct answer to ok and a misled (stale-doc) answer to misled.

Also harden how the harness runs, so the pilot env is unambiguous:

  * grade_code runs `python`/`python3` grader commands under sys.executable, so
    the hidden checks inherit the harness interpreter (uv venv, plain venv, or
    system) instead of gambling on which `python3` is on PATH.
  * Adopt uv for the harness env with a committed uv.lock (reproducible spend),
    and document that node>=22.18 is required for the TypeScript scenario's
    `node --test` type-stripping (no npm install / tsc step).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Connorrmcd6 Connorrmcd6 merged commit b222984 into main Jun 13, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Author full seed suite (incl. one TypeScript scenario)

1 participant