chore(bench): Surface agent-impact benchmark scaffold#111
Merged
Conversation
, #98) Stand up bench/ — a Python, provider-agnostic harness that measures how much documentation *accuracy* changes agent task performance, the gap Surface exists to protect. Lives outside the Rust workspace; the core stays no-network and the bench only consumes the surf binary's output. It compares four context conditions over the same code + task (C0 code-only, C1 stale doc, C2 fresh doc, C3 stale doc + a genuine `surf check --format json` report), graded deterministically. Scenarios carry a complexity `tier` so the report shows the effect as a gradient rather than a single number. Metrics: success rate, misled rate, output-token cost, and estimated dollar spend. Includes the harness (models/prompts/runner/graders/metrics/report), the tools/author.py helper that seals hub hashes and emits the real divergence with the surf binary, and two authored scenarios: a T0 local case and a T2 security-premise case. Verified end-to-end offline via a mock model. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First milestone deliverable for Empirical validation of Surface (#11): a standardized, reproducible, provider-agnostic benchmark that measures how much documentation accuracy changes an agent's task performance — the gap Surface exists to protect.
Why
Surface's agent-facing pitch ("trustworthy context matters; rot is the failure mode") is currently asserted, not measured. This harness quantifies the delta between an agent working from fresh docs vs rotted docs, using drift of exactly the kind
surf checkcatches.What changed
New top-level
bench/(Python, outside the Rust workspace — the core stays no-network/deterministic; the bench only consumes thesurfbinary's output).C0code-only ·C1stale doc ·C2fresh doc ·C3stale doc + a genuinesurf check --format jsonreport.VERDICT:line, code-edit via hidden tests — so the primary metric has no LLM-judge noise.config.toml).tools/author.pyseals hub hashes and emits the genuine divergence report with the realsurfbinary — the C3 context is Surface's actual output, not a mock.refresh-single-use-qa(T0, local) andrefresh-replay-premise-qa(T2, a security conclusion built on a stale premise).No Rust source touched.
Verification
cargo fmt --all --check→ OKcargo clippy --all-targets --all-features -- -D warnings→ cleancargo test --all→ 34 passed, 0 failedtools/author.py(genuinekind: changeddivergence).Follow-on work (full seed suite incl. a TS scenario, pilot, full run) tracked under #99–#101.
Closes #95
Closes #96
Closes #97
Closes #98
🤖 Generated with Claude Code