chore(bench): Surface agent-impact benchmark scaffold by Connorrmcd6 · Pull Request #111 · Connorrmcd6/surface

Connorrmcd6 · 2026-06-13T12:23:28Z

First milestone deliverable for Empirical validation of Surface (#11): a standardized, reproducible, provider-agnostic benchmark that measures how much documentation accuracy changes an agent's task performance — the gap Surface exists to protect.

Why

Surface's agent-facing pitch ("trustworthy context matters; rot is the failure mode") is currently asserted, not measured. This harness quantifies the delta between an agent working from fresh docs vs rotted docs, using drift of exactly the kind surf check catches.

What changed

New top-level bench/ (Python, outside the Rust workspace — the core stays no-network/deterministic; the bench only consumes the surf binary's output).

Four conditions, same code + task, only the doc block differs: C0 code-only · C1 stale doc · C2 fresh doc · C3 stale doc + a genuine surf check --format json report.
Complexity tiers on each scenario, so the report shows the Surface effect as a gradient (it should grow as re-deriving truth from code gets expensive) rather than one number.
Deterministic grading — QA via a structured VERDICT: line, code-edit via hidden tests — so the primary metric has no LLM-judge noise.
Metrics: success rate, misled rate (asserted the stale claim), output-token cost, and estimated dollar spend (token usage × per-model prices in config.toml).
tools/author.py seals hub hashes and emits the genuine divergence report with the real surf binary — the C3 context is Surface's actual output, not a mock.
Two authored scenarios: refresh-single-use-qa (T0, local) and refresh-replay-premise-qa (T2, a security conclusion built on a stale premise).

No Rust source touched.

Verification

cargo fmt --all --check → OK
cargo clippy --all-targets --all-features -- -D warnings → clean
cargo test --all → 34 passed, 0 failed
Python pipeline exercised offline via the mock model: run → grade → metrics → report renders the gradient, token, and spend sections; QA grader self-tested across correct / misled / mentions-stale-term-but-correct / unparseable; both scenarios' artifacts regenerated by tools/author.py (genuine kind: changed divergence).

Follow-on work (full seed suite incl. a TS scenario, pilot, full run) tracked under #99–#101.

Closes #95
Closes #96
Closes #97
Closes #98

🤖 Generated with Claude Code

, #98) Stand up bench/ — a Python, provider-agnostic harness that measures how much documentation *accuracy* changes agent task performance, the gap Surface exists to protect. Lives outside the Rust workspace; the core stays no-network and the bench only consumes the surf binary's output. It compares four context conditions over the same code + task (C0 code-only, C1 stale doc, C2 fresh doc, C3 stale doc + a genuine `surf check --format json` report), graded deterministically. Scenarios carry a complexity `tier` so the report shows the effect as a gradient rather than a single number. Metrics: success rate, misled rate, output-token cost, and estimated dollar spend. Includes the harness (models/prompts/runner/graders/metrics/report), the tools/author.py helper that seals hub hashes and emits the real divergence with the surf binary, and two authored scenarios: a T0 local case and a T2 security-premise case. Verified end-to-end offline via a mock model. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Connorrmcd6 merged commit 16523df into main Jun 13, 2026
5 checks passed

Connorrmcd6 deleted the chore/surface-validation-bench branch June 13, 2026 12:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(bench): Surface agent-impact benchmark scaffold#111

chore(bench): Surface agent-impact benchmark scaffold#111
Connorrmcd6 merged 1 commit into
mainfrom
chore/surface-validation-bench

Connorrmcd6 commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Connorrmcd6 commented Jun 13, 2026

Why

What changed

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant