Skip to content

chore(bench): Surface agent-impact benchmark scaffold#111

Merged
Connorrmcd6 merged 1 commit into
mainfrom
chore/surface-validation-bench
Jun 13, 2026
Merged

chore(bench): Surface agent-impact benchmark scaffold#111
Connorrmcd6 merged 1 commit into
mainfrom
chore/surface-validation-bench

Conversation

@Connorrmcd6

Copy link
Copy Markdown
Owner

First milestone deliverable for Empirical validation of Surface (#11): a standardized, reproducible, provider-agnostic benchmark that measures how much documentation accuracy changes an agent's task performance — the gap Surface exists to protect.

Why

Surface's agent-facing pitch ("trustworthy context matters; rot is the failure mode") is currently asserted, not measured. This harness quantifies the delta between an agent working from fresh docs vs rotted docs, using drift of exactly the kind surf check catches.

What changed

New top-level bench/ (Python, outside the Rust workspace — the core stays no-network/deterministic; the bench only consumes the surf binary's output).

  • Four conditions, same code + task, only the doc block differs: C0 code-only · C1 stale doc · C2 fresh doc · C3 stale doc + a genuine surf check --format json report.
  • Complexity tiers on each scenario, so the report shows the Surface effect as a gradient (it should grow as re-deriving truth from code gets expensive) rather than one number.
  • Deterministic grading — QA via a structured VERDICT: line, code-edit via hidden tests — so the primary metric has no LLM-judge noise.
  • Metrics: success rate, misled rate (asserted the stale claim), output-token cost, and estimated dollar spend (token usage × per-model prices in config.toml).
  • tools/author.py seals hub hashes and emits the genuine divergence report with the real surf binary — the C3 context is Surface's actual output, not a mock.
  • Two authored scenarios: refresh-single-use-qa (T0, local) and refresh-replay-premise-qa (T2, a security conclusion built on a stale premise).

No Rust source touched.

Verification

  • cargo fmt --all --checkOK
  • cargo clippy --all-targets --all-features -- -D warningsclean
  • cargo test --all34 passed, 0 failed
  • Python pipeline exercised offline via the mock model: run → grade → metrics → report renders the gradient, token, and spend sections; QA grader self-tested across correct / misled / mentions-stale-term-but-correct / unparseable; both scenarios' artifacts regenerated by tools/author.py (genuine kind: changed divergence).

Follow-on work (full seed suite incl. a TS scenario, pilot, full run) tracked under #99#101.

Closes #95
Closes #96
Closes #97
Closes #98

🤖 Generated with Claude Code

, #98)

Stand up bench/ — a Python, provider-agnostic harness that measures how much
documentation *accuracy* changes agent task performance, the gap Surface exists
to protect. Lives outside the Rust workspace; the core stays no-network and the
bench only consumes the surf binary's output.

It compares four context conditions over the same code + task (C0 code-only,
C1 stale doc, C2 fresh doc, C3 stale doc + a genuine `surf check --format json`
report), graded deterministically. Scenarios carry a complexity `tier` so the
report shows the effect as a gradient rather than a single number. Metrics:
success rate, misled rate, output-token cost, and estimated dollar spend.

Includes the harness (models/prompts/runner/graders/metrics/report), the
tools/author.py helper that seals hub hashes and emits the real divergence with
the surf binary, and two authored scenarios: a T0 local case and a T2
security-premise case. Verified end-to-end offline via a mock model.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Connorrmcd6 Connorrmcd6 merged commit 16523df into main Jun 13, 2026
5 checks passed
@Connorrmcd6 Connorrmcd6 deleted the chore/surface-validation-bench branch June 13, 2026 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metrics + reporting Deterministic graders: code + QA Harness core: models, prompts, runner Bench scaffold: scenario format + reference scenario

1 participant