Skip to content

docs(plan): eval authoring restructure — promptfoo superset (DRAFT)#1594

Merged
christso merged 45 commits into
mainfrom
plan/promptfoo-aligned-eval-restructure
Jul 2, 2026
Merged

docs(plan): eval authoring restructure — promptfoo superset (DRAFT)#1594
christso merged 45 commits into
mainfrom
plan/promptfoo-aligned-eval-restructure

Conversation

@christso

@christso christso commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

Draft design plan (no code changes) for re-aligning AgentV's eval authoring format with promptfoo — keeping snake_case — and borrowing runner/analytics from Margin-Lab/evals and transcripts/agentic-graders from vercel-labs/agent-eval.

Plan doc: docs/plans/promptfoo-aligned-eval-restructure.md

North-star

AgentV's eval contract is a strict superset of (snake_cased) promptfoo: any promptfoo config, mechanically snake_cased, is a valid AgentV eval with equivalent semantics; AgentV accepts more on top (bare-string asserts, workspace, gate, agentic judges, multi-turn, …).

Governing decisions (§2, resolved)

  • Prefer promptfoo naming where functionally equivalent; keep AgentV only where semantics are better (gate > scalar threshold; repeat: {count, strategy, early_exit} > repeat: int; workspace/repo materialization; agentic judge).
  • Hard deprecation (major version): removed keys (assertions, composite, eval_cases, grader name-as-metric, ${{ ENV }}) hard-error; one-shot codemod migrates.
  • targets re-canonicalized as first-class system-under-test. Verified against promptfoo source + history: promptfoo targets is a 2024-05 redteam alias for the canonical providers. AgentV elevates targets (fits agent-eval) and demotes provider/apiId to the backend-kind vocabulary. It does not accept a top-level providers alias (that would overload AgentV's existing backend provider/sub-provider term); the promptfoo importer rewrites top-level providers:targets: instead.
  • Templating: {{ var }} (nunjucks, eval-time) + ${ENV} (docker/k8s style, config-time, replaces ${{ ENV }}) — no sigil collision.
  • Optional test id: split stable derived test_id (identity/filter/compare) from display label (iddescription → vars → Test #n).
  • Bare-string assert shorthand kept (better semantics: N criteria → one judge call), desugars to a batched llm-rubric. One-way compat (promptfoo ⊆ AgentV) is the design.
  • LLM-judge types consolidated into one promptfoo-named llm-rubric (folds AgentV's rubrics + agentic llm-grader in as optional fields).

Borrows

  • margin-lab: instance-as-scheduling-unit, lease/heartbeat worker pool, infra-only retry, pass@k analytics via one pure Build().
  • vercel-agent-eval: two-layer transcript (raw + normalized canonical tool_name enum + inlined summary), evidence-by-path agentic judge, judge pinning.

Open (implementation sequencing, non-blocking — §8)

  1. Which exotic assertion types ship working vs accepted-but-stubbed at launch.
  2. Redteam: accept-and-ignore the redteam: key to preserve superset parsing vs hard carve-out.
  3. javascript/python desugar to code-grader vs distinct types.

Review asks

  • Confirm the §2 contract decisions.
  • Weigh in on the three §8 sequencing calls (leanings noted in the doc).

🤖 Generated with Claude Code


Implementation hand-off (beads)

Tracked under epic av-kfik. Workers have no context of the design conversation — read docs/plans/promptfoo-aligned-eval-restructure.md (this PR) + docs/adr/0015 first; each bead is self-contained and cites its plan section.

Ready now (no blockers):

  • av-kfik.1 (P0) — Write authoring-contract + output-contract ADRs (anchor)
  • av-kfik.3 (P1) — Group A: delete deprecated back-compat aliases (independent, parallel)

Unlocked after the schema anchor (av-kfik.2.1):

  • av-kfik.2 (P0) — Snake_cased promptfoo eval schema (Zod)
  • av-kfik.4 — Templating: nunjucks vars + ${ENV} config env ← .2
  • av-kfik.5 — prompts × targets matrix + instance expansion ← .2
  • av-kfik.6 — Re-canonicalize targets; provider=backend; registry promptfoo shape ← .2
  • av-kfik.7 — Assertion vocabulary + llm-rubric consolidation + grader execution ← .2
  • av-kfik.8 — Two-layer transcript + canonical tool_name enum + summary ← .2
  • av-kfik.9 — Datasets: file:// loading + __expected DSL ← .2
  • av-kfik.10 — Runner: worker pool + reset workspace pool + --rerun-failed.5
  • av-kfik.11 — Grading contract: assertion_results/summary/verdict/score + agentic judge ← .7
  • av-kfik.12 — Output/artifact contract: queryable summary.json + .internal/ + timing→metrics merge ← .1, .5
  • av-kfik.13 — Multi-turn: split execution/evaluation; drop window_size (ADR-0015) ← .4, .7
  • av-kfik.14 — Extensions/workspace/agent-rules slice (PR docs(plans): add promptfoo-compatible extensions plan #1592 + amendments A1–A6) ← .2, .4
  • av-kfik.15 — Hard-deprecation codemod for existing eval files ← .2, .6, .7, .13, .12
  • av-kfik.16 — Docs + examples + live provider/grader dogfood ← .10, .11, .12, .14

Use bd ready to find unblocked work; bd show <id> for the full task spec + acceptance.

…uperset

Draft design plan (for review, no code changes) to re-align AgentV's eval
authoring format with promptfoo, keeping snake_case, and borrowing runner/
analytics from Margin-Lab/evals and transcripts/agentic-graders from
vercel-labs/agent-eval.

Governing decisions captured:
- AgentV eval contract is a strict SUPERSET of (snake_cased) promptfoo.
- Prefer promptfoo naming where equivalent; keep AgentV only where semantics
  are better (gate, repeat block, workspace, agentic judge).
- Hard deprecation (major version): removed keys error, codemod migrates.
- `targets` re-canonicalized as first-class SUT (verified: promptfoo `targets`
  is a 2024-05 redteam alias for the canonical `providers`); `provider`
  demoted to backend-kind field; `providers` accepted as input alias.
- `{{ var }}` nunjucks for eval-time vars; `${ENV}` (docker/k8s style) for
  config-time env, replacing `${{ ENV }}`.
- Optional test `id`: split stable derived `test_id` from display label.
- Bare-string `assert` shorthand kept, desugars to batched `llm-rubric`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jul 1, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 0bff279
Status: ✅  Deploy successful!
Preview URL: https://2020d8e0.agentv.pages.dev
Branch Preview URL: https://plan-promptfoo-aligned-eval.agentv.pages.dev

View logs

christso and others added 3 commits July 2, 2026 09:07
…gn lifecycle with promptfoo

- No new top-level `workspace:` block. Repo/fixture spec rides as dataset
  `vars` (file://-loadable), consumed by a built-in, auto-registered,
  overridable `agentv:workspace` extension. Matches vercel (fixture=case)
  and margin (image=case): workspace is part of the dataset.
- Single lifecycle surface = promptfoo `extensions`
  (beforeAll/afterAll/beforeEach/afterEach). Hard-remove `on_run_complete`
  (= afterAll), `preprocessors`, and `workspace.hooks`.
- Isolation = the hook name in the extension reference (verified promptfoo
  mechanism, evaluatorHelpers.ts:633 EXTENSION_HOOK_NAMES):
  `agentv:workspace:beforeAll` = shared, `:beforeEach` = per-case.
- Built-in extension internally borrows margin (git/docker materialization +
  mirror cache) and vercel (per-case fixture copy, path handed to target),
  validates the vars.workspace shape, writes materialized path back to vars.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…bprocess, code-grader power tool

Verified against promptfoo source: javascript assertions run in-process
(new Function / dynamic import), only python shells out (PythonShell). AgentV
matches: `javascript` in-process (easier on Bun — imports .ts directly),
`python` subprocess, `code-grader` stays the subprocess power tool for
workspace-cwd / arbitrary-language / isolation cases. Do not desugar
`javascript` into `code-grader` (loses in-process speed). Resolves old §8.3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Hard churn confirmed OK (not yet in production) — clean break, no aliases.
- Simplify: collapse tests[].input into prompts + vars (one way to express
  what's sent to the target); input_files stays as prompt-content.
- Runner: laptop-first, no DB store. Instance expansion + pass@k + infra-only
  retry (margin ideas), but a simple worker pool; resumability via index.jsonl
  (--rerun-failed); workspace = shared or reset-based pool, not per-instance
  containers.
- nunjucks confirmed to meet inline text+var mixing; {{var}} for vars,
  ${ENV} for env.
- Redteam + unimplemented exotic assertions: treated as unrecognized fields
  (future scope), not stubbed. `similar` in (needs embeddings provider).
- test_id: layered identity — content-hash (content), author tag/metadata
  (governance/trend, Dashboard keys on this), description/vars (display).
- LLM judge type name = `llm-rubric`; default grading adopts skeptical
  evidence-by-path judge (opt-out via explicit prompt); documented what changes.
- New §9: PR #1592 is the extensions/workspace implementation slice — reasonable,
  mergeable after 4 amendments (chiefly vars.workspace + built-in agentv: scheme).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
christso added a commit that referenced this pull request Jul 1, 2026
…ation, vars.workspace, built-in scheme)

Adds an Amendments section overriding the original where they conflict, per
the wider promptfoo-superset restructure (PR #1594):
- A1: isolation is hook-derived (beforeAll=shared, beforeEach=per-case) via a
  reset-based workspace pool; remove the isolation config knob.
- A2: per-case workspace spec lives in dataset vars.workspace.
- A3: ship built-in auto-registered agentv:workspace / agentv:skills alongside file://.
- A4: grading contract unchanged — reuse EvaluationScore (score/verdict/assertions[evidence]).
- A5: ADR 0014 must note the incoming superseding ADR (assert / input removal).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
christso and others added 9 commits July 2, 2026 09:37
…#1592 amended

- §5.3: llm-rubric reuses existing EvaluationScore grading.json shape
  (score + verdict + assertions[{text,passed,evidence}]); risk is mapping the
  judge verdict to per-criterion assertions+evidence, keeping evidence in
  grading.json, and score/verdict consistency. No new grading contract.
- §9: record that PR #1592 was amended to the agreed model (A1-A5).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tension -> agent_rules

- §5.3: grading.json originates from agentskills (assertion_results[{text,passed,
  evidence}] + summary counts; no overall verdict). Per-assertion already matches
  agentskills exactly — keep. Overall stays a STRING verdict('pass'|'fail'|'skip')
  + fractional score (AgentV superset; boolean can't express skip/fractional).
  Align array key -> assertion_results and add summary counts. llm-rubric maps
  verdict->per-criterion assertion_results.
- §9: rename agentv:skills -> agentv:agent_rules (stages skills + hooks + agents +
  rules), package agent-rules, context agent_rules_paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nscript/metrics); cleanup inventory; agent-rules kebab

- §6.0: canonical output = split bundle, queryable-on-filesystem, no DB. Best of
  each: margin-lab rich jq-queryable summary.json (Summary shape) + index.jsonl;
  vercel two-layer transcript + tool_name enum + inlined transcript_summary for
  metrics; agentskills grading.json; promptfoo EvaluateSummaryV3 export only.
- §10: verified dead-code/back-compat cleanup inventory (group A delete-now,
  group B removed-by-restructure, group C dedup). Hard deprecation, no aliases.
- Naming: identifier tokens are kebab (agentv:agent-rules), data fields snake
  (agent_rules_paths). Fixed agent_rules -> agent-rules for the scheme.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ternal/ bundle layout (§6.0.1)

- index stays JSONL (append/stream/query; enables --rerun-failed), not index.json.
- Fix drift: file index.jsonl is referenced as manifest_path today; rename ->
  index_path. Reserve "manifest"/bundle.json for the frozen-config file only;
  summary.json is the queryable aggregate.
- Adopt margin-lab's internal/ folder, dot-prefixed as .internal/ (AgentV's
  "dot = skip discovery" convention): index.jsonl/progress.json/events.jsonl/
  bundle.json move under .internal/; run root stays clean (summary.json + per-case
  dirs). Cross-run .indexes/.cache at results root are a separate scope.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ce of truth); keep index.jsonl name

- No first-class promptfoo EvaluateSummaryV3 / single-file export — the split
  bundle is the single source of truth (YAGNI). A promptfoo-shaped file can be
  generated on demand if ever needed; not shipped/maintained. named_scores/
  derived_metrics still live inside the split rows (Dashboard), not a file.
- Naming: keep index.jsonl (JSONL, not index.json). Reject manifest.jsonl
  (reserved for bundle.json/run manifest). rows.jsonl is the only fallback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…split duplication)

Prior discussion: ADR-0011 & 0012 both defined metrics.json AND timing.json as
separate per-attempt sidecars; both are written today. But timing.json
(TimingArtifact) already carries total_tokens/cost_usd/token_usage/usage_sources
(perf metrics, not just timing), overlapping metrics.json (trace execution
metrics). No reference splits them (agentskills folds tokens+duration into one
file; vercel/margin keep one per-attempt blob). Decision: one metrics.json with
sections (duration/tokens/cost always; execution/trajectory when a trace exists);
drop timing.json + timing_path, keep single metrics_path. Output-contract ADR
supersedes the 0011/0012 split. Added to §6.0.1 and §10 Group C.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rgets` instead

AgentV already uses provider + apiId (provider + sub-provider) as the backend
vocabulary, so a top-level `providers`-as-SUT alias would overload the term.
Canonical SUT key is `targets` only; the promptfoo importer rewrites top-level
`providers:` -> `targets:` (mechanical, alongside camel->snake). Superset holds
via the importer, not a live alias; provider/apiId/sub-provider stay
unambiguously "backend." Updated §0/§1.1/§2.a.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r); multi-turn provenance + keep evaluation layer

- §0/§1.1/§2.a: superset is a design property, not a shipped importer. New evals
  authored in snake_case directly. providers->targets + camel->snake is a one-off
  conversion/codemod, not a runtime alias. Distinct from the hard-deprecation
  codemod (which IS built, migrates AgentV's own files).
- §3/§2.i: multi-turn has real provenance (agentv#1053, researched in
  agentevals/agentevals-research/research/findings/multiturn-conversation-eval
  against inspect-ai, google-adk, ragas). on_turn_failure <- inspect-ai
  state.completed; per-conversation aggregation is a deliberate gap-fill (inspect/
  ragas/promptfoo aggregate only across epochs, not within a conversation). Split
  execution (promptfoo _conversation) from evaluation (keep AgentV's per-turn
  assertions + aggregation + on_turn_failure). Drop only window_size (no pedigree).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…size→_conversation migration

- New ADR-0015: split conversation execution (promptfoo _conversation/sessions)
  from evaluation (keep per-turn assertions + cross-turn aggregation +
  on_turn_failure, provenance inspect-ai/google-adk/agentv#1053). Drop
  window_size. Includes the concrete window_size -> _conversation template
  mapping (loop.revindex <= N; system outside the loop) for the codemod.
- Plan §3: window_size rationale = redundant in the _conversation model (author
  windows in the template), points to ADR-0015.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
christso and others added 14 commits July 2, 2026 10:38
Materialization (resolver + git-mirror cache) is always-on and makes even
fresh-per-case fast — clone speed is decoupled from reuse. Pooling = reuse +
quick-reset between cases, a perf optimization mainly for local evals (amortize
expensive setup); CI prefers fresh-per-case + mirror cache. Three isolation
levels: shared / pooled / fresh-per-case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rkspace resolver adapters

Layered gate: deterministic CI (merge gate) + live PR-679 CargoWise dogfood
(real provider+grader, blocking) + optional promptfoo-parity diff. Workspace
acquisition = two resolver adapters unified under 'agentv workspace deps':
local git-mirror (~/projects/WiseTechGlobal/CargoWise, 7.3G, dev/offline) and a
snapshot-download adapter (port WTG download-release-deps.ts: per-year .git
tarballs from a release, shared checkout+symlink, LFS skip) for CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… / partial+sparse+shallow / snapshot-download

Not 'download vs shallow clone': ordered fastest-first — (1) local mirror via
alternates/reference (zero transfer, dev), (2) direct partial+sparse+shallow
clone (CI default: --filter=blob:none --sparse --depth 1 + workspace.repos.sparse;
less transfer than a per-year .git tarball, no producer), (3) snapshot-download
adapter (CI fallback: many-commit amortization or sha-fetch/LFS blockers).
Resolver selects per environment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…m acquisition (SWE-bench/margin/Harbor)

Per AgentV research repo-provisioning-schema-design.md: eval declares provenance
only (repo/commit/sparse/ancestor); acquisition is a harness resolver in machine
config, keyed on repo, ordered backends: local-auto-adopt(--reference) ->
mirror-cache(--reference) -> snapshot -> remote -> (future) docker-image
(SWE-bench/margin, same identity key). --reference gives shallow-speed + full
history, retiring depth/filter. Remove tangled per-repo type/resolve/depth/filter/
resolver fields. Supersedes the §11 acquisition-perf framing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…upersede 0013; fold Inspect AI

- ADR-0016: promptfoo-superset authoring contract (assert/llm-rubric/targets/prompts+
  vars/nunjucks+${ENV}/hard-deprecation/metric/test_id identity/gate/repeat/workspace
  provenance). Supersedes both 0013 ADRs (marked in their Status).
- ADR-0017: output/artifact contract (best-of-each split bundle, margin queryable
  summary + index.jsonl, vercel transcript/metrics, agentskills grading + verdict/score,
  .internal/, index_path, timing->metrics merge, pass@k Build) + workspace resolver
  (provenance vs acquisition, ordered backends, --reference workhorse, future docker-image
  = native SWE-bench). Includes FAIL_TO_PASS/PASS_TO_PASS note (code-grader, no new primitive).
- §11.1: fold Inspect AI (4th confirmation of provenance-vs-acquisition; image/build/x-local).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…op the SDK recipe suggestion

At run time the two lists collapse to 'run these named tests; pass iff all pass' —
too domain-specific for a primitive and needs no dedicated recipe; a workspace-cwd
code-grader (exit code = verdict) covers it. Lists are data the command consumes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ENV}); avoids clobbering runtime shell ${VAR}

Env vars use nunjucks {{ env.VAR }} rendered at config-load time (promptfoo-native,
load.ts:336), not a ${ENV} sigil. One templating engine; phase-separated by render
pass + env namespace. Key correctness win: {{ env.VAR }} does not collide with
runtime shell ${VAR} in CLI-target commands (those pass through to the shell).
Codemod ${{ X }} -> {{ env.X }}. Updated §2.f, ADR-0016 pt7, beads .4/.15.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…clarative workspace.repos field, not an extension

- Mark ADR-0015/0016/0017 Accepted. Numbering: 0014(extensions,#1592) / 0015(multi-turn)
  / 0016(authoring) / 0017(output+resolver) — no collision.
- Repo provisioning is a declarative workspace.repos field the harness materializes
  BEFORE hooks (ADR-0016 pt10, ADR-0017): reverses the workspace-as-extension direction
  (all 4 benchmark frameworks treat provisioning as harness-core; promptfoo has no
  workspace to align with). Extensions only for non-provisioning setup (agent-rules,
  custom hooks). isolation is a workspace field. Kept: workspace.repos/isolation/docker/
  template; removed workspace.hooks -> extensions.
- Plan §2.l/§3/§0 updated to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e hatch); correct over-absolute 'not an extension'

Provenance stays a declarative field and the common case runs before hooks
(unchanged), but acquisition is extensible per the plugins-over-builtins guardrail:
register a custom acquisition backend (recommended) or use a beforeAll escape hatch;
the built-in acquisition may itself be a swappable plugin. Invariants unchanged
(declarative provenance, acquisition-before-hooks, built-ins ship).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… add field-vs-resolver orthogonality framing

- ADR-0017: locked workspace schema (repos: path/repo/commit(SHA, base_commit alias)/
  sparse/ancestor; isolation fresh|pooled|shared; template; docker). Name 'workspace'
  chosen for durability (CI GITHUB_WORKSPACE/margin/git; not sandbox/environment/testbed).
  Never in schema: acquisition (harness config) + hooks (extensions) — that's what keeps
  it durable. Added field(what)-vs-resolver(how) orthogonality + package.json analogy.
- Plan §2.l: replaced stale vars.workspace/extension example with the locked field shape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…B; adopt provenance field; audit/anti-reward-hacking future scope

exploitbench confirms: split filesystem run-tree = source of truth, SQLite is a
derived rebuildable view (import/export bijection) not required, image pinned by
sha256 digest, config_snapshot=bundle.json. Borrow: (1) provenance field on result
rows (native/mock/replay/imported_*) — adopt; (2) eval-integrity/anti-reward-hacking
(read-only grader container, audit re-grade + red-flag scan + model-identity check)
— future scope.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n; categorize by suite AND tags/experiment

Confirms ADR-0009/0012: one <run_id> bundle per CLI invocation across all suite
YAMLs (never per-suite timestamp folders); runtime_source.kind=multi_eval.
Identity = eval_path+test_id (uuid-suffixed dir) so overlapping case IDs across
suites don't collide; suite/name are display/grouping metadata not routing.
Categorize by BOTH axes on each index row: suite (structural) + tags/experiment
(semantic/campaign); experiment = run bucket, suite = intra-run group.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…un-N -> sample-N; experiment is a tag not a bucket

- Cross-run index: rebuildable .agentv/results/.indexes/runs.jsonl (one row per
  run), derived from */summary.json, not source of truth; per-run index.jsonl stays.
  JSONL not index.json.
- Repeat folder run-N -> sample-N (margin/pass@k; de-conflicts with run_id).
  sample_index=repeats, retry_index=infra retries.
- experiment = reserved-by-convention tag (Dashboard default compare key), NOT a
  bucket/field/storage path — continues ADR-0006/0009/0013 demotion. One grouping
  mechanism (tags); experiment/suite are conventional keys.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ser-set default grouping key

Not even a default compare key. Tag keys sort alphabetically; the default cross-run
grouping/compare key is a user preference (any tag), AgentV blesses none.
--experiment X = sugar for --tag experiment=X. Completes the experiment demotion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
christso and others added 3 commits July 2, 2026 14:33
…me carry-forward, not a general bijective import

margin can rehydrate a run's completed-work from its run-dir for RESUME
(LoadProgressSnapshot + loadSavedResumeBundle + carryForwardLocalCases) but has
no general import that rebuilds the multi-run query DB from files (memory store
ephemeral, Postgres persists independently). AgentV follows exploitbench's model
(fs = source of truth, .indexes/*.jsonl derived/rebuildable; --rerun-failed reads
index.jsonl, no store to rehydrate).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Progress update on Group A cleanup / grader artifact wording:

  • Pushed 5df2530f to fix the prior CI failures: Biome formatting plus regenerated skills-data/agentv-eval-writer/references/eval.schema.json.
  • Updated ADR-0017 to clarify that per-criterion/per-aspect rows are generic AgentV grader assertion evidence, not g-eval-specific. All graders can emit assertions[]; multi-aspect graders emit multiple rows; grading.json keeps both flat assertions[] and nested graders[].assertions[].
  • Local verification passed: bun run lint; bun test packages/core/test/evaluation/validation/eval-schema-sync.test.ts; bun test apps/cli/test/commands/eval/artifact-writer.test.ts; bun --filter @agentv/core test (2107 pass).

GitHub CI is running on head 5df2530f.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Progress update on Group A alias cleanup:

  • Pushed 90b662b3 removing deprecated exported aliases: EvalCase, OutputMessage, PASS_THRESHOLD, and getAgentvHome.
  • Dropped the special results_by_project config deprecation path; it now falls through as an ordinary unexpected field. Dashboard docs now point directly to projects[].results.
  • Removed the stale required_min_score row from the eval-writer rubric reference card.

Local verification passed: bun run lint; bun --filter @agentv/core typecheck; bun --filter @agentv/core build; focused paths/config/grader/parser tests; bun run typecheck.

GitHub CI is green on head 90b662b3.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Progress checkpoint on head 76178419:

  • Removed the programmatic TypeScript expected_output convenience alias from evaluate() inputs; expectedOutput remains the TS API, while YAML/wire expected_output remains the first-class golden/reference answer.
  • Added hard-deprecation checks for inline tests and conversation turns that still pass expected_output to evaluate().
  • Tightened ADR-0017 wording for the owner clarification: artifact assertion rows are the generic AgentV grader evidence channel. Every grader returns assertions[]; deterministic graders usually emit one row, while multi-aspect graders emit one row per criterion/aspect. This is not g-eval-specific.

Local verification:

  • bunx biome format --write docs/adr/0017-output-artifact-and-workspace-resolver-contract.md packages/core/src/evaluation/evaluate.ts packages/core/test/evaluation/evaluate-enhanced.test.ts packages/core/test/evaluation/evaluate-programmatic-api.test.ts
  • bun test packages/core/test/evaluation/evaluate-enhanced.test.ts packages/core/test/evaluation/evaluate-programmatic-api.test.ts (25 pass)
  • bun --filter @agentv/core typecheck
  • bun --filter @agentv/core build
  • bun run lint
  • bun run typecheck

CI is running for head 76178419.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

CI follow-up: head 76178419 is green (Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, Cloudflare Pages all SUCCESS).

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Progress checkpoint on head 97a95244:

  • Removed built-in target provider alias sprawl from Group A: azure-openai, google, google-gemini, codex-cli, copilot, copilot_sdk, pi, claude-code, cc-mirror, bedrock, and vertex no longer canonicalize to built-in providers.
  • Kept canonical providers: azure, gemini, codex, copilot-cli, copilot-sdk, pi-coding-agent, pi-cli, claude/claude-cli/claude-sdk.
  • Preserved discovered-provider fallback for .agentv/providers/<kind>.ts; removed aliases now validate as unknown provider names instead of built-in aliases.
  • Removed stale cc-mirror target docs and updated Copilot docs to use provider: copilot-cli.

Local verification:

  • bunx biome format --write apps/web/src/content/docs/docs/targets/coding-agents.mdx packages/core/src/evaluation/providers/targets.ts packages/core/src/evaluation/providers/types.ts packages/core/src/evaluation/validation/targets-validator.ts packages/core/test/evaluation/providers/targets.test.ts packages/core/test/evaluation/validation/targets-validator.test.ts
  • bun test packages/core/test/evaluation/providers/targets.test.ts packages/core/test/evaluation/validation/targets-validator.test.ts (77 pass)
  • bun --filter @agentv/core typecheck
  • bun --filter @agentv/core build
  • bun run lint
  • bun run typecheck

CI is running for head 97a95244.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

CI follow-up: head 97a95244 is green (Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, Cloudflare Pages all SUCCESS).

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Follow-up clarification pushed in cee6a8d.\n\n- Tightened ADR-0016 and the restructure plan to state that artifact assertion rows are the generic AgentV grader contract, not a g-eval-specific behavior.\n- Clarified the terminology: every grader returns assertions[]; deterministic graders usually emit one row; multi-aspect graders emit one row per authored check/result unit; structured g-eval emits one row per criterion because criteria are one such multi-aspect unit.\n- Verification: git diff --check.\n- GitHub checks are green on cee6a8d: Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, Cloudflare Pages.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Progress update for Group A cleanup: pushed d74a7da.\n\n- Removed target-level log_format/log_output_format alias support and now reject those fields with stream_log guidance.\n- Kept canonical stream_log and wired it into Copilot/Pi/Claude logger format resolution; stream_log: raw maps to JSON event logs, stream_log: summary maps to summary logs, and stream_log: false disables those stream loggers.\n- Updated target validation so provider-specific legacy log_format settings no longer pass as known settings.\n- Local verification: bun test packages/core/test/evaluation/providers/targets.test.ts packages/core/test/evaluation/validation/targets-validator.test.ts; bun --filter @agentv/core typecheck; bun --filter @agentv/core build; bun run lint; bun run typecheck; git diff --check.\n- GitHub checks are green on d74a7da: Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, Cloudflare Pages.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Progress update for av-kfik.3: pushed e27b6aa0 (chore(evals): remove numeric required thresholds).

What changed:

  • Removed the legacy required: 0.7 threshold shim from grader parsing and orchestration. required is now boolean-only; custom thresholds use required: true + min_score.
  • Updated core/programmatic/SDK types, eval validation schema, generated eval.schema.json, docs, examples, and the eval-writer skill reference.
  • Numeric required values now fail parser loading with migration guidance and validator warnings point to min_score.

Local verification:

  • bun test packages/core/test/evaluation/loaders/grader-parser.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts packages/core/test/evaluation/orchestrator.test.ts packages/sdk/test/grader-helpers.test.ts (285 pass)
  • bun --filter @agentv/core typecheck
  • bun --filter @agentv/sdk typecheck
  • bun --filter @agentv/sdk build
  • bun run lint
  • bun run typecheck
  • git diff --check
  • jq empty skills-data/agentv-eval-writer/references/eval.schema.json

GitHub CI on head e27b6aa0 is green: Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, and Cloudflare Pages all succeeded.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Small follow-up for av-kfik.3: pushed eba1442e (chore(targets): align azure api_format removal error).

This just aligns the runtime Azure target error with the validator/docs wording: Azure api_format now consistently says the field has been removed. api_format remains supported for the non-Azure target types that still need it, including OpenAI/Codex/Copilot local-proxy configs.

Local verification:

  • bun test packages/core/test/evaluation/providers/targets.test.ts packages/core/test/evaluation/validation/targets-validator.test.ts (80 pass)
  • bun --filter @agentv/core typecheck
  • bun run lint
  • git diff --check

GitHub CI on head eba1442e is green: Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, and Cloudflare Pages all succeeded.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Live dogfood update for av-kfik.3 at head eba1442e.

  • Ran a 1-case live eval through the local OpenAI-compatible endpoint using base_url=http://127.0.0.1:10531/v1, provider: codex / api_format: responses for the agent target, and provider: openai / api_format: chat for the grader target. Env refs were used in YAML for base URL, API key, and model; command-level values were LOCAL_OPENAI_PROXY_API_KEY=dummy and LOCAL_OPENAI_PROXY_MODEL=gpt-5.4-mini.
  • Command: bun apps/cli/src/cli.ts eval run /tmp/agentv-av-kfik3-dogfood/dogfood.eval.yaml --targets /tmp/agentv-av-kfik3-dogfood/targets.yaml --target local-codex-agent --workers 1
  • Result: PASS, 1/1, score 0.90. Run bundle: .agentv/results/2026-07-02T07-47-33-344Z.
  • Artifact check: grading.json.assertions[] has 2 rows matching the two authored score-range criteria, and grading.json.graders[0].assertions[] has the same 2 rows. index.jsonl.scores[0].type is llm-grader, with scores[0].assertions[] also carrying the 2 rows and details.raw_scores showing boolean-required=8, min-score-threshold=10.
  • Conclusion: assertion/artifact rows are the generic AgentV grader evidence contract, not G-Eval-specific. Score-range rubrics emit one row per criterion; deterministic and code graders emit whatever assertion rows their checks/scripts return.

Private evidence pushed to agentv-private:evidence/av-kfik3-live-dogfood-2026-07-02 at commit aa7f3a5.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

av-kfik.2 schema/validation slice pushed in 8b261b2.\n\nChanges:\n- Zod eval schema accepts snake_cased promptfoo-shaped prompts/targets/default_test/assert/scenarios/derived_metrics/output_path/env/nunjucks_filters/extensions/evaluate_options fields.\n- Tests can omit id and use prompts/provider_output; canonical assert is accepted at suite/default_test/test levels.\n- Validator keeps top-level providers as an unknown-field warning, not a live alias for targets.\n- Regenerated skills-data/agentv-eval-writer/references/eval.schema.json.\n\nVerification:\n- bun test packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts\n- bun --filter @agentv/core typecheck\n- bun --filter @agentv/core lint\n- bun run lint\n- git diff --check\n\nBoundary: runtime parser/execution for promptfoo assert/g-eval remains in av-kfik.7+ follow-up work; this commit is schema/validation only.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

av-kfik.1 ADR polish pushed in 916eccd.\n\nChange: removed the contradictory duplicate standalone Accepted line from both superseded ADR-0013 files, so the status blocks now clearly read Accepted-then-superseded by ADR-0016.\n\nVerification:\n- bun run lint\n- git diff --check\n\nNo runtime impact.

@christso christso marked this pull request as ready for review July 2, 2026 08:48
@christso christso merged commit fd2a22b into main Jul 2, 2026
8 checks passed
@christso christso deleted the plan/promptfoo-aligned-eval-restructure branch July 2, 2026 08:48
christso added a commit that referenced this pull request Jul 2, 2026
* docs(plans): add promptfoo-compatible extensions plan

Entire-Checkpoint: dd08c8dc0d47

* docs(plans): amend extensions plan to agreed model (hook-derived isolation, vars.workspace, built-in scheme)

Adds an Amendments section overriding the original where they conflict, per
the wider promptfoo-superset restructure (PR #1594):
- A1: isolation is hook-derived (beforeAll=shared, beforeEach=per-case) via a
  reset-based workspace pool; remove the isolation config knob.
- A2: per-case workspace spec lives in dataset vars.workspace.
- A3: ship built-in auto-registered agentv:workspace / agentv:skills alongside file://.
- A4: grading contract unchanged — reuse EvaluationScore (score/verdict/assertions[evidence]).
- A5: ADR 0014 must note the incoming superseding ADR (assert / input removal).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plans): rename skills extension -> agent_rules; clarify grading contract origin

- A6: the staging extension covers skills + hooks + subagents + rules, so rename
  to agentv:agent_rules (package agent-rules, context agent_rules_paths). skills
  is one kind of agent rule, not the extension name.
- A4: note grading.json originates from agentskills (assertion_results + summary);
  AgentV adds string verdict (pass/fail/skip) + fractional score as a superset.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plans): use kebab agent-rules for the scheme identifier (agent_rules_paths field stays snake_case)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plans): amend A7 — repo provisioning is a declarative field, not an extension (per ADR-0016/0017)

Narrows this PR: do NOT move repo materialization into an extension. Repo
acquisition stays harness-core (declarative workspace.repos field + resolver,
materialized before hooks). Extensions cover only non-provisioning setup
(agent-rules, custom hooks). Names the superseding ADRs (0016 authoring, 0017
output/resolver) referenced in A5.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* style(plans): remove trailing whitespace from promptfoo plan

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant