docs(plan): eval authoring restructure — promptfoo superset (DRAFT)#1594
Conversation
…uperset
Draft design plan (for review, no code changes) to re-align AgentV's eval
authoring format with promptfoo, keeping snake_case, and borrowing runner/
analytics from Margin-Lab/evals and transcripts/agentic-graders from
vercel-labs/agent-eval.
Governing decisions captured:
- AgentV eval contract is a strict SUPERSET of (snake_cased) promptfoo.
- Prefer promptfoo naming where equivalent; keep AgentV only where semantics
are better (gate, repeat block, workspace, agentic judge).
- Hard deprecation (major version): removed keys error, codemod migrates.
- `targets` re-canonicalized as first-class SUT (verified: promptfoo `targets`
is a 2024-05 redteam alias for the canonical `providers`); `provider`
demoted to backend-kind field; `providers` accepted as input alias.
- `{{ var }}` nunjucks for eval-time vars; `${ENV}` (docker/k8s style) for
config-time env, replacing `${{ ENV }}`.
- Optional test `id`: split stable derived `test_id` from display label.
- Bare-string `assert` shorthand kept, desugars to batched `llm-rubric`.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Deploying agentv with
|
| Latest commit: |
0bff279
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://2020d8e0.agentv.pages.dev |
| Branch Preview URL: | https://plan-promptfoo-aligned-eval.agentv.pages.dev |
…gn lifecycle with promptfoo - No new top-level `workspace:` block. Repo/fixture spec rides as dataset `vars` (file://-loadable), consumed by a built-in, auto-registered, overridable `agentv:workspace` extension. Matches vercel (fixture=case) and margin (image=case): workspace is part of the dataset. - Single lifecycle surface = promptfoo `extensions` (beforeAll/afterAll/beforeEach/afterEach). Hard-remove `on_run_complete` (= afterAll), `preprocessors`, and `workspace.hooks`. - Isolation = the hook name in the extension reference (verified promptfoo mechanism, evaluatorHelpers.ts:633 EXTENSION_HOOK_NAMES): `agentv:workspace:beforeAll` = shared, `:beforeEach` = per-case. - Built-in extension internally borrows margin (git/docker materialization + mirror cache) and vercel (per-case fixture copy, path handed to target), validates the vars.workspace shape, writes materialized path back to vars. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…bprocess, code-grader power tool Verified against promptfoo source: javascript assertions run in-process (new Function / dynamic import), only python shells out (PythonShell). AgentV matches: `javascript` in-process (easier on Bun — imports .ts directly), `python` subprocess, `code-grader` stays the subprocess power tool for workspace-cwd / arbitrary-language / isolation cases. Do not desugar `javascript` into `code-grader` (loses in-process speed). Resolves old §8.3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Hard churn confirmed OK (not yet in production) — clean break, no aliases.
- Simplify: collapse tests[].input into prompts + vars (one way to express
what's sent to the target); input_files stays as prompt-content.
- Runner: laptop-first, no DB store. Instance expansion + pass@k + infra-only
retry (margin ideas), but a simple worker pool; resumability via index.jsonl
(--rerun-failed); workspace = shared or reset-based pool, not per-instance
containers.
- nunjucks confirmed to meet inline text+var mixing; {{var}} for vars,
${ENV} for env.
- Redteam + unimplemented exotic assertions: treated as unrecognized fields
(future scope), not stubbed. `similar` in (needs embeddings provider).
- test_id: layered identity — content-hash (content), author tag/metadata
(governance/trend, Dashboard keys on this), description/vars (display).
- LLM judge type name = `llm-rubric`; default grading adopts skeptical
evidence-by-path judge (opt-out via explicit prompt); documented what changes.
- New §9: PR #1592 is the extensions/workspace implementation slice — reasonable,
mergeable after 4 amendments (chiefly vars.workspace + built-in agentv: scheme).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ation, vars.workspace, built-in scheme) Adds an Amendments section overriding the original where they conflict, per the wider promptfoo-superset restructure (PR #1594): - A1: isolation is hook-derived (beforeAll=shared, beforeEach=per-case) via a reset-based workspace pool; remove the isolation config knob. - A2: per-case workspace spec lives in dataset vars.workspace. - A3: ship built-in auto-registered agentv:workspace / agentv:skills alongside file://. - A4: grading contract unchanged — reuse EvaluationScore (score/verdict/assertions[evidence]). - A5: ADR 0014 must note the incoming superseding ADR (assert / input removal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#1592 amended - §5.3: llm-rubric reuses existing EvaluationScore grading.json shape (score + verdict + assertions[{text,passed,evidence}]); risk is mapping the judge verdict to per-criterion assertions+evidence, keeping evidence in grading.json, and score/verdict consistency. No new grading contract. - §9: record that PR #1592 was amended to the agreed model (A1-A5). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tension -> agent_rules
- §5.3: grading.json originates from agentskills (assertion_results[{text,passed,
evidence}] + summary counts; no overall verdict). Per-assertion already matches
agentskills exactly — keep. Overall stays a STRING verdict('pass'|'fail'|'skip')
+ fractional score (AgentV superset; boolean can't express skip/fractional).
Align array key -> assertion_results and add summary counts. llm-rubric maps
verdict->per-criterion assertion_results.
- §9: rename agentv:skills -> agentv:agent_rules (stages skills + hooks + agents +
rules), package agent-rules, context agent_rules_paths.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nscript/metrics); cleanup inventory; agent-rules kebab - §6.0: canonical output = split bundle, queryable-on-filesystem, no DB. Best of each: margin-lab rich jq-queryable summary.json (Summary shape) + index.jsonl; vercel two-layer transcript + tool_name enum + inlined transcript_summary for metrics; agentskills grading.json; promptfoo EvaluateSummaryV3 export only. - §10: verified dead-code/back-compat cleanup inventory (group A delete-now, group B removed-by-restructure, group C dedup). Hard deprecation, no aliases. - Naming: identifier tokens are kebab (agentv:agent-rules), data fields snake (agent_rules_paths). Fixed agent_rules -> agent-rules for the scheme. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ternal/ bundle layout (§6.0.1) - index stays JSONL (append/stream/query; enables --rerun-failed), not index.json. - Fix drift: file index.jsonl is referenced as manifest_path today; rename -> index_path. Reserve "manifest"/bundle.json for the frozen-config file only; summary.json is the queryable aggregate. - Adopt margin-lab's internal/ folder, dot-prefixed as .internal/ (AgentV's "dot = skip discovery" convention): index.jsonl/progress.json/events.jsonl/ bundle.json move under .internal/; run root stays clean (summary.json + per-case dirs). Cross-run .indexes/.cache at results root are a separate scope. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ce of truth); keep index.jsonl name - No first-class promptfoo EvaluateSummaryV3 / single-file export — the split bundle is the single source of truth (YAGNI). A promptfoo-shaped file can be generated on demand if ever needed; not shipped/maintained. named_scores/ derived_metrics still live inside the split rows (Dashboard), not a file. - Naming: keep index.jsonl (JSONL, not index.json). Reject manifest.jsonl (reserved for bundle.json/run manifest). rows.jsonl is the only fallback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…split duplication) Prior discussion: ADR-0011 & 0012 both defined metrics.json AND timing.json as separate per-attempt sidecars; both are written today. But timing.json (TimingArtifact) already carries total_tokens/cost_usd/token_usage/usage_sources (perf metrics, not just timing), overlapping metrics.json (trace execution metrics). No reference splits them (agentskills folds tokens+duration into one file; vercel/margin keep one per-attempt blob). Decision: one metrics.json with sections (duration/tokens/cost always; execution/trajectory when a trace exists); drop timing.json + timing_path, keep single metrics_path. Output-contract ADR supersedes the 0011/0012 split. Added to §6.0.1 and §10 Group C. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rgets` instead AgentV already uses provider + apiId (provider + sub-provider) as the backend vocabulary, so a top-level `providers`-as-SUT alias would overload the term. Canonical SUT key is `targets` only; the promptfoo importer rewrites top-level `providers:` -> `targets:` (mechanical, alongside camel->snake). Superset holds via the importer, not a live alias; provider/apiId/sub-provider stay unambiguously "backend." Updated §0/§1.1/§2.a. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r); multi-turn provenance + keep evaluation layer - §0/§1.1/§2.a: superset is a design property, not a shipped importer. New evals authored in snake_case directly. providers->targets + camel->snake is a one-off conversion/codemod, not a runtime alias. Distinct from the hard-deprecation codemod (which IS built, migrates AgentV's own files). - §3/§2.i: multi-turn has real provenance (agentv#1053, researched in agentevals/agentevals-research/research/findings/multiturn-conversation-eval against inspect-ai, google-adk, ragas). on_turn_failure <- inspect-ai state.completed; per-conversation aggregation is a deliberate gap-fill (inspect/ ragas/promptfoo aggregate only across epochs, not within a conversation). Split execution (promptfoo _conversation) from evaluation (keep AgentV's per-turn assertions + aggregation + on_turn_failure). Drop only window_size (no pedigree). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…size→_conversation migration - New ADR-0015: split conversation execution (promptfoo _conversation/sessions) from evaluation (keep per-turn assertions + cross-turn aggregation + on_turn_failure, provenance inspect-ai/google-adk/agentv#1053). Drop window_size. Includes the concrete window_size -> _conversation template mapping (loop.revindex <= N; system outside the loop) for the codemod. - Plan §3: window_size rationale = redundant in the _conversation model (author windows in the template), points to ADR-0015. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Materialization (resolver + git-mirror cache) is always-on and makes even fresh-per-case fast — clone speed is decoupled from reuse. Pooling = reuse + quick-reset between cases, a perf optimization mainly for local evals (amortize expensive setup); CI prefers fresh-per-case + mirror cache. Three isolation levels: shared / pooled / fresh-per-case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rkspace resolver adapters Layered gate: deterministic CI (merge gate) + live PR-679 CargoWise dogfood (real provider+grader, blocking) + optional promptfoo-parity diff. Workspace acquisition = two resolver adapters unified under 'agentv workspace deps': local git-mirror (~/projects/WiseTechGlobal/CargoWise, 7.3G, dev/offline) and a snapshot-download adapter (port WTG download-release-deps.ts: per-year .git tarballs from a release, shared checkout+symlink, LFS skip) for CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… / partial+sparse+shallow / snapshot-download Not 'download vs shallow clone': ordered fastest-first — (1) local mirror via alternates/reference (zero transfer, dev), (2) direct partial+sparse+shallow clone (CI default: --filter=blob:none --sparse --depth 1 + workspace.repos.sparse; less transfer than a per-year .git tarball, no producer), (3) snapshot-download adapter (CI fallback: many-commit amortization or sha-fetch/LFS blockers). Resolver selects per environment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…m acquisition (SWE-bench/margin/Harbor) Per AgentV research repo-provisioning-schema-design.md: eval declares provenance only (repo/commit/sparse/ancestor); acquisition is a harness resolver in machine config, keyed on repo, ordered backends: local-auto-adopt(--reference) -> mirror-cache(--reference) -> snapshot -> remote -> (future) docker-image (SWE-bench/margin, same identity key). --reference gives shallow-speed + full history, retiring depth/filter. Remove tangled per-repo type/resolve/depth/filter/ resolver fields. Supersedes the §11 acquisition-perf framing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…upersede 0013; fold Inspect AI
- ADR-0016: promptfoo-superset authoring contract (assert/llm-rubric/targets/prompts+
vars/nunjucks+${ENV}/hard-deprecation/metric/test_id identity/gate/repeat/workspace
provenance). Supersedes both 0013 ADRs (marked in their Status).
- ADR-0017: output/artifact contract (best-of-each split bundle, margin queryable
summary + index.jsonl, vercel transcript/metrics, agentskills grading + verdict/score,
.internal/, index_path, timing->metrics merge, pass@k Build) + workspace resolver
(provenance vs acquisition, ordered backends, --reference workhorse, future docker-image
= native SWE-bench). Includes FAIL_TO_PASS/PASS_TO_PASS note (code-grader, no new primitive).
- §11.1: fold Inspect AI (4th confirmation of provenance-vs-acquisition; image/build/x-local).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…op the SDK recipe suggestion At run time the two lists collapse to 'run these named tests; pass iff all pass' — too domain-specific for a primitive and needs no dedicated recipe; a workspace-cwd code-grader (exit code = verdict) covers it. Lists are data the command consumes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ENV}); avoids clobbering runtime shell ${VAR}
Env vars use nunjucks {{ env.VAR }} rendered at config-load time (promptfoo-native,
load.ts:336), not a ${ENV} sigil. One templating engine; phase-separated by render
pass + env namespace. Key correctness win: {{ env.VAR }} does not collide with
runtime shell ${VAR} in CLI-target commands (those pass through to the shell).
Codemod ${{ X }} -> {{ env.X }}. Updated §2.f, ADR-0016 pt7, beads .4/.15.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…clarative workspace.repos field, not an extension - Mark ADR-0015/0016/0017 Accepted. Numbering: 0014(extensions,#1592) / 0015(multi-turn) / 0016(authoring) / 0017(output+resolver) — no collision. - Repo provisioning is a declarative workspace.repos field the harness materializes BEFORE hooks (ADR-0016 pt10, ADR-0017): reverses the workspace-as-extension direction (all 4 benchmark frameworks treat provisioning as harness-core; promptfoo has no workspace to align with). Extensions only for non-provisioning setup (agent-rules, custom hooks). isolation is a workspace field. Kept: workspace.repos/isolation/docker/ template; removed workspace.hooks -> extensions. - Plan §2.l/§3/§0 updated to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e hatch); correct over-absolute 'not an extension' Provenance stays a declarative field and the common case runs before hooks (unchanged), but acquisition is extensible per the plugins-over-builtins guardrail: register a custom acquisition backend (recommended) or use a beforeAll escape hatch; the built-in acquisition may itself be a swappable plugin. Invariants unchanged (declarative provenance, acquisition-before-hooks, built-ins ship). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… add field-vs-resolver orthogonality framing - ADR-0017: locked workspace schema (repos: path/repo/commit(SHA, base_commit alias)/ sparse/ancestor; isolation fresh|pooled|shared; template; docker). Name 'workspace' chosen for durability (CI GITHUB_WORKSPACE/margin/git; not sandbox/environment/testbed). Never in schema: acquisition (harness config) + hooks (extensions) — that's what keeps it durable. Added field(what)-vs-resolver(how) orthogonality + package.json analogy. - Plan §2.l: replaced stale vars.workspace/extension example with the locked field shape. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…B; adopt provenance field; audit/anti-reward-hacking future scope exploitbench confirms: split filesystem run-tree = source of truth, SQLite is a derived rebuildable view (import/export bijection) not required, image pinned by sha256 digest, config_snapshot=bundle.json. Borrow: (1) provenance field on result rows (native/mock/replay/imported_*) — adopt; (2) eval-integrity/anti-reward-hacking (read-only grader container, audit re-grade + red-flag scan + model-identity check) — future scope. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n; categorize by suite AND tags/experiment Confirms ADR-0009/0012: one <run_id> bundle per CLI invocation across all suite YAMLs (never per-suite timestamp folders); runtime_source.kind=multi_eval. Identity = eval_path+test_id (uuid-suffixed dir) so overlapping case IDs across suites don't collide; suite/name are display/grouping metadata not routing. Categorize by BOTH axes on each index row: suite (structural) + tags/experiment (semantic/campaign); experiment = run bucket, suite = intra-run group. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…un-N -> sample-N; experiment is a tag not a bucket - Cross-run index: rebuildable .agentv/results/.indexes/runs.jsonl (one row per run), derived from */summary.json, not source of truth; per-run index.jsonl stays. JSONL not index.json. - Repeat folder run-N -> sample-N (margin/pass@k; de-conflicts with run_id). sample_index=repeats, retry_index=infra retries. - experiment = reserved-by-convention tag (Dashboard default compare key), NOT a bucket/field/storage path — continues ADR-0006/0009/0013 demotion. One grouping mechanism (tags); experiment/suite are conventional keys. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ser-set default grouping key Not even a default compare key. Tag keys sort alphabetically; the default cross-run grouping/compare key is a user preference (any tag), AgentV blesses none. --experiment X = sugar for --tag experiment=X. Completes the experiment demotion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…me carry-forward, not a general bijective import margin can rehydrate a run's completed-work from its run-dir for RESUME (LoadProgressSnapshot + loadSavedResumeBundle + carryForwardLocalCases) but has no general import that rebuilds the multi-run query DB from files (memory store ephemeral, Postgres persists independently). AgentV follows exploitbench's model (fs = source of truth, .indexes/*.jsonl derived/rebuildable; --rerun-failed reads index.jsonl, no store to rehydrate). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Progress update on Group A cleanup / grader artifact wording:
GitHub CI is running on head |
|
Progress update on Group A alias cleanup:
Local verification passed: GitHub CI is green on head |
|
Progress checkpoint on head
Local verification:
CI is running for head |
|
CI follow-up: head |
|
Progress checkpoint on head
Local verification:
CI is running for head |
|
CI follow-up: head |
|
Follow-up clarification pushed in cee6a8d.\n\n- Tightened ADR-0016 and the restructure plan to state that artifact assertion rows are the generic AgentV grader contract, not a g-eval-specific behavior.\n- Clarified the terminology: every grader returns assertions[]; deterministic graders usually emit one row; multi-aspect graders emit one row per authored check/result unit; structured g-eval emits one row per criterion because criteria are one such multi-aspect unit.\n- Verification: git diff --check.\n- GitHub checks are green on cee6a8d: Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, Cloudflare Pages. |
|
Progress update for Group A cleanup: pushed d74a7da.\n\n- Removed target-level log_format/log_output_format alias support and now reject those fields with stream_log guidance.\n- Kept canonical stream_log and wired it into Copilot/Pi/Claude logger format resolution; stream_log: raw maps to JSON event logs, stream_log: summary maps to summary logs, and stream_log: false disables those stream loggers.\n- Updated target validation so provider-specific legacy log_format settings no longer pass as known settings.\n- Local verification: bun test packages/core/test/evaluation/providers/targets.test.ts packages/core/test/evaluation/validation/targets-validator.test.ts; bun --filter @agentv/core typecheck; bun --filter @agentv/core build; bun run lint; bun run typecheck; git diff --check.\n- GitHub checks are green on d74a7da: Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, Cloudflare Pages. |
|
Progress update for av-kfik.3: pushed What changed:
Local verification:
GitHub CI on head |
|
Small follow-up for av-kfik.3: pushed This just aligns the runtime Azure target error with the validator/docs wording: Azure Local verification:
GitHub CI on head |
|
Live dogfood update for av-kfik.3 at head
Private evidence pushed to |
|
av-kfik.2 schema/validation slice pushed in 8b261b2.\n\nChanges:\n- Zod eval schema accepts snake_cased promptfoo-shaped prompts/targets/default_test/assert/scenarios/derived_metrics/output_path/env/nunjucks_filters/extensions/evaluate_options fields.\n- Tests can omit id and use prompts/provider_output; canonical assert is accepted at suite/default_test/test levels.\n- Validator keeps top-level providers as an unknown-field warning, not a live alias for targets.\n- Regenerated skills-data/agentv-eval-writer/references/eval.schema.json.\n\nVerification:\n- bun test packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts\n- bun --filter @agentv/core typecheck\n- bun --filter @agentv/core lint\n- bun run lint\n- git diff --check\n\nBoundary: runtime parser/execution for promptfoo assert/g-eval remains in av-kfik.7+ follow-up work; this commit is schema/validation only. |
|
av-kfik.1 ADR polish pushed in 916eccd.\n\nChange: removed the contradictory duplicate standalone |
* docs(plans): add promptfoo-compatible extensions plan Entire-Checkpoint: dd08c8dc0d47 * docs(plans): amend extensions plan to agreed model (hook-derived isolation, vars.workspace, built-in scheme) Adds an Amendments section overriding the original where they conflict, per the wider promptfoo-superset restructure (PR #1594): - A1: isolation is hook-derived (beforeAll=shared, beforeEach=per-case) via a reset-based workspace pool; remove the isolation config knob. - A2: per-case workspace spec lives in dataset vars.workspace. - A3: ship built-in auto-registered agentv:workspace / agentv:skills alongside file://. - A4: grading contract unchanged — reuse EvaluationScore (score/verdict/assertions[evidence]). - A5: ADR 0014 must note the incoming superseding ADR (assert / input removal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plans): rename skills extension -> agent_rules; clarify grading contract origin - A6: the staging extension covers skills + hooks + subagents + rules, so rename to agentv:agent_rules (package agent-rules, context agent_rules_paths). skills is one kind of agent rule, not the extension name. - A4: note grading.json originates from agentskills (assertion_results + summary); AgentV adds string verdict (pass/fail/skip) + fractional score as a superset. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plans): use kebab agent-rules for the scheme identifier (agent_rules_paths field stays snake_case) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plans): amend A7 — repo provisioning is a declarative field, not an extension (per ADR-0016/0017) Narrows this PR: do NOT move repo materialization into an extension. Repo acquisition stays harness-core (declarative workspace.repos field + resolver, materialized before hooks). Extensions cover only non-provisioning setup (agent-rules, custom hooks). Names the superseding ADRs (0016 authoring, 0017 output/resolver) referenced in A5. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style(plans): remove trailing whitespace from promptfoo plan --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Draft design plan (no code changes) for re-aligning AgentV's eval authoring format with promptfoo — keeping
snake_case— and borrowing runner/analytics from Margin-Lab/evals and transcripts/agentic-graders from vercel-labs/agent-eval.Plan doc:
docs/plans/promptfoo-aligned-eval-restructure.mdNorth-star
AgentV's eval contract is a strict superset of (snake_cased) promptfoo: any promptfoo config, mechanically snake_cased, is a valid AgentV eval with equivalent semantics; AgentV accepts more on top (bare-string asserts,
workspace,gate, agentic judges, multi-turn, …).Governing decisions (§2, resolved)
gate> scalar threshold;repeat: {count, strategy, early_exit}>repeat: int; workspace/repo materialization; agentic judge).assertions,composite,eval_cases, gradername-as-metric,${{ ENV }}) hard-error; one-shot codemod migrates.targetsre-canonicalized as first-class system-under-test. Verified against promptfoo source + history: promptfootargetsis a 2024-05 redteam alias for the canonicalproviders. AgentV elevatestargets(fits agent-eval) and demotesprovider/apiIdto the backend-kind vocabulary. It does not accept a top-levelprovidersalias (that would overload AgentV's existing backendprovider/sub-provider term); the promptfoo importer rewrites top-levelproviders:→targets:instead.{{ var }}(nunjucks, eval-time) +${ENV}(docker/k8s style, config-time, replaces${{ ENV }}) — no sigil collision.id: split stable derivedtest_id(identity/filter/compare) from display label (id→description→ vars →Test #n).assertshorthand kept (better semantics: N criteria → one judge call), desugars to a batchedllm-rubric. One-way compat (promptfoo ⊆ AgentV) is the design.llm-rubric(folds AgentV'srubrics+ agenticllm-graderin as optional fields).Borrows
Build().tool_nameenum + inlined summary), evidence-by-path agentic judge, judge pinning.Open (implementation sequencing, non-blocking — §8)
redteam:key to preserve superset parsing vs hard carve-out.javascript/pythondesugar tocode-gradervs distinct types.Review asks
🤖 Generated with Claude Code
Implementation hand-off (beads)
Tracked under epic
av-kfik. Workers have no context of the design conversation — readdocs/plans/promptfoo-aligned-eval-restructure.md(this PR) +docs/adr/0015first; each bead is self-contained and cites its plan section.Ready now (no blockers):
av-kfik.1(P0) — Write authoring-contract + output-contract ADRs (anchor)av-kfik.3(P1) — Group A: delete deprecated back-compat aliases (independent, parallel)Unlocked after the schema anchor (
av-kfik.2←.1):av-kfik.2(P0) — Snake_cased promptfoo eval schema (Zod)av-kfik.4— Templating: nunjucks vars +${ENV}config env ←.2av-kfik.5— prompts × targets matrix + instance expansion ←.2av-kfik.6— Re-canonicalize targets; provider=backend; registry promptfoo shape ←.2av-kfik.7— Assertion vocabulary +llm-rubricconsolidation + grader execution ←.2av-kfik.8— Two-layer transcript + canonicaltool_nameenum + summary ←.2av-kfik.9— Datasets:file://loading +__expectedDSL ←.2av-kfik.10— Runner: worker pool + reset workspace pool +--rerun-failed←.5av-kfik.11— Grading contract:assertion_results/summary/verdict/score+ agentic judge ←.7av-kfik.12— Output/artifact contract: queryablesummary.json+.internal/+ timing→metrics merge ←.1,.5av-kfik.13— Multi-turn: split execution/evaluation; dropwindow_size(ADR-0015) ←.4,.7av-kfik.14— Extensions/workspace/agent-rules slice (PR docs(plans): add promptfoo-compatible extensions plan #1592 + amendments A1–A6) ←.2,.4av-kfik.15— Hard-deprecation codemod for existing eval files ←.2,.6,.7,.13,.12av-kfik.16— Docs + examples + live provider/grader dogfood ←.10,.11,.12,.14Use
bd readyto find unblocked work;bd show <id>for the full task spec + acceptance.