docs(plan): eval authoring restructure — promptfoo superset (DRAFT) by christso · Pull Request #1594 · EntityProcess/agentv

christso · 2026-07-01T22:47:54Z

Summary

Draft design plan (no code changes) for re-aligning AgentV's eval authoring format with promptfoo — keeping snake_case — and borrowing runner/analytics from Margin-Lab/evals and transcripts/agentic-graders from vercel-labs/agent-eval.

Plan doc: docs/plans/promptfoo-aligned-eval-restructure.md

North-star

AgentV's eval contract is a strict superset of (snake_cased) promptfoo: any promptfoo config, mechanically snake_cased, is a valid AgentV eval with equivalent semantics; AgentV accepts more on top (bare-string asserts, workspace, gate, agentic judges, multi-turn, …).

Governing decisions (§2, resolved)

Prefer promptfoo naming where functionally equivalent; keep AgentV only where semantics are better (gate > scalar threshold; repeat: {count, strategy, early_exit} > repeat: int; workspace/repo materialization; agentic judge).
Hard deprecation (major version): removed keys (assertions, composite, eval_cases, grader name-as-metric, ${{ ENV }}) hard-error; one-shot codemod migrates.
targets re-canonicalized as first-class system-under-test. Verified against promptfoo source + history: promptfoo targets is a 2024-05 redteam alias for the canonical providers. AgentV elevates targets (fits agent-eval) and demotes provider/apiId to the backend-kind vocabulary. It does not accept a top-level providers alias (that would overload AgentV's existing backend provider/sub-provider term); the promptfoo importer rewrites top-level providers: → targets: instead.
Templating: {{ var }} (nunjucks, eval-time) + ${ENV} (docker/k8s style, config-time, replaces ${{ ENV }}) — no sigil collision.
Optional test id: split stable derived test_id (identity/filter/compare) from display label (id → description → vars → Test #n).
Bare-string assert shorthand kept (better semantics: N criteria → one judge call), desugars to a batched llm-rubric. One-way compat (promptfoo ⊆ AgentV) is the design.
LLM-judge types consolidated into one promptfoo-named llm-rubric (folds AgentV's rubrics + agentic llm-grader in as optional fields).

Borrows

margin-lab: instance-as-scheduling-unit, lease/heartbeat worker pool, infra-only retry, pass@k analytics via one pure Build().
vercel-agent-eval: two-layer transcript (raw + normalized canonical tool_name enum + inlined summary), evidence-by-path agentic judge, judge pinning.

Open (implementation sequencing, non-blocking — §8)

Which exotic assertion types ship working vs accepted-but-stubbed at launch.
Redteam: accept-and-ignore the redteam: key to preserve superset parsing vs hard carve-out.
javascript/python desugar to code-grader vs distinct types.

Review asks

Confirm the §2 contract decisions.
Weigh in on the three §8 sequencing calls (leanings noted in the doc).

🤖 Generated with Claude Code

Implementation hand-off (beads)

Tracked under epic av-kfik. Workers have no context of the design conversation — read docs/plans/promptfoo-aligned-eval-restructure.md (this PR) + docs/adr/0015 first; each bead is self-contained and cites its plan section.

Ready now (no blockers):

av-kfik.1 (P0) — Write authoring-contract + output-contract ADRs (anchor)
av-kfik.3 (P1) — Group A: delete deprecated back-compat aliases (independent, parallel)

Unlocked after the schema anchor (av-kfik.2 ← .1):

av-kfik.2 (P0) — Snake_cased promptfoo eval schema (Zod)
av-kfik.4 — Templating: nunjucks vars + ${ENV} config env ← .2
av-kfik.5 — prompts × targets matrix + instance expansion ← .2
av-kfik.6 — Re-canonicalize targets; provider=backend; registry promptfoo shape ← .2
av-kfik.7 — Assertion vocabulary + llm-rubric consolidation + grader execution ← .2
av-kfik.8 — Two-layer transcript + canonical tool_name enum + summary ← .2
av-kfik.9 — Datasets: file:// loading + __expected DSL ← .2
av-kfik.10 — Runner: worker pool + reset workspace pool + --rerun-failed ← .5
av-kfik.11 — Grading contract: assertion_results/summary/verdict/score + agentic judge ← .7
av-kfik.12 — Output/artifact contract: queryable summary.json + .internal/ + timing→metrics merge ← .1, .5
av-kfik.13 — Multi-turn: split execution/evaluation; drop window_size (ADR-0015) ← .4, .7
av-kfik.14 — Extensions/workspace/agent-rules slice (PR docs(plans): add promptfoo-compatible extensions plan #1592 + amendments A1–A6) ← .2, .4
av-kfik.15 — Hard-deprecation codemod for existing eval files ← .2, .6, .7, .13, .12
av-kfik.16 — Docs + examples + live provider/grader dogfood ← .10, .11, .12, .14

Use bd ready to find unblocked work; bd show <id> for the full task spec + acceptance.

…uperset Draft design plan (for review, no code changes) to re-align AgentV's eval authoring format with promptfoo, keeping snake_case, and borrowing runner/ analytics from Margin-Lab/evals and transcripts/agentic-graders from vercel-labs/agent-eval. Governing decisions captured: - AgentV eval contract is a strict SUPERSET of (snake_cased) promptfoo. - Prefer promptfoo naming where equivalent; keep AgentV only where semantics are better (gate, repeat block, workspace, agentic judge). - Hard deprecation (major version): removed keys error, codemod migrates. - `targets` re-canonicalized as first-class SUT (verified: promptfoo `targets` is a 2024-05 redteam alias for the canonical `providers`); `provider` demoted to backend-kind field; `providers` accepted as input alias. - `{{ var }}` nunjucks for eval-time vars; `${ENV}` (docker/k8s style) for config-time env, replacing `${{ ENV }}`. - Optional test `id`: split stable derived `test_id` from display label. - Bare-string `assert` shorthand kept, desugars to batched `llm-rubric`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-07-01T22:49:02Z

Deploying agentv with Cloudflare Pages

Latest commit:	`0bff279`
Status:	✅ Deploy successful!
Preview URL:	https://2020d8e0.agentv.pages.dev
Branch Preview URL:	https://plan-promptfoo-aligned-eval.agentv.pages.dev

View logs

…gn lifecycle with promptfoo - No new top-level `workspace:` block. Repo/fixture spec rides as dataset `vars` (file://-loadable), consumed by a built-in, auto-registered, overridable `agentv:workspace` extension. Matches vercel (fixture=case) and margin (image=case): workspace is part of the dataset. - Single lifecycle surface = promptfoo `extensions` (beforeAll/afterAll/beforeEach/afterEach). Hard-remove `on_run_complete` (= afterAll), `preprocessors`, and `workspace.hooks`. - Isolation = the hook name in the extension reference (verified promptfoo mechanism, evaluatorHelpers.ts:633 EXTENSION_HOOK_NAMES): `agentv:workspace:beforeAll` = shared, `:beforeEach` = per-case. - Built-in extension internally borrows margin (git/docker materialization + mirror cache) and vercel (per-case fixture copy, path handed to target), validates the vars.workspace shape, writes materialized path back to vars. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…bprocess, code-grader power tool Verified against promptfoo source: javascript assertions run in-process (new Function / dynamic import), only python shells out (PythonShell). AgentV matches: `javascript` in-process (easier on Bun — imports .ts directly), `python` subprocess, `code-grader` stays the subprocess power tool for workspace-cwd / arbitrary-language / isolation cases. Do not desugar `javascript` into `code-grader` (loses in-process speed). Resolves old §8.3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Hard churn confirmed OK (not yet in production) — clean break, no aliases. - Simplify: collapse tests[].input into prompts + vars (one way to express what's sent to the target); input_files stays as prompt-content. - Runner: laptop-first, no DB store. Instance expansion + pass@k + infra-only retry (margin ideas), but a simple worker pool; resumability via index.jsonl (--rerun-failed); workspace = shared or reset-based pool, not per-instance containers. - nunjucks confirmed to meet inline text+var mixing; {{var}} for vars, ${ENV} for env. - Redteam + unimplemented exotic assertions: treated as unrecognized fields (future scope), not stubbed. `similar` in (needs embeddings provider). - test_id: layered identity — content-hash (content), author tag/metadata (governance/trend, Dashboard keys on this), description/vars (display). - LLM judge type name = `llm-rubric`; default grading adopts skeptical evidence-by-path judge (opt-out via explicit prompt); documented what changes. - New §9: PR #1592 is the extensions/workspace implementation slice — reasonable, mergeable after 4 amendments (chiefly vars.workspace + built-in agentv: scheme). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ation, vars.workspace, built-in scheme) Adds an Amendments section overriding the original where they conflict, per the wider promptfoo-superset restructure (PR #1594): - A1: isolation is hook-derived (beforeAll=shared, beforeEach=per-case) via a reset-based workspace pool; remove the isolation config knob. - A2: per-case workspace spec lives in dataset vars.workspace. - A3: ship built-in auto-registered agentv:workspace / agentv:skills alongside file://. - A4: grading contract unchanged — reuse EvaluationScore (score/verdict/assertions[evidence]). - A5: ADR 0014 must note the incoming superseding ADR (assert / input removal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…#1592 amended - §5.3: llm-rubric reuses existing EvaluationScore grading.json shape (score + verdict + assertions[{text,passed,evidence}]); risk is mapping the judge verdict to per-criterion assertions+evidence, keeping evidence in grading.json, and score/verdict consistency. No new grading contract. - §9: record that PR #1592 was amended to the agreed model (A1-A5). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tension -> agent_rules - §5.3: grading.json originates from agentskills (assertion_results[{text,passed, evidence}] + summary counts; no overall verdict). Per-assertion already matches agentskills exactly — keep. Overall stays a STRING verdict('pass'|'fail'|'skip') + fractional score (AgentV superset; boolean can't express skip/fractional). Align array key -> assertion_results and add summary counts. llm-rubric maps verdict->per-criterion assertion_results. - §9: rename agentv:skills -> agentv:agent_rules (stages skills + hooks + agents + rules), package agent-rules, context agent_rules_paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nscript/metrics); cleanup inventory; agent-rules kebab - §6.0: canonical output = split bundle, queryable-on-filesystem, no DB. Best of each: margin-lab rich jq-queryable summary.json (Summary shape) + index.jsonl; vercel two-layer transcript + tool_name enum + inlined transcript_summary for metrics; agentskills grading.json; promptfoo EvaluateSummaryV3 export only. - §10: verified dead-code/back-compat cleanup inventory (group A delete-now, group B removed-by-restructure, group C dedup). Hard deprecation, no aliases. - Naming: identifier tokens are kebab (agentv:agent-rules), data fields snake (agent_rules_paths). Fixed agent_rules -> agent-rules for the scheme. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ternal/ bundle layout (§6.0.1) - index stays JSONL (append/stream/query; enables --rerun-failed), not index.json. - Fix drift: file index.jsonl is referenced as manifest_path today; rename -> index_path. Reserve "manifest"/bundle.json for the frozen-config file only; summary.json is the queryable aggregate. - Adopt margin-lab's internal/ folder, dot-prefixed as .internal/ (AgentV's "dot = skip discovery" convention): index.jsonl/progress.json/events.jsonl/ bundle.json move under .internal/; run root stays clean (summary.json + per-case dirs). Cross-run .indexes/.cache at results root are a separate scope. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ce of truth); keep index.jsonl name - No first-class promptfoo EvaluateSummaryV3 / single-file export — the split bundle is the single source of truth (YAGNI). A promptfoo-shaped file can be generated on demand if ever needed; not shipped/maintained. named_scores/ derived_metrics still live inside the split rows (Dashboard), not a file. - Naming: keep index.jsonl (JSONL, not index.json). Reject manifest.jsonl (reserved for bundle.json/run manifest). rows.jsonl is the only fallback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…split duplication) Prior discussion: ADR-0011 & 0012 both defined metrics.json AND timing.json as separate per-attempt sidecars; both are written today. But timing.json (TimingArtifact) already carries total_tokens/cost_usd/token_usage/usage_sources (perf metrics, not just timing), overlapping metrics.json (trace execution metrics). No reference splits them (agentskills folds tokens+duration into one file; vercel/margin keep one per-attempt blob). Decision: one metrics.json with sections (duration/tokens/cost always; execution/trajectory when a trace exists); drop timing.json + timing_path, keep single metrics_path. Output-contract ADR supersedes the 0011/0012 split. Added to §6.0.1 and §10 Group C. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rgets` instead AgentV already uses provider + apiId (provider + sub-provider) as the backend vocabulary, so a top-level `providers`-as-SUT alias would overload the term. Canonical SUT key is `targets` only; the promptfoo importer rewrites top-level `providers:` -> `targets:` (mechanical, alongside camel->snake). Superset holds via the importer, not a live alias; provider/apiId/sub-provider stay unambiguously "backend." Updated §0/§1.1/§2.a. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…r); multi-turn provenance + keep evaluation layer - §0/§1.1/§2.a: superset is a design property, not a shipped importer. New evals authored in snake_case directly. providers->targets + camel->snake is a one-off conversion/codemod, not a runtime alias. Distinct from the hard-deprecation codemod (which IS built, migrates AgentV's own files). - §3/§2.i: multi-turn has real provenance (agentv#1053, researched in agentevals/agentevals-research/research/findings/multiturn-conversation-eval against inspect-ai, google-adk, ragas). on_turn_failure <- inspect-ai state.completed; per-conversation aggregation is a deliberate gap-fill (inspect/ ragas/promptfoo aggregate only across epochs, not within a conversation). Split execution (promptfoo _conversation) from evaluation (keep AgentV's per-turn assertions + aggregation + on_turn_failure). Drop only window_size (no pedigree). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…size→_conversation migration - New ADR-0015: split conversation execution (promptfoo _conversation/sessions) from evaluation (keep per-turn assertions + cross-turn aggregation + on_turn_failure, provenance inspect-ai/google-adk/agentv#1053). Drop window_size. Includes the concrete window_size -> _conversation template mapping (loop.revindex <= N; system outside the loop) for the codemod. - Plan §3: window_size rationale = redundant in the _conversation model (author windows in the template), points to ADR-0015. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Materialization (resolver + git-mirror cache) is always-on and makes even fresh-per-case fast — clone speed is decoupled from reuse. Pooling = reuse + quick-reset between cases, a perf optimization mainly for local evals (amortize expensive setup); CI prefers fresh-per-case + mirror cache. Three isolation levels: shared / pooled / fresh-per-case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rkspace resolver adapters Layered gate: deterministic CI (merge gate) + live PR-679 CargoWise dogfood (real provider+grader, blocking) + optional promptfoo-parity diff. Workspace acquisition = two resolver adapters unified under 'agentv workspace deps': local git-mirror (~/projects/WiseTechGlobal/CargoWise, 7.3G, dev/offline) and a snapshot-download adapter (port WTG download-release-deps.ts: per-year .git tarballs from a release, shared checkout+symlink, LFS skip) for CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… / partial+sparse+shallow / snapshot-download Not 'download vs shallow clone': ordered fastest-first — (1) local mirror via alternates/reference (zero transfer, dev), (2) direct partial+sparse+shallow clone (CI default: --filter=blob:none --sparse --depth 1 + workspace.repos.sparse; less transfer than a per-year .git tarball, no producer), (3) snapshot-download adapter (CI fallback: many-commit amortization or sha-fetch/LFS blockers). Resolver selects per environment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…m acquisition (SWE-bench/margin/Harbor) Per AgentV research repo-provisioning-schema-design.md: eval declares provenance only (repo/commit/sparse/ancestor); acquisition is a harness resolver in machine config, keyed on repo, ordered backends: local-auto-adopt(--reference) -> mirror-cache(--reference) -> snapshot -> remote -> (future) docker-image (SWE-bench/margin, same identity key). --reference gives shallow-speed + full history, retiring depth/filter. Remove tangled per-repo type/resolve/depth/filter/ resolver fields. Supersedes the §11 acquisition-perf framing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…upersede 0013; fold Inspect AI - ADR-0016: promptfoo-superset authoring contract (assert/llm-rubric/targets/prompts+ vars/nunjucks+${ENV}/hard-deprecation/metric/test_id identity/gate/repeat/workspace provenance). Supersedes both 0013 ADRs (marked in their Status). - ADR-0017: output/artifact contract (best-of-each split bundle, margin queryable summary + index.jsonl, vercel transcript/metrics, agentskills grading + verdict/score, .internal/, index_path, timing->metrics merge, pass@k Build) + workspace resolver (provenance vs acquisition, ordered backends, --reference workhorse, future docker-image = native SWE-bench). Includes FAIL_TO_PASS/PASS_TO_PASS note (code-grader, no new primitive). - §11.1: fold Inspect AI (4th confirmation of provenance-vs-acquisition; image/build/x-local). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…op the SDK recipe suggestion At run time the two lists collapse to 'run these named tests; pass iff all pass' — too domain-specific for a primitive and needs no dedicated recipe; a workspace-cwd code-grader (exit code = verdict) covers it. Lists are data the command consumes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ENV}); avoids clobbering runtime shell ${VAR} Env vars use nunjucks {{ env.VAR }} rendered at config-load time (promptfoo-native, load.ts:336), not a ${ENV} sigil. One templating engine; phase-separated by render pass + env namespace. Key correctness win: {{ env.VAR }} does not collide with runtime shell ${VAR} in CLI-target commands (those pass through to the shell). Codemod ${{ X }} -> {{ env.X }}. Updated §2.f, ADR-0016 pt7, beads .4/.15. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…clarative workspace.repos field, not an extension - Mark ADR-0015/0016/0017 Accepted. Numbering: 0014(extensions,#1592) / 0015(multi-turn) / 0016(authoring) / 0017(output+resolver) — no collision. - Repo provisioning is a declarative workspace.repos field the harness materializes BEFORE hooks (ADR-0016 pt10, ADR-0017): reverses the workspace-as-extension direction (all 4 benchmark frameworks treat provisioning as harness-core; promptfoo has no workspace to align with). Extensions only for non-provisioning setup (agent-rules, custom hooks). isolation is a workspace field. Kept: workspace.repos/isolation/docker/ template; removed workspace.hooks -> extensions. - Plan §2.l/§3/§0 updated to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e hatch); correct over-absolute 'not an extension' Provenance stays a declarative field and the common case runs before hooks (unchanged), but acquisition is extensible per the plugins-over-builtins guardrail: register a custom acquisition backend (recommended) or use a beforeAll escape hatch; the built-in acquisition may itself be a swappable plugin. Invariants unchanged (declarative provenance, acquisition-before-hooks, built-ins ship). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… add field-vs-resolver orthogonality framing - ADR-0017: locked workspace schema (repos: path/repo/commit(SHA, base_commit alias)/ sparse/ancestor; isolation fresh|pooled|shared; template; docker). Name 'workspace' chosen for durability (CI GITHUB_WORKSPACE/margin/git; not sandbox/environment/testbed). Never in schema: acquisition (harness config) + hooks (extensions) — that's what keeps it durable. Added field(what)-vs-resolver(how) orthogonality + package.json analogy. - Plan §2.l: replaced stale vars.workspace/extension example with the locked field shape. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…B; adopt provenance field; audit/anti-reward-hacking future scope exploitbench confirms: split filesystem run-tree = source of truth, SQLite is a derived rebuildable view (import/export bijection) not required, image pinned by sha256 digest, config_snapshot=bundle.json. Borrow: (1) provenance field on result rows (native/mock/replay/imported_*) — adopt; (2) eval-integrity/anti-reward-hacking (read-only grader container, audit re-grade + red-flag scan + model-identity check) — future scope. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…n; categorize by suite AND tags/experiment Confirms ADR-0009/0012: one <run_id> bundle per CLI invocation across all suite YAMLs (never per-suite timestamp folders); runtime_source.kind=multi_eval. Identity = eval_path+test_id (uuid-suffixed dir) so overlapping case IDs across suites don't collide; suite/name are display/grouping metadata not routing. Categorize by BOTH axes on each index row: suite (structural) + tags/experiment (semantic/campaign); experiment = run bucket, suite = intra-run group. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…un-N -> sample-N; experiment is a tag not a bucket - Cross-run index: rebuildable .agentv/results/.indexes/runs.jsonl (one row per run), derived from */summary.json, not source of truth; per-run index.jsonl stays. JSONL not index.json. - Repeat folder run-N -> sample-N (margin/pass@k; de-conflicts with run_id). sample_index=repeats, retry_index=infra retries. - experiment = reserved-by-convention tag (Dashboard default compare key), NOT a bucket/field/storage path — continues ADR-0006/0009/0013 demotion. One grouping mechanism (tags); experiment/suite are conventional keys. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ser-set default grouping key Not even a default compare key. Tag keys sort alphabetically; the default cross-run grouping/compare key is a user preference (any tag), AgentV blesses none. --experiment X = sugar for --tag experiment=X. Completes the experiment demotion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…me carry-forward, not a general bijective import margin can rehydrate a run's completed-work from its run-dir for RESUME (LoadProgressSnapshot + loadSavedResumeBundle + carryForwardLocalCases) but has no general import that rebuilds the multi-run query DB from files (memory store ephemeral, Postgres persists independently). AgentV follows exploitbench's model (fs = source of truth, .indexes/*.jsonl derived/rebuildable; --rerun-failed reads index.jsonl, no store to rehydrate). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

christso · 2026-07-02T06:28:13Z

Progress update on Group A cleanup / grader artifact wording:

Pushed 5df2530f to fix the prior CI failures: Biome formatting plus regenerated skills-data/agentv-eval-writer/references/eval.schema.json.
Updated ADR-0017 to clarify that per-criterion/per-aspect rows are generic AgentV grader assertion evidence, not g-eval-specific. All graders can emit assertions[]; multi-aspect graders emit multiple rows; grading.json keeps both flat assertions[] and nested graders[].assertions[].
Local verification passed: bun run lint; bun test packages/core/test/evaluation/validation/eval-schema-sync.test.ts; bun test apps/cli/test/commands/eval/artifact-writer.test.ts; bun --filter @agentv/core test (2107 pass).

GitHub CI is running on head 5df2530f.

christso · 2026-07-02T06:38:19Z

Progress update on Group A alias cleanup:

Pushed 90b662b3 removing deprecated exported aliases: EvalCase, OutputMessage, PASS_THRESHOLD, and getAgentvHome.
Dropped the special results_by_project config deprecation path; it now falls through as an ordinary unexpected field. Dashboard docs now point directly to projects[].results.
Removed the stale required_min_score row from the eval-writer rubric reference card.

Local verification passed: bun run lint; bun --filter @agentv/core typecheck; bun --filter @agentv/core build; focused paths/config/grader/parser tests; bun run typecheck.

GitHub CI is green on head 90b662b3.

christso · 2026-07-02T06:46:44Z

Progress checkpoint on head 76178419:

Removed the programmatic TypeScript expected_output convenience alias from evaluate() inputs; expectedOutput remains the TS API, while YAML/wire expected_output remains the first-class golden/reference answer.
Added hard-deprecation checks for inline tests and conversation turns that still pass expected_output to evaluate().
Tightened ADR-0017 wording for the owner clarification: artifact assertion rows are the generic AgentV grader evidence channel. Every grader returns assertions[]; deterministic graders usually emit one row, while multi-aspect graders emit one row per criterion/aspect. This is not g-eval-specific.

Local verification:

bunx biome format --write docs/adr/0017-output-artifact-and-workspace-resolver-contract.md packages/core/src/evaluation/evaluate.ts packages/core/test/evaluation/evaluate-enhanced.test.ts packages/core/test/evaluation/evaluate-programmatic-api.test.ts
bun test packages/core/test/evaluation/evaluate-enhanced.test.ts packages/core/test/evaluation/evaluate-programmatic-api.test.ts (25 pass)
bun --filter @agentv/core typecheck
bun --filter @agentv/core build
bun run lint
bun run typecheck

CI is running for head 76178419.

christso · 2026-07-02T06:48:44Z

CI follow-up: head 76178419 is green (Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, Cloudflare Pages all SUCCESS).

christso · 2026-07-02T06:58:11Z

Progress checkpoint on head 97a95244:

Removed built-in target provider alias sprawl from Group A: azure-openai, google, google-gemini, codex-cli, copilot, copilot_sdk, pi, claude-code, cc-mirror, bedrock, and vertex no longer canonicalize to built-in providers.
Kept canonical providers: azure, gemini, codex, copilot-cli, copilot-sdk, pi-coding-agent, pi-cli, claude/claude-cli/claude-sdk.
Preserved discovered-provider fallback for .agentv/providers/<kind>.ts; removed aliases now validate as unknown provider names instead of built-in aliases.
Removed stale cc-mirror target docs and updated Copilot docs to use provider: copilot-cli.

Local verification:

bunx biome format --write apps/web/src/content/docs/docs/targets/coding-agents.mdx packages/core/src/evaluation/providers/targets.ts packages/core/src/evaluation/providers/types.ts packages/core/src/evaluation/validation/targets-validator.ts packages/core/test/evaluation/providers/targets.test.ts packages/core/test/evaluation/validation/targets-validator.test.ts
bun test packages/core/test/evaluation/providers/targets.test.ts packages/core/test/evaluation/validation/targets-validator.test.ts (77 pass)
bun --filter @agentv/core typecheck
bun --filter @agentv/core build
bun run lint
bun run typecheck

CI is running for head 97a95244.

christso · 2026-07-02T07:00:32Z

CI follow-up: head 97a95244 is green (Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, Cloudflare Pages all SUCCESS).

christso · 2026-07-02T07:09:14Z

Follow-up clarification pushed in cee6a8d.\n\n- Tightened ADR-0016 and the restructure plan to state that artifact assertion rows are the generic AgentV grader contract, not a g-eval-specific behavior.\n- Clarified the terminology: every grader returns assertions[]; deterministic graders usually emit one row; multi-aspect graders emit one row per authored check/result unit; structured g-eval emits one row per criterion because criteria are one such multi-aspect unit.\n- Verification: git diff --check.\n- GitHub checks are green on cee6a8d: Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, Cloudflare Pages.

christso · 2026-07-02T07:21:48Z

Progress update for Group A cleanup: pushed d74a7da.\n\n- Removed target-level log_format/log_output_format alias support and now reject those fields with stream_log guidance.\n- Kept canonical stream_log and wired it into Copilot/Pi/Claude logger format resolution; stream_log: raw maps to JSON event logs, stream_log: summary maps to summary logs, and stream_log: false disables those stream loggers.\n- Updated target validation so provider-specific legacy log_format settings no longer pass as known settings.\n- Local verification: bun test packages/core/test/evaluation/providers/targets.test.ts packages/core/test/evaluation/validation/targets-validator.test.ts; bun --filter @agentv/core typecheck; bun --filter @agentv/core build; bun run lint; bun run typecheck; git diff --check.\n- GitHub checks are green on d74a7da: Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, Cloudflare Pages.

christso · 2026-07-02T07:36:17Z

Progress update for av-kfik.3: pushed e27b6aa0 (chore(evals): remove numeric required thresholds).

What changed:

Removed the legacy required: 0.7 threshold shim from grader parsing and orchestration. required is now boolean-only; custom thresholds use required: true + min_score.
Updated core/programmatic/SDK types, eval validation schema, generated eval.schema.json, docs, examples, and the eval-writer skill reference.
Numeric required values now fail parser loading with migration guidance and validator warnings point to min_score.

Local verification:

bun test packages/core/test/evaluation/loaders/grader-parser.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts packages/core/test/evaluation/orchestrator.test.ts packages/sdk/test/grader-helpers.test.ts (285 pass)
bun --filter @agentv/core typecheck
bun --filter @agentv/sdk typecheck
bun --filter @agentv/sdk build
bun run lint
bun run typecheck
git diff --check
jq empty skills-data/agentv-eval-writer/references/eval.schema.json

GitHub CI on head e27b6aa0 is green: Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, and Cloudflare Pages all succeeded.

christso · 2026-07-02T07:41:01Z

Small follow-up for av-kfik.3: pushed eba1442e (chore(targets): align azure api_format removal error).

This just aligns the runtime Azure target error with the validator/docs wording: Azure api_format now consistently says the field has been removed. api_format remains supported for the non-Azure target types that still need it, including OpenAI/Codex/Copilot local-proxy configs.

Local verification:

bun test packages/core/test/evaluation/providers/targets.test.ts packages/core/test/evaluation/validation/targets-validator.test.ts (80 pass)
bun --filter @agentv/core typecheck
bun run lint
git diff --check

GitHub CI on head eba1442e is green: Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, and Cloudflare Pages all succeeded.

christso · 2026-07-02T07:57:46Z

Live dogfood update for av-kfik.3 at head eba1442e.

Ran a 1-case live eval through the local OpenAI-compatible endpoint using base_url=http://127.0.0.1:10531/v1, provider: codex / api_format: responses for the agent target, and provider: openai / api_format: chat for the grader target. Env refs were used in YAML for base URL, API key, and model; command-level values were LOCAL_OPENAI_PROXY_API_KEY=dummy and LOCAL_OPENAI_PROXY_MODEL=gpt-5.4-mini.
Command: bun apps/cli/src/cli.ts eval run /tmp/agentv-av-kfik3-dogfood/dogfood.eval.yaml --targets /tmp/agentv-av-kfik3-dogfood/targets.yaml --target local-codex-agent --workers 1
Result: PASS, 1/1, score 0.90. Run bundle: .agentv/results/2026-07-02T07-47-33-344Z.
Artifact check: grading.json.assertions[] has 2 rows matching the two authored score-range criteria, and grading.json.graders[0].assertions[] has the same 2 rows. index.jsonl.scores[0].type is llm-grader, with scores[0].assertions[] also carrying the 2 rows and details.raw_scores showing boolean-required=8, min-score-threshold=10.
Conclusion: assertion/artifact rows are the generic AgentV grader evidence contract, not G-Eval-specific. Score-range rubrics emit one row per criterion; deterministic and code graders emit whatever assertion rows their checks/scripts return.

Private evidence pushed to agentv-private:evidence/av-kfik3-live-dogfood-2026-07-02 at commit aa7f3a5.

christso · 2026-07-02T08:17:25Z

av-kfik.2 schema/validation slice pushed in 8b261b2.\n\nChanges:\n- Zod eval schema accepts snake_cased promptfoo-shaped prompts/targets/default_test/assert/scenarios/derived_metrics/output_path/env/nunjucks_filters/extensions/evaluate_options fields.\n- Tests can omit id and use prompts/provider_output; canonical assert is accepted at suite/default_test/test levels.\n- Validator keeps top-level providers as an unknown-field warning, not a live alias for targets.\n- Regenerated skills-data/agentv-eval-writer/references/eval.schema.json.\n\nVerification:\n- bun test packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts\n- bun --filter @agentv/core typecheck\n- bun --filter @agentv/core lint\n- bun run lint\n- git diff --check\n\nBoundary: runtime parser/execution for promptfoo assert/g-eval remains in av-kfik.7+ follow-up work; this commit is schema/validation only.

christso · 2026-07-02T08:23:29Z

av-kfik.1 ADR polish pushed in 916eccd.\n\nChange: removed the contradictory duplicate standalone Accepted line from both superseded ADR-0013 files, so the status blocks now clearly read Accepted-then-superseded by ADR-0016.\n\nVerification:\n- bun run lint\n- git diff --check\n\nNo runtime impact.

* docs(plans): add promptfoo-compatible extensions plan Entire-Checkpoint: dd08c8dc0d47 * docs(plans): amend extensions plan to agreed model (hook-derived isolation, vars.workspace, built-in scheme) Adds an Amendments section overriding the original where they conflict, per the wider promptfoo-superset restructure (PR #1594): - A1: isolation is hook-derived (beforeAll=shared, beforeEach=per-case) via a reset-based workspace pool; remove the isolation config knob. - A2: per-case workspace spec lives in dataset vars.workspace. - A3: ship built-in auto-registered agentv:workspace / agentv:skills alongside file://. - A4: grading contract unchanged — reuse EvaluationScore (score/verdict/assertions[evidence]). - A5: ADR 0014 must note the incoming superseding ADR (assert / input removal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plans): rename skills extension -> agent_rules; clarify grading contract origin - A6: the staging extension covers skills + hooks + subagents + rules, so rename to agentv:agent_rules (package agent-rules, context agent_rules_paths). skills is one kind of agent rule, not the extension name. - A4: note grading.json originates from agentskills (assertion_results + summary); AgentV adds string verdict (pass/fail/skip) + fractional score as a superset. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plans): use kebab agent-rules for the scheme identifier (agent_rules_paths field stays snake_case) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plans): amend A7 — repo provisioning is a declarative field, not an extension (per ADR-0016/0017) Narrows this PR: do NOT move repo materialization into an extension. Repo acquisition stays harness-core (declarative workspace.repos field + resolver, materialized before hooks). Extensions cover only non-provisioning setup (agent-rules, custom hooks). Names the superseding ADRs (0016 authoring, 0017 output/resolver) referenced in A5. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style(plans): remove trailing whitespace from promptfoo plan --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

christso and others added 3 commits July 2, 2026 09:07

christso and others added 9 commits July 2, 2026 09:37

christso mentioned this pull request Jul 2, 2026

docs(plans): add promptfoo-compatible extensions plan #1592

Merged

christso and others added 14 commits July 2, 2026 10:38

christso and others added 3 commits July 2, 2026 14:33

Remove deprecated eval authoring aliases

1b46c04

Fix eval schema drift and grader artifact wording

5df2530

Remove deprecated exported aliases

90b662b

Remove programmatic expected_output alias

7617841

Remove built-in provider aliases

97a9524

Clarify generic grader assertion rows

cee6a8d

Remove target log_format aliases

d74a7da

chore(evals): remove numeric required thresholds

e27b6aa

chore(targets): align azure api_format removal error

eba1442

feat(evals): accept promptfoo-shaped eval schema fields

8b261b2

docs(adr): clarify superseded authoring decisions

916eccd

fix(cli): remove removed judge target from scaffold

0bff279

christso marked this pull request as ready for review July 2, 2026 08:48

christso merged commit fd2a22b into main Jul 2, 2026
8 checks passed

christso deleted the plan/promptfoo-aligned-eval-restructure branch July 2, 2026 08:48

Uh oh!

Conversation

christso commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

North-star

Governing decisions (§2, resolved)

Borrows

Open (implementation sequencing, non-blocking — §8)

Review asks

Implementation hand-off (beads)

Uh oh!

cloudflare-workers-and-pages Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented Jul 1, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented Jul 1, 2026 •

edited

Loading

christso commented Jul 2, 2026 •

edited

Loading