Skip to content

docs(plans): add promptfoo-compatible extensions plan#1592

Merged
christso merged 6 commits into
mainfrom
docs/promptfoo-compatible-extensions-plan
Jul 2, 2026
Merged

docs(plans): add promptfoo-compatible extensions plan#1592
christso merged 6 commits into
mainfrom
docs/promptfoo-compatible-extensions-plan

Conversation

@christso

@christso christso commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR adds an implementation-ready plan for adopting Promptfoo-compatible extension hooks while removing workspace as an AgentV core primitive. The plan sets the intended shape: core owns generic extension runtime context and run artifacts, while workspace and skills setup move into bundled or local extensions.

The PR 679 parity example is treated as the motivating reusable layout, with suite files focused on providers, prompts, defaults, and tests while workspace materialization and skill staging live in named extension modules.

Validation

Not run; this is a plan-only documentation change.


Compound Engineering
Codex


Tracking: implemented as bead av-kfik.14 (extensions/workspace/agent-rules slice) under epic av-kfik — see PR #1594 for the full restructure plan and the A1–A6 amendments folded into this doc. Depends on av-kfik.2 (schema) + av-kfik.4 (templating).

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jul 1, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 8fe4811
Status: ✅  Deploy successful!
Preview URL: https://816376c0.agentv.pages.dev
Branch Preview URL: https://docs-promptfoo-compatible-ex.agentv.pages.dev

View logs

christso added a commit that referenced this pull request Jul 1, 2026
- Hard churn confirmed OK (not yet in production) — clean break, no aliases.
- Simplify: collapse tests[].input into prompts + vars (one way to express
  what's sent to the target); input_files stays as prompt-content.
- Runner: laptop-first, no DB store. Instance expansion + pass@k + infra-only
  retry (margin ideas), but a simple worker pool; resumability via index.jsonl
  (--rerun-failed); workspace = shared or reset-based pool, not per-instance
  containers.
- nunjucks confirmed to meet inline text+var mixing; {{var}} for vars,
  ${ENV} for env.
- Redteam + unimplemented exotic assertions: treated as unrecognized fields
  (future scope), not stubbed. `similar` in (needs embeddings provider).
- test_id: layered identity — content-hash (content), author tag/metadata
  (governance/trend, Dashboard keys on this), description/vars (display).
- LLM judge type name = `llm-rubric`; default grading adopts skeptical
  evidence-by-path judge (opt-out via explicit prompt); documented what changes.
- New §9: PR #1592 is the extensions/workspace implementation slice — reasonable,
  mergeable after 4 amendments (chiefly vars.workspace + built-in agentv: scheme).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ation, vars.workspace, built-in scheme)

Adds an Amendments section overriding the original where they conflict, per
the wider promptfoo-superset restructure (PR #1594):
- A1: isolation is hook-derived (beforeAll=shared, beforeEach=per-case) via a
  reset-based workspace pool; remove the isolation config knob.
- A2: per-case workspace spec lives in dataset vars.workspace.
- A3: ship built-in auto-registered agentv:workspace / agentv:skills alongside file://.
- A4: grading contract unchanged — reuse EvaluationScore (score/verdict/assertions[evidence]).
- A5: ADR 0014 must note the incoming superseding ADR (assert / input removal).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
christso added a commit that referenced this pull request Jul 1, 2026
…#1592 amended

- §5.3: llm-rubric reuses existing EvaluationScore grading.json shape
  (score + verdict + assertions[{text,passed,evidence}]); risk is mapping the
  judge verdict to per-criterion assertions+evidence, keeping evidence in
  grading.json, and score/verdict consistency. No new grading contract.
- §9: record that PR #1592 was amended to the agreed model (A1-A5).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
christso and others added 2 commits July 2, 2026 09:41
…contract origin

- A6: the staging extension covers skills + hooks + subagents + rules, so rename
  to agentv:agent_rules (package agent-rules, context agent_rules_paths). skills
  is one kind of agent rule, not the extension name.
- A4: note grading.json originates from agentskills (assertion_results + summary);
  AgentV adds string verdict (pass/fail/skip) + fractional score as a superset.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ules_paths field stays snake_case)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
christso added a commit that referenced this pull request Jul 2, 2026
…clarative workspace.repos field, not an extension

- Mark ADR-0015/0016/0017 Accepted. Numbering: 0014(extensions,#1592) / 0015(multi-turn)
  / 0016(authoring) / 0017(output+resolver) — no collision.
- Repo provisioning is a declarative workspace.repos field the harness materializes
  BEFORE hooks (ADR-0016 pt10, ADR-0017): reverses the workspace-as-extension direction
  (all 4 benchmark frameworks treat provisioning as harness-core; promptfoo has no
  workspace to align with). Extensions only for non-provisioning setup (agent-rules,
  custom hooks). isolation is a workspace field. Kept: workspace.repos/isolation/docker/
  template; removed workspace.hooks -> extensions.
- Plan §2.l/§3/§0 updated to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
christso and others added 2 commits July 2, 2026 11:49
… an extension (per ADR-0016/0017)

Narrows this PR: do NOT move repo materialization into an extension. Repo
acquisition stays harness-core (declarative workspace.repos field + resolver,
materialized before hooks). Extensions cover only non-provisioning setup
(agent-rules, custom hooks). Names the superseding ADRs (0016 authoring, 0017
output/resolver) referenced in A5.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
christso added a commit that referenced this pull request Jul 2, 2026
…1594)

* docs(plan): draft plan to restructure eval authoring as a promptfoo superset

Draft design plan (for review, no code changes) to re-align AgentV's eval
authoring format with promptfoo, keeping snake_case, and borrowing runner/
analytics from Margin-Lab/evals and transcripts/agentic-graders from
vercel-labs/agent-eval.

Governing decisions captured:
- AgentV eval contract is a strict SUPERSET of (snake_cased) promptfoo.
- Prefer promptfoo naming where equivalent; keep AgentV only where semantics
  are better (gate, repeat block, workspace, agentic judge).
- Hard deprecation (major version): removed keys error, codemod migrates.
- `targets` re-canonicalized as first-class SUT (verified: promptfoo `targets`
  is a 2024-05 redteam alias for the canonical `providers`); `provider`
  demoted to backend-kind field; `providers` accepted as input alias.
- `{{ var }}` nunjucks for eval-time vars; `${ENV}` (docker/k8s style) for
  config-time env, replacing `${{ ENV }}`.
- Optional test `id`: split stable derived `test_id` from display label.
- Bare-string `assert` shorthand kept, desugars to batched `llm-rubric`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): workspace as dataset+extension; drop on_run_complete; align lifecycle with promptfoo

- No new top-level `workspace:` block. Repo/fixture spec rides as dataset
  `vars` (file://-loadable), consumed by a built-in, auto-registered,
  overridable `agentv:workspace` extension. Matches vercel (fixture=case)
  and margin (image=case): workspace is part of the dataset.
- Single lifecycle surface = promptfoo `extensions`
  (beforeAll/afterAll/beforeEach/afterEach). Hard-remove `on_run_complete`
  (= afterAll), `preprocessors`, and `workspace.hooks`.
- Isolation = the hook name in the extension reference (verified promptfoo
  mechanism, evaluatorHelpers.ts:633 EXTENSION_HOOK_NAMES):
  `agentv:workspace:beforeAll` = shared, `:beforeEach` = per-case.
- Built-in extension internally borrows margin (git/docker materialization +
  mirror cache) and vercel (per-case fixture copy, path handed to target),
  validates the vars.workspace shape, writes materialized path back to vars.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): grader execution model — javascript in-process, python subprocess, code-grader power tool

Verified against promptfoo source: javascript assertions run in-process
(new Function / dynamic import), only python shells out (PythonShell). AgentV
matches: `javascript` in-process (easier on Bun — imports .ts directly),
`python` subprocess, `code-grader` stays the subprocess power tool for
workspace-cwd / arbitrary-language / isolation cases. Do not desugar
`javascript` into `code-grader` (loses in-process speed). Resolves old §8.3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): lock owner decisions (#1-#8) + reconcile with PR #1592

- Hard churn confirmed OK (not yet in production) — clean break, no aliases.
- Simplify: collapse tests[].input into prompts + vars (one way to express
  what's sent to the target); input_files stays as prompt-content.
- Runner: laptop-first, no DB store. Instance expansion + pass@k + infra-only
  retry (margin ideas), but a simple worker pool; resumability via index.jsonl
  (--rerun-failed); workspace = shared or reset-based pool, not per-instance
  containers.
- nunjucks confirmed to meet inline text+var mixing; {{var}} for vars,
  ${ENV} for env.
- Redteam + unimplemented exotic assertions: treated as unrecognized fields
  (future scope), not stubbed. `similar` in (needs embeddings provider).
- test_id: layered identity — content-hash (content), author tag/metadata
  (governance/trend, Dashboard keys on this), description/vars (display).
- LLM judge type name = `llm-rubric`; default grading adopts skeptical
  evidence-by-path judge (opt-out via explicit prompt); documented what changes.
- New §9: PR #1592 is the extensions/workspace implementation slice — reasonable,
  mergeable after 4 amendments (chiefly vars.workspace + built-in agentv: scheme).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): grading.json contract for llm-rubric (main risk) + note PR #1592 amended

- §5.3: llm-rubric reuses existing EvaluationScore grading.json shape
  (score + verdict + assertions[{text,passed,evidence}]); risk is mapping the
  judge verdict to per-criterion assertions+evidence, keeping evidence in
  grading.json, and score/verdict consistency. No new grading contract.
- §9: record that PR #1592 was amended to the agreed model (A1-A5).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): reconcile grading.json with agentskills; rename skills extension -> agent_rules

- §5.3: grading.json originates from agentskills (assertion_results[{text,passed,
  evidence}] + summary counts; no overall verdict). Per-assertion already matches
  agentskills exactly — keep. Overall stays a STRING verdict('pass'|'fail'|'skip')
  + fractional score (AgentV superset; boolean can't express skip/fractional).
  Align array key -> assertion_results and add summary counts. llm-rubric maps
  verdict->per-criterion assertion_results.
- §9: rename agentv:skills -> agentv:agent_rules (stages skills + hooks + agents +
  rules), package agent-rules, context agent_rules_paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): output format best-of-each (margin aggregate + vercel transcript/metrics); cleanup inventory; agent-rules kebab

- §6.0: canonical output = split bundle, queryable-on-filesystem, no DB. Best of
  each: margin-lab rich jq-queryable summary.json (Summary shape) + index.jsonl;
  vercel two-layer transcript + tool_name enum + inlined transcript_summary for
  metrics; agentskills grading.json; promptfoo EvaluateSummaryV3 export only.
- §10: verified dead-code/back-compat cleanup inventory (group A delete-now,
  group B removed-by-restructure, group C dedup). Hard deprecation, no aliases.
- Naming: identifier tokens are kebab (agentv:agent-rules), data fields snake
  (agent_rules_paths). Fixed agent_rules -> agent-rules for the scheme.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): resolve index/manifest naming drift; adopt margin-lab .internal/ bundle layout (§6.0.1)

- index stays JSONL (append/stream/query; enables --rerun-failed), not index.json.
- Fix drift: file index.jsonl is referenced as manifest_path today; rename ->
  index_path. Reserve "manifest"/bundle.json for the frozen-config file only;
  summary.json is the queryable aggregate.
- Adopt margin-lab's internal/ folder, dot-prefixed as .internal/ (AgentV's
  "dot = skip discovery" convention): index.jsonl/progress.json/events.jsonl/
  bundle.json move under .internal/; run root stays clean (summary.json + per-case
  dirs). Cross-run .indexes/.cache at results root are a separate scope.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): drop maintained consolidated export (split bundle is source of truth); keep index.jsonl name

- No first-class promptfoo EvaluateSummaryV3 / single-file export — the split
  bundle is the single source of truth (YAGNI). A promptfoo-shaped file can be
  generated on demand if ever needed; not shipped/maintained. named_scores/
  derived_metrics still live inside the split rows (Dashboard), not a file.
- Naming: keep index.jsonl (JSONL, not index.json). Reject manifest.jsonl
  (reserved for bundle.json/run manifest). rows.jsonl is the only fallback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): merge timing.json into metrics.json (fixes ADR-0011/0012 split duplication)

Prior discussion: ADR-0011 & 0012 both defined metrics.json AND timing.json as
separate per-attempt sidecars; both are written today. But timing.json
(TimingArtifact) already carries total_tokens/cost_usd/token_usage/usage_sources
(perf metrics, not just timing), overlapping metrics.json (trace execution
metrics). No reference splits them (agentskills folds tokens+duration into one
file; vercel/margin keep one per-attempt blob). Decision: one metrics.json with
sections (duration/tokens/cost always; execution/trajectory when a trace exists);
drop timing.json + timing_path, keep single metrics_path. Output-contract ADR
supersedes the 0011/0012 split. Added to §6.0.1 and §10 Group C.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): drop top-level `providers` alias — importer remaps to `targets` instead

AgentV already uses provider + apiId (provider + sub-provider) as the backend
vocabulary, so a top-level `providers`-as-SUT alias would overload the term.
Canonical SUT key is `targets` only; the promptfoo importer rewrites top-level
`providers:` -> `targets:` (mechanical, alongside camel->snake). Superset holds
via the importer, not a live alias; provider/apiId/sub-provider stay
unambiguously "backend." Updated §0/§1.1/§2.a.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): clarify superset is a design property (no shipped importer); multi-turn provenance + keep evaluation layer

- §0/§1.1/§2.a: superset is a design property, not a shipped importer. New evals
  authored in snake_case directly. providers->targets + camel->snake is a one-off
  conversion/codemod, not a runtime alias. Distinct from the hard-deprecation
  codemod (which IS built, migrates AgentV's own files).
- §3/§2.i: multi-turn has real provenance (agentv#1053, researched in
  agentevals/agentevals-research/research/findings/multiturn-conversation-eval
  against inspect-ai, google-adk, ragas). on_turn_failure <- inspect-ai
  state.completed; per-conversation aggregation is a deliberate gap-fill (inspect/
  ragas/promptfoo aggregate only across epochs, not within a conversation). Split
  execution (promptfoo _conversation) from evaluation (keep AgentV's per-turn
  assertions + aggregation + on_turn_failure). Drop only window_size (no pedigree).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(adr): add ADR-0015 multi-turn (execution vs evaluation); window_size→_conversation migration

- New ADR-0015: split conversation execution (promptfoo _conversation/sessions)
  from evaluation (keep per-turn assertions + cross-turn aggregation +
  on_turn_failure, provenance inspect-ai/google-adk/agentv#1053). Drop
  window_size. Includes the concrete window_size -> _conversation template
  mapping (loop.revindex <= N; system outside the loop) for the codemod.
- Plan §3: window_size rationale = redundant in the _conversation model (author
  windows in the template), points to ADR-0015.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): separate workspace materialization from pooling (§4)

Materialization (resolver + git-mirror cache) is always-on and makes even
fresh-per-case fast — clone speed is decoupled from reuse. Pooling = reuse +
quick-reset between cases, a perf optimization mainly for local evals (amortize
expensive setup); CI prefers fresh-per-case + mirror cache. Three isolation
levels: shared / pooled / fresh-per-case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): §11 quality gate + CargoWise PR-679 live dogfood + two workspace resolver adapters

Layered gate: deterministic CI (merge gate) + live PR-679 CargoWise dogfood
(real provider+grader, blocking) + optional promptfoo-parity diff. Workspace
acquisition = two resolver adapters unified under 'agentv workspace deps':
local git-mirror (~/projects/WiseTechGlobal/CargoWise, 7.3G, dev/offline) and a
snapshot-download adapter (port WTG download-release-deps.ts: per-year .git
tarballs from a release, shared checkout+symlink, LFS skip) for CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): §11 workspace acquisition performance — mirror-alternates / partial+sparse+shallow / snapshot-download

Not 'download vs shallow clone': ordered fastest-first — (1) local mirror via
alternates/reference (zero transfer, dev), (2) direct partial+sparse+shallow
clone (CI default: --filter=blob:none --sparse --depth 1 + workspace.repos.sparse;
less transfer than a per-year .git tarball, no producer), (3) snapshot-download
adapter (CI fallback: many-commit amortization or sha-fetch/LFS blockers).
Resolver selects per environment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): §11.1 canonical workspace resolver — split provenance from acquisition (SWE-bench/margin/Harbor)

Per AgentV research repo-provisioning-schema-design.md: eval declares provenance
only (repo/commit/sparse/ancestor); acquisition is a harness resolver in machine
config, keyed on repo, ordered backends: local-auto-adopt(--reference) ->
mirror-cache(--reference) -> snapshot -> remote -> (future) docker-image
(SWE-bench/margin, same identity key). --reference gives shallow-speed + full
history, retiring depth/filter. Remove tangled per-repo type/resolve/depth/filter/
resolver fields. Supersedes the §11 acquisition-perf framing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(adr): 0016 authoring-contract + 0017 output/artifact+resolver; supersede 0013; fold Inspect AI

- ADR-0016: promptfoo-superset authoring contract (assert/llm-rubric/targets/prompts+
  vars/nunjucks+${ENV}/hard-deprecation/metric/test_id identity/gate/repeat/workspace
  provenance). Supersedes both 0013 ADRs (marked in their Status).
- ADR-0017: output/artifact contract (best-of-each split bundle, margin queryable
  summary + index.jsonl, vercel transcript/metrics, agentskills grading + verdict/score,
  .internal/, index_path, timing->metrics merge, pass@k Build) + workspace resolver
  (provenance vs acquisition, ordered backends, --reference workhorse, future docker-image
  = native SWE-bench). Includes FAIL_TO_PASS/PASS_TO_PASS note (code-grader, no new primitive).
- §11.1: fold Inspect AI (4th confirmation of provenance-vs-acquisition; image/build/x-local).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(adr-0017): FAIL_TO_PASS/PASS_TO_PASS is a plain code-grader — drop the SDK recipe suggestion

At run time the two lists collapse to 'run these named tests; pass iff all pass' —
too domain-specific for a primitive and needs no dedicated recipe; a workspace-cwd
code-grader (exit code = verdict) covers it. Lists are data the command consumes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: adopt promptfoo-native {{ env.VAR }} for config env (reverse ${ENV}); avoids clobbering runtime shell ${VAR}

Env vars use nunjucks {{ env.VAR }} rendered at config-load time (promptfoo-native,
load.ts:336), not a ${ENV} sigil. One templating engine; phase-separated by render
pass + env namespace. Key correctness win: {{ env.VAR }} does not collide with
runtime shell ${VAR} in CLI-target commands (those pass through to the shell).
Codemod ${{ X }} -> {{ env.X }}. Updated §2.f, ADR-0016 pt7, beads .4/.15.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(adr): finalize 0015/0016/0017 (Accepted); repo provisioning = declarative workspace.repos field, not an extension

- Mark ADR-0015/0016/0017 Accepted. Numbering: 0014(extensions,#1592) / 0015(multi-turn)
  / 0016(authoring) / 0017(output+resolver) — no collision.
- Repo provisioning is a declarative workspace.repos field the harness materializes
  BEFORE hooks (ADR-0016 pt10, ADR-0017): reverses the workspace-as-extension direction
  (all 4 benchmark frameworks treat provisioning as harness-core; promptfoo has no
  workspace to align with). Extensions only for non-provisioning setup (agent-rules,
  custom hooks). isolation is a workspace field. Kept: workspace.repos/isolation/docker/
  template; removed workspace.hooks -> extensions.
- Plan §2.l/§3/§0 updated to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(adr): acquisition is PLUGGABLE (custom backend + beforeAll escape hatch); correct over-absolute 'not an extension'

Provenance stays a declarative field and the common case runs before hooks
(unchanged), but acquisition is extensible per the plugins-over-builtins guardrail:
register a custom acquisition backend (recommended) or use a beforeAll escape hatch;
the built-in acquisition may itself be a swappable plugin. Invariants unchanged
(declarative provenance, acquisition-before-hooks, built-ins ship).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: lock final workspace schema + name (durable 'workspace' field); add field-vs-resolver orthogonality framing

- ADR-0017: locked workspace schema (repos: path/repo/commit(SHA, base_commit alias)/
  sparse/ancestor; isolation fresh|pooled|shared; template; docker). Name 'workspace'
  chosen for durability (CI GITHUB_WORKSPACE/margin/git; not sandbox/environment/testbed).
  Never in schema: acquisition (harness config) + hooks (extensions) — that's what keeps
  it durable. Added field(what)-vs-resolver(how) orthogonality + package.json analogy.
- Plan §2.l: replaced stale vars.workspace/extension example with the locked field shape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(adr-0017): cross-check exploitbench — confirms split-bundle/no-DB; adopt provenance field; audit/anti-reward-hacking future scope

exploitbench confirms: split filesystem run-tree = source of truth, SQLite is a
derived rebuildable view (import/export bijection) not required, image pinned by
sha256 digest, config_snapshot=bundle.json. Borrow: (1) provenance field on result
rows (native/mock/replay/imported_*) — adopt; (2) eval-integrity/anti-reward-hacking
(read-only grader container, audit re-grade + red-flag scan + model-identity check)
— future scope.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(adr-0017): multi-suite runs — one run_id/timestamp per invocation; categorize by suite AND tags/experiment

Confirms ADR-0009/0012: one <run_id> bundle per CLI invocation across all suite
YAMLs (never per-suite timestamp folders); runtime_source.kind=multi_eval.
Identity = eval_path+test_id (uuid-suffixed dir) so overlapping case IDs across
suites don't collide; suite/name are display/grouping metadata not routing.
Categorize by BOTH axes on each index row: suite (structural) + tags/experiment
(semantic/campaign); experiment = run bucket, suite = intra-run group.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(adr): run organization — cross-run .indexes/runs.jsonl; rename run-N -> sample-N; experiment is a tag not a bucket

- Cross-run index: rebuildable .agentv/results/.indexes/runs.jsonl (one row per
  run), derived from */summary.json, not source of truth; per-run index.jsonl stays.
  JSONL not index.json.
- Repeat folder run-N -> sample-N (margin/pass@k; de-conflicts with run_id).
  sample_index=repeats, retry_index=infra retries.
- experiment = reserved-by-convention tag (Dashboard default compare key), NOT a
  bucket/field/storage path — continues ADR-0006/0009/0013 demotion. One grouping
  mechanism (tags); experiment/suite are conventional keys.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(adr): experiment has ZERO privilege — plain tag, alphabetical, user-set default grouping key

Not even a default compare key. Tag keys sort alphabetically; the default cross-run
grouping/compare key is a user preference (any tag), AgentV blesses none.
--experiment X = sugar for --tag experiment=X. Completes the experiment demotion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(adr): experiment — no structural privilege, but auto-default its VALUE to eval/suite name (ADR-0009)

Separates the two: no privileged grouping KEY (user preference, alphabetical tags),
but the harness auto-populates the experiment tag VALUE from the eval/suite name
when unset (--experiment > authored tags.experiment > eval/suite name) so every run
is always groupable under a meaningful experiment. Default value, not a privileged key.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(adr-0017): full results-tree layout (cross-run .indexes/ vs per-run .internal/); dashboard default view

- Two dot-namespaces: .agentv/results/.indexes/runs.jsonl (cross-run catalog, one
  row/run) vs <run_id>/.internal/index.jsonl (per-run, one row/case). No per-run
  .indexes — .internal already holds the per-run index. Names signal scope.
- Dashboard default view is sensible (group by always-populated experiment value, or
  recent-runs list); grouping key is a user preference, not an absent default — so
  it never looks odd/empty.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(adr-0017): add cross-run cases.jsonl (case-level filtering); confirm margin-lab consistency

- runs.jsonl (one row/run) can't do case-level cross-run queries; add
  .indexes/cases.jsonl (one row per run x case) rebuilt from */.internal/index.jsonl
  + run metadata. Join key = layered identity (content-hash test_id + governance tag).
  SQLite view is the escape hatch if JSONL scanning outgrows laptop scale (optional
  adapter, exploitbench pattern, no-DB core intact).
- margin-lab consistency: matches on substance (queryable aggregate, internal folder,
  per-unit dirs, no DB, pure Build pass@k, instance_key test_id#sample_index);
  deliberate divergences (hierarchical test-id/sample-N vs flat instances/, timing->metrics).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(adr-0017): lock artifact filenames — keep summary.json (not results.json) and grading.json (not grades.json)

summary.json kept over margin's results.json: accurate (it's an aggregate not the
full results), avoids results/<run_id>/results.json stutter, symmetric at run+case
levels, vercel-aligned; margin match is on concept/shape not filename. Per-sample
triad result.json/grading.json/metrics.json all kept (distinct). grading.json kept
(agentskills-consistent), not grades.json.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(adr-0017): correct margin claim — margin uses a RunStore (memory/Postgres, NOT SQLite); no-DB is AgentV's divergence

margin's runner has a persistent RunStore (in-memory/Postgres) for scheduling +
queries — the operational source during a run, not a filesystem-derived index.
AgentV deliberately declines a store (laptop-first; index.jsonl + --rerun-failed).
The rebuildable derived-index/view idea (JSONL .indexes/, optional SQLite escape
hatch) is exploitbench's (import/export bijection), not margin's.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(adr-0017): margin store rebuild nuance — fs->store only for resume carry-forward, not a general bijective import

margin can rehydrate a run's completed-work from its run-dir for RESUME
(LoadProgressSnapshot + loadSavedResumeBundle + carryForwardLocalCases) but has
no general import that rebuilds the multi-run query DB from files (memory store
ephemeral, Postgres persists independently). AgentV follows exploitbench's model
(fs = source of truth, .indexes/*.jsonl derived/rebuildable; --rerun-failed reads
index.jsonl, no store to rehydrate).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Remove deprecated eval authoring aliases

* Fix eval schema drift and grader artifact wording

* Remove deprecated exported aliases

* Remove programmatic expected_output alias

* Remove built-in provider aliases

* Clarify generic grader assertion rows

* Remove target log_format aliases

* chore(evals): remove numeric required thresholds

* chore(targets): align azure api_format removal error

* feat(evals): accept promptfoo-shaped eval schema fields

* docs(adr): clarify superseded authoring decisions

* fix(cli): remove removed judge target from scaffold

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@christso christso marked this pull request as ready for review July 2, 2026 08:49
@christso christso merged commit 64b0471 into main Jul 2, 2026
8 checks passed
@christso christso deleted the docs/promptfoo-compatible-extensions-plan branch July 2, 2026 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant