docs(plans): add promptfoo-compatible extensions plan#1592
Merged
Conversation
Entire-Checkpoint: dd08c8dc0d47
Deploying agentv with
|
| Latest commit: |
8fe4811
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://816376c0.agentv.pages.dev |
| Branch Preview URL: | https://docs-promptfoo-compatible-ex.agentv.pages.dev |
christso
added a commit
that referenced
this pull request
Jul 1, 2026
- Hard churn confirmed OK (not yet in production) — clean break, no aliases.
- Simplify: collapse tests[].input into prompts + vars (one way to express
what's sent to the target); input_files stays as prompt-content.
- Runner: laptop-first, no DB store. Instance expansion + pass@k + infra-only
retry (margin ideas), but a simple worker pool; resumability via index.jsonl
(--rerun-failed); workspace = shared or reset-based pool, not per-instance
containers.
- nunjucks confirmed to meet inline text+var mixing; {{var}} for vars,
${ENV} for env.
- Redteam + unimplemented exotic assertions: treated as unrecognized fields
(future scope), not stubbed. `similar` in (needs embeddings provider).
- test_id: layered identity — content-hash (content), author tag/metadata
(governance/trend, Dashboard keys on this), description/vars (display).
- LLM judge type name = `llm-rubric`; default grading adopts skeptical
evidence-by-path judge (opt-out via explicit prompt); documented what changes.
- New §9: PR #1592 is the extensions/workspace implementation slice — reasonable,
mergeable after 4 amendments (chiefly vars.workspace + built-in agentv: scheme).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ation, vars.workspace, built-in scheme) Adds an Amendments section overriding the original where they conflict, per the wider promptfoo-superset restructure (PR #1594): - A1: isolation is hook-derived (beforeAll=shared, beforeEach=per-case) via a reset-based workspace pool; remove the isolation config knob. - A2: per-case workspace spec lives in dataset vars.workspace. - A3: ship built-in auto-registered agentv:workspace / agentv:skills alongside file://. - A4: grading contract unchanged — reuse EvaluationScore (score/verdict/assertions[evidence]). - A5: ADR 0014 must note the incoming superseding ADR (assert / input removal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
christso
added a commit
that referenced
this pull request
Jul 1, 2026
…#1592 amended - §5.3: llm-rubric reuses existing EvaluationScore grading.json shape (score + verdict + assertions[{text,passed,evidence}]); risk is mapping the judge verdict to per-criterion assertions+evidence, keeping evidence in grading.json, and score/verdict consistency. No new grading contract. - §9: record that PR #1592 was amended to the agreed model (A1-A5). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…contract origin - A6: the staging extension covers skills + hooks + subagents + rules, so rename to agentv:agent_rules (package agent-rules, context agent_rules_paths). skills is one kind of agent rule, not the extension name. - A4: note grading.json originates from agentskills (assertion_results + summary); AgentV adds string verdict (pass/fail/skip) + fractional score as a superset. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ules_paths field stays snake_case) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
christso
added a commit
that referenced
this pull request
Jul 2, 2026
…clarative workspace.repos field, not an extension - Mark ADR-0015/0016/0017 Accepted. Numbering: 0014(extensions,#1592) / 0015(multi-turn) / 0016(authoring) / 0017(output+resolver) — no collision. - Repo provisioning is a declarative workspace.repos field the harness materializes BEFORE hooks (ADR-0016 pt10, ADR-0017): reverses the workspace-as-extension direction (all 4 benchmark frameworks treat provisioning as harness-core; promptfoo has no workspace to align with). Extensions only for non-provisioning setup (agent-rules, custom hooks). isolation is a workspace field. Kept: workspace.repos/isolation/docker/ template; removed workspace.hooks -> extensions. - Plan §2.l/§3/§0 updated to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… an extension (per ADR-0016/0017) Narrows this PR: do NOT move repo materialization into an extension. Repo acquisition stays harness-core (declarative workspace.repos field + resolver, materialized before hooks). Extensions cover only non-provisioning setup (agent-rules, custom hooks). Names the superseding ADRs (0016 authoring, 0017 output/resolver) referenced in A5. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
christso
added a commit
that referenced
this pull request
Jul 2, 2026
…1594) * docs(plan): draft plan to restructure eval authoring as a promptfoo superset Draft design plan (for review, no code changes) to re-align AgentV's eval authoring format with promptfoo, keeping snake_case, and borrowing runner/ analytics from Margin-Lab/evals and transcripts/agentic-graders from vercel-labs/agent-eval. Governing decisions captured: - AgentV eval contract is a strict SUPERSET of (snake_cased) promptfoo. - Prefer promptfoo naming where equivalent; keep AgentV only where semantics are better (gate, repeat block, workspace, agentic judge). - Hard deprecation (major version): removed keys error, codemod migrates. - `targets` re-canonicalized as first-class SUT (verified: promptfoo `targets` is a 2024-05 redteam alias for the canonical `providers`); `provider` demoted to backend-kind field; `providers` accepted as input alias. - `{{ var }}` nunjucks for eval-time vars; `${ENV}` (docker/k8s style) for config-time env, replacing `${{ ENV }}`. - Optional test `id`: split stable derived `test_id` from display label. - Bare-string `assert` shorthand kept, desugars to batched `llm-rubric`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): workspace as dataset+extension; drop on_run_complete; align lifecycle with promptfoo - No new top-level `workspace:` block. Repo/fixture spec rides as dataset `vars` (file://-loadable), consumed by a built-in, auto-registered, overridable `agentv:workspace` extension. Matches vercel (fixture=case) and margin (image=case): workspace is part of the dataset. - Single lifecycle surface = promptfoo `extensions` (beforeAll/afterAll/beforeEach/afterEach). Hard-remove `on_run_complete` (= afterAll), `preprocessors`, and `workspace.hooks`. - Isolation = the hook name in the extension reference (verified promptfoo mechanism, evaluatorHelpers.ts:633 EXTENSION_HOOK_NAMES): `agentv:workspace:beforeAll` = shared, `:beforeEach` = per-case. - Built-in extension internally borrows margin (git/docker materialization + mirror cache) and vercel (per-case fixture copy, path handed to target), validates the vars.workspace shape, writes materialized path back to vars. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): grader execution model — javascript in-process, python subprocess, code-grader power tool Verified against promptfoo source: javascript assertions run in-process (new Function / dynamic import), only python shells out (PythonShell). AgentV matches: `javascript` in-process (easier on Bun — imports .ts directly), `python` subprocess, `code-grader` stays the subprocess power tool for workspace-cwd / arbitrary-language / isolation cases. Do not desugar `javascript` into `code-grader` (loses in-process speed). Resolves old §8.3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): lock owner decisions (#1-#8) + reconcile with PR #1592 - Hard churn confirmed OK (not yet in production) — clean break, no aliases. - Simplify: collapse tests[].input into prompts + vars (one way to express what's sent to the target); input_files stays as prompt-content. - Runner: laptop-first, no DB store. Instance expansion + pass@k + infra-only retry (margin ideas), but a simple worker pool; resumability via index.jsonl (--rerun-failed); workspace = shared or reset-based pool, not per-instance containers. - nunjucks confirmed to meet inline text+var mixing; {{var}} for vars, ${ENV} for env. - Redteam + unimplemented exotic assertions: treated as unrecognized fields (future scope), not stubbed. `similar` in (needs embeddings provider). - test_id: layered identity — content-hash (content), author tag/metadata (governance/trend, Dashboard keys on this), description/vars (display). - LLM judge type name = `llm-rubric`; default grading adopts skeptical evidence-by-path judge (opt-out via explicit prompt); documented what changes. - New §9: PR #1592 is the extensions/workspace implementation slice — reasonable, mergeable after 4 amendments (chiefly vars.workspace + built-in agentv: scheme). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): grading.json contract for llm-rubric (main risk) + note PR #1592 amended - §5.3: llm-rubric reuses existing EvaluationScore grading.json shape (score + verdict + assertions[{text,passed,evidence}]); risk is mapping the judge verdict to per-criterion assertions+evidence, keeping evidence in grading.json, and score/verdict consistency. No new grading contract. - §9: record that PR #1592 was amended to the agreed model (A1-A5). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): reconcile grading.json with agentskills; rename skills extension -> agent_rules - §5.3: grading.json originates from agentskills (assertion_results[{text,passed, evidence}] + summary counts; no overall verdict). Per-assertion already matches agentskills exactly — keep. Overall stays a STRING verdict('pass'|'fail'|'skip') + fractional score (AgentV superset; boolean can't express skip/fractional). Align array key -> assertion_results and add summary counts. llm-rubric maps verdict->per-criterion assertion_results. - §9: rename agentv:skills -> agentv:agent_rules (stages skills + hooks + agents + rules), package agent-rules, context agent_rules_paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): output format best-of-each (margin aggregate + vercel transcript/metrics); cleanup inventory; agent-rules kebab - §6.0: canonical output = split bundle, queryable-on-filesystem, no DB. Best of each: margin-lab rich jq-queryable summary.json (Summary shape) + index.jsonl; vercel two-layer transcript + tool_name enum + inlined transcript_summary for metrics; agentskills grading.json; promptfoo EvaluateSummaryV3 export only. - §10: verified dead-code/back-compat cleanup inventory (group A delete-now, group B removed-by-restructure, group C dedup). Hard deprecation, no aliases. - Naming: identifier tokens are kebab (agentv:agent-rules), data fields snake (agent_rules_paths). Fixed agent_rules -> agent-rules for the scheme. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): resolve index/manifest naming drift; adopt margin-lab .internal/ bundle layout (§6.0.1) - index stays JSONL (append/stream/query; enables --rerun-failed), not index.json. - Fix drift: file index.jsonl is referenced as manifest_path today; rename -> index_path. Reserve "manifest"/bundle.json for the frozen-config file only; summary.json is the queryable aggregate. - Adopt margin-lab's internal/ folder, dot-prefixed as .internal/ (AgentV's "dot = skip discovery" convention): index.jsonl/progress.json/events.jsonl/ bundle.json move under .internal/; run root stays clean (summary.json + per-case dirs). Cross-run .indexes/.cache at results root are a separate scope. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): drop maintained consolidated export (split bundle is source of truth); keep index.jsonl name - No first-class promptfoo EvaluateSummaryV3 / single-file export — the split bundle is the single source of truth (YAGNI). A promptfoo-shaped file can be generated on demand if ever needed; not shipped/maintained. named_scores/ derived_metrics still live inside the split rows (Dashboard), not a file. - Naming: keep index.jsonl (JSONL, not index.json). Reject manifest.jsonl (reserved for bundle.json/run manifest). rows.jsonl is the only fallback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): merge timing.json into metrics.json (fixes ADR-0011/0012 split duplication) Prior discussion: ADR-0011 & 0012 both defined metrics.json AND timing.json as separate per-attempt sidecars; both are written today. But timing.json (TimingArtifact) already carries total_tokens/cost_usd/token_usage/usage_sources (perf metrics, not just timing), overlapping metrics.json (trace execution metrics). No reference splits them (agentskills folds tokens+duration into one file; vercel/margin keep one per-attempt blob). Decision: one metrics.json with sections (duration/tokens/cost always; execution/trajectory when a trace exists); drop timing.json + timing_path, keep single metrics_path. Output-contract ADR supersedes the 0011/0012 split. Added to §6.0.1 and §10 Group C. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): drop top-level `providers` alias — importer remaps to `targets` instead AgentV already uses provider + apiId (provider + sub-provider) as the backend vocabulary, so a top-level `providers`-as-SUT alias would overload the term. Canonical SUT key is `targets` only; the promptfoo importer rewrites top-level `providers:` -> `targets:` (mechanical, alongside camel->snake). Superset holds via the importer, not a live alias; provider/apiId/sub-provider stay unambiguously "backend." Updated §0/§1.1/§2.a. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): clarify superset is a design property (no shipped importer); multi-turn provenance + keep evaluation layer - §0/§1.1/§2.a: superset is a design property, not a shipped importer. New evals authored in snake_case directly. providers->targets + camel->snake is a one-off conversion/codemod, not a runtime alias. Distinct from the hard-deprecation codemod (which IS built, migrates AgentV's own files). - §3/§2.i: multi-turn has real provenance (agentv#1053, researched in agentevals/agentevals-research/research/findings/multiturn-conversation-eval against inspect-ai, google-adk, ragas). on_turn_failure <- inspect-ai state.completed; per-conversation aggregation is a deliberate gap-fill (inspect/ ragas/promptfoo aggregate only across epochs, not within a conversation). Split execution (promptfoo _conversation) from evaluation (keep AgentV's per-turn assertions + aggregation + on_turn_failure). Drop only window_size (no pedigree). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(adr): add ADR-0015 multi-turn (execution vs evaluation); window_size→_conversation migration - New ADR-0015: split conversation execution (promptfoo _conversation/sessions) from evaluation (keep per-turn assertions + cross-turn aggregation + on_turn_failure, provenance inspect-ai/google-adk/agentv#1053). Drop window_size. Includes the concrete window_size -> _conversation template mapping (loop.revindex <= N; system outside the loop) for the codemod. - Plan §3: window_size rationale = redundant in the _conversation model (author windows in the template), points to ADR-0015. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): separate workspace materialization from pooling (§4) Materialization (resolver + git-mirror cache) is always-on and makes even fresh-per-case fast — clone speed is decoupled from reuse. Pooling = reuse + quick-reset between cases, a perf optimization mainly for local evals (amortize expensive setup); CI prefers fresh-per-case + mirror cache. Three isolation levels: shared / pooled / fresh-per-case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): §11 quality gate + CargoWise PR-679 live dogfood + two workspace resolver adapters Layered gate: deterministic CI (merge gate) + live PR-679 CargoWise dogfood (real provider+grader, blocking) + optional promptfoo-parity diff. Workspace acquisition = two resolver adapters unified under 'agentv workspace deps': local git-mirror (~/projects/WiseTechGlobal/CargoWise, 7.3G, dev/offline) and a snapshot-download adapter (port WTG download-release-deps.ts: per-year .git tarballs from a release, shared checkout+symlink, LFS skip) for CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): §11 workspace acquisition performance — mirror-alternates / partial+sparse+shallow / snapshot-download Not 'download vs shallow clone': ordered fastest-first — (1) local mirror via alternates/reference (zero transfer, dev), (2) direct partial+sparse+shallow clone (CI default: --filter=blob:none --sparse --depth 1 + workspace.repos.sparse; less transfer than a per-year .git tarball, no producer), (3) snapshot-download adapter (CI fallback: many-commit amortization or sha-fetch/LFS blockers). Resolver selects per environment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(plan): §11.1 canonical workspace resolver — split provenance from acquisition (SWE-bench/margin/Harbor) Per AgentV research repo-provisioning-schema-design.md: eval declares provenance only (repo/commit/sparse/ancestor); acquisition is a harness resolver in machine config, keyed on repo, ordered backends: local-auto-adopt(--reference) -> mirror-cache(--reference) -> snapshot -> remote -> (future) docker-image (SWE-bench/margin, same identity key). --reference gives shallow-speed + full history, retiring depth/filter. Remove tangled per-repo type/resolve/depth/filter/ resolver fields. Supersedes the §11 acquisition-perf framing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(adr): 0016 authoring-contract + 0017 output/artifact+resolver; supersede 0013; fold Inspect AI - ADR-0016: promptfoo-superset authoring contract (assert/llm-rubric/targets/prompts+ vars/nunjucks+${ENV}/hard-deprecation/metric/test_id identity/gate/repeat/workspace provenance). Supersedes both 0013 ADRs (marked in their Status). - ADR-0017: output/artifact contract (best-of-each split bundle, margin queryable summary + index.jsonl, vercel transcript/metrics, agentskills grading + verdict/score, .internal/, index_path, timing->metrics merge, pass@k Build) + workspace resolver (provenance vs acquisition, ordered backends, --reference workhorse, future docker-image = native SWE-bench). Includes FAIL_TO_PASS/PASS_TO_PASS note (code-grader, no new primitive). - §11.1: fold Inspect AI (4th confirmation of provenance-vs-acquisition; image/build/x-local). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(adr-0017): FAIL_TO_PASS/PASS_TO_PASS is a plain code-grader — drop the SDK recipe suggestion At run time the two lists collapse to 'run these named tests; pass iff all pass' — too domain-specific for a primitive and needs no dedicated recipe; a workspace-cwd code-grader (exit code = verdict) covers it. Lists are data the command consumes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: adopt promptfoo-native {{ env.VAR }} for config env (reverse ${ENV}); avoids clobbering runtime shell ${VAR} Env vars use nunjucks {{ env.VAR }} rendered at config-load time (promptfoo-native, load.ts:336), not a ${ENV} sigil. One templating engine; phase-separated by render pass + env namespace. Key correctness win: {{ env.VAR }} does not collide with runtime shell ${VAR} in CLI-target commands (those pass through to the shell). Codemod ${{ X }} -> {{ env.X }}. Updated §2.f, ADR-0016 pt7, beads .4/.15. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(adr): finalize 0015/0016/0017 (Accepted); repo provisioning = declarative workspace.repos field, not an extension - Mark ADR-0015/0016/0017 Accepted. Numbering: 0014(extensions,#1592) / 0015(multi-turn) / 0016(authoring) / 0017(output+resolver) — no collision. - Repo provisioning is a declarative workspace.repos field the harness materializes BEFORE hooks (ADR-0016 pt10, ADR-0017): reverses the workspace-as-extension direction (all 4 benchmark frameworks treat provisioning as harness-core; promptfoo has no workspace to align with). Extensions only for non-provisioning setup (agent-rules, custom hooks). isolation is a workspace field. Kept: workspace.repos/isolation/docker/ template; removed workspace.hooks -> extensions. - Plan §2.l/§3/§0 updated to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(adr): acquisition is PLUGGABLE (custom backend + beforeAll escape hatch); correct over-absolute 'not an extension' Provenance stays a declarative field and the common case runs before hooks (unchanged), but acquisition is extensible per the plugins-over-builtins guardrail: register a custom acquisition backend (recommended) or use a beforeAll escape hatch; the built-in acquisition may itself be a swappable plugin. Invariants unchanged (declarative provenance, acquisition-before-hooks, built-ins ship). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: lock final workspace schema + name (durable 'workspace' field); add field-vs-resolver orthogonality framing - ADR-0017: locked workspace schema (repos: path/repo/commit(SHA, base_commit alias)/ sparse/ancestor; isolation fresh|pooled|shared; template; docker). Name 'workspace' chosen for durability (CI GITHUB_WORKSPACE/margin/git; not sandbox/environment/testbed). Never in schema: acquisition (harness config) + hooks (extensions) — that's what keeps it durable. Added field(what)-vs-resolver(how) orthogonality + package.json analogy. - Plan §2.l: replaced stale vars.workspace/extension example with the locked field shape. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(adr-0017): cross-check exploitbench — confirms split-bundle/no-DB; adopt provenance field; audit/anti-reward-hacking future scope exploitbench confirms: split filesystem run-tree = source of truth, SQLite is a derived rebuildable view (import/export bijection) not required, image pinned by sha256 digest, config_snapshot=bundle.json. Borrow: (1) provenance field on result rows (native/mock/replay/imported_*) — adopt; (2) eval-integrity/anti-reward-hacking (read-only grader container, audit re-grade + red-flag scan + model-identity check) — future scope. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(adr-0017): multi-suite runs — one run_id/timestamp per invocation; categorize by suite AND tags/experiment Confirms ADR-0009/0012: one <run_id> bundle per CLI invocation across all suite YAMLs (never per-suite timestamp folders); runtime_source.kind=multi_eval. Identity = eval_path+test_id (uuid-suffixed dir) so overlapping case IDs across suites don't collide; suite/name are display/grouping metadata not routing. Categorize by BOTH axes on each index row: suite (structural) + tags/experiment (semantic/campaign); experiment = run bucket, suite = intra-run group. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(adr): run organization — cross-run .indexes/runs.jsonl; rename run-N -> sample-N; experiment is a tag not a bucket - Cross-run index: rebuildable .agentv/results/.indexes/runs.jsonl (one row per run), derived from */summary.json, not source of truth; per-run index.jsonl stays. JSONL not index.json. - Repeat folder run-N -> sample-N (margin/pass@k; de-conflicts with run_id). sample_index=repeats, retry_index=infra retries. - experiment = reserved-by-convention tag (Dashboard default compare key), NOT a bucket/field/storage path — continues ADR-0006/0009/0013 demotion. One grouping mechanism (tags); experiment/suite are conventional keys. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(adr): experiment has ZERO privilege — plain tag, alphabetical, user-set default grouping key Not even a default compare key. Tag keys sort alphabetically; the default cross-run grouping/compare key is a user preference (any tag), AgentV blesses none. --experiment X = sugar for --tag experiment=X. Completes the experiment demotion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(adr): experiment — no structural privilege, but auto-default its VALUE to eval/suite name (ADR-0009) Separates the two: no privileged grouping KEY (user preference, alphabetical tags), but the harness auto-populates the experiment tag VALUE from the eval/suite name when unset (--experiment > authored tags.experiment > eval/suite name) so every run is always groupable under a meaningful experiment. Default value, not a privileged key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(adr-0017): full results-tree layout (cross-run .indexes/ vs per-run .internal/); dashboard default view - Two dot-namespaces: .agentv/results/.indexes/runs.jsonl (cross-run catalog, one row/run) vs <run_id>/.internal/index.jsonl (per-run, one row/case). No per-run .indexes — .internal already holds the per-run index. Names signal scope. - Dashboard default view is sensible (group by always-populated experiment value, or recent-runs list); grouping key is a user preference, not an absent default — so it never looks odd/empty. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(adr-0017): add cross-run cases.jsonl (case-level filtering); confirm margin-lab consistency - runs.jsonl (one row/run) can't do case-level cross-run queries; add .indexes/cases.jsonl (one row per run x case) rebuilt from */.internal/index.jsonl + run metadata. Join key = layered identity (content-hash test_id + governance tag). SQLite view is the escape hatch if JSONL scanning outgrows laptop scale (optional adapter, exploitbench pattern, no-DB core intact). - margin-lab consistency: matches on substance (queryable aggregate, internal folder, per-unit dirs, no DB, pure Build pass@k, instance_key test_id#sample_index); deliberate divergences (hierarchical test-id/sample-N vs flat instances/, timing->metrics). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(adr-0017): lock artifact filenames — keep summary.json (not results.json) and grading.json (not grades.json) summary.json kept over margin's results.json: accurate (it's an aggregate not the full results), avoids results/<run_id>/results.json stutter, symmetric at run+case levels, vercel-aligned; margin match is on concept/shape not filename. Per-sample triad result.json/grading.json/metrics.json all kept (distinct). grading.json kept (agentskills-consistent), not grades.json. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(adr-0017): correct margin claim — margin uses a RunStore (memory/Postgres, NOT SQLite); no-DB is AgentV's divergence margin's runner has a persistent RunStore (in-memory/Postgres) for scheduling + queries — the operational source during a run, not a filesystem-derived index. AgentV deliberately declines a store (laptop-first; index.jsonl + --rerun-failed). The rebuildable derived-index/view idea (JSONL .indexes/, optional SQLite escape hatch) is exploitbench's (import/export bijection), not margin's. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(adr-0017): margin store rebuild nuance — fs->store only for resume carry-forward, not a general bijective import margin can rehydrate a run's completed-work from its run-dir for RESUME (LoadProgressSnapshot + loadSavedResumeBundle + carryForwardLocalCases) but has no general import that rebuilds the multi-run query DB from files (memory store ephemeral, Postgres persists independently). AgentV follows exploitbench's model (fs = source of truth, .indexes/*.jsonl derived/rebuildable; --rerun-failed reads index.jsonl, no store to rehydrate). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Remove deprecated eval authoring aliases * Fix eval schema drift and grader artifact wording * Remove deprecated exported aliases * Remove programmatic expected_output alias * Remove built-in provider aliases * Clarify generic grader assertion rows * Remove target log_format aliases * chore(evals): remove numeric required thresholds * chore(targets): align azure api_format removal error * feat(evals): accept promptfoo-shaped eval schema fields * docs(adr): clarify superseded authoring decisions * fix(cli): remove removed judge target from scaffold --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds an implementation-ready plan for adopting Promptfoo-compatible extension hooks while removing
workspaceas an AgentV core primitive. The plan sets the intended shape: core owns generic extension runtime context and run artifacts, while workspace and skills setup move into bundled or local extensions.The PR 679 parity example is treated as the motivating reusable layout, with suite files focused on providers, prompts, defaults, and tests while workspace materialization and skill staging live in named extension modules.
Validation
Not run; this is a plan-only documentation change.
Tracking: implemented as bead
av-kfik.14(extensions/workspace/agent-rules slice) under epicav-kfik— see PR #1594 for the full restructure plan and the A1–A6 amendments folded into this doc. Depends onav-kfik.2(schema) +av-kfik.4(templating).