From 59c8389f6ad6e07ad060e5a48202b763f42d94dd Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Wed, 1 Jul 2026 16:17:49 +0200 Subject: [PATCH 1/6] docs(plans): add promptfoo-compatible extensions plan Entire-Checkpoint: dd08c8dc0d47 --- ...at-promptfoo-compatible-extensions-plan.md | 270 ++++++++++++++++++ 1 file changed, 270 insertions(+) create mode 100644 docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md diff --git a/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md b/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md new file mode 100644 index 000000000..f3ac65a5e --- /dev/null +++ b/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md @@ -0,0 +1,270 @@ +--- +artifact_contract: ce-unified-plan/v1 +artifact_readiness: implementation-ready +product_contract_source: ce-plan-bootstrap +execution: code +title: "feat: Add Promptfoo-compatible extensions" +created_at: 2026-07-01 +type: feat +--- + +# feat: Add Promptfoo-compatible extensions + +## Goal Capsule + +- **Objective:** Adopt a Promptfoo-compatible eval authoring contract for extension hooks, remove `workspace` as an AgentV core primitive, and make the PR 679 parity example reusable through generalized workspace and skills extensions. +- **Authority:** User request, AgentV product boundary, ADR 0013 eval authoring contract, existing workspace and Agent Skills implementation. +- **Execution profile:** Deep, cross-cutting schema/runtime/docs change touching eval parsing, validation, orchestration, examples, and provider-facing skill setup. +- **Stop conditions:** Stop if removing `workspace` from core would break canonical run artifact creation or provider invocation in a way the generalized extension runtime cannot replace. +- **Tail ownership:** AgentV core owns extension loading, provider invocation, normalized run context, and run artifacts; workspace and skills setup live in bundled or local extensions rather than core. + +--- + +## Product Contract + +### Summary + +AgentV should support Promptfoo-style extension hook references such as `file://scripts/workspace.ts:beforeAll` while preserving AgentV's repo-native execution and artifact model. The PR 679 Promptfoo parity example should become a reusable layout where the suite file stays focused on providers, prompts, tests, assertions, and defaults, while workspace materialization and skill staging live in named extension modules with reusable config. + +### Problem Frame + +The current AgentV `workspace` field is powerful but makes authoring diverge from Promptfoo and pushes environment setup into core. The Promptfoo PR 679 parity branch demonstrates a cleaner top-level suite shape, but its extension scripts communicate through bespoke environment variables and one-off file names. AgentV needs the compatibility win without keeping workspace as a privileged core concept. + +### Requirements + +- R1. Eval YAML supports a top-level `extensions` array using Promptfoo-compatible `file://path:function` references. +- R2. Supported hook names are `beforeAll`, `beforeEach`, `afterEach`, and `afterAll`, mapped internally to AgentV's existing snake_case lifecycle stages. +- R3. AgentV core no longer parses or owns top-level `workspace`; workspace setup is expressed through reusable extensions. +- R4. Existing workspace materialization behavior is extracted behind an extension boundary for templates, repositories, hooks, Docker config, static paths, and workspace pooling. +- R5. Skills setup is modeled as a reusable skills extension that stages `SKILL.md` directories into the prepared workspace and exposes normalized skill paths to providers. +- R6. The PR 679 parity example is reorganized so the suite file is Promptfoo-like and the workspace, materialization, skills, providers, fixtures, and rubrics are independently reusable. +- R7. Machine-local paths and secrets stay outside portable suite files; extension configs may reference local overlays or env vars, but portable examples must show placeholders and config-local routing. +- R8. Run bundles continue to expose canonical AgentV result artifacts; Promptfoo compatibility does not create a Promptfoo result format or make Promptfoo the artifact owner. +- R9. Documentation and validation explain the removal of core `workspace` and the replacement extension model without collapsing AgentV `project` and `benchmark` vocabulary. + +### Acceptance Examples + +- AE1. Given an eval file with `extensions: ["file://extensions/workspace.ts:beforeAll"]`, when `agentv validate` runs, then the extension reference is accepted and resolved relative to the eval file. +- AE2. Given a reusable workspace extension config with pinned repos, when an eval run starts, then the extension prepares the runtime directory and returns generic context consumed by core. +- AE3. Given a skills extension config, when a supported coding-agent target runs, then the staged skill directories are available in the workspace and reflected in provider request metadata without bespoke env var glue. +- AE4. Given a legacy eval that still uses `workspace`, when validation runs, then AgentV fails with a migration message that points to `extensions` and the bundled workspace extension. +- AE5. Given the PR 679 parity suite, when a user swaps only the cases or skill config, then the same workspace and provider extension files can be reused without editing the suite file. + +### Scope Boundaries + +- **In scope:** Promptfoo-compatible extension reference syntax, hook lifecycle integration, generalized extension runtime context, workspace/skills extension contracts, validation/docs/examples, and PR 679 reusable layout. +- **In scope:** Removing core `workspace` parsing and routing users to extension-based setup. +- **Out of scope:** Importing Promptfoo results, copying Promptfoo's full config schema, adding hosted Promptfoo integration, or changing AgentV run bundle ownership. +- **Deferred to Follow-Up Work:** A deterministic Promptfoo-to-AgentV converter, package-distributed extension registries, remote extension loading, and a generalized plugin marketplace. + +--- + +## Planning Contract + +### Key Technical Decisions + +- KTD1. Adopt Promptfoo syntax only at the hook reference boundary. AgentV should accept `file://path:function` and the four Promptfoo hook names, but the hook context and returned normalized runtime state remain AgentV-owned. +- KTD2. Remove workspace from core, then reintroduce the behavior as an extension. The existing `packages/core/src/evaluation/workspace/` pipeline should be extracted to a bundled workspace extension or workspace support package; core should only consume normalized runtime context such as `cwd`, cleanup callbacks, environment, artifacts, and metadata. +- KTD3. Use explicit extension outputs instead of env-var side channels. The PR 679 Promptfoo branch uses `PROMPTFOO_PR679_WORKSPACE_DIR` and related env vars; AgentV should instead let hooks return typed runtime data such as prepared working directories, staged skill paths, environment values, and metadata. +- KTD4. Resolve extension paths like Promptfoo, execute them like AgentV. `file://` paths resolve relative to the eval file, support JavaScript/TypeScript first, and may support Python through the same subprocess discipline used for code graders if needed. +- KTD5. Hard-remove `workspace` from the core eval contract. If a short migration window is unavoidable for already-published examples, keep it outside the parser as a conversion helper, not as a runtime compatibility path. +- KTD6. Keep the PR 679 example split by reusable concern. The suite owns prompts/tests/defaults; `extensions/workspace.*` owns repo materialization; `extensions/skills.*` owns skill staging; `providers/` owns targets; `datasets/pr-679/` owns fixtures and cases; `rubrics/` owns grading criteria. + +### High-Level Technical Design + +```mermaid +flowchart TB + A[Eval YAML] --> B[Parser and validator] + B --> C[Extension registry] + C --> D[beforeAll workspace extension] + C --> E[beforeAll skills extension] + D --> F[Normalized runtime context] + E --> G[Normalized staged skill paths] + F --> H[Bundled workspace extension/support package] + G --> I[Provider request context] + H --> I + I --> J[AgentV run bundle] +``` + +The suite file should stay Promptfoo-like at the authoring boundary, but normalized state should enter the AgentV execution path before provider invocation. The `workspace` term remains valid only as a runtime directory or extension capability name; it stops being an AgentV core eval field. + +### Suggested Reusable Example Layout + +```text +framework-parity/promptfoo/pr-679/ + promptfooconfig.yaml + datasets/pr-679/cases.yaml + datasets/pr-679/fixtures/clear-job-consol-transport-vessel-fk-offline.cs + datasets/pr-679/fixtures/clear-job-consol-transport-vessel-fk-online.cs + extensions/workspace.ts + extensions/workspace.config.yaml + extensions/workspace.local.example.yaml + extensions/skills.ts + extensions/skills.config.yaml + providers/pi-cli-reviewer.yaml + providers/pi-cli-grader.yaml + rubrics/fractional-rubric.json + README.md +``` + +The same shape can later be mirrored in AgentV examples with non-sensitive fixtures. The private WTG parity repo can keep the CargoWise-specific config and local mirror mapping. + +### Assumptions + +- The user's "sam syntax" means "same syntax" as Promptfoo extension references. +- `workspace` removal applies to core eval schema and core orchestration ownership, not to runtime paths, run artifacts, or Dashboard wording where "workspace" means an actual prepared directory. +- The WTG PR 679 example should stay in `wtg-ai-prompts-experiment` unless a separate decision approves publishing those artifacts in AgentV. + +### Sources And Research + +- Promptfoo local clone: `src/evaluatorHelpers.ts` defines hook names, `file://path:function` parsing, and hook-specific calling convention. +- Promptfoo local clone: `src/types/index.ts` validates top-level `extensions` as `file://` Python or JavaScript hook references. +- Promptfoo parity branch: `framework-parity/promptfoo/pr-679/data-transformation-pr50857-e2e.suite.yaml` shows the desired suite shape with `providers`, `extensions`, `prompts`, `defaultTest`, and `tests`. +- AgentV current code: `packages/core/src/evaluation/types.ts`, `packages/core/src/evaluation/yaml-parser.ts`, and `packages/core/src/evaluation/workspace/setup.ts` define the current `workspace` authoring and runtime pipeline. +- AgentV current examples: `examples/features/workspace-shared-config/` demonstrates reusable workspace config files, and `examples/features/agent-skills-evals/` demonstrates Agent Skills eval import and skill-trigger grading. +- ADR source: `docs/adr/0013-stabilize-eval-authoring-contract.md` keeps `tests`, `target`, `experiment`, `default_test`, and `gate` as preferred AgentV vocabulary. + +--- + +## Implementation Units + +### U1. Freeze The Contract In Docs And Vocabulary + +- **Goal:** Record the Promptfoo-compatible extension contract and the removal of `workspace` from AgentV core before changing runtime behavior. +- **Requirements:** R1, R2, R3, R8, R9 +- **Dependencies:** None +- **Files:** `docs/adr/0014-promptfoo-compatible-extensions.md`, `CONCEPTS.md`, `apps/web/src/content/docs/docs/evaluation/eval-files.mdx`, `apps/web/src/content/docs/docs/guides/workspace-architecture.mdx` +- **Approach:** Add an ADR that defines top-level `extensions`, supported hooks, path resolution, trusted local execution posture, and the replacement path for existing `workspace` examples. Update Concepts so `Workspace` distinguishes runtime environment from extension-provided setup. +- **Patterns to follow:** ADR 0013 for preferred authoring contract language; `CONCEPTS.md` for project vocabulary. +- **Test scenarios:** Test expectation: none -- documentation-only unit. +- **Verification:** Reviewers can identify the preferred YAML shape, the hard-removal stance for core `workspace`, and the boundary between extension syntax compatibility and AgentV-owned artifacts. + +### U2. Add Extension Reference Parsing And Validation + +- **Goal:** Accept Promptfoo-compatible extension references in AgentV eval YAML. +- **Requirements:** R1, R2, AE1 +- **Dependencies:** U1 +- **Files:** `packages/core/src/evaluation/types.ts`, `packages/core/src/evaluation/yaml-parser.ts`, `packages/core/src/evaluation/validation/eval-file.schema.ts`, `packages/core/src/evaluation/validation/eval-validator.ts`, `packages/core/test/evaluation/validation/eval-file-schema.test.ts`, `packages/core/test/evaluation/validation/eval-validator.test.ts`, `packages/core/test/evaluation/yaml-parser-metadata.test.ts` +- **Approach:** Add an `extensions` field to raw suite parsing and normalized suite metadata. Validate `file://path:function` references, recognize the Promptfoo hook names, reject missing functions for v1 unless a deliberate default-export convention is documented, and resolve relative paths from the eval file directory. +- **Patterns to follow:** Existing `default_test` validation, file-reference validation, and Promptfoo's `getExtensionHookName` behavior as compatibility input. +- **Test scenarios:** + - Parse a suite with `extensions: ["file://extensions/workspace.ts:beforeAll"]` and confirm the normalized suite retains the extension reference. + - Reject an extension not starting with `file://`. + - Reject a known hook typo such as `before_all` with a message that names `beforeAll`. + - Resolve relative extension paths from a nested eval file without using `process.cwd()`. +- **Verification:** `agentv validate` accepts the Promptfoo-style reference and reports actionable errors for malformed references. + +### U3. Implement The General Extension Runtime + +- **Goal:** Execute extension hooks at AgentV lifecycle points and let them provide normalized runtime context to the core runner. +- **Requirements:** R1, R2, R4, R8 +- **Dependencies:** U2 +- **Files:** `packages/core/src/evaluation/extensions/runner.ts`, `packages/core/src/evaluation/extensions/types.ts`, `packages/core/src/evaluation/orchestrator.ts`, `packages/core/test/evaluation/extensions/runner.test.ts`, `packages/core/test/evaluation/orchestrator.test.ts` +- **Approach:** Add a small extension runner that loads local JavaScript/TypeScript hook modules, invokes only hooks matching the current lifecycle, and merges returned AgentV-normalized runtime fields. Core should understand generic fields such as `cwd`, `env`, `metadata`, `artifacts`, `cleanup`, and provider context; it should not understand workspace repos, templates, Docker, or skill-copy rules. +- **Patterns to follow:** Promptfoo hook filtering semantics, AgentV `executeWorkspaceScript` subprocess discipline, and orchestrator hook execution ordering. +- **Test scenarios:** + - Two `beforeAll` extensions run in order and the second sees normalized state returned by the first. + - A `beforeEach` extension does not run for `beforeAll`. + - An extension error fails the run at a clear `setup` stage with the extension path in the message. + - `afterEach` can add result metadata without replacing scores or AgentV artifact paths. +- **Verification:** Provider invocation tests continue to pass with no extensions configured, and runtime context from extensions reaches provider requests without workspace-specific core code. + +### U4. Extract Workspace Setup Into A Bundled Extension + +- **Goal:** Remove workspace setup from AgentV core and ship it as an extension built on the generalized runtime. +- **Requirements:** R3, R4, R7, AE2, AE4 +- **Dependencies:** U3 +- **Files:** `packages/extensions/workspace/src/index.ts`, `packages/extensions/workspace/src/types.ts`, `packages/extensions/workspace/test/workspace-extension.test.ts`, `packages/core/src/evaluation/workspace/setup.ts`, `packages/core/src/evaluation/workspace/repo-config-parser.ts`, `packages/core/test/evaluation/workspace-config-parsing.test.ts`, `packages/core/test/evaluation/workspace/resolve.test.ts`, `packages/core/test/evaluation/orchestrator.test.ts` +- **Approach:** Move the current workspace materialization code behind a bundled extension package or equivalent non-core module. The extension reads workspace config, materializes repos/templates/Docker/hooks, then returns generic runtime context to core. Keep machine-local `workspace_path` and mirror paths in `.agentv/config.local.yaml` or extension-local overlay files. Make top-level `workspace` a validation error with migration guidance. +- **Patterns to follow:** `examples/features/workspace-shared-config/workspace.yaml` for reusable config and `packages/core/src/evaluation/workspace/setup.ts` as the extraction source. +- **Test scenarios:** + - A workspace extension returns a config with pinned repos and the orchestrator materializes them through the existing repo manager. + - A workspace extension returns `isolation: per_case` and each test receives a separate workspace. + - A machine-local mirror path in a local overlay is not serialized into the portable eval suite. + - A legacy top-level `workspace` eval fails validation with a precise migration message. +- **Verification:** Workspace materialization tests move to the extension package and prove template, repo, hook, Docker, and static directory behavior through the extension path, while core orchestrator tests no longer import workspace-specific config types. + +### U5. Add A Skills Extension Contract + +- **Goal:** Replace one-off skill setup scripts and metadata side channels with a reusable skills extension. +- **Requirements:** R5, R7, AE3 +- **Dependencies:** U3, U4 +- **Files:** `packages/extensions/skills/src/index.ts`, `packages/extensions/skills/src/types.ts`, `packages/extensions/skills/test/skills-extension.test.ts`, `packages/core/src/evaluation/loaders/agent-skills-parser.ts`, `packages/core/src/evaluation/providers/claude-cli.ts`, `packages/core/src/evaluation/providers/copilot-sdk.ts`, `packages/core/src/evaluation/providers/codex-cli.ts`, `packages/core/src/evaluation/providers/pi-cli.ts`, `packages/core/test/evaluation/loaders/agent-skills-parser.test.ts`, `packages/core/test/evaluation/providers/copilot-sdk.test.ts` +- **Approach:** Let the skills extension copy or generate skill directories into the prepared runtime directory and return normalized `skill_paths` provider context. Providers that support explicit skill paths should consume those paths from provider request context. Existing `metadata.agent_skills_files` handling can remain as import compatibility but should not be the preferred authoring path. +- **Patterns to follow:** `packages/core/src/evaluation/loaders/agent-skills-parser.ts` for Agent Skills import, `packages/core/src/evaluation/providers/copilot-sdk.ts` for auto-discovered skill directories, and `examples/features/copilot-log-eval/` for skill-trigger evidence. +- **Test scenarios:** + - A skills extension stages a `SKILL.md` directory into a prepared workspace and exposes the staged path to a provider request. + - Multiple skills can be staged without overwriting each other. + - A missing skill source fails setup with a message naming the source command or path. + - `skill-trigger` grading still works from normalized provider traces. +- **Verification:** A no-live-provider fixture can prove staging and request metadata; live dogfood later proves a coding-agent provider can actually load the skill. + +### U6. Restructure The PR 679 Promptfoo Compatibility Example + +- **Goal:** Turn the PR 679 parity example into reusable Promptfoo-compatible layout instead of a one-off script bundle. +- **Requirements:** R6, R7, AE5 +- **Dependencies:** U1, U2, U3, U4, U5 +- **Files:** `docs/examples/promptfoo-compatible-layout.md`, `examples/features/promptfoo-compatible-extensions/README.md`, `examples/features/promptfoo-compatible-extensions/promptfooconfig.yaml`, `examples/features/promptfoo-compatible-extensions/extensions/workspace.ts`, `examples/features/promptfoo-compatible-extensions/extensions/skills.ts`, `examples/features/promptfoo-compatible-extensions/providers/local-agent.yaml`, `examples/features/promptfoo-compatible-extensions/datasets/sample/cases.yaml`, `examples/features/promptfoo-compatible-extensions/rubrics/fractional-rubric.json` +- **Approach:** Add a non-sensitive AgentV example that mirrors the private PR 679 layout and documents how the private `wtg-ai-prompts-experiment` paths map into it. Keep the WTG-specific CargoWise fixtures and mirror paths in the private repo unless publishing is explicitly approved. +- **Patterns to follow:** Promptfoo parity branch paths under `framework-parity/promptfoo/pr-679/`, AgentV examples under `examples/features/`, and docs guidance in `apps/web/src/content/docs/docs/evaluation/examples.mdx`. +- **Test scenarios:** + - The example validates with extension-only workspace and skills setup. + - Replacing only `datasets/sample/cases.yaml` does not require editing provider or extension files. + - The README shows how to map the private PR 679 files into the reusable layout without absolute local paths. +- **Verification:** `agentv validate` succeeds for the example, and docs do not expose private WTG local paths. + +### U7. Update CLI, Schema, And AI-Facing Guidance + +- **Goal:** Make the new contract discoverable through CLI validation, generated schema, docs, and bundled authoring skills. +- **Requirements:** R1, R3, R9 +- **Dependencies:** U2, U4, U5, U6 +- **Files:** `packages/core/scripts/generate-eval-schema.ts`, `apps/cli/src/commands/eval/commands/run.ts`, `apps/cli/src/commands/eval/shared.ts`, `apps/web/src/content/docs/docs/evaluation/eval-files.mdx`, `apps/web/src/content/docs/docs/guides/agent-eval-layers.mdx`, `plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md` +- **Approach:** Regenerate schema output, update CLI help and validation examples, and revise the AI-facing eval builder guidance so agents author `extensions` rather than top-level `workspace` for new evals. +- **Patterns to follow:** `.agents/workflow.md` note that schema guidance changes update `plugins/agentv-dev/skills/agentv-eval-builder/`. +- **Test scenarios:** + - Generated eval schema includes `extensions`. + - CLI validation output identifies extension errors with file and hook context. + - Docs examples use `default_test` rather than Promptfoo's camelCase `defaultTest` unless the example is explicitly showing Promptfoo source compatibility. +- **Verification:** Schema sync tests pass and docs examples stay aligned with ADR 0013 field names. + +### U8. Dogfood The Contract With A Real Provider And Grader + +- **Goal:** Prove extension-authored workspace and skills setup works in a live eval, not only parser tests. +- **Requirements:** R4, R5, R6, R8, AE2, AE3, AE5 +- **Dependencies:** U6, U7 +- **Files:** `examples/features/promptfoo-compatible-extensions/README.md`, private evidence branch contents outside the public repo +- **Approach:** Run the smallest live coding-agent eval with a real provider and real LLM grader using the extension-only example. Capture the run bundle, validated config, evidence README, and any screenshots or artifact trees on an `agentv-private` evidence branch per `.agents/verification.md`. +- **Patterns to follow:** `.agents/verification.md` live dogfood rules for eval execution, providers, graders, and artifact changes. +- **Test scenarios:** + - Live provider loads the staged skill from the skills extension. + - Workspace extension prepares the expected repo/template state. + - LLM grader evaluates the result and the run bundle records canonical AgentV artifacts. + - A failure path, such as missing skill source or missing workspace config, produces a setup error rather than a provider crash. +- **Verification:** Private evidence includes the canonical `.agentv/results//` bundle, command output, artifact tree, and exact blocker if the live provider or grader cannot run. + +--- + +## Verification Contract + +| Gate | Applies To | Done Signal | +|---|---|---| +| `bun test packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts` | U2, U7 | Extension schema and validator behavior is covered. | +| `bun test packages/core/test/evaluation/extensions/runner.test.ts packages/core/test/evaluation/orchestrator.test.ts` | U3, U4 | Hook lifecycle integration works without regressing orchestration. | +| `bun test packages/extensions/workspace/test/workspace-extension.test.ts` | U4 | The bundled workspace extension prepares runtime context without core workspace schema. | +| `bun test packages/core/test/evaluation/loaders/agent-skills-parser.test.ts packages/core/test/evaluation/providers/copilot-sdk.test.ts` | U5 | Skill staging and provider request metadata are covered without live providers. | +| `bun apps/cli/src/cli.ts validate examples/features/promptfoo-compatible-extensions/promptfooconfig.yaml` | U6, U7 | Example validates through the public CLI. | +| Live provider plus live LLM grader dogfood | U8 | Private evidence proves an extension-only eval can run end to end and produce canonical run artifacts. | + +--- + +## Definition of Done + +- The preferred eval authoring docs use top-level `extensions` for workspace and skills setup. +- AgentV accepts Promptfoo-compatible `file://path:function` extension references with the four hook names. +- Core orchestration consumes generic extension runtime context and no longer owns workspace setup. +- The former workspace engine is available through a bundled workspace extension and is fed by extension config rather than eval core schema. +- Skills setup has a reusable extension path and provider-facing normalized skill metadata. +- Legacy top-level `workspace` behavior is hard-removed from the core eval contract, with validation errors and migration docs pointing to extensions. +- The PR 679 parity layout is represented as a reusable, non-sensitive AgentV example or documented mapping, with private WTG-specific files kept out of the public repo. +- Validation, unit tests, schema sync, and live provider/grader dogfood pass or record exact blockers. +- Abandoned experimental code, duplicate setup paths, and one-off PR 679 env-var glue are not left in the final diff. From 6e53e90cea358d43dcc4b9284895643499d9afd5 Mon Sep 17 00:00:00 2001 From: Christopher Date: Thu, 2 Jul 2026 09:36:43 +1000 Subject: [PATCH 2/6] docs(plans): amend extensions plan to agreed model (hook-derived isolation, vars.workspace, built-in scheme) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds an Amendments section overriding the original where they conflict, per the wider promptfoo-superset restructure (PR #1594): - A1: isolation is hook-derived (beforeAll=shared, beforeEach=per-case) via a reset-based workspace pool; remove the isolation config knob. - A2: per-case workspace spec lives in dataset vars.workspace. - A3: ship built-in auto-registered agentv:workspace / agentv:skills alongside file://. - A4: grading contract unchanged — reuse EvaluationScore (score/verdict/assertions[evidence]). - A5: ADR 0014 must note the incoming superseding ADR (assert / input removal). Co-Authored-By: Claude Opus 4.8 (1M context) --- ...-001-feat-promptfoo-compatible-extensions-plan.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md b/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md index f3ac65a5e..ada58a92a 100644 --- a/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md +++ b/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md @@ -10,6 +10,18 @@ type: feat # feat: Add Promptfoo-compatible extensions +## Amendments (agreed — override the sections below where they conflict) + +This plan is the **extensions/workspace implementation slice** of the wider promptfoo-superset restructure (`docs/plans/promptfoo-aligned-eval-restructure.md`, PR #1594). The following owner-agreed decisions override the original text: + +- **A1. Isolation is hook-derived, not a config field.** Remove the `isolation: per_case` config knob (KTD2/U4/AE). Shared-vs-per-case is selected by **which hook** the workspace extension is registered on: `beforeAll` = shared workspace, `beforeEach` = per-case. The mechanism is a **reset-based workspace pool** (workers share a workspace, or draw from a pool that is reset to original — git clean / snapshot — between uses), not container-per-instance. +- **A2. Per-case workspace spec lives in dataset `vars.workspace`** (not only in an `extensions/workspace.config.yaml`). Workspace "is part of the dataset": the `beforeEach` extension reads `vars.workspace` from the test context; shared/global config in a config file is still allowed for run-wide defaults. Amend U2/U4 to consume `vars.workspace`. +- **A3. Ship a built-in, auto-registered `agentv:workspace` / `agentv:skills` scheme** alongside `file://`. The common case needs no copied script: `extensions: [agentv:workspace:beforeAll]`. `file://path:function` remains for custom extensions. (Original text's `file://`-only model becomes the *custom* path, not the only path.) +- **A4. Grading contract unchanged: reuse `EvaluationScore`.** Extensions never own `grading.json`; the `pass`/`score`/`evidence` triple stays in the existing `EvaluationScore` (`score` + `verdict` + `assertions[{text,passed,evidence}]`). +- **A5. ADR sequencing.** The proposed ADR 0014 must note that a broader superseding ADR (reversing parts of ADR-0013: `assert`, grader-type names, removal of `tests[].input`) is incoming, so 0014 does not re-entrench `input`/`assertions`. + +The rest of the plan (Promptfoo `file://path:function` refs, four hook names, hard-remove core `workspace`, typed outputs instead of env-var side channels, JS-in-process/Python-subprocess, canonical run bundle, skills extension) stands as written. + ## Goal Capsule - **Objective:** Adopt a Promptfoo-compatible eval authoring contract for extension hooks, remove `workspace` as an AgentV core primitive, and make the PR 679 parity example reusable through generalized workspace and skills extensions. From 48d322495972be24432c7f994ce3496badae8ebb Mon Sep 17 00:00:00 2001 From: Christopher Date: Thu, 2 Jul 2026 09:41:28 +1000 Subject: [PATCH 3/6] docs(plans): rename skills extension -> agent_rules; clarify grading contract origin - A6: the staging extension covers skills + hooks + subagents + rules, so rename to agentv:agent_rules (package agent-rules, context agent_rules_paths). skills is one kind of agent rule, not the extension name. - A4: note grading.json originates from agentskills (assertion_results + summary); AgentV adds string verdict (pass/fail/skip) + fractional score as a superset. Co-Authored-By: Claude Opus 4.8 (1M context) --- ...-07-01-001-feat-promptfoo-compatible-extensions-plan.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md b/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md index ada58a92a..0330b98f5 100644 --- a/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md +++ b/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md @@ -16,11 +16,12 @@ This plan is the **extensions/workspace implementation slice** of the wider prom - **A1. Isolation is hook-derived, not a config field.** Remove the `isolation: per_case` config knob (KTD2/U4/AE). Shared-vs-per-case is selected by **which hook** the workspace extension is registered on: `beforeAll` = shared workspace, `beforeEach` = per-case. The mechanism is a **reset-based workspace pool** (workers share a workspace, or draw from a pool that is reset to original — git clean / snapshot — between uses), not container-per-instance. - **A2. Per-case workspace spec lives in dataset `vars.workspace`** (not only in an `extensions/workspace.config.yaml`). Workspace "is part of the dataset": the `beforeEach` extension reads `vars.workspace` from the test context; shared/global config in a config file is still allowed for run-wide defaults. Amend U2/U4 to consume `vars.workspace`. -- **A3. Ship a built-in, auto-registered `agentv:workspace` / `agentv:skills` scheme** alongside `file://`. The common case needs no copied script: `extensions: [agentv:workspace:beforeAll]`. `file://path:function` remains for custom extensions. (Original text's `file://`-only model becomes the *custom* path, not the only path.) -- **A4. Grading contract unchanged: reuse `EvaluationScore`.** Extensions never own `grading.json`; the `pass`/`score`/`evidence` triple stays in the existing `EvaluationScore` (`score` + `verdict` + `assertions[{text,passed,evidence}]`). +- **A3. Ship a built-in, auto-registered `agentv:workspace` / `agentv:agent_rules` scheme** alongside `file://`. The common case needs no copied script: `extensions: [agentv:workspace:beforeAll]`. `file://path:function` remains for custom extensions. (Original text's `file://`-only model becomes the *custom* path, not the only path.) +- **A4. Grading contract unchanged: reuse `EvaluationScore`.** Extensions never own `grading.json`. The contract originates from agentskills (`assertion_results[{text,passed,evidence}]` + `summary` counts); AgentV keeps that per-assertion shape and adds a top-level string `verdict` (`pass`/`fail`/`skip`) + fractional `score` as a superset. Not a boolean. - **A5. ADR sequencing.** The proposed ADR 0014 must note that a broader superseding ADR (reversing parts of ADR-0013: `assert`, grader-type names, removal of `tests[].input`) is incoming, so 0014 does not re-entrench `input`/`assertions`. +- **A6. Rename the skills extension → `agent_rules` (U5).** It stages more than skills — **skills, hooks, subagents/agents, and other agent rules** into the workspace. Rename: built-in `agentv:agent_rules`, package `packages/extensions/agent-rules`, provider context `agent_rules_paths` (typed map covering skills/hooks/agents), not `skill_paths`. `skills` becomes one *kind* of agent rule, not the extension name. Update all U5 files/tests accordingly. -The rest of the plan (Promptfoo `file://path:function` refs, four hook names, hard-remove core `workspace`, typed outputs instead of env-var side channels, JS-in-process/Python-subprocess, canonical run bundle, skills extension) stands as written. +The rest of the plan (Promptfoo `file://path:function` refs, four hook names, hard-remove core `workspace`, typed outputs instead of env-var side channels, JS-in-process/Python-subprocess, canonical run bundle, the agent-rules extension) stands as written. ## Goal Capsule From 909eccc506cf2d6759df89ba7f93e4cb664c8b3a Mon Sep 17 00:00:00 2001 From: Christopher Date: Thu, 2 Jul 2026 09:48:33 +1000 Subject: [PATCH 4/6] docs(plans): use kebab agent-rules for the scheme identifier (agent_rules_paths field stays snake_case) Co-Authored-By: Claude Opus 4.8 (1M context) --- ...026-07-01-001-feat-promptfoo-compatible-extensions-plan.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md b/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md index 0330b98f5..2f0ccb336 100644 --- a/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md +++ b/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md @@ -16,10 +16,10 @@ This plan is the **extensions/workspace implementation slice** of the wider prom - **A1. Isolation is hook-derived, not a config field.** Remove the `isolation: per_case` config knob (KTD2/U4/AE). Shared-vs-per-case is selected by **which hook** the workspace extension is registered on: `beforeAll` = shared workspace, `beforeEach` = per-case. The mechanism is a **reset-based workspace pool** (workers share a workspace, or draw from a pool that is reset to original — git clean / snapshot — between uses), not container-per-instance. - **A2. Per-case workspace spec lives in dataset `vars.workspace`** (not only in an `extensions/workspace.config.yaml`). Workspace "is part of the dataset": the `beforeEach` extension reads `vars.workspace` from the test context; shared/global config in a config file is still allowed for run-wide defaults. Amend U2/U4 to consume `vars.workspace`. -- **A3. Ship a built-in, auto-registered `agentv:workspace` / `agentv:agent_rules` scheme** alongside `file://`. The common case needs no copied script: `extensions: [agentv:workspace:beforeAll]`. `file://path:function` remains for custom extensions. (Original text's `file://`-only model becomes the *custom* path, not the only path.) +- **A3. Ship a built-in, auto-registered `agentv:workspace` / `agentv:agent-rules` scheme** alongside `file://`. The common case needs no copied script: `extensions: [agentv:workspace:beforeAll]`. `file://path:function` remains for custom extensions. (Original text's `file://`-only model becomes the *custom* path, not the only path.) - **A4. Grading contract unchanged: reuse `EvaluationScore`.** Extensions never own `grading.json`. The contract originates from agentskills (`assertion_results[{text,passed,evidence}]` + `summary` counts); AgentV keeps that per-assertion shape and adds a top-level string `verdict` (`pass`/`fail`/`skip`) + fractional `score` as a superset. Not a boolean. - **A5. ADR sequencing.** The proposed ADR 0014 must note that a broader superseding ADR (reversing parts of ADR-0013: `assert`, grader-type names, removal of `tests[].input`) is incoming, so 0014 does not re-entrench `input`/`assertions`. -- **A6. Rename the skills extension → `agent_rules` (U5).** It stages more than skills — **skills, hooks, subagents/agents, and other agent rules** into the workspace. Rename: built-in `agentv:agent_rules`, package `packages/extensions/agent-rules`, provider context `agent_rules_paths` (typed map covering skills/hooks/agents), not `skill_paths`. `skills` becomes one *kind* of agent rule, not the extension name. Update all U5 files/tests accordingly. +- **A6. Rename the skills extension → `agent-rules` (U5).** It stages more than skills — **skills, hooks, subagents/agents, and other agent rules** into the workspace. Rename: built-in `agentv:agent-rules`, package `packages/extensions/agent-rules`, provider context `agent_rules_paths` (typed map covering skills/hooks/agents), not `skill_paths`. `skills` becomes one *kind* of agent rule, not the extension name. Update all U5 files/tests accordingly. The rest of the plan (Promptfoo `file://path:function` refs, four hook names, hard-remove core `workspace`, typed outputs instead of env-var side channels, JS-in-process/Python-subprocess, canonical run bundle, the agent-rules extension) stands as written. From a640beaadfaa81857814398696ae9165ac5ca0ac Mon Sep 17 00:00:00 2001 From: Christopher Date: Thu, 2 Jul 2026 11:49:41 +1000 Subject: [PATCH 5/6] =?UTF-8?q?docs(plans):=20amend=20A7=20=E2=80=94=20rep?= =?UTF-8?q?o=20provisioning=20is=20a=20declarative=20field,=20not=20an=20e?= =?UTF-8?q?xtension=20(per=20ADR-0016/0017)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Narrows this PR: do NOT move repo materialization into an extension. Repo acquisition stays harness-core (declarative workspace.repos field + resolver, materialized before hooks). Extensions cover only non-provisioning setup (agent-rules, custom hooks). Names the superseding ADRs (0016 authoring, 0017 output/resolver) referenced in A5. Co-Authored-By: Claude Opus 4.8 (1M context) --- ...026-07-01-001-feat-promptfoo-compatible-extensions-plan.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md b/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md index 2f0ccb336..5017ea60f 100644 --- a/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md +++ b/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md @@ -21,7 +21,9 @@ This plan is the **extensions/workspace implementation slice** of the wider prom - **A5. ADR sequencing.** The proposed ADR 0014 must note that a broader superseding ADR (reversing parts of ADR-0013: `assert`, grader-type names, removal of `tests[].input`) is incoming, so 0014 does not re-entrench `input`/`assertions`. - **A6. Rename the skills extension → `agent-rules` (U5).** It stages more than skills — **skills, hooks, subagents/agents, and other agent rules** into the workspace. Rename: built-in `agentv:agent-rules`, package `packages/extensions/agent-rules`, provider context `agent_rules_paths` (typed map covering skills/hooks/agents), not `skill_paths`. `skills` becomes one *kind* of agent rule, not the extension name. Update all U5 files/tests accordingly. -The rest of the plan (Promptfoo `file://path:function` refs, four hook names, hard-remove core `workspace`, typed outputs instead of env-var side channels, JS-in-process/Python-subprocess, canonical run bundle, the agent-rules extension) stands as written. +- **A7. Repo provisioning is a declarative `workspace.repos` FIELD, not an extension (finalized in ADR-0016 pt10 / ADR-0017).** This narrows this PR's scope: **do NOT move repo materialization into an extension.** Repo acquisition stays harness-core — a declarative `workspace.repos` field (provenance) the harness materializes (resolver backends) BEFORE hooks. Extensions here are only for **non-provisioning** setup: `agentv:agent-rules` (skills/hooks/agents staging) and custom `file://` lifecycle hooks. `isolation` is a `workspace` field, not a hook choice. The broader superseding ADR referenced in A5 is **ADR-0016** (authoring) + **ADR-0017** (output/resolver). + +The rest of the plan (Promptfoo `file://path:function` refs, four hook names, typed outputs instead of env-var side channels, JS-in-process/Python-subprocess, canonical run bundle, the agent-rules extension) stands as written — EXCEPT the removal of core `workspace`: repo provisioning is retained as a declarative field per A7 (only `workspace.hooks` moves to extensions). ## Goal Capsule From 8fe4811b957c997361caf87f3dd39d5f9d1ab933 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Thu, 2 Jul 2026 10:41:26 +0200 Subject: [PATCH 6/6] style(plans): remove trailing whitespace from promptfoo plan --- ...01-feat-promptfoo-compatible-extensions-plan.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md b/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md index 5017ea60f..2a62e22a0 100644 --- a/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md +++ b/docs/plans/2026-07-01-001-feat-promptfoo-compatible-extensions-plan.md @@ -162,7 +162,7 @@ The same shape can later be mirrored in AgentV examples with non-sensitive fixtu - **Files:** `packages/core/src/evaluation/types.ts`, `packages/core/src/evaluation/yaml-parser.ts`, `packages/core/src/evaluation/validation/eval-file.schema.ts`, `packages/core/src/evaluation/validation/eval-validator.ts`, `packages/core/test/evaluation/validation/eval-file-schema.test.ts`, `packages/core/test/evaluation/validation/eval-validator.test.ts`, `packages/core/test/evaluation/yaml-parser-metadata.test.ts` - **Approach:** Add an `extensions` field to raw suite parsing and normalized suite metadata. Validate `file://path:function` references, recognize the Promptfoo hook names, reject missing functions for v1 unless a deliberate default-export convention is documented, and resolve relative paths from the eval file directory. - **Patterns to follow:** Existing `default_test` validation, file-reference validation, and Promptfoo's `getExtensionHookName` behavior as compatibility input. -- **Test scenarios:** +- **Test scenarios:** - Parse a suite with `extensions: ["file://extensions/workspace.ts:beforeAll"]` and confirm the normalized suite retains the extension reference. - Reject an extension not starting with `file://`. - Reject a known hook typo such as `before_all` with a message that names `beforeAll`. @@ -177,7 +177,7 @@ The same shape can later be mirrored in AgentV examples with non-sensitive fixtu - **Files:** `packages/core/src/evaluation/extensions/runner.ts`, `packages/core/src/evaluation/extensions/types.ts`, `packages/core/src/evaluation/orchestrator.ts`, `packages/core/test/evaluation/extensions/runner.test.ts`, `packages/core/test/evaluation/orchestrator.test.ts` - **Approach:** Add a small extension runner that loads local JavaScript/TypeScript hook modules, invokes only hooks matching the current lifecycle, and merges returned AgentV-normalized runtime fields. Core should understand generic fields such as `cwd`, `env`, `metadata`, `artifacts`, `cleanup`, and provider context; it should not understand workspace repos, templates, Docker, or skill-copy rules. - **Patterns to follow:** Promptfoo hook filtering semantics, AgentV `executeWorkspaceScript` subprocess discipline, and orchestrator hook execution ordering. -- **Test scenarios:** +- **Test scenarios:** - Two `beforeAll` extensions run in order and the second sees normalized state returned by the first. - A `beforeEach` extension does not run for `beforeAll`. - An extension error fails the run at a clear `setup` stage with the extension path in the message. @@ -192,7 +192,7 @@ The same shape can later be mirrored in AgentV examples with non-sensitive fixtu - **Files:** `packages/extensions/workspace/src/index.ts`, `packages/extensions/workspace/src/types.ts`, `packages/extensions/workspace/test/workspace-extension.test.ts`, `packages/core/src/evaluation/workspace/setup.ts`, `packages/core/src/evaluation/workspace/repo-config-parser.ts`, `packages/core/test/evaluation/workspace-config-parsing.test.ts`, `packages/core/test/evaluation/workspace/resolve.test.ts`, `packages/core/test/evaluation/orchestrator.test.ts` - **Approach:** Move the current workspace materialization code behind a bundled extension package or equivalent non-core module. The extension reads workspace config, materializes repos/templates/Docker/hooks, then returns generic runtime context to core. Keep machine-local `workspace_path` and mirror paths in `.agentv/config.local.yaml` or extension-local overlay files. Make top-level `workspace` a validation error with migration guidance. - **Patterns to follow:** `examples/features/workspace-shared-config/workspace.yaml` for reusable config and `packages/core/src/evaluation/workspace/setup.ts` as the extraction source. -- **Test scenarios:** +- **Test scenarios:** - A workspace extension returns a config with pinned repos and the orchestrator materializes them through the existing repo manager. - A workspace extension returns `isolation: per_case` and each test receives a separate workspace. - A machine-local mirror path in a local overlay is not serialized into the portable eval suite. @@ -207,7 +207,7 @@ The same shape can later be mirrored in AgentV examples with non-sensitive fixtu - **Files:** `packages/extensions/skills/src/index.ts`, `packages/extensions/skills/src/types.ts`, `packages/extensions/skills/test/skills-extension.test.ts`, `packages/core/src/evaluation/loaders/agent-skills-parser.ts`, `packages/core/src/evaluation/providers/claude-cli.ts`, `packages/core/src/evaluation/providers/copilot-sdk.ts`, `packages/core/src/evaluation/providers/codex-cli.ts`, `packages/core/src/evaluation/providers/pi-cli.ts`, `packages/core/test/evaluation/loaders/agent-skills-parser.test.ts`, `packages/core/test/evaluation/providers/copilot-sdk.test.ts` - **Approach:** Let the skills extension copy or generate skill directories into the prepared runtime directory and return normalized `skill_paths` provider context. Providers that support explicit skill paths should consume those paths from provider request context. Existing `metadata.agent_skills_files` handling can remain as import compatibility but should not be the preferred authoring path. - **Patterns to follow:** `packages/core/src/evaluation/loaders/agent-skills-parser.ts` for Agent Skills import, `packages/core/src/evaluation/providers/copilot-sdk.ts` for auto-discovered skill directories, and `examples/features/copilot-log-eval/` for skill-trigger evidence. -- **Test scenarios:** +- **Test scenarios:** - A skills extension stages a `SKILL.md` directory into a prepared workspace and exposes the staged path to a provider request. - Multiple skills can be staged without overwriting each other. - A missing skill source fails setup with a message naming the source command or path. @@ -222,7 +222,7 @@ The same shape can later be mirrored in AgentV examples with non-sensitive fixtu - **Files:** `docs/examples/promptfoo-compatible-layout.md`, `examples/features/promptfoo-compatible-extensions/README.md`, `examples/features/promptfoo-compatible-extensions/promptfooconfig.yaml`, `examples/features/promptfoo-compatible-extensions/extensions/workspace.ts`, `examples/features/promptfoo-compatible-extensions/extensions/skills.ts`, `examples/features/promptfoo-compatible-extensions/providers/local-agent.yaml`, `examples/features/promptfoo-compatible-extensions/datasets/sample/cases.yaml`, `examples/features/promptfoo-compatible-extensions/rubrics/fractional-rubric.json` - **Approach:** Add a non-sensitive AgentV example that mirrors the private PR 679 layout and documents how the private `wtg-ai-prompts-experiment` paths map into it. Keep the WTG-specific CargoWise fixtures and mirror paths in the private repo unless publishing is explicitly approved. - **Patterns to follow:** Promptfoo parity branch paths under `framework-parity/promptfoo/pr-679/`, AgentV examples under `examples/features/`, and docs guidance in `apps/web/src/content/docs/docs/evaluation/examples.mdx`. -- **Test scenarios:** +- **Test scenarios:** - The example validates with extension-only workspace and skills setup. - Replacing only `datasets/sample/cases.yaml` does not require editing provider or extension files. - The README shows how to map the private PR 679 files into the reusable layout without absolute local paths. @@ -236,7 +236,7 @@ The same shape can later be mirrored in AgentV examples with non-sensitive fixtu - **Files:** `packages/core/scripts/generate-eval-schema.ts`, `apps/cli/src/commands/eval/commands/run.ts`, `apps/cli/src/commands/eval/shared.ts`, `apps/web/src/content/docs/docs/evaluation/eval-files.mdx`, `apps/web/src/content/docs/docs/guides/agent-eval-layers.mdx`, `plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md` - **Approach:** Regenerate schema output, update CLI help and validation examples, and revise the AI-facing eval builder guidance so agents author `extensions` rather than top-level `workspace` for new evals. - **Patterns to follow:** `.agents/workflow.md` note that schema guidance changes update `plugins/agentv-dev/skills/agentv-eval-builder/`. -- **Test scenarios:** +- **Test scenarios:** - Generated eval schema includes `extensions`. - CLI validation output identifies extension errors with file and hook context. - Docs examples use `default_test` rather than Promptfoo's camelCase `defaultTest` unless the example is explicitly showing Promptfoo source compatibility. @@ -250,7 +250,7 @@ The same shape can later be mirrored in AgentV examples with non-sensitive fixtu - **Files:** `examples/features/promptfoo-compatible-extensions/README.md`, private evidence branch contents outside the public repo - **Approach:** Run the smallest live coding-agent eval with a real provider and real LLM grader using the extension-only example. Capture the run bundle, validated config, evidence README, and any screenshots or artifact trees on an `agentv-private` evidence branch per `.agents/verification.md`. - **Patterns to follow:** `.agents/verification.md` live dogfood rules for eval execution, providers, graders, and artifact changes. -- **Test scenarios:** +- **Test scenarios:** - Live provider loads the staged skill from the skills extension. - Workspace extension prepares the expected repo/template state. - LLM grader evaluates the result and the run bundle records canonical AgentV artifacts.