Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
c1d515f
docs(plan): draft plan to restructure eval authoring as a promptfoo s…
christso Jul 1, 2026
e5c08b9
docs(plan): workspace as dataset+extension; drop on_run_complete; ali…
christso Jul 1, 2026
351dd2d
docs(plan): grader execution model — javascript in-process, python su…
christso Jul 1, 2026
3fc5fcf
docs(plan): lock owner decisions (#1-#8) + reconcile with PR #1592
christso Jul 1, 2026
e68ac84
docs(plan): grading.json contract for llm-rubric (main risk) + note P…
christso Jul 1, 2026
bbed52c
docs(plan): reconcile grading.json with agentskills; rename skills ex…
christso Jul 1, 2026
0ea43cf
docs(plan): output format best-of-each (margin aggregate + vercel tra…
christso Jul 1, 2026
e1348cf
docs(plan): resolve index/manifest naming drift; adopt margin-lab .in…
christso Jul 1, 2026
cc7c69e
docs(plan): drop maintained consolidated export (split bundle is sour…
christso Jul 1, 2026
5f29b5d
docs(plan): merge timing.json into metrics.json (fixes ADR-0011/0012 …
christso Jul 1, 2026
2f798a8
docs(plan): drop top-level `providers` alias — importer remaps to `ta…
christso Jul 2, 2026
fb08326
docs(plan): clarify superset is a design property (no shipped importe…
christso Jul 2, 2026
0598ef8
docs(adr): add ADR-0015 multi-turn (execution vs evaluation); window_…
christso Jul 2, 2026
52921be
docs(plan): separate workspace materialization from pooling (§4)
christso Jul 2, 2026
2dec247
docs(plan): §11 quality gate + CargoWise PR-679 live dogfood + two wo…
christso Jul 2, 2026
a4af46c
docs(plan): §11 workspace acquisition performance — mirror-alternates…
christso Jul 2, 2026
fe96cb3
docs(plan): §11.1 canonical workspace resolver — split provenance fro…
christso Jul 2, 2026
e96601c
docs(adr): 0016 authoring-contract + 0017 output/artifact+resolver; s…
christso Jul 2, 2026
3e0b61c
docs(adr-0017): FAIL_TO_PASS/PASS_TO_PASS is a plain code-grader — dr…
christso Jul 2, 2026
346b098
docs: adopt promptfoo-native {{ env.VAR }} for config env (reverse ${…
christso Jul 2, 2026
a73eba9
docs(adr): finalize 0015/0016/0017 (Accepted); repo provisioning = de…
christso Jul 2, 2026
ffbff1e
docs(adr): acquisition is PLUGGABLE (custom backend + beforeAll escap…
christso Jul 2, 2026
c5dc6f7
docs: lock final workspace schema + name (durable 'workspace' field);…
christso Jul 2, 2026
d69db58
docs(adr-0017): cross-check exploitbench — confirms split-bundle/no-D…
christso Jul 2, 2026
5679377
docs(adr-0017): multi-suite runs — one run_id/timestamp per invocatio…
christso Jul 2, 2026
ea139ba
docs(adr): run organization — cross-run .indexes/runs.jsonl; rename r…
christso Jul 2, 2026
f06b655
docs(adr): experiment has ZERO privilege — plain tag, alphabetical, u…
christso Jul 2, 2026
60203ee
docs(adr): experiment — no structural privilege, but auto-default its…
christso Jul 2, 2026
7c9dc1e
docs(adr-0017): full results-tree layout (cross-run .indexes/ vs per-…
christso Jul 2, 2026
83ed251
docs(adr-0017): add cross-run cases.jsonl (case-level filtering); con…
christso Jul 2, 2026
383f64d
docs(adr-0017): lock artifact filenames — keep summary.json (not resu…
christso Jul 2, 2026
8e2d340
docs(adr-0017): correct margin claim — margin uses a RunStore (memory…
christso Jul 2, 2026
7a06839
docs(adr-0017): margin store rebuild nuance — fs->store only for resu…
christso Jul 2, 2026
1b46c04
Remove deprecated eval authoring aliases
christso Jul 2, 2026
5df2530
Fix eval schema drift and grader artifact wording
christso Jul 2, 2026
90b662b
Remove deprecated exported aliases
christso Jul 2, 2026
7617841
Remove programmatic expected_output alias
christso Jul 2, 2026
97a9524
Remove built-in provider aliases
christso Jul 2, 2026
cee6a8d
Clarify generic grader assertion rows
christso Jul 2, 2026
d74a7da
Remove target log_format aliases
christso Jul 2, 2026
e27b6aa
chore(evals): remove numeric required thresholds
christso Jul 2, 2026
eba1442
chore(targets): align azure api_format removal error
christso Jul 2, 2026
8b261b2
feat(evals): accept promptfoo-shaped eval schema fields
christso Jul 2, 2026
916eccd
docs(adr): clarify superseded authoring decisions
christso Jul 2, 2026
0bff279
fix(cli): remove removed judge target from scaffold
christso Jul 2, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .agents/product-boundary.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,8 @@ Use public reference standards before inventing AgentV-specific contracts:
- Hugging Face Datasets for dataset, split, record, and portable corpus conventions.
- OpenInference for trace, span, tool-call, and model-observability semantics.

Research those references from local cloned repositories first when a clone is available, and use DeepWiki MCP for repository-level orientation or cross-repo questions. Broad web search is a fallback, not the default. If current public documentation matters for the decision, use official docs and record the exact source or commit alongside the conclusion.

Treat these as reference inputs, not dependencies. AgentV should adopt the shared lowest common denominator when it fits the repo-native artifact model, and document any intentional divergence in the relevant plan, ADR, or contract docs.

### 5. YAGNI - You Aren't Gonna Need It
Expand Down
1 change: 1 addition & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ Design guardrails:
- Document composition patterns before inventing a new feature.
- Match industry-standard lowest-common-denominator contracts when possible.
- When designing AgentV contracts, check public reference standards such as Claude Skills, Vercel agent-eval, Hugging Face Datasets, and OpenInference before inventing AgentV-specific shapes. Use their shared lowest common denominator where it fits, and document any intentional divergence.
- For peer-framework research, prefer local cloned repositories and DeepWiki MCP over broad web search. If a public contract must be checked for currentness, use official docs and record the source or commit behind the conclusion.
- Apply YAGNI aggressively and solve the current request with the smallest surface that works.
- Keep extensions non-breaking unless a same-week unreleased surface should be hard-corrected.
- Design for AI comprehension with self-describing modules, clear extension points, and no dead scaffolding.
Expand Down
2 changes: 1 addition & 1 deletion apps/cli/src/commands/eval/commands/bundle.ts
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ function unique(values: readonly string[]): readonly string[] {

function targetReferenceNames(target: TargetDefinition): readonly string[] {
const references: string[] = [];
for (const key of ['use_target', 'grader_target', 'judge_target'] as const) {
for (const key of ['use_target', 'grader_target'] as const) {
const value = target[key];
if (typeof value === 'string' && value.trim().length > 0 && !value.includes('${{')) {
references.push(value.trim());
Expand Down
4 changes: 2 additions & 2 deletions apps/cli/src/commands/eval/task-bundle.ts
Original file line number Diff line number Diff line change
Expand Up @@ -497,7 +497,7 @@ function buildEvalCase(

function targetReferenceNames(target: TargetDefinition): readonly string[] {
const references: string[] = [];
for (const key of ['use_target', 'grader_target', 'judge_target'] as const) {
for (const key of ['use_target', 'grader_target'] as const) {
const value = target[key];
if (typeof value === 'string' && value.trim().length > 0 && !value.includes('${{')) {
references.push(value.trim());
Expand Down Expand Up @@ -831,7 +831,7 @@ async function collectWorkspaceReferences(

for (const hookName of ['before_all', 'before_each', 'after_each', 'after_all'] as const) {
const hook = hooks[hookName];
const command = hook?.command ?? hook?.script;
const command = hook?.command;
if (!command || command.length === 0) {
continue;
}
Expand Down
2 changes: 1 addition & 1 deletion apps/cli/src/commands/pipeline/input.ts
Original file line number Diff line number Diff line change
Expand Up @@ -293,7 +293,7 @@ async function writeGraderConfigs(
weight: r.weight ?? 1.0,
...(r.score_ranges ? { score_range: r.score_ranges } : {}),
...(r.required !== undefined ? { required: r.required } : {}),
...(r.required_min_score !== undefined ? { required_min_score: r.required_min_score } : {}),
...(r.min_score !== undefined ? { min_score: r.min_score } : {}),
}));

await writeJson(join(llmGradersDir, `${config.name}.json`), {
Expand Down
2 changes: 1 addition & 1 deletion apps/cli/src/commands/pipeline/run.ts
Original file line number Diff line number Diff line change
Expand Up @@ -476,7 +476,7 @@ async function writeGraderConfigs(
weight: r.weight ?? 1.0,
...(r.score_ranges ? { score_range: r.score_ranges } : {}),
...(r.required !== undefined ? { required: r.required } : {}),
...(r.required_min_score !== undefined ? { required_min_score: r.required_min_score } : {}),
...(r.min_score !== undefined ? { min_score: r.min_score } : {}),
}));

await writeJson(join(llmGradersDir, `${config.name}.json`), {
Expand Down
2 changes: 1 addition & 1 deletion apps/cli/src/commands/runs/rerun.ts
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ function resolveWholeEnvReference(value: unknown): string | undefined {

function referencedTargetNames(definition: Record<string, unknown>): readonly string[] {
const names: string[] = [];
for (const key of ['use_target', 'grader_target', 'judge_target'] as const) {
for (const key of ['use_target', 'grader_target'] as const) {
const resolved = resolveWholeEnvReference(definition[key]);
if (resolved && !resolved.includes('${{')) {
names.push(resolved);
Expand Down
7 changes: 4 additions & 3 deletions apps/cli/src/templates/.agentv/targets.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# A list of all supported evaluation targets for the project.
# Each target defines a provider and its specific configuration.
# Actual values for paths/keys are stored in the local .env file.
# Agent and CLI targets use grader_target to reference an LLM target for scoring.

targets:
- name: default
Expand All @@ -12,7 +13,7 @@ targets:

- name: codex
provider: codex
judge_target: azure-llm
grader_target: azure-llm
# Uses the Codex CLI (defaults to `codex` on PATH)
# executable: ${{ CODEX_CLI_PATH }} # Optional: override executable path
# args: # Optional additional CLI arguments
Expand All @@ -29,7 +30,7 @@ targets:
# Claude - Anthropic's Claude Agent SDK
- name: claude
provider: claude
judge_target: azure-llm
grader_target: azure-llm
# Uses the @anthropic-ai/claude-agent-sdk
# model: claude-sonnet-4-20250514 # Optional: override model
# cwd: ${{ CLAUDE_WORKSPACE_DIR }} # Optional: working directory (defaults to process.cwd())
Expand All @@ -53,7 +54,7 @@ targets:

- name: local_cli
provider: cli
judge_target: azure-llm
grader_target: azure-llm
# Passes the fully rendered prompt and any attached files to a local Python script
# NOTE: Do not add quotes around {PROMPT} or {FILES} - they are already shell-escaped
command: uv run ./mock_cli.py --prompt {PROMPT} {FILES} --output {OUTPUT_FILE}
Expand Down
42 changes: 42 additions & 0 deletions apps/cli/test/commands/eval/artifact-writer.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -260,6 +260,48 @@ describe('buildGradingArtifact', () => {
expect(grading.graders?.[1].score).toBe(0.7);
});

it('preserves multi-aspect grader assertions at top level and under the grader', () => {
const rubricAssertions = [
{
text: '[accuracy] Answer matches the reference - Score: 8/10 (strong)',
passed: true,
evidence: 'The answer includes the expected facts.',
},
{
text: '[citations] Answer cites the source - Score: 4/10 (weak)',
passed: false,
evidence: 'The answer does not cite a source.',
},
];
const result = makeResult({
assertions: rubricAssertions,
scores: [
makeEvaluatorResult({
name: 'rubric-review',
type: 'llm-grader',
score: 0.6,
assertions: rubricAssertions,
}),
],
});

const grading = buildGradingArtifact(result);

expect(grading.assertions).toEqual(rubricAssertions);
expect(grading.summary).toEqual({
passed: 1,
failed: 1,
total: 2,
pass_rate: 0.5,
});
expect(grading.graders?.[0]).toMatchObject({
name: 'rubric-review',
type: 'llm-grader',
score: 0.6,
assertions: rubricAssertions,
});
});

it('keeps grading.json focused on grading evidence', () => {
const result = makeResult({ error: 'Timeout exceeded' });
const grading = buildGradingArtifact(result);
Expand Down
5 changes: 3 additions & 2 deletions apps/web/src/content/docs/docs/evaluation/eval-cases.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -345,15 +345,16 @@ Any grader in `assertions` can be marked as `required`. When a required grader f
| Value | Behavior |
|-------|----------|
| `required: true` | Must score >= 0.8 (default threshold) to pass |
| `required: 0.6` | Must score >= 0.6 to pass (custom threshold between 0 and 1) |
| `required: true` + `min_score: 0.6` | Must score >= 0.6 to pass (custom threshold between 0 and 1) |

```yaml
assertions:
- type: contains
value: "DENIED"
required: true # must pass (>= 0.8)
- type: rubrics
required: 0.6 # must score at least 0.6
required: true
min_score: 0.6 # must score at least 0.6
criteria:
- id: quality
outcome: Response is well-structured
Expand Down
2 changes: 1 addition & 1 deletion apps/web/src/content/docs/docs/evaluation/rubrics.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ assertions:
| `score_ranges` | — | Score range definitions (analytic mode) |

:::note
`required_min_score` (0–10 integer scale) is deprecated. Use `min_score` (0–1 scale) instead. For example, `required_min_score: 8` becomes `min_score: 0.8`.
Use `min_score` for analytic rubric gating. The only 0–10 values in authored rubrics are `score_ranges` bands and grader outputs.
:::

### Criterion Operators
Expand Down
2 changes: 1 addition & 1 deletion apps/web/src/content/docs/docs/graders/custom-graders.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ final_score = sum(score_i * weight_i) / sum(weight_i)
```

If `weight` is omitted, it defaults to `1.0` (equal weighting).
If any grader has `required: true` (or `required: <threshold>`) and scores below its required threshold, the overall test score is forced to `0`.
If any grader has `required: true` and scores below its required threshold, the overall test score is forced to `0`. Use `min_score` for a custom threshold.

## Best Practices

Expand Down
13 changes: 6 additions & 7 deletions apps/web/src/content/docs/docs/guides/workspace-architecture.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -119,13 +119,12 @@ For each materialized repo, AgentV resolves acquisition in this order:

| Order | Source | How it is used |
|-------|--------|----------------|
| 1 | Explicit resolver | `workspace.repos[].resolver` names a configured command resolver. If it returns `handled:false`, AgentV fails clearly instead of guessing. |
| 2 | Pattern resolver | The first non-`default` `repo_resolvers[]` entry whose `repos` pattern matches the repo URL or identity. If it returns `handled:false`, AgentV continues to the default resolver. |
| 3 | Default resolver | The resolver named `default`, if configured. It must not declare `repos`; it is the unconditional project default. If it returns `handled:false`, AgentV continues to the built-in git resolver. |
| 4 | Registered project | A project in `$AGENTV_HOME/projects.yaml` whose `origin` matches the repo identity. AgentV seeds its mirror cache from that local checkout, then clones the cache into the workspace and resets `origin` to the declared repo URL. |
| 5 | Configured mirror | A path listed under `git_cache.mirrors`. AgentV seeds its mirror cache from that checkout or bare mirror, then clones the cache into the workspace. |
| 6 | Mirror cache | An AgentV-owned bare cache under `$AGENTV_DATA_DIR/git-cache/<hash>`. Cache population is locked, cloned into a temporary path, verified, and atomically renamed before use. |
| 7 | Remote clone | The normalized clone URL from the eval's `repo` field. |
| 1 | Pattern resolver | The first non-`default` `repo_resolvers[]` entry whose `repos` pattern matches the repo URL or identity. If it returns `handled:false`, AgentV continues to the default resolver. |
| 2 | Default resolver | The resolver named `default`, if configured. It must not declare `repos`; it is the unconditional project default. If it returns `handled:false`, AgentV continues to the built-in git resolver. |
| 3 | Registered project | A project in `$AGENTV_HOME/projects.yaml` whose `origin` matches the repo identity. AgentV seeds its mirror cache from that local checkout, then clones the cache into the workspace and resets `origin` to the declared repo URL. |
| 4 | Configured mirror | A path listed under `git_cache.mirrors`. AgentV seeds its mirror cache from that checkout or bare mirror, then clones the cache into the workspace. |
| 5 | Mirror cache | An AgentV-owned bare cache under `$AGENTV_DATA_DIR/git-cache/<hash>`. Cache population is locked, cloned into a temporary path, verified, and atomically renamed before use. |
| 6 | Remote clone | The normalized clone URL from the eval's `repo` field. |

Workspace clones are independent from user-owned checkouts, configured mirrors,
and resolver source directories. AgentV does not leave Git alternates pointing
Expand Down
41 changes: 4 additions & 37 deletions apps/web/src/content/docs/docs/targets/coding-agents.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ sidebar:
order: 3
---

Coding agent targets evaluate AI coding assistants and CLI-based agents. These targets require a `grader_target` (also accepts `judge_target` for backward compatibility) to run LLM-based graders.
Coding agent targets evaluate AI coding assistants and CLI-based agents. These targets require a `grader_target` to run LLM-based graders.

## Prompt format

Expand Down Expand Up @@ -73,39 +73,6 @@ targets:
| `cwd` | No | Working directory |
| `grader_target` | Yes | LLM target for evaluation |

## cc-mirror

[cc-mirror](https://github.com/numman-ali/cc-mirror) creates isolated Claude Code variants that route through alternative providers (Z.ai, Kimi, MiniMax, OpenRouter, etc.). The `cc-mirror` provider alias resolves to `claude-cli` and auto-discovers the binary path from `~/.cc-mirror/<variant>/variant.json`.

```yaml
targets:
# Explicit variant with known executable
- name: claude-zai
provider: cc-mirror
executable: claude-zai
grader_target: azure-base

# Auto-discover binary from variant.json
- name: my-kimi
provider: cc-mirror
grader_target: azure-base
```

| Field | Required | Description |
|-------|----------|-------------|
| `executable` | No | CLI binary name or path. When set, used directly (skips variant.json lookup). |
| `variant` | No | Variant name (directory under `~/.cc-mirror/`). Defaults to target `name`. Used to locate `variant.json` when `executable` is not set. |
| `cwd` | No | Working directory |
| `grader_target` | Yes | LLM target for evaluation |

Setup a variant first, then reference it by name:

```bash
npx cc-mirror quick --provider zai --name claude-zai --api-key "$Z_AI_API_KEY"
```

Since `cc-mirror` resolves to `claude-cli`, all Claude target fields (model, system_prompt, timeout_seconds, etc.) are also supported.

## Codex CLI

```yaml
Expand All @@ -131,7 +98,7 @@ targets:
```yaml
targets:
- name: copilot
provider: copilot
provider: copilot-cli
model: gpt-5-mini
grader_target: azure-base
```
Expand All @@ -140,7 +107,7 @@ targets:
|-------|----------|-------------|
| `model` | No | Model to use (defaults to copilot's default) |
| `cwd` | No | Working directory |
| `subprovider` | No | OpenAI-compatible provider type for `copilot`, `copilot-cli`, or `copilot-sdk`, such as `openai` or `azure` |
| `subprovider` | No | OpenAI-compatible provider type for `copilot-cli` or `copilot-sdk`, such as `openai` or `azure` |
| `base_url` | No | Provider base URL or Azure resource URL/name |
| `api_key` | No | Provider API key. Prefer `${{ ENV_VAR }}` references. |
| `bearer_token` | No | Provider bearer token. Prefer `${{ ENV_VAR }}` references. Takes precedence over `api_key` when set. |
Expand Down Expand Up @@ -308,7 +275,7 @@ The VS Code provider uses a **subagent file-messaging architecture**. AgentV pro
```yaml
targets:
- name: copilot
provider: copilot
provider: copilot-cli
executable: ${{ COPILOT_EXE }}
grader_target: azure-base
```
Expand Down
3 changes: 1 addition & 2 deletions apps/web/src/content/docs/docs/targets/configuration.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ tests:

## Grader Target

Agent targets that need LLM-based evaluation specify a `grader_target` (also accepts `judge_target` for backward compatibility) — the LLM used to run LLM grader graders:
Agent targets that need LLM-based evaluation specify a `grader_target` — the LLM used to run LLM grader graders:

```yaml
targets:
Expand Down Expand Up @@ -171,7 +171,6 @@ workspace:
|-------|-------------|
| `repos[].path` | Directory within the workspace to clone into |
| `repos[].repo` | Repository identity: full clone URL or GitHub `org/name` shorthand |
| `repos[].resolver` | Optional configured `repo_resolvers[].name` override |
| `repos[].commit` | Branch, tag, or SHA to check out (default: `HEAD`) |
| `repos[].base_commit` | Alias for `commit`, useful for SWE-bench-style datasets |
| `repos[].ancestor` | Walk N commits back from the checked-out ref (e.g., `1` for parent) |
Expand Down
2 changes: 1 addition & 1 deletion apps/web/src/content/docs/docs/tools/dashboard.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -314,7 +314,7 @@ results:
auto_push: false
```

Project-local `.agentv/config.yaml` is for portable eval defaults such as `execution`, `eval_patterns`, and `dashboard`. Do not put `projects` in project-local config; AgentV warns and ignores it there. `results_by_project` is deprecated; use `projects[].results` in `$AGENTV_HOME/config.yaml`.
Project-local `.agentv/config.yaml` is for portable eval defaults such as `execution`, `eval_patterns`, and `dashboard`. Do not put `projects` in project-local config; AgentV warns and ignores it there. Put per-project results settings in `projects[].results` in `$AGENTV_HOME/config.yaml`.

The project `repo` and the `results` block sync different repositories:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Date: 2026-07-01

## Status

Accepted
Accepted, then **superseded** (eval-authoring portions) by [ADR 0016](0016-promptfoo-superset-eval-authoring-contract.md) as part of the promptfoo-superset restructure (2026-07-02).

Extends [ADR 0009](0009-eval-path-result-identity-and-default-experiment.md) and
builds on [ADR 0012](0012-finalize-run-artifact-layout.md), which established
Expand Down
2 changes: 1 addition & 1 deletion docs/adr/0013-stabilize-eval-authoring-contract.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Date: 2026-07-01

## Status

Accepted
Accepted, then **superseded** (eval-authoring portions) by [ADR 0016](0016-promptfoo-superset-eval-authoring-contract.md) as part of the promptfoo-superset restructure (2026-07-02).

Supersedes the eval-authoring placement portions of
[ADR 0002](0002-keep-harbor-benchmark-execution-behind-runner-boundary.md),
Expand Down
Loading
Loading