diff --git a/.agents/verification.md b/.agents/verification.md index f955798ea..2bc524901 100644 --- a/.agents/verification.md +++ b/.agents/verification.md @@ -141,12 +141,12 @@ Use live dogfood before marking PRs ready when they affect eval execution, exper - Prefer the smallest realistic eval: one or two cases, bounded timeouts, and `workers: 1` for heavyweight agent providers. - For artifact/result contract changes, prefer letting AgentV choose the canonical run directory and capture the printed `Artifact workspace written to:` and `Results written to:` paths for evidence. Do not precompute `--output` unless the test specifically needs a fixed path. - For native experiment changes, run through `agentv eval run ... --experiment ` so resolution, setup, scripts, target selection, run knobs, and artifact metadata are exercised together. -- For repeat-run changes, use top-level `runs >= 2` when validating repeated attempts. Inspect root `index.jsonl`, root `benchmark.json`, and the repeated case folder. The repeated case folder should carry aggregate `summary.json` with flattened snake_case timing fields plus AgentV aggregate `grading.json`; attempt-specific outputs, transcripts, and metrics live under `run-N/`. Each `run-N/` folder should contain `result.json`, `grading.json`, `metrics.json`, `transcript.jsonl`, `transcript-raw.jsonl`, and `outputs/answer.md` when answer output is available. `result.json` should point at `./grading.json`, `./metrics.json`, `./transcript.jsonl`, and `./transcript-raw.jsonl` through the corresponding path fields. +- For repeat-run changes, use `evaluate_options.repeat.count >= 2` when validating repeated attempts. Inspect root `index.jsonl`, root `summary.json`, and the repeated case folder. Use `repeat` for authored configuration and `attempts[]` for produced executions. The repeated case folder should carry aggregate `summary.json` with flattened snake_case timing fields; attempt-specific outputs, transcripts, and metrics live under `attempt-N/`. Each `attempt-N/` folder should contain `result.json`, `grading.json`, `metrics.json`, `transcript.json`, `transcript-raw.jsonl`, and `outputs/answer.md` when answer output is available. `result.json` should point at `./grading.json`, `./metrics.json`, `./transcript.json`, and `./transcript-raw.jsonl` through the corresponding path fields. - For local OpenAI-compatible grading through the OAuth proxy, use `endpoint: http://127.0.0.1:10531/v1`, but still route `api_key` and `model` through environment references such as `${{ LOCAL_OPENAI_PROXY_API_KEY }}` and `${{ LOCAL_OPENAI_PROXY_MODEL }}`. Literal secrets and literal model values are intentionally rejected by target validation unless a resolver explicitly allows them. - For `codex`/Codex SDK live dogfood through the same local proxy, configure the agent target with `provider: codex`, `base_url: ${{ LOCAL_OPENAI_PROXY_BASE_URL }}`, `api_key: ${{ LOCAL_OPENAI_PROXY_API_KEY }}`, `model: ${{ LOCAL_OPENAI_PROXY_MODEL }}`, `api_format: responses`, `grader_target: `, `workers: 1`, and a bounded `timeout_seconds`. Configure the grader target as `provider: openai`, `api_format: chat`, and the same local proxy env references. A minimal run should use `bun apps/cli/src/cli.ts eval run --targets --target --workers 1`. - If the local proxy returns `401 token_expired`, the blocker is stale Codex OAuth, not AgentV target configuration. Refresh from a trusted local terminal with `codex logout`, `codex login --device-auth`, then restart `openai-oauth` and rerun the same eval command. - Preserve review evidence in `agentv-private` on an orphan `evidence/` branch. Include the run bundle, source eval/experiment/targets files, a short README, an artifact tree, contract checks, and screenshots when folder structure or UI behavior is under review. -- If comparing against an external convention such as Vercel `agent-eval`, verify both semantic provenance and the physical `run-N` artifact layout for repeat runs. +- If comparing against an external convention such as Vercel `agent-eval`, verify both semantic provenance and the physical `attempt-N` artifact layout for repeat runs. - For transcript/result artifact contract changes, try the same provider spread before merging: `pi-cli`, `codex-sdk`, and `copilot-sdk` through the local OpenAI-compatible endpoint when available. If a provider cannot run live, record the exact blocker, the run bundle or command output, and whether coverage moved to fixture/regression tests. - If dogfood or review changes the durable verification playbook, update this file or `AGENTS.md` in the same PR. Use `docs/solutions/` for longer reusable lessons rather than relying on PR comments or private evidence as the only source. diff --git a/apps/cli/src/commands/results/export.ts b/apps/cli/src/commands/results/export.ts index aa0994624..3bda4e2c7 100644 --- a/apps/cli/src/commands/results/export.ts +++ b/apps/cli/src/commands/results/export.ts @@ -8,9 +8,9 @@ * index.jsonl — per-test manifest with artifact pointers * / * summary.json — per-case aggregate - * run-1/result.json — per-run result - * run-1/grading.json — per-run grading artifact (assertions, graders) - * run-1/metrics.json — per-run metrics artifact + * attempt-1/result.json — per-attempt result + * attempt-1/grading.json — per-attempt grading artifact (assertions, graders) + * attempt-1/metrics.json — per-attempt metrics artifact * * This module delegates artifact building to the shared artifact-writer so * that summary/grading/timing schemas stay aligned with `agentv eval`. diff --git a/apps/cli/src/commands/results/manifest.ts b/apps/cli/src/commands/results/manifest.ts index 412757a59..51644457f 100644 --- a/apps/cli/src/commands/results/manifest.ts +++ b/apps/cli/src/commands/results/manifest.ts @@ -34,8 +34,17 @@ export interface ResultManifestRecord { readonly variant?: string; readonly score: number; readonly scores?: readonly Record[]; + readonly attempts?: readonly { + readonly attempt?: number; + readonly attempt_path?: string; + readonly run_path?: string; + readonly score?: number; + readonly verdict?: string; + readonly [key: string]: unknown; + }[]; readonly trials?: readonly { readonly attempt?: number; + readonly attempt_path?: string; readonly run_path?: string; readonly score?: number; readonly verdict?: string; diff --git a/apps/cli/src/commands/results/serve.ts b/apps/cli/src/commands/results/serve.ts index b21e4fe4d..b98f71a0e 100644 --- a/apps/cli/src/commands/results/serve.ts +++ b/apps/cli/src/commands/results/serve.ts @@ -805,8 +805,9 @@ function addTrialRunCatalogEntries( ? normalizeArtifactRelativePath(record.result_dir) : undefined; if (!resultDir) return; - for (const trial of record.trials ?? []) { - const runPath = trial.run_path ? normalizeArtifactRelativePath(trial.run_path) : undefined; + for (const trial of record.attempts ?? record.trials ?? []) { + const rawPath = typeof trial.attempt_path === 'string' ? trial.attempt_path : trial.run_path; + const runPath = rawPath ? normalizeArtifactRelativePath(rawPath) : undefined; if (!runPath) continue; const runDir = path.posix.join(resultDir, runPath); addDirectArtifactCatalogEntry( @@ -1145,13 +1146,15 @@ function buildRepeatTrialReadModels( baseDir: string, record: ResultManifestRecord, ): Array> | undefined { - if (!record.trials || record.trials.length === 0) return undefined; + const attempts = record.attempts ?? record.trials; + if (!attempts || attempts.length === 0) return undefined; const resultDir = record.result_dir ? normalizeArtifactRelativePath(record.result_dir) : undefined; - return record.trials.map((trial) => { - const runPath = trial.run_path ? normalizeArtifactRelativePath(trial.run_path) : undefined; + return attempts.map((trial) => { + const rawPath = typeof trial.attempt_path === 'string' ? trial.attempt_path : trial.run_path; + const runPath = rawPath ? normalizeArtifactRelativePath(rawPath) : undefined; const metricsPath = caseTrialArtifactPath(resultDir, runPath, 'metrics.json'); const timingPath = caseTrialArtifactPath(resultDir, runPath, 'timing.json'); const gradingPath = caseTrialArtifactPath(resultDir, runPath, 'grading.json'); @@ -1202,7 +1205,7 @@ function attachRunDetailReadModelFields>( return results.map((result, index) => { const record = records[index]; if (!record) return result; - const trials = buildRepeatTrialReadModels(baseDir, record); + const attempts = buildRepeatTrialReadModels(baseDir, record); return { ...result, ...(record.aggregation && { aggregation: record.aggregation }), @@ -1217,7 +1220,7 @@ function attachRunDetailReadModelFields>( ...(record.transcript_raw_path && { transcript_raw_path: record.transcript_raw_path }), ...(record.output_path && { output_path: record.output_path }), ...(record.answer_path && { answer_path: record.answer_path }), - ...(trials && { trials }), + ...(attempts && { attempts }), }; }); } diff --git a/apps/cli/src/commands/results/validate.ts b/apps/cli/src/commands/results/validate.ts index f9d102e87..44e056202 100644 --- a/apps/cli/src/commands/results/validate.ts +++ b/apps/cli/src/commands/results/validate.ts @@ -40,7 +40,8 @@ interface IndexEntry { readonly grading_path?: string; readonly timing_path?: string; readonly result_dir?: string; - readonly trials?: readonly { readonly run_path?: string }[]; + readonly attempts?: readonly { readonly attempt_path?: string; readonly run_path?: string }[]; + readonly trials?: readonly { readonly attempt_path?: string; readonly run_path?: string }[]; readonly [key: string]: unknown; } @@ -242,23 +243,24 @@ function checkArtifactFiles(runDir: string, entries: IndexEntry[]): Diagnostic[] } } - for (const trial of entry.trials ?? []) { - if (!entry.result_dir || !trial.run_path) { + for (const attempt of entry.attempts ?? entry.trials ?? []) { + const attemptPath = attempt.attempt_path ?? attempt.run_path; + if (!entry.result_dir || !attemptPath) { continue; } - const runDirPath = path.join(runDir, entry.result_dir, trial.run_path); + const runDirPath = path.join(runDir, entry.result_dir, attemptPath); const resultPath = path.join(runDirPath, 'result.json'); const gradingPath = path.join(runDirPath, 'grading.json'); if (!existsSync(resultPath)) { diagnostics.push({ severity: 'error', - message: `${testId}: result.json not found at '${path.posix.join(entry.result_dir, trial.run_path, 'result.json')}'`, + message: `${testId}: result.json not found at '${path.posix.join(entry.result_dir, attemptPath, 'result.json')}'`, }); } if (!existsSync(gradingPath)) { diagnostics.push({ severity: 'error', - message: `${testId}: grading.json not found at '${path.posix.join(entry.result_dir, trial.run_path, 'grading.json')}'`, + message: `${testId}: grading.json not found at '${path.posix.join(entry.result_dir, attemptPath, 'grading.json')}'`, }); } } diff --git a/apps/cli/test/commands/eval/aggregate.test.ts b/apps/cli/test/commands/eval/aggregate.test.ts index b6a7763a9..3de6ed4a4 100644 --- a/apps/cli/test/commands/eval/aggregate.test.ts +++ b/apps/cli/test/commands/eval/aggregate.test.ts @@ -289,17 +289,17 @@ describe('writePerTestArtifacts', () => { await writePerTestArtifacts(results, tmpDir); const grading1 = JSON.parse( - readFileSync(rowRunPath(tmpDir, 'test-1', 'run-1', 'grading.json'), 'utf8'), + readFileSync(rowRunPath(tmpDir, 'test-1', 'attempt-1', 'grading.json'), 'utf8'), ); expect(grading1.assertions).toHaveLength(1); const timing1 = JSON.parse( - readFileSync(rowRunPath(tmpDir, 'test-1', 'run-1', 'timing.json'), 'utf8'), + readFileSync(rowRunPath(tmpDir, 'test-1', 'attempt-1', 'timing.json'), 'utf8'), ); expect(timing1.total_tokens).toBeGreaterThanOrEqual(0); const grading2 = JSON.parse( - readFileSync(rowRunPath(tmpDir, 'test-2', 'run-1', 'grading.json'), 'utf8'), + readFileSync(rowRunPath(tmpDir, 'test-2', 'attempt-1', 'grading.json'), 'utf8'), ); expect(grading2.assertions).toHaveLength(1); }); @@ -310,7 +310,7 @@ describe('writePerTestArtifacts', () => { await writePerTestArtifacts(results, tmpDir); const answer = readFileSync( - rowRunPath(tmpDir, 'test-1', 'run-1', 'outputs', 'answer.md'), + rowRunPath(tmpDir, 'test-1', 'attempt-1', 'outputs', 'answer.md'), 'utf8', ); expect(answer).toContain('hello'); diff --git a/apps/cli/test/commands/eval/artifact-writer.test.ts b/apps/cli/test/commands/eval/artifact-writer.test.ts index 790c810a3..151c1bd9a 100644 --- a/apps/cli/test/commands/eval/artifact-writer.test.ts +++ b/apps/cli/test/commands/eval/artifact-writer.test.ts @@ -186,7 +186,7 @@ describe('buildGradingArtifact', () => { const grading = buildGradingArtifact(result); - expect(grading.trials).toEqual([ + expect(grading.attempts).toEqual([ { attempt: 0, score: 0.4, @@ -524,7 +524,7 @@ describe('buildRunSummaryArtifact', () => { [makeResult({})], 'test.eval.yaml', 'baseline-v2', - 'run-1', + 'attempt-1', undefined, undefined, undefined, @@ -691,8 +691,8 @@ describe('buildIndexArtifactEntry', () => { }), { outputDir: '/tmp/artifacts', - gradingPath: '/tmp/artifacts/alpha/run-1/grading.json', - timingPath: '/tmp/artifacts/alpha/run-1/timing.json', + gradingPath: '/tmp/artifacts/alpha/attempt-1/grading.json', + timingPath: '/tmp/artifacts/alpha/attempt-1/timing.json', outputPath: '/tmp/artifacts/alpha/outputs/answer.md', answerPath: '/tmp/artifacts/alpha/outputs/answer.md', }, @@ -722,14 +722,14 @@ describe('buildIndexArtifactEntry', () => { ], execution_status: 'quality_failure', error: 'model drift', - grading_path: 'alpha/run-1/grading.json', - timing_path: 'alpha/run-1/timing.json', + grading_path: 'alpha/attempt-1/grading.json', + timing_path: 'alpha/attempt-1/timing.json', output_path: 'alpha/outputs/answer.md', answer_path: 'alpha/outputs/answer.md', - trials: [ + attempts: [ { attempt: 0, - run_path: 'run-1', + attempt_path: 'attempt-1', score: 0.9, verdict: 'pass', scores: [ @@ -790,12 +790,12 @@ describe('buildIndexArtifactEntry', () => { }), { outputDir: '/tmp/artifacts', - gradingPath: '/tmp/artifacts/alpha/run-1/grading.json', - timingPath: '/tmp/artifacts/alpha/run-1/timing.json', + gradingPath: '/tmp/artifacts/alpha/attempt-1/grading.json', + timingPath: '/tmp/artifacts/alpha/attempt-1/timing.json', }, ); - expect(entry.trials).toEqual([ + expect(entry.attempts).toEqual([ { attempt: 0, score: 0.8, verdict: 'pass' }, { attempt: 1, score: 0.6, verdict: 'fail', error: 'missing token' }, ]); @@ -855,9 +855,9 @@ describe('parseJsonlResults', () => { artifactPointers: { transcript: { ref: 'agentv/artifacts/v1', - key: 'transcripts/pointer-row/run-1/transcript-raw.jsonl', + key: 'transcripts/pointer-row/attempt-1/transcript-raw.jsonl', object_version: 'sha256:test', - path: 'pointer-row/run-1/transcript-raw.jsonl', + path: 'pointer-row/attempt-1/transcript-raw.jsonl', sha256: 'test', size: 1, schema_version: 'agentv.transcript.v1', @@ -875,7 +875,7 @@ describe('parseJsonlResults', () => { test_id: 'file-changes-row', target: 'codex', score: 1, - fileChangesPath: 'file-changes-row/run-1/outputs/file_changes.diff', + fileChangesPath: 'file-changes-row/attempt-1/outputs/file_changes.diff', })}\n`; expect(() => parseJsonlResults(content)).toThrow(/Use "file_changes_path"/); @@ -887,7 +887,7 @@ describe('parseJsonlResults', () => { target: 'codex', score: 1, output: 'done', - raw_provider_log_path: 'raw-log-case/run-1/provider.log', + raw_provider_log_path: 'raw-log-case/attempt-1/provider.log', })}\n`; const results = parseJsonlResults(content); @@ -1043,9 +1043,11 @@ describe('writeArtifactsFromResults', () => { expect(rootSummary.manifest_path).toBe(RESULT_INDEX_FILENAME); const alphaEntries = await readdir(path.join(paths.testArtifactDir, alphaRowDir)); - expect(alphaEntries.sort()).toEqual(['run-1', 'summary.json']); + expect(alphaEntries.sort()).toEqual(['attempt-1', 'summary.json']); - const alphaRunEntries = await readdir(path.join(paths.testArtifactDir, alphaRowDir, 'run-1')); + const alphaRunEntries = await readdir( + path.join(paths.testArtifactDir, alphaRowDir, 'attempt-1'), + ); expect(alphaRunEntries.sort()).toEqual([ 'grading.json', 'metrics.json', @@ -1058,7 +1060,7 @@ describe('writeArtifactsFromResults', () => { const alphaGrading: GradingArtifact = JSON.parse( await readFile( - path.join(paths.testArtifactDir, alphaRowDir, 'run-1', 'grading.json'), + path.join(paths.testArtifactDir, alphaRowDir, 'attempt-1', 'grading.json'), 'utf8', ), ); @@ -1066,7 +1068,10 @@ describe('writeArtifactsFromResults', () => { expect(alphaGrading).not.toHaveProperty('execution_metrics'); const alphaTiming: TimingArtifact = JSON.parse( - await readFile(path.join(paths.testArtifactDir, alphaRowDir, 'run-1', 'timing.json'), 'utf8'), + await readFile( + path.join(paths.testArtifactDir, alphaRowDir, 'attempt-1', 'timing.json'), + 'utf8', + ), ); expect(alphaTiming.duration_ms).toBe(5000); @@ -1076,9 +1081,9 @@ describe('writeArtifactsFromResults', () => { expect(summary.timing.duration_ms).toBe(13000); expect(indexLines[0]?.summary_path).toBe(`${alphaRowDir}/summary.json`); - expect(indexLines[0]?.grading_path).toBe(`${alphaRowDir}/run-1/grading.json`); - expect(indexLines[0]?.timing_path).toBe(`${alphaRowDir}/run-1/timing.json`); - expect(indexLines[0]?.metrics_path).toBe(`${alphaRowDir}/run-1/metrics.json`); + expect(indexLines[0]?.grading_path).toBe(`${alphaRowDir}/attempt-1/grading.json`); + expect(indexLines[0]?.timing_path).toBe(`${alphaRowDir}/attempt-1/timing.json`); + expect(indexLines[0]?.metrics_path).toBe(`${alphaRowDir}/attempt-1/metrics.json`); }); it('writes optional runtime source metadata to summary and index rows', async () => { @@ -1181,9 +1186,9 @@ describe('writeArtifactsFromResults', () => { const [indexEntry] = await readIndexLines(paths.indexPath); const repeatRowDir = expectRowDir(indexEntry, 'repeat-case'); - expect(indexEntry?.trials).toMatchObject([ - { attempt: 0, run_path: 'run-1', score: 0.25, verdict: 'fail' }, - { attempt: 1, run_path: 'run-2', score: 1, verdict: 'pass' }, + expect(indexEntry?.attempts).toMatchObject([ + { attempt: 0, attempt_path: 'attempt-1', score: 0.25, verdict: 'fail' }, + { attempt: 1, attempt_path: 'attempt-2', score: 1, verdict: 'pass' }, ]); expect(indexEntry?.aggregation).toEqual({ strategy: 'confidence_interval', @@ -1202,14 +1207,14 @@ describe('writeArtifactsFromResults', () => { expect(indexEntry?.metrics_path).toBeUndefined(); const repeatEntries = await readdir(path.join(paths.testArtifactDir, repeatRowDir)); - expect(repeatEntries.sort()).toEqual(['run-1', 'run-2', 'summary.json']); + expect(repeatEntries.sort()).toEqual(['attempt-1', 'attempt-2', 'summary.json']); const caseSummary = JSON.parse( await readFile(path.join(paths.testArtifactDir, repeatRowDir, 'summary.json'), 'utf8'), ) as Record; expect(caseSummary).toMatchObject({ - total_runs: 2, - passed_runs: 1, + total_attempts: 2, + passed_attempts: 1, pass_rate: '50%', mean_duration_ms: 3000, mean_duration_seconds: 3, @@ -1240,7 +1245,7 @@ describe('writeArtifactsFromResults', () => { readFile(path.join(paths.testArtifactDir, repeatRowDir, 'grading.json'), 'utf8'), ).rejects.toThrow(); - for (const runDir of ['run-1', 'run-2']) { + for (const runDir of ['attempt-1', 'attempt-2']) { const runEntries = await readdir(path.join(paths.testArtifactDir, repeatRowDir, runDir)); expect(runEntries.sort()).toEqual([ 'grading.json', @@ -1255,7 +1260,7 @@ describe('writeArtifactsFromResults', () => { const runOneResult = JSON.parse( await readFile( - path.join(paths.testArtifactDir, repeatRowDir, 'run-1', 'result.json'), + path.join(paths.testArtifactDir, repeatRowDir, 'attempt-1', 'result.json'), 'utf8', ), ) as Record; @@ -1275,17 +1280,17 @@ describe('writeArtifactsFromResults', () => { }, }); expect(runOneResult).not.toHaveProperty('status'); - expect(indexEntry?.trials?.[0]?.transcript_summary).toEqual(runOneResult.transcript_summary); + expect(indexEntry?.attempts?.[0]?.transcript_summary).toEqual(runOneResult.transcript_summary); const runTwoAnswer = await readFile( - path.join(paths.testArtifactDir, repeatRowDir, 'run-2', 'outputs', 'answer.md'), + path.join(paths.testArtifactDir, repeatRowDir, 'attempt-2', 'outputs', 'answer.md'), 'utf8', ); expect(runTwoAnswer).toBe('second attempt'); const runTwoResult = JSON.parse( await readFile( - path.join(paths.testArtifactDir, repeatRowDir, 'run-2', 'result.json'), + path.join(paths.testArtifactDir, repeatRowDir, 'attempt-2', 'result.json'), 'utf8', ), ) as Record; @@ -1301,7 +1306,7 @@ describe('writeArtifactsFromResults', () => { }, }); expect(runTwoResult).not.toHaveProperty('status'); - expect(indexEntry?.trials?.[1]?.transcript_summary).toEqual(runTwoResult.transcript_summary); + expect(indexEntry?.attempts?.[1]?.transcript_summary).toEqual(runTwoResult.transcript_summary); }); it('handles empty results array', async () => { @@ -1338,13 +1343,13 @@ describe('writeArtifactsFromResults', () => { const testTwo = indexLines.find((line) => line.test_id === 'test-2'); const gradingOne: GradingArtifact = JSON.parse( - await readFile(runArtifactPath(testDir, testOne, 'run-1', 'grading.json'), 'utf8'), + await readFile(runArtifactPath(testDir, testOne, 'attempt-1', 'grading.json'), 'utf8'), ); const gradingTwo: GradingArtifact = JSON.parse( - await readFile(runArtifactPath(testDir, testTwo, 'run-1', 'grading.json'), 'utf8'), + await readFile(runArtifactPath(testDir, testTwo, 'attempt-1', 'grading.json'), 'utf8'), ); const timingOne: TimingArtifact = JSON.parse( - await readFile(runArtifactPath(testDir, testOne, 'run-1', 'timing.json'), 'utf8'), + await readFile(runArtifactPath(testDir, testOne, 'attempt-1', 'timing.json'), 'utf8'), ); expect(gradingOne.summary.total).toBe(1); @@ -1408,11 +1413,14 @@ describe('writeArtifactsFromResults', () => { const [indexLine] = await readIndexLines(paths.indexPath); const rowDir = expectRowDir(indexLine, 'transcript-case'); - const transcriptPath = runArtifactPath(testDir, indexLine, 'run-1', 'transcript.json'); + const transcriptPath = runArtifactPath(testDir, indexLine, 'attempt-1', 'transcript.json'); const transcript = JSON.parse(await readFile(transcriptPath, 'utf8')); const rawTranscriptLines = ( - await readFile(runArtifactPath(testDir, indexLine, 'run-1', 'transcript-raw.jsonl'), 'utf8') + await readFile( + runArtifactPath(testDir, indexLine, 'attempt-1', 'transcript-raw.jsonl'), + 'utf8', + ) ) .trim() .split('\n') @@ -1495,17 +1503,17 @@ describe('writeArtifactsFromResults', () => { role: 'user', }); await expect( - readFile(path.join(testDir, rowDir, 'run-1', 'transcript.jsonl'), 'utf8'), + readFile(path.join(testDir, rowDir, 'attempt-1', 'transcript.jsonl'), 'utf8'), ).rejects.toThrow(); await expect( - readFile(runArtifactPath(testDir, indexLine, 'run-1', 'trace.json'), 'utf8'), + readFile(runArtifactPath(testDir, indexLine, 'attempt-1', 'trace.json'), 'utf8'), ).rejects.toThrow(); expect(indexLine).not.toHaveProperty('trace_path'); - expect(indexLine?.transcript_path).toBe(`${rowDir}/run-1/transcript.json`); - expect(indexLine?.transcript_raw_path).toBe(`${rowDir}/run-1/transcript-raw.jsonl`); + expect(indexLine?.transcript_path).toBe(`${rowDir}/attempt-1/transcript.json`); + expect(indexLine?.transcript_raw_path).toBe(`${rowDir}/attempt-1/transcript-raw.jsonl`); expect(indexLine?.transcript_summary).toEqual(transcript.transcript_summary); - expect(indexLine?.metrics_path).toBe(`${rowDir}/run-1/metrics.json`); + expect(indexLine?.metrics_path).toBe(`${rowDir}/attempt-1/metrics.json`); expect(indexLine.metrics_path.endsWith(CANONICAL_METRICS_ARTIFACT_PATH)).toBe(true); expect(indexLine.artifact_pointers).toBeUndefined(); @@ -1608,26 +1616,26 @@ describe('writeArtifactsFromResults', () => { const [indexLine] = await readIndexLines(paths.indexPath); const rowDir = expectRowDir(indexLine, 'summary-case'); - expect(indexLine?.metrics_path).toBe(`${rowDir}/run-1/metrics.json`); + expect(indexLine?.metrics_path).toBe(`${rowDir}/attempt-1/metrics.json`); expect(indexLine?.file_changes_path).toBe( - `${rowDir}/run-1/${CANONICAL_FILE_CHANGES_ARTIFACT_PATH}`, + `${rowDir}/attempt-1/${CANONICAL_FILE_CHANGES_ARTIFACT_PATH}`, ); await expect( readFile( - runArtifactPath(testDir, indexLine, 'run-1', 'outputs', 'file_changes.diff'), + runArtifactPath(testDir, indexLine, 'attempt-1', 'outputs', 'file_changes.diff'), 'utf8', ), ).resolves.toBe(fileChanges); const runResult = JSON.parse( - await readFile(runArtifactPath(testDir, indexLine, 'run-1', 'result.json'), 'utf8'), + await readFile(runArtifactPath(testDir, indexLine, 'attempt-1', 'result.json'), 'utf8'), ); expect(runResult.file_changes_path).toBe('./outputs/file_changes.diff'); expect(runResult.output_paths.file_changes).toBe('./outputs/file_changes.diff'); const summary = MetricsArtifactWireSchema.parse( JSON.parse( - await readFile(runArtifactPath(testDir, indexLine, 'run-1', 'metrics.json'), 'utf8'), + await readFile(runArtifactPath(testDir, indexLine, 'attempt-1', 'metrics.json'), 'utf8'), ), ); @@ -1646,7 +1654,7 @@ describe('writeArtifactsFromResults', () => { }); expect(summary.source_artifacts).not.toHaveProperty('trace_path'); await expect( - readFile(runArtifactPath(testDir, indexLine, 'run-1', 'trace.json'), 'utf8'), + readFile(runArtifactPath(testDir, indexLine, 'attempt-1', 'trace.json'), 'utf8'), ).rejects.toThrow(); expect(summary.metrics.total_turns).toBe(2); expect(summary.metrics.total_tool_calls).toBe(4); @@ -1720,7 +1728,7 @@ describe('writeArtifactsFromResults', () => { expect(summary).not.toHaveProperty('usage_summary'); const timing = JSON.parse( - await readFile(runArtifactPath(testDir, indexLine, 'run-1', 'timing.json'), 'utf8'), + await readFile(runArtifactPath(testDir, indexLine, 'attempt-1', 'timing.json'), 'utf8'), ); expect(timing).toMatchObject({ total_tokens: 140, @@ -1778,21 +1786,21 @@ describe('writeArtifactsFromResults', () => { const estimatedRow = indexLines.find((line) => line.test_id === 'estimated-usage'); const aggregateTiming = JSON.parse( - await readFile(runArtifactPath(testDir, aggregateRow, 'run-1', 'timing.json'), 'utf8'), + await readFile(runArtifactPath(testDir, aggregateRow, 'attempt-1', 'timing.json'), 'utf8'), ); const estimatedTiming = JSON.parse( - await readFile(runArtifactPath(testDir, estimatedRow, 'run-1', 'timing.json'), 'utf8'), + await readFile(runArtifactPath(testDir, estimatedRow, 'attempt-1', 'timing.json'), 'utf8'), ); const runSummary = JSON.parse(await readFile(path.join(testDir, 'summary.json'), 'utf8')); MetricsArtifactWireSchema.parse( JSON.parse( - await readFile(runArtifactPath(testDir, aggregateRow, 'run-1', 'metrics.json'), 'utf8'), + await readFile(runArtifactPath(testDir, aggregateRow, 'attempt-1', 'metrics.json'), 'utf8'), ), ); MetricsArtifactWireSchema.parse( JSON.parse( - await readFile(runArtifactPath(testDir, estimatedRow, 'run-1', 'metrics.json'), 'utf8'), + await readFile(runArtifactPath(testDir, estimatedRow, 'attempt-1', 'metrics.json'), 'utf8'), ), ); @@ -1853,18 +1861,18 @@ describe('writeArtifactsFromResults', () => { const [indexLine] = await readIndexLines(paths.indexPath); const rowDir = expectRowDir(indexLine, 'raw-log-case'); - const copiedRawLogPath = runArtifactPath(testDir, indexLine, 'run-1', 'provider.log'); + const copiedRawLogPath = runArtifactPath(testDir, indexLine, 'attempt-1', 'provider.log'); await expect(readFile(copiedRawLogPath, 'utf8')).rejects.toThrow(); - const transcriptPath = runArtifactPath(testDir, indexLine, 'run-1', 'transcript-raw.jsonl'); + const transcriptPath = runArtifactPath(testDir, indexLine, 'attempt-1', 'transcript-raw.jsonl'); await expect(readFile(transcriptPath, 'utf8')).resolves.toBe(rawLog); await expect(readFile(rawLogPath, 'utf8')).resolves.toBe(rawLog); await expect( - readFile(path.join(testDir, rowDir, 'run-1', 'transcript.jsonl'), 'utf8'), + readFile(path.join(testDir, rowDir, 'attempt-1', 'transcript.jsonl'), 'utf8'), ).rejects.toThrow(); const transcript = JSON.parse( - await readFile(runArtifactPath(testDir, indexLine, 'run-1', 'transcript.json'), 'utf8'), + await readFile(runArtifactPath(testDir, indexLine, 'attempt-1', 'transcript.json'), 'utf8'), ); expect(transcript.turns[0]).toMatchObject({ v: 1, @@ -1874,8 +1882,8 @@ describe('writeArtifactsFromResults', () => { }); expect(indexLine.raw_provider_log_path).toBeUndefined(); - expect(indexLine.transcript_path).toBe(`${rowDir}/run-1/transcript.json`); - expect(indexLine.transcript_raw_path).toBe(`${rowDir}/run-1/transcript-raw.jsonl`); + expect(indexLine.transcript_path).toBe(`${rowDir}/attempt-1/transcript.json`); + expect(indexLine.transcript_raw_path).toBe(`${rowDir}/attempt-1/transcript-raw.jsonl`); expect(indexLine).not.toHaveProperty('transcript_json_path'); }); @@ -1920,7 +1928,7 @@ describe('writeArtifactsFromResults', () => { expect(JSON.stringify(indexLine)).not.toContain('api_key'); const transcriptJson = await readFile( - runArtifactPath(testDir, indexLine, 'run-1', 'transcript.json'), + runArtifactPath(testDir, indexLine, 'attempt-1', 'transcript.json'), 'utf8', ); expect(transcriptJson).not.toContain('secret'); @@ -1939,12 +1947,12 @@ describe('writeArtifactsFromResults', () => { const paths = await writeArtifactsFromResults(results, testDir); const [indexLine] = await readIndexLines(paths.indexPath); - const transcriptPath = runArtifactPath(testDir, indexLine, 'run-1', 'transcript-raw.jsonl'); + const transcriptPath = runArtifactPath(testDir, indexLine, 'attempt-1', 'transcript-raw.jsonl'); await expect(readFile(transcriptPath, 'utf8')).rejects.toThrow(); expect(indexLine).not.toHaveProperty('transcript_path'); expect(indexLine.metrics_path).toBe( - `${expectRowDir(indexLine, 'no-transcript-case')}/run-1/metrics.json`, + `${expectRowDir(indexLine, 'no-transcript-case')}/attempt-1/metrics.json`, ); expect(indexLine.artifact_pointers).toBeUndefined(); }); @@ -1973,11 +1981,11 @@ describe('writeArtifactsFromResults', () => { const [indexLine] = await readIndexLines(paths.indexPath); const rowDir = expectRowDir(indexLine, 'shared-id'); - expect(indexLine.grading_path).toBe(`${rowDir}/run-1/grading.json`); + expect(indexLine.grading_path).toBe(`${rowDir}/attempt-1/grading.json`); expect(rowDir).not.toContain('/'); const grading: GradingArtifact = JSON.parse( - await readFile(runArtifactPath(testDir, indexLine, 'run-1', 'grading.json'), 'utf8'), + await readFile(runArtifactPath(testDir, indexLine, 'attempt-1', 'grading.json'), 'utf8'), ); expect(grading.assertions[0].text).toBe('baseline-check'); @@ -1996,11 +2004,11 @@ describe('writeArtifactsFromResults', () => { const rowDirs = indexLines.map((line) => expectRowDir(line, 'shared-id')); expect(new Set(rowDirs).size).toBe(2); expect(indexLines.map((line) => line.grading_path)).toEqual( - rowDirs.map((rowDir) => `${rowDir}/run-1/grading.json`), + rowDirs.map((rowDir) => `${rowDir}/attempt-1/grading.json`), ); const answers = await Promise.all( indexLines.map((line) => - readFile(runArtifactPath(testDir, line, 'run-1', 'outputs', 'answer.md'), 'utf8'), + readFile(runArtifactPath(testDir, line, 'attempt-1', 'outputs', 'answer.md'), 'utf8'), ), ); expect(answers.sort()).toEqual(['alpha answer', 'beta answer']); @@ -2088,7 +2096,7 @@ describe('writeArtifactsFromResults', () => { id: 'alpha', key: 'alpha', dimensions: { - runId: 'run-1', + runId: 'attempt-1', suite: 'variant-suite', evalPath: 'evals/variant.eval.yaml', testId: 'shared-id', @@ -2110,7 +2118,7 @@ describe('writeArtifactsFromResults', () => { id: 'beta', key: 'beta', dimensions: { - runId: 'run-1', + runId: 'attempt-1', suite: 'variant-suite', evalPath: 'evals/variant.eval.yaml', testId: 'shared-id', diff --git a/apps/cli/test/commands/grade/grade-prepared.test.ts b/apps/cli/test/commands/grade/grade-prepared.test.ts index 4b8a174fa..df5cd6c91 100644 --- a/apps/cli/test/commands/grade/grade-prepared.test.ts +++ b/apps/cli/test/commands/grade/grade-prepared.test.ts @@ -197,7 +197,7 @@ describe('agentv grade prepared attempts', () => { }); expect(typeof row.metadata.prepared_attempt.baseline_commit).toBe('string'); - expect(row.file_changes_path).toMatch(/\/run-1\/outputs\/file_changes\.diff$/); + expect(row.file_changes_path).toMatch(/\/attempt-1\/outputs\/file_changes\.diff$/); await expect(readFile(path.join(runDir, row.file_changes_path), 'utf8')).resolves.toContain( '+manual edit', ); diff --git a/apps/cli/test/commands/prepare/prepare.test.ts b/apps/cli/test/commands/prepare/prepare.test.ts index 1c7e5e6a9..ed850db91 100644 --- a/apps/cli/test/commands/prepare/prepare.test.ts +++ b/apps/cli/test/commands/prepare/prepare.test.ts @@ -255,7 +255,7 @@ describe('agentv prepare', () => { path.join(tempDir, '.agentv', 'targets.yaml'), ` targets: - - name: codex + - label: codex provider: cli command: bun ./scripts/target.ts `, diff --git a/apps/cli/test/commands/results/export-e2e-providers.test.ts b/apps/cli/test/commands/results/export-e2e-providers.test.ts index 991d29a91..a8e78ba2e 100644 --- a/apps/cli/test/commands/results/export-e2e-providers.test.ts +++ b/apps/cli/test/commands/results/export-e2e-providers.test.ts @@ -238,7 +238,7 @@ function runArtifactDir( outputDir: string, record: { suite?: string; target?: string; test_id?: string }, ): string { - return path.join(outputDir, findIndexEntry(outputDir, record).result_dir, 'run-1'); + return path.join(outputDir, findIndexEntry(outputDir, record).result_dir, 'attempt-1'); } describe('export e2e — multi-provider metrics verification', () => { diff --git a/apps/cli/test/commands/results/export.test.ts b/apps/cli/test/commands/results/export.test.ts index 90860e80a..ad835961c 100644 --- a/apps/cli/test/commands/results/export.test.ts +++ b/apps/cli/test/commands/results/export.test.ts @@ -197,7 +197,7 @@ function runArtifactDir( outputDir: string, record: { suite?: string; target?: string; test_id?: string }, ): string { - return path.join(artifactDir(outputDir, record), 'run-1'); + return path.join(artifactDir(outputDir, record), 'attempt-1'); } function readAnswer( @@ -299,7 +299,7 @@ describe('results export', () => { }); expect(first.entries[0].artifact_refs).toMatchObject({ status: 'planned_export', - timing_path: expect.stringMatching(/^test-private--[a-f0-9]{12}\/run-1\/timing\.json$/), + timing_path: expect.stringMatching(/^test-private--[a-f0-9]{12}\/attempt-1\/timing\.json$/), }); expect(first.entries[0].artifact_refs).not.toHaveProperty('input_path'); expect(first.entries[0].artifact_refs).not.toHaveProperty('output_path'); @@ -380,20 +380,20 @@ describe('results export', () => { status: 'planned_export', result_dir: resultDir, summary_path: `${resultDir}/summary.json`, - grading_path: `${resultDir}/run-1/grading.json`, - timing_path: `${resultDir}/run-1/timing.json`, - metrics_path: `${resultDir}/run-1/metrics.json`, - output_path: `${resultDir}/run-1/outputs/answer.md`, - answer_path: `${resultDir}/run-1/outputs/answer.md`, - transcript_path: `${resultDir}/run-1/transcript.json`, - transcript_raw_path: `${resultDir}/run-1/transcript-raw.jsonl`, + grading_path: `${resultDir}/attempt-1/grading.json`, + timing_path: `${resultDir}/attempt-1/timing.json`, + metrics_path: `${resultDir}/attempt-1/metrics.json`, + output_path: `${resultDir}/attempt-1/outputs/answer.md`, + answer_path: `${resultDir}/attempt-1/outputs/answer.md`, + transcript_path: `${resultDir}/attempt-1/transcript.json`, + transcript_raw_path: `${resultDir}/attempt-1/transcript-raw.jsonl`, }); expect(bundle.entries[0].artifact_refs).not.toHaveProperty('trace_path'); expect(bundle.entries[0].artifact_refs).not.toHaveProperty('input_path'); expect(bundle.entries[0].trace).not.toHaveProperty('envelope_ref'); expect(bundle.entries[0].trace_envelope.artifacts).toBeDefined(); expect(bundle.entries[0].trace_envelope.artifacts).not.toHaveProperty('trace_path'); - expect(bundle.entries[0].feedback.grading_path).toBe(`${resultDir}/run-1/grading.json`); + expect(bundle.entries[0].feedback.grading_path).toBe(`${resultDir}/attempt-1/grading.json`); expect(bundle.entries[0].raw_content).toBeDefined(); expect(bundle.entries[0].feedback.scores?.[0]).toHaveProperty('evidence'); expect(serialized).toContain('SECRET_PROMPT_TEXT'); @@ -454,13 +454,13 @@ describe('results export', () => { execution_status: 'ok', result_dir: rowDir, summary_path: `${rowDir}/summary.json`, - grading_path: `${rowDir}/run-1/grading.json`, - timing_path: `${rowDir}/run-1/timing.json`, - metrics_path: `${rowDir}/run-1/metrics.json`, - output_path: `${rowDir}/run-1/outputs/answer.md`, - answer_path: `${rowDir}/run-1/outputs/answer.md`, - transcript_path: `${rowDir}/run-1/transcript.json`, - transcript_raw_path: `${rowDir}/run-1/transcript-raw.jsonl`, + grading_path: `${rowDir}/attempt-1/grading.json`, + timing_path: `${rowDir}/attempt-1/timing.json`, + metrics_path: `${rowDir}/attempt-1/metrics.json`, + output_path: `${rowDir}/attempt-1/outputs/answer.md`, + answer_path: `${rowDir}/attempt-1/outputs/answer.md`, + transcript_path: `${rowDir}/attempt-1/transcript.json`, + transcript_raw_path: `${rowDir}/attempt-1/transcript-raw.jsonl`, }); expect(entries[0]).not.toHaveProperty('input_path'); expect(entries[0].projection_identity).toMatchObject({ @@ -832,7 +832,7 @@ describe('results export', () => { const answerPath = path.join( artifactDir(outputDir, RESULT_DIFFERENT_TARGET), - 'run-1', + 'attempt-1', 'outputs', 'answer.md', ); diff --git a/apps/cli/test/commands/results/serve.test.ts b/apps/cli/test/commands/results/serve.test.ts index 8121fac19..77228476d 100644 --- a/apps/cli/test/commands/results/serve.test.ts +++ b/apps/cli/test/commands/results/serve.test.ts @@ -2461,14 +2461,14 @@ describe('serve app', () => { thinking_blocks: 1, }; - mkdirSync(path.join(runDir, resultDir, 'run-1'), { recursive: true }); - mkdirSync(path.join(runDir, resultDir, 'run-2'), { recursive: true }); + mkdirSync(path.join(runDir, resultDir, 'attempt-1'), { recursive: true }); + mkdirSync(path.join(runDir, resultDir, 'attempt-2'), { recursive: true }); writeFileSync( - path.join(runDir, resultDir, 'run-1', 'result.json'), + path.join(runDir, resultDir, 'attempt-1', 'result.json'), `${JSON.stringify({ transcript_summary: firstSummary })}\n`, ); writeFileSync( - path.join(runDir, resultDir, 'run-2', 'result.json'), + path.join(runDir, resultDir, 'attempt-2', 'result.json'), `${JSON.stringify({ transcript_summary: secondSummary })}\n`, ); writeFileSync( @@ -2477,15 +2477,15 @@ describe('serve app', () => { ...RESULT_A, test_id: 'repeat-case', result_dir: resultDir, - trials: [ + attempts: [ { attempt: 0, - run_path: 'run-1', + attempt_path: 'attempt-1', score: 0.25, verdict: 'fail', transcript_summary: firstSummary, }, - { attempt: 1, run_path: 'run-2', score: 1, verdict: 'pass' }, + { attempt: 1, attempt_path: 'attempt-2', score: 1, verdict: 'pass' }, ], }), ); @@ -2495,18 +2495,18 @@ describe('serve app', () => { expect(res.status).toBe(200); const data = (await res.json()) as { results: Array<{ - trials?: Array<{ + attempts?: Array<{ transcript_path?: string; transcript_summary?: Record; }>; }>; }; - expect(data.results[0]?.trials?.[0]?.transcript_summary).toEqual(firstSummary); - expect(data.results[0]?.trials?.[1]?.transcript_summary).toEqual(secondSummary); - expect(data.results[0]?.trials?.map((trial) => trial.transcript_path)).toEqual([ - `${resultDir}/run-1/transcript.json`, - `${resultDir}/run-2/transcript.json`, + expect(data.results[0]?.attempts?.[0]?.transcript_summary).toEqual(firstSummary); + expect(data.results[0]?.attempts?.[1]?.transcript_summary).toEqual(secondSummary); + expect(data.results[0]?.attempts?.map((trial) => trial.transcript_path)).toEqual([ + `${resultDir}/attempt-1/transcript.json`, + `${resultDir}/attempt-2/transcript.json`, ]); }); @@ -3706,8 +3706,8 @@ describe('serve app', () => { const timestampDir = path.join(runsDir, '2026-03-25T10-00-00-000Z'); const alphaDir = 'case-one--111111111111'; const betaDir = 'case-one--222222222222'; - const alphaAnswer = path.join(timestampDir, alphaDir, 'run-1', 'outputs', 'answer.md'); - const betaAnswer = path.join(timestampDir, betaDir, 'run-1', 'outputs', 'answer.md'); + const alphaAnswer = path.join(timestampDir, alphaDir, 'attempt-1', 'outputs', 'answer.md'); + const betaAnswer = path.join(timestampDir, betaDir, 'attempt-1', 'outputs', 'answer.md'); mkdirSync(path.dirname(alphaAnswer), { recursive: true }); mkdirSync(path.dirname(betaAnswer), { recursive: true }); @@ -3722,7 +3722,7 @@ describe('serve app', () => { test_id: 'case-one', target: 'mock-alpha', result_dir: alphaDir, - answer_path: `${alphaDir}/run-1/outputs/answer.md`, + answer_path: `${alphaDir}/attempt-1/outputs/answer.md`, }, { ...RESULT_A, @@ -3730,14 +3730,14 @@ describe('serve app', () => { test_id: 'case-one', target: 'mock-beta', result_dir: betaDir, - answer_path: `${betaDir}/run-1/outputs/answer.md`, + answer_path: `${betaDir}/attempt-1/outputs/answer.md`, }, ), ); const app = createApp([], tempDir, tempDir, undefined, { studioDir }); const res = await app.request( - `/api/runs/${encodeURIComponent('2026-03-25T10-00-00-000Z')}/evals/case-one/files/${betaDir}/run-1/outputs/answer.md?result_dir=${encodeURIComponent(betaDir)}`, + `/api/runs/${encodeURIComponent('2026-03-25T10-00-00-000Z')}/evals/case-one/files/${betaDir}/attempt-1/outputs/answer.md?result_dir=${encodeURIComponent(betaDir)}`, ); expect(res.status).toBe(200); diff --git a/apps/cli/test/commands/results/validate.test.ts b/apps/cli/test/commands/results/validate.test.ts index a33bf01fb..5c8ff3505 100644 --- a/apps/cli/test/commands/results/validate.test.ts +++ b/apps/cli/test/commands/results/validate.test.ts @@ -76,12 +76,12 @@ describe('results validate', () => { scores: [{ name: 'quality', type: 'llm', score: 1, verdict: 'pass' }], execution_status: 'ok', summary_path: 'test-greeting/summary.json', - trace_path: 'test-greeting/run-1/trace.json', + trace_path: 'test-greeting/attempt-1/trace.json', artifact_pointers: { trace: { ref: 'agentv/artifacts/v1', - key: 'traces/test-greeting/run-1/trace.json', - path: 'test-greeting/run-1/trace.json', + key: 'traces/test-greeting/attempt-1/trace.json', + path: 'test-greeting/attempt-1/trace.json', }, }, })}\n`, diff --git a/apps/cli/test/eval.integration.test.ts b/apps/cli/test/eval.integration.test.ts index 3bbbd2c68..419d06d23 100644 --- a/apps/cli/test/eval.integration.test.ts +++ b/apps/cli/test/eval.integration.test.ts @@ -343,7 +343,7 @@ describe('agentv eval CLI', () => { const resultDir = row.result_dir as string; expect(resultDir).not.toContain('/'); await expectFileExists(path.join(outputDir, resultDir, 'summary.json')); - await expectFileExists(path.join(outputDir, resultDir, 'run-1', 'grading.json')); + await expectFileExists(path.join(outputDir, resultDir, 'attempt-1', 'grading.json')); } } finally { await rm(fixture.baseDir, { recursive: true, force: true }); @@ -366,7 +366,7 @@ describe('agentv eval CLI', () => { const [firstRow] = (await readJsonLines(indexPath)) as Array>; await expectFileExists(path.join(outputDir, firstRow.result_dir as string, 'summary.json')); await expectFileExists( - path.join(outputDir, firstRow.result_dir as string, 'run-1', 'grading.json'), + path.join(outputDir, firstRow.result_dir as string, 'attempt-1', 'grading.json'), ); } finally { await rm(fixture.baseDir, { recursive: true, force: true }); @@ -409,10 +409,10 @@ describe('agentv eval CLI', () => { expect(canonicalResults).toHaveLength(2); await expectFileExists(path.join(outputDir, 'summary.json')); for (const row of canonicalResults) { - expect(row.transcript_path).toMatch(/run-1\/transcript\.json$/); + expect(row.transcript_path).toMatch(/attempt-1\/transcript\.json$/); await expectFileExists(path.join(outputDir, row.transcript_path as string)); expect(row.transcript_summary).toBeDefined(); - expect(row.transcript_raw_path).toMatch(/run-1\/transcript-raw\.jsonl$/); + expect(row.transcript_raw_path).toMatch(/attempt-1\/transcript-raw\.jsonl$/); await expectFileExists(path.join(outputDir, row.transcript_raw_path as string)); } } finally { @@ -640,10 +640,11 @@ describe('agentv eval CLI', () => { 'timeout_seconds: 12', 'threshold: 0.8', 'budget_usd: 3', - 'repeat:', - ' count: 2', - ' strategy: pass_any', - ' early_exit: true', + 'evaluate_options:', + ' repeat:', + ' count: 2', + ' strategy: pass_any', + ' early_exit: true', 'tests:', ' - include: sample.test.yaml', ' type: suite', diff --git a/apps/dashboard/src/components/EvalDetail.tsx b/apps/dashboard/src/components/EvalDetail.tsx index 5b861ce08..ece5f4b61 100644 --- a/apps/dashboard/src/components/EvalDetail.tsx +++ b/apps/dashboard/src/components/EvalDetail.tsx @@ -64,7 +64,7 @@ function findFirstFile(nodes: FileNode[]): string | null { } function caseTrialPath(trial: EvalCaseTrial, index = 0): string { - return trial.run_path ?? `run-${trial.attempt ?? index + 1}`; + return trial.attempt_path ?? trial.run_path ?? `attempt-${trial.attempt ?? index + 1}`; } function caseTrialTokenTotal(trial: EvalCaseTrial): number | undefined { @@ -618,9 +618,12 @@ function RepeatAggregateChecksTab({
- + - +
@@ -644,7 +647,7 @@ function RepeatAggregateChecksTab({ ) : null}
-

Runs

+

Attempts

{group.trials.map((trial, index) => (
- Run score + Attempt score
@@ -804,7 +807,7 @@ function RepeatAggregateTranscriptTab({ return (

- Run transcripts + Attempt transcripts

{group.trials.map((trial, index) => { const runLabel = caseTrialPath(trial, index); @@ -891,7 +894,7 @@ function TrialTranscriptTab({

No structured transcript

- This run does not include a transcript artifact. + This attempt does not include a transcript artifact.

); diff --git a/apps/dashboard/src/components/ResultTable.tsx b/apps/dashboard/src/components/ResultTable.tsx index 25c8c1cf7..5d6d51028 100644 --- a/apps/dashboard/src/components/ResultTable.tsx +++ b/apps/dashboard/src/components/ResultTable.tsx @@ -152,7 +152,7 @@ function caseTrialTokenTotal(trial: EvalCaseTrial): number | undefined { } function caseTrialPath(trial: EvalCaseTrial, index = 0): string { - return trial.run_path ?? `run-${trial.attempt ?? index + 1}`; + return trial.attempt_path ?? trial.run_path ?? `attempt-${trial.attempt ?? index + 1}`; } function compactTokenBreakdown(result: EvalResult): string | undefined { @@ -711,15 +711,15 @@ function RepeatStatusCell({ return ( ); } function RepeatSummaryText({ group }: { group: RepeatRunGroup }) { const parts = [ - `${group.trialCount} runs`, - `${formatPercent(group.passRate)} run success`, + `${group.trialCount} attempts`, + `${formatPercent(group.passRate)} attempt success`, `${formatPercent(group.meanScore)} mean score`, group.assertionPassRate != null ? `${formatPercent(group.assertionPassRate)} assertions (${group.passedAssertions}/${group.assertionCount})` diff --git a/apps/dashboard/src/lib/result-table.test.ts b/apps/dashboard/src/lib/result-table.test.ts index 9e50afb1f..bf10edd64 100644 --- a/apps/dashboard/src/lib/result-table.test.ts +++ b/apps/dashboard/src/lib/result-table.test.ts @@ -127,8 +127,8 @@ describe('result-table model', () => { eval_path: 'evals/strict-layout.eval.yaml', target: 'openai', trials: [ - { attempt: 0, run_path: 'run-1', score: 1, verdict: 'pass' }, - { attempt: 1, run_path: 'run-2', score: 0.4, verdict: 'fail' }, + { attempt: 0, attempt_path: 'attempt-1', score: 1, verdict: 'pass' }, + { attempt: 1, attempt_path: 'attempt-2', score: 0.4, verdict: 'fail' }, ], }), ], diff --git a/apps/dashboard/src/lib/result-table.ts b/apps/dashboard/src/lib/result-table.ts index ee84e1bde..7e227c347 100644 --- a/apps/dashboard/src/lib/result-table.ts +++ b/apps/dashboard/src/lib/result-table.ts @@ -196,7 +196,7 @@ function numeric(values: readonly (number | undefined)[]): number[] { } function caseTrials(result: EvalResult): readonly EvalCaseTrial[] { - return result.trials ?? []; + return result.attempts ?? result.trials ?? []; } function caseTrialPassed(trial: EvalCaseTrial, passThreshold: number): boolean { @@ -309,7 +309,9 @@ function buildRow( } function buildRepeatGroup(row: ResultTableRow, passThreshold: number): RepeatRunGroup | undefined { - const trials = caseTrials(row.result).filter((trial) => trial.run_path || trial.verdict); + const trials = caseTrials(row.result).filter( + (trial) => trial.attempt_path || trial.run_path || trial.verdict, + ); if (trials.length <= 1) return undefined; const passedTrials = trials.filter((trial) => caseTrialPassed(trial, passThreshold)).length; diff --git a/apps/dashboard/src/lib/types.ts b/apps/dashboard/src/lib/types.ts index 64a77c52f..2c7a8d8e0 100644 --- a/apps/dashboard/src/lib/types.ts +++ b/apps/dashboard/src/lib/types.ts @@ -93,6 +93,7 @@ export interface AssertionEntry { export interface EvalCaseTrial { attempt?: number; + attempt_path?: string; run_path?: string; score?: number; verdict?: string; @@ -240,6 +241,7 @@ export interface EvalResult { externalTrace?: CamelExternalTraceMetadata; metadata?: Record; source_traceability?: SourceTraceability; + attempts?: EvalCaseTrial[]; trials?: EvalCaseTrial[]; aggregation?: EvalTrialAggregation; result_dir?: string; diff --git a/apps/web/src/content/docs/docs/evaluation/eval-files.mdx b/apps/web/src/content/docs/docs/evaluation/eval-files.mdx index 7391e2dc9..6290dba5c 100644 --- a/apps/web/src/content/docs/docs/evaluation/eval-files.mdx +++ b/apps/web/src/content/docs/docs/evaluation/eval-files.mdx @@ -5,7 +5,7 @@ sidebar: order: 1 --- -Evaluation files define the test cases, graders, workspace lifecycle, and run controls for an evaluation run. The reserved `tags.experiment` key is the run/result grouping label, top-level `target` identifies the system under test, and fields such as `repeat`, `threshold`, `timeout_seconds`, `evaluate_options.budget_usd`, and `evaluate_options.max_concurrency` control repeated attempts and gates. Workspace reuse belongs under `workspace.isolation`; repository provenance belongs under `workspace.repos`; Docker/container binding belongs under `workspace.docker`. Non-provisioning setup commands belong in top-level `extensions`; reset policy stays under `workspace.hooks.after_each.reset`; runner-specific setup belongs in the `target` object or `targets.yaml`. AgentV supports two eval data formats: YAML and JSONL. +Evaluation files define the test cases, graders, workspace lifecycle, and run controls for an evaluation run. The reserved `tags.experiment` key is the run/result grouping label, top-level `target` identifies the system under test, and fields such as `evaluate_options.repeat`, `threshold`, `timeout_seconds`, `evaluate_options.budget_usd`, and `evaluate_options.max_concurrency` control repeated attempts and gates. Workspace reuse belongs under `workspace.isolation`; repository provenance belongs under `workspace.repos`; Docker/container binding belongs under `workspace.docker`. Non-provisioning setup commands belong in top-level `extensions`; reset policy stays under `workspace.hooks.after_each.reset`; runner-specific setup belongs in the `target` object or `targets.yaml`. AgentV supports two eval data formats: YAML and JSONL. YAML is the canonical portable model. TypeScript helpers, generated fixtures, and Python scripts should lower to the same YAML/JSONL shapes rather than inventing a separate eval contract. Eval files describe the task, target binding, and run controls. Use `evaluate_options.max_concurrency` for authored suite concurrency. Operators can still override concurrency with `--workers` or set defaults with `execution.workers` in `agentv.config.*` / `.agentv/config.yaml`; do not author legacy `workers` fields in eval YAML. @@ -24,8 +24,8 @@ experiment format. it with `imports.tests`, `tests: ./cases.yaml`, or string shorthand; parent suite context applies because raw cases do not carry their own suite context. - A **wrapper eval** is eval YAML that imports one or more suites with - `imports.suites` and binds run controls with top-level `target`, `repeat`, - `threshold`, `timeout_seconds`, and `evaluate_options`. + `imports.suites` and binds run controls with top-level `target`, `threshold`, + `timeout_seconds`, and `evaluate_options`. Wrapper evals can live anywhere in the repo. A wrapper that imports suites with `imports.suites` must not define parent `workspace`; imported suites own task environment. Machine-local existing workspace paths belong in CLI flags @@ -64,9 +64,10 @@ A wrapper eval stays ordinary eval YAML while choosing a target and run controls # experiments/refunds-codex.eval.yaml name: refunds-codex target: codex-gpt5 -repeat: - count: 2 - strategy: pass_any +evaluate_options: + repeat: + count: 2 + strategy: pass_any imports: suites: @@ -117,7 +118,7 @@ tests: | `category` | Optional slash-delimited analytics taxonomy path. Overrides the category derived from the eval file path. | | `target` | Named system under test from `.agentv/targets.yaml` or `--targets` | | `experiment` | Optional run/result grouping label | -| `repeat` | Optional repeat policy with `count`, `strategy`, and `early_exit` | +| `evaluate_options.repeat` | Optional repeat policy as a positive integer shorthand or object with `count`, `strategy`, `early_exit`, and `cost_limit_usd` | | `timeout_seconds` | Optional per-case timeout | | `evaluate_options` | Optional evaluation runtime options such as `budget_usd` and `max_concurrency` | | `threshold` | Optional suite quality threshold | diff --git a/apps/web/src/content/docs/docs/evaluation/experiments.mdx b/apps/web/src/content/docs/docs/evaluation/experiments.mdx index a02e1a331..ffa5ce1b8 100644 --- a/apps/web/src/content/docs/docs/evaluation/experiments.mdx +++ b/apps/web/src/content/docs/docs/evaluation/experiments.mdx @@ -8,8 +8,8 @@ sidebar: AgentV eval files are the runnable authoring artifact. Use top-level `description` for display metadata, `tags.experiment` as the run/result grouping label, `target` for the system under test, and flat top-level run controls such -as `repeat`, `timeout_seconds`, and `threshold`. Use `evaluate_options` for -evaluation runtime options such as `budget_usd` and `max_concurrency`. +as `timeout_seconds` and `threshold`. Use `evaluate_options` for evaluation +runtime options such as `repeat`, `budget_usd`, and `max_concurrency`. Use `agentv eval --workers N` or project config defaults such as `agentv.config.*` / `.agentv/config.yaml` `execution.workers` for operator-side overrides. @@ -23,11 +23,11 @@ target: extends: codex-gpt5 model: gpt-5.1 reasoning_effort: high -repeat: - count: 4 - strategy: pass_any timeout_seconds: 720 evaluate_options: + repeat: + count: 4 + strategy: pass_any budget_usd: 2.00 max_concurrency: 3 @@ -157,9 +157,10 @@ test.run > import run > parent top-level run controls ```yaml target: agent threshold: 0.8 -repeat: - count: 3 - strategy: pass_any +evaluate_options: + repeat: + count: 3 + strategy: pass_any imports: suites: @@ -205,7 +206,7 @@ target-specific runner state. | Configure an agent runner or provider variant | `target` object or `targets.yaml` | | Choose the target | top-level `target` | | Override the target's default model | `target.model` | -| Configure repeat policy, budget, concurrency, timeout, threshold | top-level `repeat`, `evaluate_options.budget_usd`, `evaluate_options.max_concurrency`, `timeout_seconds`, `threshold` | +| Configure repeat policy, budget, concurrency, timeout, threshold | `evaluate_options.repeat`, `evaluate_options.budget_usd`, `evaluate_options.max_concurrency`, `timeout_seconds`, `threshold` | | Bind an existing local workspace directory | `--workspace-path` or `.agentv/config.local.yaml` | ```yaml @@ -217,9 +218,10 @@ target: hooks: before_each: command: ["sh", "-c", "cp -R skills \"{{workspace_path}}/.codex/skills\""] -repeat: - count: 3 - strategy: pass_any +evaluate_options: + repeat: + count: 3 + strategy: pass_any ``` Existing local workspace paths are machine-local bindings: pass @@ -230,18 +232,32 @@ top-level or case-level `workspace`. ## Repeat Runs -Use top-level `runs` when you want AgentV to try each case more than once: +Use `evaluate_options.repeat` when you want AgentV to try each case more than once: + +```yaml +evaluate_options: + repeat: 3 +``` + +Use object form when you need richer AgentV behavior: ```yaml -runs: 3 +evaluate_options: + repeat: + count: 3 + strategy: pass_any + early_exit: true + cost_limit_usd: 1.00 ``` -`repeat.strategy` controls verdict aggregation. `pass_any` treats the case as -successful when any completed attempt passes; `pass_all` requires every -completed attempt to pass. `mean` and `confidence_interval` aggregate scores -where supported today. `repeat.early_exit` is only a scheduling and cost -optimization: `pass_any` may stop at the first pass, and `pass_all` may stop at -the first fail. Leave it unset or `false` when you want complete variance data. +`evaluate_options.repeat.strategy` controls verdict aggregation. `pass_any` +treats the case as successful when any completed attempt passes; `pass_all` +requires every completed attempt to pass. `mean` and `confidence_interval` +aggregate scores where supported today. `evaluate_options.repeat.early_exit` is +only a scheduling and cost optimization: `pass_any` may stop at the first pass, +and `pass_all` may stop at the first fail. Leave it unset or `false` when you +want complete variance data. Per-case `tests[].options.repeat` overrides the +global repeat count or object for that case. ## Result Layout diff --git a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx index b13595cfd..3b58cd214 100644 --- a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx +++ b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx @@ -132,7 +132,7 @@ my-results/ summary.json / summary.json - run-1/ + attempt-1/ result.json grading.json metrics.json diff --git a/apps/web/src/content/docs/docs/reference/result-artifacts.mdx b/apps/web/src/content/docs/docs/reference/result-artifacts.mdx index 77acc974f..fd4409548 100644 --- a/apps/web/src/content/docs/docs/reference/result-artifacts.mdx +++ b/apps/web/src/content/docs/docs/reference/result-artifacts.mdx @@ -35,7 +35,7 @@ The default local layout is: targets.yaml files/ graders/ - run-1/ + attempt-1/ result.json grading.json metrics.json @@ -45,7 +45,7 @@ The default local layout is: outputs/ answer.md file_changes.diff - run-2/ + attempt-2/ result.json grading.json metrics.json @@ -133,11 +133,11 @@ Example row: "duration_ms": 184200, "result_dir": "refund-eligibility--4f9a7c2d1b6e", "summary_path": "refund-eligibility--4f9a7c2d1b6e/summary.json", - "grading_path": "refund-eligibility--4f9a7c2d1b6e/run-1/grading.json", - "metrics_path": "refund-eligibility--4f9a7c2d1b6e/run-1/metrics.json", - "timing_path": "refund-eligibility--4f9a7c2d1b6e/run-1/timing.json", - "transcript_path": "refund-eligibility--4f9a7c2d1b6e/run-1/transcript.json", - "transcript_raw_path": "refund-eligibility--4f9a7c2d1b6e/run-1/transcript-raw.jsonl", + "grading_path": "refund-eligibility--4f9a7c2d1b6e/attempt-1/grading.json", + "metrics_path": "refund-eligibility--4f9a7c2d1b6e/attempt-1/metrics.json", + "timing_path": "refund-eligibility--4f9a7c2d1b6e/attempt-1/timing.json", + "transcript_path": "refund-eligibility--4f9a7c2d1b6e/attempt-1/transcript.json", + "transcript_raw_path": "refund-eligibility--4f9a7c2d1b6e/attempt-1/transcript-raw.jsonl", "transcript_summary": { "total_turns": 4, "tool_calls": { "file_read": 2, "shell": 1, "unknown": 0 }, @@ -148,9 +148,9 @@ Example row: "errors": [], "thinking_blocks": 1 }, - "output_path": "refund-eligibility--4f9a7c2d1b6e/run-1/outputs/answer.md", - "answer_path": "refund-eligibility--4f9a7c2d1b6e/run-1/outputs/answer.md", - "file_changes_path": "refund-eligibility--4f9a7c2d1b6e/run-1/outputs/file_changes.diff", + "output_path": "refund-eligibility--4f9a7c2d1b6e/attempt-1/outputs/answer.md", + "answer_path": "refund-eligibility--4f9a7c2d1b6e/attempt-1/outputs/answer.md", + "file_changes_path": "refund-eligibility--4f9a7c2d1b6e/attempt-1/outputs/file_changes.diff", "test_dir": "refund-eligibility--4f9a7c2d1b6e/test" } ``` @@ -167,12 +167,13 @@ each row and as `summary.json.metadata.tags`. Its reserved `experiment` key matches the row `experiment` field, so trend/compare views can group by `tags.experiment`. -The `run-1/`, `run-2/`, and later folders under a result directory are artifact -attempt/execution folders. Do not treat those folder names as the comparison -dimension. Repeated stochastic samples should be represented by explicit -metadata such as `sample_index` and `sample_count`; infrastructure retries -should use retry metadata such as `retry_index`, `retry_count`, and -`retry_reason` when available. +Use `repeat` for authoring configuration and `attempts` for produced +executions. The `attempt-1/`, `attempt-2/`, and later folders under a result +directory are artifact folders for those produced executions. Do not treat those +folder names as the comparison dimension. Repeated stochastic samples should be +represented by explicit metadata such as `sample_index` and `sample_count`; +infrastructure retries should use retry metadata such as `retry_index`, +`retry_count`, and `retry_reason` when available. ## Reader Rules diff --git a/apps/web/src/content/docs/docs/tools/results.mdx b/apps/web/src/content/docs/docs/tools/results.mdx index b5fbd43fd..69be64896 100644 --- a/apps/web/src/content/docs/docs/tools/results.mdx +++ b/apps/web/src/content/docs/docs/tools/results.mdx @@ -128,13 +128,13 @@ and metric-style graders; it is not canonical trace storage and does not carry token/cost usage. Every case uses aggregate `summary.json`, then stores execution artifact details -under `run-N/`. Each `run-N/` contains a compact per-attempt manifest +under `attempt-N/`. Each `attempt-N/` contains a compact per-attempt manifest `result.json`, `grading.json`, `metrics.json`, `timing.json`, `transcript.json`, `transcript-raw.jsonl`, `outputs/answer.md`, and `outputs/file_changes.diff` when workspace changes were captured. The `result.json` file carries AgentV `execution_status` and `verdict` fields plus `grading_path`, `metrics_path`, transcript, output, and `file_changes_path` -paths. Treat `run-N/` as an artifact attempt folder, not as a comparison +paths. Treat `attempt-N/` as an artifact attempt folder, not as a comparison dimension; stochastic samples and infrastructure retries should be represented with explicit sample/retry metadata rather than inferred from folder names. @@ -182,10 +182,10 @@ Agent Skills eval artifacts map into AgentV like this: | Agent Skills pattern | AgentV field | Artifact location | |----------------------|--------------|-------------------| | Authored `evals/evals.json` cases | AgentV eval cases and test bundle paths | Eval source plus optional `test_dir`, `eval_path`, `targets_path`, `files_path`, and `graders_path` in `index.jsonl` | -| Per-case answer | Generated target output artifact | `run-N/outputs/answer.md` | -| Per-attempt sidecars | Normalized transcript, metrics, and raw provider evidence | `run-N/transcript.json`, `run-N/transcript-raw.jsonl`, `run-N/metrics.json` | -| Per-attempt `timing.json` | Duration, token totals, cost, and usage source labels | `run-N/timing.json` | -| Per-attempt `grading.json` | Assertions, graders, execution metrics, workspace changes | `run-N/grading.json`; summary fields can reference the same trace/result facts | +| Per-case answer | Generated target output artifact | `attempt-N/outputs/answer.md` | +| Per-attempt sidecars | Normalized transcript, metrics, and raw provider evidence | `attempt-N/transcript.json`, `attempt-N/transcript-raw.jsonl`, `attempt-N/metrics.json` | +| Per-attempt `timing.json` | Duration, token totals, cost, and usage source labels | `attempt-N/timing.json` | +| Per-attempt `grading.json` | Assertions, graders, execution metrics, workspace changes | `attempt-N/grading.json`; summary fields can reference the same trace/result facts | | Iteration-level `summary.json` | Pass rate, time, tokens, tool calls, cost aggregates | Run-level `summary.json` | | Transcript/log outlier analysis | Normalized transcript, raw evidence, metrics, and optional external trace link | `transcript.json` for portable review; `transcript-raw.jsonl` for native evidence; `metrics.json` for behavior summaries; `external_trace` for link-out correlation | | Aggregate pass rate/time/tokens/delta | Run summaries and comparison tooling | `summary.json`, result comparisons, and projection bundles | diff --git a/examples/features/README.md b/examples/features/README.md index 632f10b4d..843cb9186 100644 --- a/examples/features/README.md +++ b/examples/features/README.md @@ -81,7 +81,7 @@ Focused examples for specific AgentV capabilities. Find your use case below, the | Example | Description | |---------|-------------| | [benchmark-tooling](benchmark-tooling/) | N-way benchmarking with `agentv compare` over completed runs | -| [trials](trials/) | Configure repeated attempts with `repeat` | +| [trials](trials/) | Configure repeated attempts with `evaluate_options.repeat` | | [trial-output-consistency](trial-output-consistency/) | Measure output consistency across trials using pairwise cosine similarity | | [compare](compare/) | Compare a run against a stored baseline | diff --git a/examples/features/benchmark-tooling/scripts/benchmark-report.ts b/examples/features/benchmark-tooling/scripts/benchmark-report.ts index c139080e7..79c15cc31 100644 --- a/examples/features/benchmark-tooling/scripts/benchmark-report.ts +++ b/examples/features/benchmark-tooling/scripts/benchmark-report.ts @@ -30,7 +30,7 @@ interface ResultRecord { target?: string; score: number; scores?: EvaluatorScore[]; - trials?: TrialRecord[]; + attempts?: TrialRecord[]; aggregation?: AggregationInfo; } @@ -198,8 +198,10 @@ function readResultFile(filePath: string, fallbackTarget: string): ResultRecord[ if (Array.isArray(raw.scores)) { record.scores = raw.scores as EvaluatorScore[]; } - if (Array.isArray(raw.trials)) { - record.trials = raw.trials as TrialRecord[]; + if (Array.isArray(raw.attempts)) { + record.attempts = raw.attempts as TrialRecord[]; + } else if (Array.isArray(raw.trials)) { + record.attempts = raw.trials as TrialRecord[]; } if (raw.aggregation && typeof raw.aggregation === 'object') { record.aggregation = raw.aggregation as AggregationInfo; diff --git a/examples/features/trials/README.md b/examples/features/trials/README.md index 95660ee04..3c37f7303 100644 --- a/examples/features/trials/README.md +++ b/examples/features/trials/README.md @@ -1,7 +1,8 @@ # Repeat Runs This example keeps the runnable contract in one eval file. Top-level `target` -selects the system under test and top-level `repeat` configures repeated attempts. +selects the system under test and `evaluate_options.repeat` configures repeated +attempts. ## Files @@ -13,13 +14,14 @@ selects the system under test and top-level `repeat` configures repeated attempt bun agentv eval examples/features/trials/evals/dataset.eval.yaml ``` -Edit `repeat.count` to change how many attempts AgentV makes for each case: +Edit `evaluate_options.repeat.count` to change how many attempts AgentV makes +for each case: ```yaml -repeat: - count: 2 - strategy: pass_any - early_exit: false evaluate_options: + repeat: + count: 2 + strategy: pass_any + early_exit: false budget_usd: 1.00 ``` diff --git a/examples/features/trials/evals/dataset.eval.baseline.jsonl b/examples/features/trials/evals/dataset.eval.baseline.jsonl index 944cc64d9..e928711f7 100644 --- a/examples/features/trials/evals/dataset.eval.baseline.jsonl +++ b/examples/features/trials/evals/dataset.eval.baseline.jsonl @@ -1,2 +1,2 @@ -{"timestamp":"2026-02-20T21:40:25.928Z","test_id":"capital-knowledge","suite":"dataset","score":1,"target":"default","trials":[{"attempt":0,"score":1,"verdict":"pass"},{"attempt":1,"score":1,"verdict":"pass"}],"aggregation":{"strategy":"pass_any","passed_attempts":2,"total_attempts":2},"assertions":[{"text":"Correctly identifies Canberra as the capital of Australia","passed":true,"evidence":"The candidate answer provides the correct and complete information, fully matching the reference answer."}]} -{"timestamp":"2026-02-20T21:40:26.593Z","test_id":"math-basics","suite":"dataset","score":1,"target":"default","trials":[{"attempt":0,"score":1,"verdict":"pass"},{"attempt":1,"score":1,"verdict":"pass"}],"aggregation":{"strategy":"pass_any","passed_attempts":2,"total_attempts":2},"assertions":[{"text":"Explains step-by-step reasoning","passed":true,"evidence":"The candidate answer breaks down the calculation clearly, explains each step, and arrives at the correct answer, matching the reference reasoning."},{"text":"Splits 15 into 10 and 5 for easier calculation","passed":true},{"text":"Calculates partial products (10\u00d77 and 5\u00d77)","passed":true},{"text":"Arrives at correct final answer (105)","passed":true}]} +{"timestamp":"2026-02-20T21:40:25.928Z","test_id":"capital-knowledge","suite":"dataset","score":1,"target":"default","attempts":[{"attempt":0,"score":1,"verdict":"pass"},{"attempt":1,"score":1,"verdict":"pass"}],"aggregation":{"strategy":"pass_any","passed_attempts":2,"total_attempts":2},"assertions":[{"text":"Correctly identifies Canberra as the capital of Australia","passed":true,"evidence":"The candidate answer provides the correct and complete information, fully matching the reference answer."}]} +{"timestamp":"2026-02-20T21:40:26.593Z","test_id":"math-basics","suite":"dataset","score":1,"target":"default","attempts":[{"attempt":0,"score":1,"verdict":"pass"},{"attempt":1,"score":1,"verdict":"pass"}],"aggregation":{"strategy":"pass_any","passed_attempts":2,"total_attempts":2},"assertions":[{"text":"Explains step-by-step reasoning","passed":true,"evidence":"The candidate answer breaks down the calculation clearly, explains each step, and arrives at the correct answer, matching the reference reasoning."},{"text":"Splits 15 into 10 and 5 for easier calculation","passed":true},{"text":"Calculates partial products (10\u00d77 and 5\u00d77)","passed":true},{"text":"Arrives at correct final answer (105)","passed":true}]} diff --git a/examples/features/trials/evals/dataset.eval.yaml b/examples/features/trials/evals/dataset.eval.yaml index 9f1c38b32..f57ca2224 100644 --- a/examples/features/trials/evals/dataset.eval.yaml +++ b/examples/features/trials/evals/dataset.eval.yaml @@ -5,11 +5,11 @@ name: trials description: Repeat runs example with 2 attempts configured inline target: llm -repeat: - count: 2 - strategy: pass_any - early_exit: false evaluate_options: + repeat: + count: 2 + strategy: pass_any + early_exit: false budget_usd: 1.00 tests: diff --git a/examples/showcase/multi-model-benchmark/README.md b/examples/showcase/multi-model-benchmark/README.md index 709de04f8..8ec4f3955 100644 --- a/examples/showcase/multi-model-benchmark/README.md +++ b/examples/showcase/multi-model-benchmark/README.md @@ -130,16 +130,16 @@ Weighted average formula: `(3×accuracy + 2×completeness + 1×clarity) / 6` ### 3. Repeat Runs -Each test runs twice through top-level repeat controls. The repeated-attempt +Each test runs twice through `evaluate_options.repeat`. The repeated-attempt aggregation below treats a case as successful when any completed attempt succeeds. ```yaml -repeat: - count: 2 - strategy: pass_any - early_exit: false evaluate_options: + repeat: + count: 2 + strategy: pass_any + early_exit: false budget_usd: 2.00 ``` @@ -204,14 +204,14 @@ assertions: ### Adjusting run count -Increase `repeat.count` for more variability data (at proportional cost): +Increase `evaluate_options.repeat.count` for more variability data (at proportional cost): ```yaml -repeat: - count: 5 - strategy: pass_any - early_exit: false evaluate_options: + repeat: + count: 5 + strategy: pass_any + early_exit: false budget_usd: 5.00 ``` diff --git a/examples/showcase/multi-model-benchmark/evals/benchmark.eval.yaml b/examples/showcase/multi-model-benchmark/evals/benchmark.eval.yaml index 7f51ddd98..2ed014927 100644 --- a/examples/showcase/multi-model-benchmark/evals/benchmark.eval.yaml +++ b/examples/showcase/multi-model-benchmark/evals/benchmark.eval.yaml @@ -16,11 +16,11 @@ description: Multi-model benchmark — accuracy, completeness, and clarity acros tags: [multi-provider] target: copilot -repeat: - count: 2 - strategy: pass_any - early_exit: false evaluate_options: + repeat: + count: 2 + strategy: pass_any + early_exit: false budget_usd: 2.00 assertions: diff --git a/packages/core/src/evaluation/experiment.ts b/packages/core/src/evaluation/experiment.ts index e954d3e6e..eedb58b94 100644 --- a/packages/core/src/evaluation/experiment.ts +++ b/packages/core/src/evaluation/experiment.ts @@ -25,6 +25,8 @@ export type ExperimentRepeatWire = { readonly cost_limit_usd?: number; }; +export type ExperimentRepeatInput = number | ExperimentRepeatWire; + export type ExperimentRepeat = { readonly count: number; readonly strategy: TrialStrategy; @@ -39,7 +41,7 @@ export type ExperimentConfigWire = { readonly targets?: readonly ExperimentTargetRefWire[]; readonly model?: string; readonly agent_options?: Record; - readonly repeat?: ExperimentRepeatWire; + readonly repeat?: ExperimentRepeatInput; readonly timeout_seconds?: number; readonly threshold?: number; readonly budget_usd?: number; @@ -220,8 +222,14 @@ function readRepeat(raw: unknown): ExperimentRepeat | undefined { if (raw === undefined) { return undefined; } + if (typeof raw === 'number') { + return { + count: readRequiredPositiveInteger(raw, 'repeat'), + strategy: 'pass_any', + }; + } if (!isRecord(raw)) { - throw new Error('Experiment repeat must be an object.'); + throw new Error('Experiment repeat must be a positive integer or object.'); } for (const key of Object.keys(raw)) { if (!REPEAT_FIELDS.has(key)) { @@ -322,12 +330,12 @@ function readOptionalRepeatStrategy(raw: unknown): TrialStrategy | undefined { function rejectLegacyTopLevelRepeatFields(rawConfig: Record): void { if (rawConfig.runs !== undefined) { throw new Error( - "Experiment top-level 'runs' has been removed. Use repeat.count and repeat.strategy instead.", + "Experiment top-level 'runs' has been removed. Use evaluate_options.repeat.count and evaluate_options.repeat.strategy instead.", ); } if (rawConfig.early_exit !== undefined || rawConfig.earlyExit !== undefined) { throw new Error( - "Experiment top-level 'early_exit' has been removed. Use repeat.early_exit instead.", + "Experiment top-level 'early_exit' has been removed. Use evaluate_options.repeat.early_exit instead.", ); } } diff --git a/packages/core/src/evaluation/loaders/config-loader.ts b/packages/core/src/evaluation/loaders/config-loader.ts index ab45dbe00..c4374d0ed 100644 --- a/packages/core/src/evaluation/loaders/config-loader.ts +++ b/packages/core/src/evaluation/loaders/config-loader.ts @@ -275,7 +275,7 @@ function rejectAuthoredRuntimeContainers(suite: JsonObject): void { } if (suite.policy !== undefined) { throw new Error( - "Top-level 'policy' is not part of eval YAML. Put repeat, timeout_seconds, and threshold at the top level, and budget_usd under evaluate_options.", + "Top-level 'policy' is not part of eval YAML. Put repeat under evaluate_options.repeat, timeout_seconds and threshold at the top level, and budget_usd under evaluate_options.", ); } if (suite.execution !== undefined) { diff --git a/packages/core/src/evaluation/orchestrator.ts b/packages/core/src/evaluation/orchestrator.ts index 29518ee3a..f4d002695 100644 --- a/packages/core/src/evaluation/orchestrator.ts +++ b/packages/core/src/evaluation/orchestrator.ts @@ -809,11 +809,11 @@ export async function runEvaluation( } = options; const repoRootPath = pathFromRoot(repoRoot); - // Disable cache when trials > 1 (cache makes trials deterministic = pointless) + // Disable cache when repeat count > 1 (cached responses would make attempts deterministic). let useCache = options.useCache; if (trials && trials.count > 1 && useCache) { console.warn( - 'Warning: Caching is disabled when trials.count > 1 (cached responses would make trials deterministic).', + 'Warning: Caching is disabled when evaluate_options.repeat.count > 1 (cached responses would make attempts deterministic).', ); useCache = false; } @@ -875,9 +875,11 @@ export async function runEvaluation( typeof primaryProvider.invokeBatch === 'function'; let batchingDisabledByRuntimePolicy = false; - // Disable batch mode when trials > 1 (batch processes all cases at once, incompatible with per-case retries) + // Disable batch mode when repeat count > 1 (batching is incompatible with separate attempts). if (trials && trials.count > 1 && providerSupportsBatch) { - console.warn('Warning: Batch mode is disabled when trials.count > 1. Using per-case dispatch.'); + console.warn( + 'Warning: Batch mode is disabled when evaluate_options.repeat.count > 1. Using per-case dispatch for attempts.', + ); providerSupportsBatch = false; batchingDisabledByRuntimePolicy = true; } @@ -2449,7 +2451,7 @@ async function runEvalCaseWithTrials( cumulativeCost += trialCost; } else if (trialsConfig.costLimitUsd && !costWarningEmitted) { console.warn( - 'Warning: cost_limit_usd is set but provider does not report cost. All trials will run.', + 'Warning: cost_limit_usd is set but provider does not report cost. All attempts will run.', ); costWarningEmitted = true; } diff --git a/packages/core/src/evaluation/run-artifacts.ts b/packages/core/src/evaluation/run-artifacts.ts index f8a6fc941..bbc1ec727 100644 --- a/packages/core/src/evaluation/run-artifacts.ts +++ b/packages/core/src/evaluation/run-artifacts.ts @@ -347,13 +347,13 @@ export interface GradingArtifact { readonly turns: number; readonly conversation_id: string; }; - readonly trials?: readonly TrialResultArtifact[]; + readonly attempts?: readonly TrialResultArtifact[]; readonly aggregation?: TrialAggregationArtifact; } export type TrialResultArtifact = { readonly attempt: number; - readonly run_path?: string; + readonly attempt_path?: string; readonly score: number; readonly verdict: string; readonly scores?: IndexArtifactEntry['scores']; @@ -488,7 +488,7 @@ export interface IndexArtifactEntry { readonly start_time?: string; readonly end_time?: string; readonly scores?: readonly Record[]; - readonly trials?: readonly TrialResultArtifact[]; + readonly attempts?: readonly TrialResultArtifact[]; readonly aggregation?: TrialAggregationArtifact; readonly execution_status?: string; readonly error?: string; @@ -582,8 +582,8 @@ export interface AgentVRunResultArtifact { } export interface RepeatCaseSummaryArtifact { - readonly total_runs: number; - readonly passed_runs: number; + readonly total_attempts: number; + readonly passed_attempts: number; readonly pass_rate: string; readonly mean_duration_ms: number; readonly mean_duration_seconds: number; @@ -730,8 +730,8 @@ function toIndexScores(scores: readonly GraderResult[] | undefined): IndexArtifa return scores?.map(toIndexScore) as IndexArtifactEntry['scores']; } -function trialRunDirName(attempt: number): string { - return `run-${attempt + 1}`; +function attemptDirName(attempt: number): string { + return `attempt-${attempt + 1}`; } function hasPersistedTrialRuns(result: EvaluationResult): boolean { @@ -753,7 +753,7 @@ function toTrialArtifacts( } return trials.map((trial) => ({ attempt: trial.attempt, - run_path: trial.result ? trialRunDirName(trial.attempt) : undefined, + attempt_path: trial.result ? attemptDirName(trial.attempt) : undefined, score: trial.score, verdict: trial.verdict, scores: toIndexScores(trial.scores), @@ -925,8 +925,8 @@ function buildRepeatCaseSummaryArtifact( const meanDurationMs = timing.mean_duration_ms ?? fallbackMeanMs; return { - total_runs: totalRuns, - passed_runs: passedRuns, + total_attempts: totalRuns, + passed_attempts: passedRuns, pass_rate: formatRepeatPassRate(passedRuns, totalRuns), mean_duration_ms: meanDurationMs, mean_duration_seconds: timing.mean_duration_seconds ?? roundSecondsFromMs(meanDurationMs), @@ -1050,7 +1050,7 @@ async function writeTrialRunArtifacts(params: { return; } - const runDirName = trialRunDirName(params.trial.attempt); + const runDirName = attemptDirName(params.trial.attempt); const runDir = path.join(params.parentTestDir, runDirName); const grading = buildGradingArtifact(result, { includeTrials: false }); const timing = buildTimingArtifact([result]); @@ -1219,7 +1219,7 @@ export function buildGradingArtifact( conversation_id: result.conversationId, } : undefined, - trials: includeTrials ? toIndexTrialArtifacts(result) : undefined, + attempts: includeTrials ? toIndexTrialArtifacts(result) : undefined, aggregation: includeTrials ? toTrialAggregationArtifact(result.aggregation) : undefined, }; } @@ -1765,7 +1765,7 @@ export function buildIndexArtifactEntry( start_time: result.startTime, end_time: result.endTime, scores: toIndexScores(result.scores), - trials: toIndexTrialArtifacts(result), + attempts: toIndexTrialArtifacts(result), aggregation: toTrialAggregationArtifact(result.aggregation), execution_status: result.executionStatus, error: result.error, @@ -1838,7 +1838,7 @@ export function buildResultIndexArtifact( const hasFileChanges = result.fileChanges !== undefined && result.fileChanges.length > 0; const hasTranscript = resultHasExecutionTraceTranscript(result); const isSingleRun = !hasPersistedTrialRuns(result); - const singleRunDir = path.posix.join(artifactSubdir, trialRunDirName(0)); + const singleRunDir = path.posix.join(artifactSubdir, attemptDirName(0)); return { timestamp: result.timestamp, @@ -1855,7 +1855,7 @@ export function buildResultIndexArtifact( start_time: result.startTime, end_time: result.endTime, scores: toIndexScores(result.scores), - trials: toIndexTrialArtifacts(result), + attempts: toIndexTrialArtifacts(result), aggregation: toTrialAggregationArtifact(result.aggregation), execution_status: result.executionStatus, error: result.error, @@ -2351,7 +2351,7 @@ export async function writePerTestArtifacts( } const isSingleRun = !hasPersistedTrialRuns(result); - const singleRunDir = path.join(testDir, trialRunDirName(0)); + const singleRunDir = path.join(testDir, attemptDirName(0)); const singleAnswerPath = isSingleRun && result.output.length > 0 ? path.join(singleRunDir, 'outputs', 'answer.md') @@ -2467,7 +2467,7 @@ export async function writeArtifactsFromResults( const caseSummaryPath = path.join(testDir, RUN_SUMMARY_FILENAME); const identityId = projectionIdentity.id; const isSingleRun = !hasPersistedTrialRuns(result); - const singleRunDir = path.join(testDir, trialRunDirName(0)); + const singleRunDir = path.join(testDir, attemptDirName(0)); const singleAnswerPath = isSingleRun && result.output.length > 0 ? path.join(singleRunDir, 'outputs', 'answer.md') diff --git a/packages/core/src/evaluation/types.ts b/packages/core/src/evaluation/types.ts index 806d86146..0c2502a47 100644 --- a/packages/core/src/evaluation/types.ts +++ b/packages/core/src/evaluation/types.ts @@ -1072,7 +1072,7 @@ export interface DependencyResult { export type TrialStrategy = 'pass_any' | 'pass_all' | 'mean' | 'confidence_interval'; /** - * Configuration for running multiple trials per eval case. + * Configuration for running repeated attempts per eval case. */ export interface TrialsConfig { readonly count: number; @@ -1083,7 +1083,7 @@ export interface TrialsConfig { } /** - * Result of a single trial attempt. + * Result of a single produced attempt. */ export interface TrialResult { readonly attempt: number; @@ -1092,7 +1092,7 @@ export interface TrialResult { readonly scores?: readonly GraderResult[]; readonly error?: string; readonly costUsd?: number; - /** Primary classification for this trial attempt */ + /** Primary classification for this attempt */ readonly executionStatus?: ExecutionStatus; /** Pipeline stage where failure occurred */ readonly failureStage?: FailureStage; @@ -1100,7 +1100,7 @@ export interface TrialResult { readonly failureReasonCode?: string; /** * Full per-attempt result used by artifact writers to materialize AgentV - * run-N folders. This is intentionally omitted from wire trial summaries. + * attempt-N folders. This is intentionally omitted from compact wire summaries. */ readonly result?: EvaluationResult; } @@ -1250,11 +1250,11 @@ export interface EvaluationResult { readonly afterEachOutput?: string; /** Unified diff of workspace file changes */ readonly fileChanges?: string; - /** Individual trial results (only present when trials.count > 1) */ + /** Individual attempt results (only present when evaluate_options.repeat.count > 1) */ readonly trials?: readonly TrialResult[]; - /** Aggregation metadata describing how the final score was computed from trials */ + /** Aggregation metadata describing how the final score was computed from attempts */ readonly aggregation?: TrialAggregation; - /** Whether the trial loop was terminated early due to cost limit */ + /** Whether the attempt loop was terminated early due to cost limit */ readonly costLimited?: boolean; /** Whether the evaluation was skipped due to suite-level budget exhaustion */ readonly budgetExceeded?: boolean; diff --git a/packages/core/src/evaluation/validation/eval-file.schema.ts b/packages/core/src/evaluation/validation/eval-file.schema.ts index 021cece7d..55bdb0ffa 100644 --- a/packages/core/src/evaluation/validation/eval-file.schema.ts +++ b/packages/core/src/evaluation/validation/eval-file.schema.ts @@ -736,7 +736,7 @@ export const EvalFileSchema: z.ZodType = z model: z.never().optional(), // Run/result grouping label and flat run controls experiment: z.string().min(1).optional(), - repeat: ExperimentRepeatSchema.optional(), + repeat: z.never().optional(), runs: z.never().optional(), early_exit: z.never().optional(), timeout_seconds: z.number().gt(0).optional(), diff --git a/packages/core/src/evaluation/validation/eval-validator.ts b/packages/core/src/evaluation/validation/eval-validator.ts index 0edc477ef..ec0b276d0 100644 --- a/packages/core/src/evaluation/validation/eval-validator.ts +++ b/packages/core/src/evaluation/validation/eval-validator.ts @@ -160,7 +160,7 @@ const REMOVED_TOP_LEVEL_FIELDS = new Map([ ['model', "Top-level 'model' is not part of eval YAML. Put model inside the target object."], [ 'policy', - "Top-level 'policy' is not part of eval YAML. Put repeat, timeout_seconds, and threshold at the top level, and budget_usd under evaluate_options.", + "Top-level 'policy' is not part of eval YAML. Put repeat under evaluate_options.repeat, timeout_seconds and threshold at the top level, and budget_usd under evaluate_options.", ], [ 'execution', @@ -170,8 +170,12 @@ const REMOVED_TOP_LEVEL_FIELDS = new Map([ 'providers', "Top-level 'providers' is not a runtime alias in AgentV eval YAML. Use 'targets' for systems under test; provider names backend kind inside each target.", ], - ['runs', "Top-level 'runs' has been removed. Use repeat.count instead."], - ['early_exit', "Top-level 'early_exit' has been removed. Use repeat.early_exit instead."], + ['repeat', "Top-level 'repeat' has been removed. Use evaluate_options.repeat instead."], + ['runs', "Top-level 'runs' has been removed. Use evaluate_options.repeat.count instead."], + [ + 'early_exit', + "Top-level 'early_exit' has been removed. Use evaluate_options.repeat.early_exit instead.", + ], ]); /** Deprecated top-level fields with migration hints. */ @@ -391,7 +395,6 @@ export async function validateEvalFile(filePath: string): Promise { await rm(tempDir, { recursive: true, force: true }); }); - it('parses top-level target and repeat controls as the canonical runtime block', async () => { + it('parses evaluate_options.repeat object as the canonical runtime block', async () => { const evalPath = path.join(tempDir, 'runtime.eval.yaml'); await writeFile( evalPath, @@ -29,10 +29,11 @@ describe('eval.yaml flat runtime controls and tests imports', () => { ' model: gpt-5.1', ' reasoning_effort: high', 'threshold: 0.7', - 'repeat:', - ' count: 2', - ' strategy: pass_any', - ' early_exit: true', + 'evaluate_options:', + ' repeat:', + ' count: 2', + ' strategy: pass_any', + ' early_exit: true', 'timeout_seconds: 30', 'budget_usd: 1.5', 'tests:', @@ -65,6 +66,28 @@ describe('eval.yaml flat runtime controls and tests imports', () => { expect(suite.targets).toBeUndefined(); }); + it('parses evaluate_options.repeat number shorthand', async () => { + const evalPath = path.join(tempDir, 'runtime-repeat-shorthand.eval.yaml'); + await writeFile( + evalPath, + [ + 'name: runtime-repeat-shorthand', + 'target: codex', + 'evaluate_options:', + ' repeat: 3', + 'tests:', + ' - id: one', + ' input: hello', + ' criteria: ok', + '', + ].join('\n'), + ); + + const suite = await loadTestSuite(evalPath, tempDir); + + expect(suite.experimentConfig?.repeat).toEqual({ count: 3, strategy: 'pass_any' }); + }); + it('parses default_test.threshold separately from legacy top-level threshold', async () => { const evalPath = path.join(tempDir, 'default-test-threshold.eval.yaml'); await writeFile( @@ -247,12 +270,14 @@ describe('eval.yaml flat runtime controls and tests imports', () => { ); }); - it('rejects removed top-level runs and early_exit controls', async () => { + it('rejects removed top-level repeat controls', async () => { const evalPath = path.join(tempDir, 'removed-repeat-controls.eval.yaml'); await writeFile( evalPath, [ 'target: codex', + 'repeat:', + ' count: 2', 'runs: 2', 'early_exit: true', 'tests:', @@ -263,7 +288,7 @@ describe('eval.yaml flat runtime controls and tests imports', () => { ].join('\n'), ); - await expect(loadTestSuite(evalPath, tempDir)).rejects.toThrow(/repeat.count/); + await expect(loadTestSuite(evalPath, tempDir)).rejects.toThrow(/evaluate_options\.repeat/); }); it('rejects top-level execution blocks and non-string experiment values', async () => { @@ -683,9 +708,10 @@ describe('eval.yaml flat runtime controls and tests imports', () => { 'name: parent-suite', 'target: parent-target', 'threshold: 0.8', - 'repeat:', - ' count: 3', - ' strategy: pass_any', + 'evaluate_options:', + ' repeat:', + ' count: 3', + ' strategy: pass_any', 'timeout_seconds: 30', 'budget_usd: 1.5', 'input: parent shared input', @@ -716,6 +742,55 @@ describe('eval.yaml flat runtime controls and tests imports', () => { expect(test.assertions?.[0]).toMatchObject({ value: 'child' }); }); + it('applies tests[].options.repeat over the global repeat object', async () => { + const evalPath = path.join(tempDir, 'test-options-repeat.eval.yaml'); + await writeFile( + evalPath, + [ + 'name: test-options-repeat', + 'target: codex', + 'evaluate_options:', + ' repeat:', + ' count: 4', + ' strategy: pass_all', + 'tests:', + ' - id: global-repeat', + ' input: hello', + ' criteria: ok', + ' - id: case-repeat-count', + ' input: hello', + ' criteria: ok', + ' options:', + ' repeat: 2', + ' - id: case-repeat-object', + ' input: hello', + ' criteria: ok', + ' run:', + ' threshold: 0.9', + ' options:', + ' repeat:', + ' count: 3', + ' strategy: mean', + ' early_exit: false', + '', + ].join('\n'), + ); + + const suite = await loadTestSuite(evalPath, tempDir); + const byId = new Map(suite.tests.map((test) => [test.id, test])); + + expect(suite.experimentConfig?.repeat).toEqual({ count: 4, strategy: 'pass_all' }); + expect(byId.get('global-repeat')?.run).toBeUndefined(); + expect(byId.get('case-repeat-count')?.run?.repeat).toEqual({ + count: 2, + strategy: 'pass_any', + }); + expect(byId.get('case-repeat-object')?.run).toMatchObject({ + threshold: 0.9, + repeat: { count: 3, strategy: 'mean', earlyExit: false }, + }); + }); + it('rejects parent workspace when importing eval suites with type: suite', async () => { await writeFile( path.join(tempDir, 'child.eval.yaml'), diff --git a/packages/core/test/evaluation/evaluate-programmatic-api.test.ts b/packages/core/test/evaluation/evaluate-programmatic-api.test.ts index 7485f8c2b..fa347b6b6 100644 --- a/packages/core/test/evaluation/evaluate-programmatic-api.test.ts +++ b/packages/core/test/evaluation/evaluate-programmatic-api.test.ts @@ -162,11 +162,11 @@ describe('evaluate() — programmatic API extensions', () => { expect(summaryArtifact.timing.duration_ms).toBeGreaterThanOrEqual(0); expect(resultDir).toMatch(/^programmatic-artifacts--[a-f0-9]{12}$/); - expect(existsSync(path.join(outputDir, resultDir ?? '', 'run-1', 'grading.json'))).toBe( + expect(existsSync(path.join(outputDir, resultDir ?? '', 'attempt-1', 'grading.json'))).toBe( true, ); expect( - existsSync(path.join(outputDir, resultDir ?? '', 'run-1', 'outputs', 'answer.md')), + existsSync(path.join(outputDir, resultDir ?? '', 'attempt-1', 'outputs', 'answer.md')), ).toBe(true); } finally { rmSync(outputDir, { recursive: true, force: true }); diff --git a/packages/core/test/evaluation/experiment.test.ts b/packages/core/test/evaluation/experiment.test.ts index 1d9392e0a..3bdb8a2a2 100644 --- a/packages/core/test/evaluation/experiment.test.ts +++ b/packages/core/test/evaluation/experiment.test.ts @@ -63,6 +63,17 @@ describe('inline experiment config', () => { }); }); + it('normalizes repeat number shorthand', () => { + const config = normalizeExperimentConfig({ + repeat: 3, + }); + + expect(config.repeat).toEqual({ + count: 3, + strategy: 'pass_any', + }); + }); + it('rejects invalid run counts', () => { expect(() => normalizeExperimentConfig({ runs: 3 })).toThrow(/repeat.count/); expect(() => normalizeExperimentConfig({ early_exit: true })).toThrow(/repeat.early_exit/); diff --git a/packages/core/test/evaluation/orchestrator.test.ts b/packages/core/test/evaluation/orchestrator.test.ts index 4acf92dd1..60d1951d4 100644 --- a/packages/core/test/evaluation/orchestrator.test.ts +++ b/packages/core/test/evaluation/orchestrator.test.ts @@ -1,4 +1,4 @@ -import { afterEach, describe, expect, it, mock } from 'bun:test'; +import { afterEach, describe, expect, it, mock, spyOn } from 'bun:test'; import { existsSync, mkdirSync, @@ -769,7 +769,7 @@ console.log('spreadsheet: revenue,total\\nQ1,42');`, .map((line) => JSON.parse(line) as Record); const resultDir = indexRows[0]?.result_dir; expect(resultDir).toMatch(/^case-1--[a-f0-9]{12}$/); - const runDir = path.join(outputDir, resultDir ?? '', 'run-1'); + const runDir = path.join(outputDir, resultDir ?? '', 'attempt-1'); const outputsDir = path.join(runDir, 'outputs'); expect(readdirSync(runDir)).not.toContain('provider.log'); expect(readdirSync(runDir)).toContain('transcript-raw.jsonl'); @@ -782,8 +782,8 @@ console.log('spreadsheet: revenue,total\\nQ1,42');`, expect(indexRows[0]?.raw_provider_log_path).toBeUndefined(); expect(indexRows[0]?.trace_path).toBeUndefined(); - expect(indexRows[0]?.transcript_path).toBe(`${resultDir}/run-1/transcript.json`); - expect(indexRows[0]?.transcript_raw_path).toBe(`${resultDir}/run-1/transcript-raw.jsonl`); + expect(indexRows[0]?.transcript_path).toBe(`${resultDir}/attempt-1/transcript.json`); + expect(indexRows[0]?.transcript_raw_path).toBe(`${resultDir}/attempt-1/transcript-raw.jsonl`); expect(existsSync(rawLogPath)).toBe(false); }); @@ -905,6 +905,56 @@ console.log('spreadsheet: revenue,total\\nQ1,42');`, } }); + it('disables provider batching when repeat attempts are configured', async () => { + class BatchCapableProvider implements Provider { + readonly id = 'batch:repeat'; + readonly kind = 'mock' as const; + readonly targetName = 'repeat'; + readonly supportsBatch = true; + invokeCalls = 0; + batchCalls = 0; + + async invoke(): Promise { + this.invokeCalls += 1; + return { + output: [{ role: 'assistant', content: 'OK' }], + }; + } + + async invokeBatch(): Promise { + this.batchCalls += 1; + throw new Error('batch should not be used for repeat attempts'); + } + } + + const provider = new BatchCapableProvider(); + const warnSpy = spyOn(console, 'warn').mockImplementation(() => {}); + try { + const results = await runEvaluation({ + testFilePath: 'in-memory.yaml', + repoRoot: 'in-memory', + target: { + ...baseTarget, + providerBatching: true, + workers: 1, + }, + providerFactory: () => provider, + evaluators: evaluatorRegistry, + evalCases: [baseTestCase], + trials: { count: 2, strategy: 'pass_any' }, + }); + + expect(results).toHaveLength(1); + expect(provider.batchCalls).toBe(0); + expect(provider.invokeCalls).toBe(2); + expect(warnSpy).toHaveBeenCalledWith( + 'Warning: Batch mode is disabled when evaluate_options.repeat.count > 1. Using per-case dispatch for attempts.', + ); + } finally { + warnSpy.mockRestore(); + } + }); + it('uses a custom evaluator prompt when provided', async () => { const directory = mkdtempSync(path.join(tmpdir(), 'agentv-custom-grader-')); const promptPath = path.join(directory, 'grader-prompt.md'); @@ -1936,7 +1986,7 @@ describe('runEvaluation with trials', () => { expect(result.costLimited).toBeUndefined(); }); - it('disables cache when trials > 1', async () => { + it('disables cache when repeat attempts are configured', async () => { const provider = new MultiCallProvider(); const evalRegistry = createScoringEvaluator([0.5, 0.9]); const trials: TrialsConfig = { count: 2, strategy: 'pass_any' }; @@ -1948,21 +1998,29 @@ describe('runEvaluation with trials', () => { async set() {}, }; - const results = await runEvaluation({ - testFilePath: 'in-memory.yaml', - repoRoot: 'in-memory', - target: baseTarget, - providerFactory: () => provider, - evaluators: evalRegistry, - evalCases: [baseTestCase], - trials, - cache, - useCache: true, // Should be overridden to false - }); + const warnSpy = spyOn(console, 'warn').mockImplementation(() => {}); + try { + const results = await runEvaluation({ + testFilePath: 'in-memory.yaml', + repoRoot: 'in-memory', + target: baseTarget, + providerFactory: () => provider, + evaluators: evalRegistry, + evalCases: [baseTestCase], + trials, + cache, + useCache: true, // Should be overridden to false + }); - // Provider should have been called for each trial (cache disabled) - expect(provider.callCount).toBe(2); - expect(results[0].trials).toHaveLength(2); + // Provider should have been called for each attempt (cache disabled) + expect(provider.callCount).toBe(2); + expect(results[0].trials).toHaveLength(2); + expect(warnSpy).toHaveBeenCalledWith( + 'Warning: Caching is disabled when evaluate_options.repeat.count > 1 (cached responses would make attempts deterministic).', + ); + } finally { + warnSpy.mockRestore(); + } }); }); diff --git a/packages/core/test/evaluation/validation/eval-file-schema.test.ts b/packages/core/test/evaluation/validation/eval-file-schema.test.ts index 5f9a83b11..08e5a67f3 100644 --- a/packages/core/test/evaluation/validation/eval-file-schema.test.ts +++ b/packages/core/test/evaluation/validation/eval-file-schema.test.ts @@ -89,7 +89,7 @@ describe('EvalFileSchema input shorthand', () => { expect(result.success).toBe(true); }); - it('accepts top-level target object and repeat runtime controls with include selection entries', () => { + it('accepts top-level target object and evaluate_options repeat controls with include selection entries', () => { const result = EvalFileSchema.safeParse({ name: 'wrapper', description: 'Wrapper eval', @@ -100,17 +100,23 @@ describe('EvalFileSchema input shorthand', () => { reasoning_effort: 'high', }, threshold: 0.8, - repeat: { - count: 2, - strategy: 'pass_any', - early_exit: true, - }, timeout_seconds: 300, evaluate_options: { budget_usd: 2, max_concurrency: 3, + repeat: { + count: 2, + strategy: 'pass_any', + early_exit: true, + }, }, tests: [ + { + ...baseTest, + options: { + repeat: 3, + }, + }, { include: './evals/**/*.eval.yaml', type: 'suite', diff --git a/packages/core/test/evaluation/validation/eval-validator.test.ts b/packages/core/test/evaluation/validation/eval-validator.test.ts index 97236de02..d540f695a 100644 --- a/packages/core/test/evaluation/validation/eval-validator.test.ts +++ b/packages/core/test/evaluation/validation/eval-validator.test.ts @@ -44,10 +44,10 @@ threshold: 0.8 evaluate_options: budget_usd: 2 max_concurrency: 3 -repeat: - count: 2 - strategy: pass_any - early_exit: true + repeat: + count: 2 + strategy: pass_any + early_exit: true tests: - id: local-case input: "Hello" @@ -184,6 +184,10 @@ tests: - description: fixed output row vars: diff: change + options: + repeat: + count: 3 + strategy: mean provider_output: "Looks safe." assert: - type: contains @@ -253,11 +257,13 @@ tests: ).toBe(true); }); - it('rejects removed top-level runs and early_exit with migration guidance', async () => { + it('rejects removed top-level repeat controls with migration guidance', async () => { const filePath = path.join(tempDir, 'removed-repeat-fields.yaml'); await writeFile( filePath, `target: codex +repeat: + count: 2 runs: 2 early_exit: true tests: @@ -269,10 +275,17 @@ tests: const result = await validateEvalFile(filePath); expect(result.valid).toBe(false); - expect(result.errors.some((error) => error.message.includes('Use repeat.count'))).toBe(true); - expect(result.errors.some((error) => error.message.includes('Use repeat.early_exit'))).toBe( - true, - ); + expect( + result.errors.some((error) => error.message.includes('Use evaluate_options.repeat')), + ).toBe(true); + expect( + result.errors.some((error) => error.message.includes('Use evaluate_options.repeat.count')), + ).toBe(true); + expect( + result.errors.some((error) => + error.message.includes('Use evaluate_options.repeat.early_exit'), + ), + ).toBe(true); }); it('rejects unsupported test-level execution.targets', async () => { diff --git a/packages/sdk/src/eval.ts b/packages/sdk/src/eval.ts index c31e7866c..b8323467d 100644 --- a/packages/sdk/src/eval.ts +++ b/packages/sdk/src/eval.ts @@ -261,8 +261,8 @@ function lowerEvalYamlValue(value: unknown): unknown { function lowerEvalDefinition(definition: unknown): Record { const lowered = lowerEvalYamlValue(definition) as Record; - const { budget_usd: budgetUsd, ...loweredWithoutBudget } = lowered; - if (budgetUsd === undefined) { + const { budget_usd: budgetUsd, repeat, ...loweredWithoutRuntimeOptions } = lowered; + if (budgetUsd === undefined && repeat === undefined) { return lowered; } @@ -273,11 +273,14 @@ function lowerEvalDefinition(definition: unknown): Record { ? { ...(lowered.evaluate_options as Record) } : {}; - if (evaluateOptions.budget_usd === undefined) { + if (budgetUsd !== undefined && evaluateOptions.budget_usd === undefined) { evaluateOptions.budget_usd = budgetUsd; } + if (repeat !== undefined && evaluateOptions.repeat === undefined) { + evaluateOptions.repeat = repeat; + } return { - ...loweredWithoutBudget, + ...loweredWithoutRuntimeOptions, evaluate_options: evaluateOptions, }; } @@ -319,7 +322,7 @@ function validateTopLevelRuntimeFields(definition: EvalDefinition): void { for (const field of ['model', 'policy', 'execution', 'runs', 'earlyExit']) { if (Object.prototype.hasOwnProperty.call(rawDefinition, field)) { throw new Error( - `defineEval() does not accept top-level '${field}'. Put target overrides in target and repeat controls under repeat.`, + `defineEval() does not accept top-level '${field}'. Put target overrides in target and repeat controls under repeat, which serializes to evaluate_options.repeat.`, ); } } diff --git a/packages/sdk/test/eval-authoring.test.ts b/packages/sdk/test/eval-authoring.test.ts index e71c4fff9..d483b5166 100644 --- a/packages/sdk/test/eval-authoring.test.ts +++ b/packages/sdk/test/eval-authoring.test.ts @@ -113,14 +113,14 @@ describe('YAML-aligned eval authoring helpers', () => { }, }, }, - repeat: { - count: 3, - strategy: 'pass_any', - early_exit: false, - }, timeout_seconds: 600, threshold: 0.8, evaluate_options: { + repeat: { + count: 3, + strategy: 'pass_any', + early_exit: false, + }, budget_usd: 1.5, }, assertions: [ diff --git a/skills-data/agentv-eval-writer/SKILL.md b/skills-data/agentv-eval-writer/SKILL.md index 7449431d2..e5938542d 100644 --- a/skills-data/agentv-eval-writer/SKILL.md +++ b/skills-data/agentv-eval-writer/SKILL.md @@ -19,7 +19,7 @@ Comprehensive docs: https://agentv.dev Treat YAML as the canonical portable model. Prefer authoring `.eval.yaml` / `EVAL.yaml` first, then use TypeScript helpers, Python scripts, or executable graders only when they lower to the same fields or when the evaluation logic must actually run code. Eval files define what is tested and how it runs: prompts, datasets, assertions, -task fixtures, top-level `target`, and top-level run controls. Use `imports.suites` +task fixtures, top-level `target`, and suite run controls. Use `imports.suites` for full child suites that preserve their workspace, shared input, assertions, fixtures, and graders. Use `imports.tests` for raw case rows that should run in the parent file's context. Inline `tests` are also parent-owned raw cases. @@ -30,7 +30,8 @@ entries still load with a migration warning, but new evals should use individual tests only for `threshold`, `repeat`, `timeout_seconds`, and legacy `budget_usd`; keep target selection at top-level `target` or CLI `--target`, put suite budget caps under `evaluate_options.budget_usd`, authored concurrency -under `evaluate_options.max_concurrency`, +under `evaluate_options.max_concurrency`, suite repeat policy under +`evaluate_options.repeat`, and keep setup and workspace mutation under `workspace`. Use `@agentv/sdk` for TypeScript helper imports. Do not use `@agentv/eval` for new evals, examples, scaffolds, or skill guidance; it was a deprecated compatibility package and has been removed from this repository. @@ -126,7 +127,7 @@ tests: ## Eval File Structure **Required:** `tests` (array or string raw-case path) or `imports` -**Optional:** `name`, `description`, `experiment`, `version`, `author`, `tags`, `license`, `requires`, `target`, `repeat`, `timeout_seconds`, `evaluate_options`, `threshold`, `suite`, `workspace`, `assertions`, `input` +**Optional:** `name`, `description`, `experiment`, `version`, `author`, `tags`, `license`, `requires`, `target`, `timeout_seconds`, `evaluate_options`, `threshold`, `suite`, `workspace`, `assertions`, `input` **Test fields:** @@ -572,13 +573,14 @@ See `references/rubric-grader.md` for score-range mode and scoring formula. ## Suite-Level Quality Threshold -Set a minimum mean score for the eval suite. If the mean quality score falls below the threshold, the CLI exits with code 1 — useful for CI/CD quality gates. Use top-level `repeat` when each case should be attempted more than once. +Set a minimum mean score for the eval suite. If the mean quality score falls below the threshold, the CLI exits with code 1 — useful for CI/CD quality gates. Use `evaluate_options.repeat` when each case should be attempted more than once. ```yaml -repeat: - count: 3 - strategy: pass_any - early_exit: false +evaluate_options: + repeat: + count: 3 + strategy: pass_any + early_exit: false threshold: 0.8 ``` @@ -644,6 +646,7 @@ import { defineEval, graders } from '@agentv/sdk'; export default defineEval({ name: 'helper-suite', target: 'default', + // The SDK helper lowers this to evaluate_options.repeat in generated YAML. repeat: { count: 3, strategy: 'pass_any', diff --git a/skills-data/agentv-eval-writer/references/eval.schema.json b/skills-data/agentv-eval-writer/references/eval.schema.json index 7f0511f1c..96e052005 100644 --- a/skills-data/agentv-eval-writer/references/eval.schema.json +++ b/skills-data/agentv-eval-writer/references/eval.schema.json @@ -11314,26 +11314,7 @@ "minLength": 1 }, "repeat": { - "type": "object", - "properties": { - "count": { - "type": "integer", - "minimum": 1 - }, - "strategy": { - "type": "string", - "enum": ["pass_any", "pass_all", "mean", "confidence_interval"] - }, - "early_exit": { - "type": "boolean" - }, - "cost_limit_usd": { - "type": "number", - "minimum": 0 - } - }, - "required": ["count"], - "additionalProperties": false + "not": {} }, "runs": { "not": {}