feat(core): load file-backed datasets by christso · Pull Request #1601 · EntityProcess/agentv

christso · 2026-07-02T11:15:35Z

Summary

AgentV eval suites can now load raw tests from file:// CSV, JSON, JSONL, YAML, JavaScript, and Python dataset sources without replacing imports.tests or select composition. Promptfoo-style CSV rows now become AgentV cases: expected columns create assertions, provider output remains first-class expected_output, metadata/config columns map onto case metadata and grader thresholds, and __metric survives as assertion names in result scores.

The implementation keeps dataset script execution explicit and bounded: JavaScript is loaded by file URL, Python runs through uv run python, and both paths must return JSON arrays of case objects.

Validation

bun test packages/core/test/evaluation/loaders/case-file-loader.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts
bun run --cwd packages/core typecheck
bun run --cwd packages/core lint
bun run --cwd packages/core build
bun run lint
bun run typecheck
bun apps/cli/src/cli.ts validate examples/features/external-datasets/evals/dataset.eval.yaml

Live provider dogfood was not run because this slice changes dataset parsing, validation, docs, and examples only; it does not change provider execution, graders, scoring runtime, or run artifact layout.

Deploying agentv with Cloudflare Pages

Latest commit:	`2d66a22`
Status:	✅ Deploy successful!
Preview URL:	https://4bc932b0.agentv.pages.dev
Branch Preview URL:	https://feat-av-kfik-9-datasets.agentv.pages.dev

View logs

christso · 2026-07-02T11:21:59Z

CI follow-up pushed in 079e808. The Test job failed because the GitHub runner did not have uv on PATH, so the Python dataset fixture failed before exercising the dataset loader. The fix keeps uv run python as the preferred runner and falls back to python3/python only when the executable is missing.\n\nLocal validation after the fix:\n- hidden-uv reproduction: env PATH=/usr/bin:/bin \$(command -v bun) test packages/core/test/evaluation/loaders/case-file-loader.test.ts -t 'loads tests from explicit Python function dataset files'\n- bun test packages/core/test/evaluation/loaders/case-file-loader.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts\n- bun run --cwd packages/core lint\n- bun run --cwd packages/core typecheck\n- bun --filter @agentv/core test (2123 pass, 0 fail)

christso · 2026-07-02T13:47:18Z

Independent review: PR #1601

Findings:

P1 - CSV rows that rely on parent suite input are silently dropped.

parseCsvCases() only emits an input field when the CSV has an input column (packages/core/src/evaluation/loaders/case-file-loader.ts:377). Later, loadTests() checks completeness before it prepends suite-level input, and it requires testInputMessages from the row itself (packages/core/src/evaluation/yaml-parser.ts:667, packages/core/src/evaluation/yaml-parser.ts:677). A promptfoo-style dataset with ordinary variable columns plus __expected, used under a parent suite prompt, validates successfully but loads zero tests.

I reproduced this with:

input: Answer about {{ topic }}
tests: file://cases.csv

id,topic,__expected
case,refund,contains:refund

validateEvalFile() returns valid, but loadTests() returns [] and logs Skipping incomplete test: case. That breaks the documented raw-case composition contract that parent suite context applies and makes common promptfoo CSV datasets unusable unless every row also duplicates an input column. The completeness check should account for effective suite input before skipping, or validation should reject this shape.

P1 - Several promptfoo __expected mini-DSL values generate assertions that AgentV cannot run.

The new CSV parser advertises promptfoo-compatible assertionFromString, but the generated AgentV configs do not match AgentV's grader parser/registry for some promptfoo-supported values. For example, latency(1000) and cost(0.01) become { type, min_score }, while parseGraders() requires threshold for latency and budget for cost (packages/core/src/evaluation/loaders/grader-parser.ts:1087, packages/core/src/evaluation/loaders/grader-parser.ts:1116), so they are skipped. similar:hello is emitted as type: similar, but similar is not a built-in registered grader (packages/core/src/evaluation/registry/builtin-graders.ts:405), so runtime grading fails with unknown grader type unless a user happens to define a custom assertion named similar. file://grader.py is parsed as type: file rather than a Python/code grader because the regex consumes file://... as a generic file: assertion (packages/core/src/evaluation/loaders/case-file-loader.ts:237).

A parser probe using:

id,input,__expected,__expected2,__expected3,__expected4
case,hello,similar:hello,latency(1000),cost(0.01),file://grader.py

loaded assertions only for similar and file; latency and cost were skipped with warnings, while validateEvalFile() still returned valid. This is contract drift from the Bead's assertionFromString requirement and the promptfoo reference parser. The fix should either map these DSL forms to runnable AgentV graders or explicitly narrow the documented/supported DSL and warn/error during validation.

Verification:

bun test packages/core/test/evaluation/loaders/case-file-loader.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts passes locally after bun install in this review worktree: 124 pass, 0 fail.
gh pr checks 1601 --repo EntityProcess/agentv shows Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, and Cloudflare Pages passing on head 079e808a.

Residual risk: live provider dogfood was not run for this review; the findings above are parser/validation contract issues reproduced through local parser probes.

christso · 2026-07-02T14:28:52Z

Addressed the two review blockers in commit 1067596 (fix(core): align CSV dataset validation with runtime).\n\nValidation run locally:\n- bun test packages/core/test/evaluation/loaders/case-file-loader.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts — 129 pass, 0 fail\n- bun --filter @agentv/core typecheck\n- bun --filter @agentv/core build\n- bun run lint\n- bun run validate:examples — 62 valid, 0 invalid\n- bun --filter @agentv/core test — 2128 pass, 0 fail\n\nNotes:\n- CSV rows with vars plus parent suite input now load as runnable tests.\n- CSV __expected latency/cost/file://.py forms now map to runnable AgentV assertions; unsupported forms such as similar: now fail validation/load clearly instead of drifting to runtime skips/failures.\n

christso · 2026-07-02T14:39:42Z

Rebased and pushed the review-blocker fix in aacf6278 (fix(core): align CSV dataset validation with runtime).

Addressed:

CSV rows that rely on parent suite input now validate and load as runnable tests instead of being skipped at runtime.
Promptfoo-style __expected mini-DSL forms are aligned with AgentV runtime support: latency(...), cost(...), and file://*.py generate runnable graders; unsupported typed forms such as similar:* are rejected during load/validation with a clear error.

Validation on the rebased branch:

bun test packages/core/test/evaluation/loaders/case-file-loader.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts -> 129 pass, 0 fail
bun --filter @agentv/core typecheck -> pass
bun --filter @agentv/core build -> pass
bun run lint -> pass
bun run validate:examples -> 62 valid, 0 invalid
bun --filter @agentv/core test -> 2161 pass, 0 fail

christso · 2026-07-02T14:44:04Z

Updated the pushed fix to 2d66a227 after reproducing the failed CI Test job locally.

Additional CI blocker fix:

apps/cli/test/commands/prepare/prepare.test.ts had a stale target fixture using removed name; updated it to the current label target schema.

Validation on the updated branch:

bun test packages/core/test/evaluation/loaders/case-file-loader.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts -> 129 pass, 0 fail
bun --filter @agentv/core typecheck -> pass
bun --filter @agentv/core build -> pass
bun run lint -> pass
bun run validate:examples -> 62 valid, 0 invalid
bun --filter @agentv/core test -> 2161 pass, 0 fail
bun test apps/cli/test/commands/prepare/prepare.test.ts -t "remaps prepared extension context paths into the output workspace" -> 1 pass, 0 fail
bun --filter agentv test -> 747 pass, 0 fail

christso added 2 commits July 2, 2026 16:29

feat(core): load external dataset files

aaa88b5

fix(core): tolerate missing uv for python datasets

8c75072

christso force-pushed the feat/av-kfik-9-datasets branch from 1067596 to aacf627 Compare July 2, 2026 14:37

fix(core): align CSV dataset validation with runtime

2d66a22

christso force-pushed the feat/av-kfik-9-datasets branch from aacf627 to 2d66a22 Compare July 2, 2026 14:43

christso marked this pull request as ready for review July 2, 2026 14:46

christso merged commit 7a3bdaa into main Jul 2, 2026
8 checks passed

christso deleted the feat/av-kfik-9-datasets branch July 2, 2026 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(core): load file-backed datasets#1601

feat(core): load file-backed datasets#1601
christso merged 3 commits into
mainfrom
feat/av-kfik-9-datasets

christso commented Jul 2, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

christso commented Jul 2, 2026

Summary

Validation

Related

Uh oh!

cloudflare-workers-and-pages Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Independent review: PR #1601

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cloudflare-workers-and-pages Bot commented Jul 2, 2026 •

edited

Loading