Skip to content

feat(core): load file-backed datasets#1601

Merged
christso merged 3 commits into
mainfrom
feat/av-kfik-9-datasets
Jul 2, 2026
Merged

feat(core): load file-backed datasets#1601
christso merged 3 commits into
mainfrom
feat/av-kfik-9-datasets

Conversation

@christso

@christso christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

AgentV eval suites can now load raw tests from file:// CSV, JSON, JSONL, YAML, JavaScript, and Python dataset sources without replacing imports.tests or select composition. Promptfoo-style CSV rows now become AgentV cases: expected columns create assertions, provider output remains first-class expected_output, metadata/config columns map onto case metadata and grader thresholds, and __metric survives as assertion names in result scores.

The implementation keeps dataset script execution explicit and bounded: JavaScript is loaded by file URL, Python runs through uv run python, and both paths must return JSON arrays of case objects.

Validation

  • bun test packages/core/test/evaluation/loaders/case-file-loader.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts
  • bun run --cwd packages/core typecheck
  • bun run --cwd packages/core lint
  • bun run --cwd packages/core build
  • bun run lint
  • bun run typecheck
  • bun apps/cli/src/cli.ts validate examples/features/external-datasets/evals/dataset.eval.yaml

Live provider dogfood was not run because this slice changes dataset parsing, validation, docs, and examples only; it does not change provider execution, graders, scoring runtime, or run artifact layout.

Related

Related: av-kfik.9


Compound Engineering
GPT--5

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jul 2, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 2d66a22
Status: ✅  Deploy successful!
Preview URL: https://4bc932b0.agentv.pages.dev
Branch Preview URL: https://feat-av-kfik-9-datasets.agentv.pages.dev

View logs

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

CI follow-up pushed in 079e808. The Test job failed because the GitHub runner did not have uv on PATH, so the Python dataset fixture failed before exercising the dataset loader. The fix keeps uv run python as the preferred runner and falls back to python3/python only when the executable is missing.\n\nLocal validation after the fix:\n- hidden-uv reproduction: env PATH=/usr/bin:/bin \$(command -v bun) test packages/core/test/evaluation/loaders/case-file-loader.test.ts -t 'loads tests from explicit Python function dataset files'\n- bun test packages/core/test/evaluation/loaders/case-file-loader.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts\n- bun run --cwd packages/core lint\n- bun run --cwd packages/core typecheck\n- bun --filter @agentv/core test (2123 pass, 0 fail)

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Independent review: PR #1601

Findings:

  1. P1 - CSV rows that rely on parent suite input are silently dropped.

parseCsvCases() only emits an input field when the CSV has an input column (packages/core/src/evaluation/loaders/case-file-loader.ts:377). Later, loadTests() checks completeness before it prepends suite-level input, and it requires testInputMessages from the row itself (packages/core/src/evaluation/yaml-parser.ts:667, packages/core/src/evaluation/yaml-parser.ts:677). A promptfoo-style dataset with ordinary variable columns plus __expected, used under a parent suite prompt, validates successfully but loads zero tests.

I reproduced this with:

input: Answer about {{ topic }}
tests: file://cases.csv
id,topic,__expected
case,refund,contains:refund

validateEvalFile() returns valid, but loadTests() returns [] and logs Skipping incomplete test: case. That breaks the documented raw-case composition contract that parent suite context applies and makes common promptfoo CSV datasets unusable unless every row also duplicates an input column. The completeness check should account for effective suite input before skipping, or validation should reject this shape.

  1. P1 - Several promptfoo __expected mini-DSL values generate assertions that AgentV cannot run.

The new CSV parser advertises promptfoo-compatible assertionFromString, but the generated AgentV configs do not match AgentV's grader parser/registry for some promptfoo-supported values. For example, latency(1000) and cost(0.01) become { type, min_score }, while parseGraders() requires threshold for latency and budget for cost (packages/core/src/evaluation/loaders/grader-parser.ts:1087, packages/core/src/evaluation/loaders/grader-parser.ts:1116), so they are skipped. similar:hello is emitted as type: similar, but similar is not a built-in registered grader (packages/core/src/evaluation/registry/builtin-graders.ts:405), so runtime grading fails with unknown grader type unless a user happens to define a custom assertion named similar. file://grader.py is parsed as type: file rather than a Python/code grader because the regex consumes file://... as a generic file: assertion (packages/core/src/evaluation/loaders/case-file-loader.ts:237).

A parser probe using:

id,input,__expected,__expected2,__expected3,__expected4
case,hello,similar:hello,latency(1000),cost(0.01),file://grader.py

loaded assertions only for similar and file; latency and cost were skipped with warnings, while validateEvalFile() still returned valid. This is contract drift from the Bead's assertionFromString requirement and the promptfoo reference parser. The fix should either map these DSL forms to runnable AgentV graders or explicitly narrow the documented/supported DSL and warn/error during validation.

Verification:

  • bun test packages/core/test/evaluation/loaders/case-file-loader.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts passes locally after bun install in this review worktree: 124 pass, 0 fail.
  • gh pr checks 1601 --repo EntityProcess/agentv shows Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, and Cloudflare Pages passing on head 079e808a.

Residual risk: live provider dogfood was not run for this review; the findings above are parser/validation contract issues reproduced through local parser probes.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the two review blockers in commit 1067596 (fix(core): align CSV dataset validation with runtime).\n\nValidation run locally:\n- bun test packages/core/test/evaluation/loaders/case-file-loader.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts — 129 pass, 0 fail\n- bun --filter @agentv/core typecheck\n- bun --filter @agentv/core build\n- bun run lint\n- bun run validate:examples — 62 valid, 0 invalid\n- bun --filter @agentv/core test — 2128 pass, 0 fail\n\nNotes:\n- CSV rows with vars plus parent suite input now load as runnable tests.\n- CSV __expected latency/cost/file://.py forms now map to runnable AgentV assertions; unsupported forms such as similar: now fail validation/load clearly instead of drifting to runtime skips/failures.\n

@christso christso force-pushed the feat/av-kfik-9-datasets branch from 1067596 to aacf627 Compare July 2, 2026 14:37
@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Rebased and pushed the review-blocker fix in aacf6278 (fix(core): align CSV dataset validation with runtime).

Addressed:

  • CSV rows that rely on parent suite input now validate and load as runnable tests instead of being skipped at runtime.
  • Promptfoo-style __expected mini-DSL forms are aligned with AgentV runtime support: latency(...), cost(...), and file://*.py generate runnable graders; unsupported typed forms such as similar:* are rejected during load/validation with a clear error.

Validation on the rebased branch:

  • bun test packages/core/test/evaluation/loaders/case-file-loader.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts -> 129 pass, 0 fail
  • bun --filter @agentv/core typecheck -> pass
  • bun --filter @agentv/core build -> pass
  • bun run lint -> pass
  • bun run validate:examples -> 62 valid, 0 invalid
  • bun --filter @agentv/core test -> 2161 pass, 0 fail

@christso christso force-pushed the feat/av-kfik-9-datasets branch from aacf627 to 2d66a22 Compare July 2, 2026 14:43
@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Updated the pushed fix to 2d66a227 after reproducing the failed CI Test job locally.

Additional CI blocker fix:

  • apps/cli/test/commands/prepare/prepare.test.ts had a stale target fixture using removed name; updated it to the current label target schema.

Validation on the updated branch:

  • bun test packages/core/test/evaluation/loaders/case-file-loader.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts -> 129 pass, 0 fail
  • bun --filter @agentv/core typecheck -> pass
  • bun --filter @agentv/core build -> pass
  • bun run lint -> pass
  • bun run validate:examples -> 62 valid, 0 invalid
  • bun --filter @agentv/core test -> 2161 pass, 0 fail
  • bun test apps/cli/test/commands/prepare/prepare.test.ts -t "remaps prepared extension context paths into the output workspace" -> 1 pass, 0 fail
  • bun --filter agentv test -> 747 pass, 0 fail

@christso christso marked this pull request as ready for review July 2, 2026 14:46
@christso christso merged commit 7a3bdaa into main Jul 2, 2026
8 checks passed
@christso christso deleted the feat/av-kfik-9-datasets branch July 2, 2026 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant