[codex] Add promptfoo assertion grader surface by christso · Pull Request #1599 · EntityProcess/agentv

christso · 2026-07-02T10:19:50Z

Summary

Implements Bead av-kfik.7 assertion/grader vocabulary work:

Makes assert canonical across YAML/JSONL parsing, with precedence over assertions and legacy evaluator fields.
Adds promptfoo-compatible runtime assertion types for javascript, python, webhook, similar, and assert-set, while keeping deterministic types aligned.
Routes bare-string assertions and structured rubrics to grouped g-eval, preserving rubric_item/rubrics, score_ranges, min_score, operator, weight, required-like gating, and per-criterion artifact rows.
Keeps llm-rubric as the free-form rubric-text LLM judge.
Preserves expected_output as first-class case data and verifies structured script stdin preserves expected_output, input, and config.
Uses script as the canonical subprocess grader type, preserving the former code-grader structured stdin/config behavior for migration.
Explicitly rejects known unimplemented promptfoo exotic assertion types instead of accepting them silently.

Validation

bun --filter @agentv/core generate:schema
bun run build
bun run lint
bun run typecheck
bun run test
- core: 2131 pass
- sdk: 92 pass
- agentv CLI: 745 pass
- dashboard: 145 pass
Follow-up naming correction validation:
- bun test packages/core/test/evaluation/loaders/grader-parser.test.ts packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts packages/sdk/test/grader-helpers.test.ts
- bun run lint
- bun run typecheck
- rg "script-grader|code-rubric" -n found no matches.
- generated eval schema no longer advertises code-grader or code_grader.
Live local LLM grader smoke via OpenAI-compatible base URL http://127.0.0.1:10531/v1, no API key, model gpt-5.4-mini: mock-target run passed 2/2 for g-eval and llm-rubric; g-eval emitted per-criterion artifact rows in /tmp/agentv-kfik7-smoke.lQrEgt/results/index.jsonl.

Notes

First live smoke attempt with a CLI target was discarded because that target command used the wrong output-file contract; the subsequent mock-target grader smoke passed.

cloudflare-workers-and-pages · 2026-07-02T10:21:12Z

Deploying agentv with Cloudflare Pages

Latest commit:	`5d72784`
Status:	✅ Deploy successful!
Preview URL:	https://ccaa8ae6.agentv.pages.dev
Branch Preview URL:	https://feat-av-kfik-7-graders.agentv.pages.dev

View logs

christso · 2026-07-02T10:36:58Z

Pushed CI fix 42ba4de: source traceability now emits canonical script_grader_command/script_grader_cwd for script graders while keeping task-bundle compatibility with legacy code_grader_cwd references. Local validation: bun test packages/core/test/evaluation/source-traceability.test.ts; bun test apps/cli/test/commands/eval/task-bundle.test.ts; git diff --check; bun --filter @agentv/core test (2131 pass); bun run typecheck.

christso · 2026-07-02T10:46:47Z

Pushed follow-up eb6073a after the second Test failure. The CLI pipeline now treats canonical script graders as subprocess sidecars while preserving the existing code_graders/code_grader_results directory contract; grade results preserve sidecar type and default legacy missing type to script. Local validation: targeted pipeline input/e2e/grade tests passed; git diff --check passed; bun run typecheck passed; clean rerun of bun --filter agentv test passed (745 pass, 0 fail).

christso · 2026-07-02T10:48:59Z

CI is green for eb6073a: Build, Test, Typecheck, Lint, Check Links, Validate Evals, Validate Marketplace, and Cloudflare Pages all passed.

christso · 2026-07-02T13:48:01Z

Review findings:

[P1] expected_output and test-level criteria still trigger the implicit LLM grader. yaml-parser.ts still treats criteria or expected_output as a complete evaluation spec (packages/core/src/evaluation/yaml-parser.ts:677) and stores test-level criteria on the case (packages/core/src/evaluation/yaml-parser.ts:854); the schema still exposes criteria (packages/core/src/evaluation/validation/eval-file.schema.ts:526). Then any case without explicit assertions falls through to llm-grader (packages/core/src/evaluation/orchestrator.ts:2686) and immediately evaluates it (packages/core/src/evaluation/orchestrator.ts:2696). This conflicts with the av-kfik.7 decision to remove test-level criteria from the new YAML contract and keep expected_output passive first-class reference data. A case containing only expected_output should not start a live/default LLM grading call, and criteria text should live under explicit assert entries such as g-eval.
[P2] Script-style assertions pass numeric zero when no threshold is set. normalizeScriptResult defaults threshold = 0 (packages/core/src/evaluation/graders/promptfoo-assertions.ts:44), so numeric results pass on score >= threshold (packages/core/src/evaluation/graders/promptfoo-assertions.ts:57) and object results without pass use the same default (packages/core/src/evaluation/graders/promptfoo-assertions.ts:74). That makes type: javascript/python/webhook with a score of 0 produce a passing verdict. Since composite threshold aggregation counts child verdicts (packages/core/src/evaluation/graders/composite.ts:181), a zero-score child can satisfy pass-count logic. Promptfoo-compatible numeric behavior should fail zero when no threshold is provided; please add coverage for score 0 across JS/Python/Webhook and composite threshold usage.
[P2] The AI-facing eval writer guide still teaches the removed/stale contract. skills-data/agentv-eval-writer/SKILL.md:136 says test-level criteria is conditionally required, skills-data/agentv-eval-writer/SKILL.md:271 says criteria without assertions runs implicit llm-grader, and skills-data/agentv-eval-writer/SKILL.md:416 / :717 still present code-grader as the YAML type. This PR changes the grader vocabulary and YAML contract, so the guide will keep generating invalid/deprecated evals unless it is updated to canonical assert, g-eval, llm-rubric, and script semantics.
[P2] Required dogfood evidence is still short of the repo gate for grader changes. The PR validation notes a live local LLM grader smoke against mock-target, while .agents/verification.md requires live provider/target plus real LLM grader dogfood for provider/grader/eval-execution changes and explicitly says mock targets do not count as live dogfood. Please add/link private evidence from a real target/provider run with the real grader, or clarify where that evidence already exists.

christso · 2026-07-02T14:45:28Z

Addressed the grader blocker in commit 013cf958.

Validation run locally:

bun test packages/core/test/evaluation/loaders/jsonl-parser.test.ts packages/core/test/evaluation/orchestrator.test.ts packages/core/test/evaluation/graders/promptfoo-assertions.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts packages/core/test/evaluation/validation/eval-file-schema.test.ts -> 268 pass
bun --filter @agentv/core test -> 2139 pass
bun run lint -> pass
bun run typecheck -> pass
git diff --check -> pass

Live-provider dogfood:

Local OpenAI-compatible endpoint: http://127.0.0.1:10531/v1
Model: gpt-5.3-codex-spark
Run bundle: .agentv/results/kfik7-dogfood-pr1599
Private evidence pushed to EntityProcess/agentv-private:evidence/av-kfik-7-2-graders commit 721b076, path evidence/pr-1599/kfik7-graders/

Dogfood contract checks:

passive-expected-output: score 1, no scores entry in index.jsonl; expected-output-only grading was skipped.
legacy-criteria-explicit-assertion: score 1, scores[0].type is g-eval, scores[0].target is local-openai-grader; legacy criteria desugared into explicit assertion form and used a real LLM grader.
javascript-zero-score: score 0, verdict fail; numeric zero no longer passes by default.
python-zero-score: score 0, verdict fail; numeric zero no longer passes by default.

The dogfood run is intentionally mixed 2 pass / 2 quality failures because the zero-score script assertions are expected to fail under the fixed semantics.

christso · 2026-07-02T14:45:58Z

Post-push merge status check: not merging yet.

Current blockers after 013cf958:

GitHub reports mergeable: CONFLICTING / mergeStateStatus: DIRTY for PR [codex] Add promptfoo assertion grader surface #1599 against main.
Cloudflare Pages is still pending for the pushed head.

I left the PR in draft and did not merge. Validation and dogfood evidence are in the previous comment: #1599 (comment)

christso · 2026-07-02T15:03:17Z

Rebased feat/av-kfik-7-graders onto current origin/main and force-pushed with lease.

Head commit: 8d9b894ecc7a7638cc8eddb5e72fd9db2bb93ead

Local validation after rebase:

bun test packages/core/test/evaluation/loaders/jsonl-parser.test.ts packages/core/test/evaluation/orchestrator.test.ts packages/core/test/evaluation/graders/promptfoo-assertions.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/eval-inline-experiment.test.ts -> 306 pass
bun run lint -> pass
bun run typecheck -> pass

Dogfood evidence: preserved existing live-provider evidence because the rebase only resolved conflicts and did not materially change grader behavior. Evidence remains on EntityProcess/agentv-private:evidence/av-kfik-7-2-graders at evidence/pr-1599/kfik7-graders/, evidence commit 721b076, run .agentv/results/kfik7-dogfood-pr1599; earlier PR note: #1599 (comment)

PR #1603 still has base branch feat/av-kfik-7-graders, so this branch must not be deleted if #1599 merges before #1603 is retargeted/rebased.

christso · 2026-07-02T15:14:29Z

Follow-up CI fix pushed.

Commit: 5d727840 (fix(eval): normalize promptfoo script imports)

What changed: promptfoo CSV python: / file://*.py expected DSL imports now emit canonical type: script assertions instead of the deprecated type: code-grader alias. This fixes the stale CI expectation in case-file-loader.test.ts and aligns with the script-grader naming contract.

Validation after the fix:

bun test packages/core/test/evaluation/loaders/case-file-loader.test.ts packages/core/test/evaluation/loaders/jsonl-parser.test.ts packages/core/test/evaluation/orchestrator.test.ts packages/core/test/evaluation/graders/promptfoo-assertions.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/eval-inline-experiment.test.ts -> 347 pass
bun run lint -> pass
bun run typecheck -> pass
bun run test -> pass locally

Dogfood evidence unchanged from the prior note: EntityProcess/agentv-private:evidence/av-kfik-7-2-graders, evidence/pr-1599/kfik7-graders/, evidence commit 721b076, run .agentv/results/kfik7-dogfood-pr1599.

Reminder: PR #1603 still bases on feat/av-kfik-7-graders; do not delete this branch on merge.

christso added 5 commits July 2, 2026 17:00

feat(eval): add promptfoo assertion grader surface

aaaceea

fix(eval): make script the canonical subprocess grader

9cf5cad

fix(eval): trace script grader sources

725775c

fix(cli): include script graders in pipeline sidecars

e537e84

fix(eval): align grader default semantics

8d9b894

christso force-pushed the feat/av-kfik-7-graders branch from 013cf95 to 8d9b894 Compare July 2, 2026 15:02

fix(eval): normalize promptfoo script imports

5d72784

christso marked this pull request as ready for review July 2, 2026 15:16

christso merged commit a773fc8 into main Jul 2, 2026
8 checks passed

christso deleted the feat/av-kfik-7-graders branch July 2, 2026 15:17

christso restored the feat/av-kfik-7-graders branch July 2, 2026 15:18

christso mentioned this pull request Jul 2, 2026

Grading contract: assertion_results rows #1603

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] Add promptfoo assertion grader surface#1599

[codex] Add promptfoo assertion grader surface#1599
christso merged 6 commits into
mainfrom
feat/av-kfik-7-graders

christso commented Jul 2, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

christso commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Notes

Uh oh!

cloudflare-workers-and-pages Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented Jul 2, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented Jul 2, 2026 •

edited

Loading