[codex] Add promptfoo assertion grader surface#1599
Conversation
Deploying agentv with
|
| Latest commit: |
5d72784
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://ccaa8ae6.agentv.pages.dev |
| Branch Preview URL: | https://feat-av-kfik-7-graders.agentv.pages.dev |
|
Pushed CI fix 42ba4de: source traceability now emits canonical script_grader_command/script_grader_cwd for script graders while keeping task-bundle compatibility with legacy code_grader_cwd references. Local validation: bun test packages/core/test/evaluation/source-traceability.test.ts; bun test apps/cli/test/commands/eval/task-bundle.test.ts; git diff --check; bun --filter @agentv/core test (2131 pass); bun run typecheck. |
|
Pushed follow-up eb6073a after the second Test failure. The CLI pipeline now treats canonical script graders as subprocess sidecars while preserving the existing code_graders/code_grader_results directory contract; grade results preserve sidecar type and default legacy missing type to script. Local validation: targeted pipeline input/e2e/grade tests passed; git diff --check passed; bun run typecheck passed; clean rerun of bun --filter agentv test passed (745 pass, 0 fail). |
|
CI is green for eb6073a: Build, Test, Typecheck, Lint, Check Links, Validate Evals, Validate Marketplace, and Cloudflare Pages all passed. |
|
Review findings:
|
|
Addressed the grader blocker in commit Validation run locally:
Live-provider dogfood:
Dogfood contract checks:
The dogfood run is intentionally mixed 2 pass / 2 quality failures because the zero-score script assertions are expected to fail under the fixed semantics. |
|
Post-push merge status check: not merging yet. Current blockers after
I left the PR in draft and did not merge. Validation and dogfood evidence are in the previous comment: #1599 (comment) |
013cf95 to
8d9b894
Compare
|
Rebased Head commit: Local validation after rebase:
Dogfood evidence: preserved existing live-provider evidence because the rebase only resolved conflicts and did not materially change grader behavior. Evidence remains on PR #1603 still has base branch |
|
Follow-up CI fix pushed. Commit: What changed: promptfoo CSV Validation after the fix:
Dogfood evidence unchanged from the prior note: Reminder: PR #1603 still bases on |
Summary
Implements Bead
av-kfik.7assertion/grader vocabulary work:assertcanonical across YAML/JSONL parsing, with precedence overassertionsand legacy evaluator fields.javascript,python,webhook,similar, andassert-set, while keeping deterministic types aligned.g-eval, preservingrubric_item/rubrics,score_ranges,min_score, operator, weight, required-like gating, and per-criterion artifact rows.llm-rubricas the free-form rubric-text LLM judge.expected_outputas first-class case data and verifies structured script stdin preservesexpected_output,input, andconfig.scriptas the canonical subprocess grader type, preserving the former code-grader structured stdin/config behavior for migration.Validation
bun --filter @agentv/core generate:schemabun run buildbun run lintbun run typecheckbun run testbun test packages/core/test/evaluation/loaders/grader-parser.test.ts packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts packages/sdk/test/grader-helpers.test.tsbun run lintbun run typecheckrg "script-grader|code-rubric" -nfound no matches.code-graderorcode_grader.http://127.0.0.1:10531/v1, no API key, modelgpt-5.4-mini: mock-target run passed 2/2 forg-evalandllm-rubric; g-eval emitted per-criterion artifact rows in/tmp/agentv-kfik7-smoke.lQrEgt/results/index.jsonl.Notes
First live smoke attempt with a CLI target was discarded because that target command used the wrong output-file contract; the subsequent mock-target grader smoke passed.