Skip to content

[codex] Add promptfoo assertion grader surface#1599

Merged
christso merged 6 commits into
mainfrom
feat/av-kfik-7-graders
Jul 2, 2026
Merged

[codex] Add promptfoo assertion grader surface#1599
christso merged 6 commits into
mainfrom
feat/av-kfik-7-graders

Conversation

@christso

@christso christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

Implements Bead av-kfik.7 assertion/grader vocabulary work:

  • Makes assert canonical across YAML/JSONL parsing, with precedence over assertions and legacy evaluator fields.
  • Adds promptfoo-compatible runtime assertion types for javascript, python, webhook, similar, and assert-set, while keeping deterministic types aligned.
  • Routes bare-string assertions and structured rubrics to grouped g-eval, preserving rubric_item/rubrics, score_ranges, min_score, operator, weight, required-like gating, and per-criterion artifact rows.
  • Keeps llm-rubric as the free-form rubric-text LLM judge.
  • Preserves expected_output as first-class case data and verifies structured script stdin preserves expected_output, input, and config.
  • Uses script as the canonical subprocess grader type, preserving the former code-grader structured stdin/config behavior for migration.
  • Explicitly rejects known unimplemented promptfoo exotic assertion types instead of accepting them silently.

Validation

  • bun --filter @agentv/core generate:schema
  • bun run build
  • bun run lint
  • bun run typecheck
  • bun run test
    • core: 2131 pass
    • sdk: 92 pass
    • agentv CLI: 745 pass
    • dashboard: 145 pass
  • Follow-up naming correction validation:
    • bun test packages/core/test/evaluation/loaders/grader-parser.test.ts packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts packages/sdk/test/grader-helpers.test.ts
    • bun run lint
    • bun run typecheck
    • rg "script-grader|code-rubric" -n found no matches.
    • generated eval schema no longer advertises code-grader or code_grader.
  • Live local LLM grader smoke via OpenAI-compatible base URL http://127.0.0.1:10531/v1, no API key, model gpt-5.4-mini: mock-target run passed 2/2 for g-eval and llm-rubric; g-eval emitted per-criterion artifact rows in /tmp/agentv-kfik7-smoke.lQrEgt/results/index.jsonl.

Notes

First live smoke attempt with a CLI target was discarded because that target command used the wrong output-file contract; the subsequent mock-target grader smoke passed.

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jul 2, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 5d72784
Status: ✅  Deploy successful!
Preview URL: https://ccaa8ae6.agentv.pages.dev
Branch Preview URL: https://feat-av-kfik-7-graders.agentv.pages.dev

View logs

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Pushed CI fix 42ba4de: source traceability now emits canonical script_grader_command/script_grader_cwd for script graders while keeping task-bundle compatibility with legacy code_grader_cwd references. Local validation: bun test packages/core/test/evaluation/source-traceability.test.ts; bun test apps/cli/test/commands/eval/task-bundle.test.ts; git diff --check; bun --filter @agentv/core test (2131 pass); bun run typecheck.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Pushed follow-up eb6073a after the second Test failure. The CLI pipeline now treats canonical script graders as subprocess sidecars while preserving the existing code_graders/code_grader_results directory contract; grade results preserve sidecar type and default legacy missing type to script. Local validation: targeted pipeline input/e2e/grade tests passed; git diff --check passed; bun run typecheck passed; clean rerun of bun --filter agentv test passed (745 pass, 0 fail).

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

CI is green for eb6073a: Build, Test, Typecheck, Lint, Check Links, Validate Evals, Validate Marketplace, and Cloudflare Pages all passed.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Review findings:

  1. [P1] expected_output and test-level criteria still trigger the implicit LLM grader. yaml-parser.ts still treats criteria or expected_output as a complete evaluation spec (packages/core/src/evaluation/yaml-parser.ts:677) and stores test-level criteria on the case (packages/core/src/evaluation/yaml-parser.ts:854); the schema still exposes criteria (packages/core/src/evaluation/validation/eval-file.schema.ts:526). Then any case without explicit assertions falls through to llm-grader (packages/core/src/evaluation/orchestrator.ts:2686) and immediately evaluates it (packages/core/src/evaluation/orchestrator.ts:2696). This conflicts with the av-kfik.7 decision to remove test-level criteria from the new YAML contract and keep expected_output passive first-class reference data. A case containing only expected_output should not start a live/default LLM grading call, and criteria text should live under explicit assert entries such as g-eval.

  2. [P2] Script-style assertions pass numeric zero when no threshold is set. normalizeScriptResult defaults threshold = 0 (packages/core/src/evaluation/graders/promptfoo-assertions.ts:44), so numeric results pass on score >= threshold (packages/core/src/evaluation/graders/promptfoo-assertions.ts:57) and object results without pass use the same default (packages/core/src/evaluation/graders/promptfoo-assertions.ts:74). That makes type: javascript/python/webhook with a score of 0 produce a passing verdict. Since composite threshold aggregation counts child verdicts (packages/core/src/evaluation/graders/composite.ts:181), a zero-score child can satisfy pass-count logic. Promptfoo-compatible numeric behavior should fail zero when no threshold is provided; please add coverage for score 0 across JS/Python/Webhook and composite threshold usage.

  3. [P2] The AI-facing eval writer guide still teaches the removed/stale contract. skills-data/agentv-eval-writer/SKILL.md:136 says test-level criteria is conditionally required, skills-data/agentv-eval-writer/SKILL.md:271 says criteria without assertions runs implicit llm-grader, and skills-data/agentv-eval-writer/SKILL.md:416 / :717 still present code-grader as the YAML type. This PR changes the grader vocabulary and YAML contract, so the guide will keep generating invalid/deprecated evals unless it is updated to canonical assert, g-eval, llm-rubric, and script semantics.

  4. [P2] Required dogfood evidence is still short of the repo gate for grader changes. The PR validation notes a live local LLM grader smoke against mock-target, while .agents/verification.md requires live provider/target plus real LLM grader dogfood for provider/grader/eval-execution changes and explicitly says mock targets do not count as live dogfood. Please add/link private evidence from a real target/provider run with the real grader, or clarify where that evidence already exists.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the grader blocker in commit 013cf958.

Validation run locally:

  • bun test packages/core/test/evaluation/loaders/jsonl-parser.test.ts packages/core/test/evaluation/orchestrator.test.ts packages/core/test/evaluation/graders/promptfoo-assertions.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts packages/core/test/evaluation/validation/eval-file-schema.test.ts -> 268 pass
  • bun --filter @agentv/core test -> 2139 pass
  • bun run lint -> pass
  • bun run typecheck -> pass
  • git diff --check -> pass

Live-provider dogfood:

  • Local OpenAI-compatible endpoint: http://127.0.0.1:10531/v1
  • Model: gpt-5.3-codex-spark
  • Run bundle: .agentv/results/kfik7-dogfood-pr1599
  • Private evidence pushed to EntityProcess/agentv-private:evidence/av-kfik-7-2-graders commit 721b076, path evidence/pr-1599/kfik7-graders/

Dogfood contract checks:

  • passive-expected-output: score 1, no scores entry in index.jsonl; expected-output-only grading was skipped.
  • legacy-criteria-explicit-assertion: score 1, scores[0].type is g-eval, scores[0].target is local-openai-grader; legacy criteria desugared into explicit assertion form and used a real LLM grader.
  • javascript-zero-score: score 0, verdict fail; numeric zero no longer passes by default.
  • python-zero-score: score 0, verdict fail; numeric zero no longer passes by default.

The dogfood run is intentionally mixed 2 pass / 2 quality failures because the zero-score script assertions are expected to fail under the fixed semantics.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Post-push merge status check: not merging yet.

Current blockers after 013cf958:

I left the PR in draft and did not merge. Validation and dogfood evidence are in the previous comment: #1599 (comment)

@christso christso force-pushed the feat/av-kfik-7-graders branch from 013cf95 to 8d9b894 Compare July 2, 2026 15:02
@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Rebased feat/av-kfik-7-graders onto current origin/main and force-pushed with lease.

Head commit: 8d9b894ecc7a7638cc8eddb5e72fd9db2bb93ead

Local validation after rebase:

  • bun test packages/core/test/evaluation/loaders/jsonl-parser.test.ts packages/core/test/evaluation/orchestrator.test.ts packages/core/test/evaluation/graders/promptfoo-assertions.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/eval-inline-experiment.test.ts -> 306 pass
  • bun run lint -> pass
  • bun run typecheck -> pass

Dogfood evidence: preserved existing live-provider evidence because the rebase only resolved conflicts and did not materially change grader behavior. Evidence remains on EntityProcess/agentv-private:evidence/av-kfik-7-2-graders at evidence/pr-1599/kfik7-graders/, evidence commit 721b076, run .agentv/results/kfik7-dogfood-pr1599; earlier PR note: #1599 (comment)

PR #1603 still has base branch feat/av-kfik-7-graders, so this branch must not be deleted if #1599 merges before #1603 is retargeted/rebased.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Follow-up CI fix pushed.

Commit: 5d727840 (fix(eval): normalize promptfoo script imports)

What changed: promptfoo CSV python: / file://*.py expected DSL imports now emit canonical type: script assertions instead of the deprecated type: code-grader alias. This fixes the stale CI expectation in case-file-loader.test.ts and aligns with the script-grader naming contract.

Validation after the fix:

  • bun test packages/core/test/evaluation/loaders/case-file-loader.test.ts packages/core/test/evaluation/loaders/jsonl-parser.test.ts packages/core/test/evaluation/orchestrator.test.ts packages/core/test/evaluation/graders/promptfoo-assertions.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/eval-inline-experiment.test.ts -> 347 pass
  • bun run lint -> pass
  • bun run typecheck -> pass
  • bun run test -> pass locally

Dogfood evidence unchanged from the prior note: EntityProcess/agentv-private:evidence/av-kfik-7-2-graders, evidence/pr-1599/kfik7-graders/, evidence commit 721b076, run .agentv/results/kfik7-dogfood-pr1599.

Reminder: PR #1603 still bases on feat/av-kfik-7-graders; do not delete this branch on merge.

@christso christso marked this pull request as ready for review July 2, 2026 15:16
@christso christso merged commit a773fc8 into main Jul 2, 2026
8 checks passed
@christso christso deleted the feat/av-kfik-7-graders branch July 2, 2026 15:17
@christso christso restored the feat/av-kfik-7-graders branch July 2, 2026 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant