Grading contract: assertion_results rows#1603
Conversation
Deploying agentv with
|
| Latest commit: |
864fc93
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://624c371c.agentv.pages.dev |
| Branch Preview URL: | https://feat-av-kfik-11-grading-cont.agentv.pages.dev |
|
Manual CI was dispatched because this stacked PR targets feat/av-kfik-7-graders and pull_request CI only runs for base main. Green run: https://github.com/EntityProcess/agentv/actions/runs/28590245280 |
|
Review findings for av-kfik.11.1:
|
|
Review-fix commit pushed: Addressed the four findings from #1603 (comment):
Validation run:
Live dogfood rerun because the fix changes explicit custom grader prompt behavior:
|
|
Manual CI for pushed commit 1f443fe is green: https://github.com/EntityProcess/agentv/actions/runs/28592955489 |
013cf95 to
8d9b894
Compare
Adds promptfoo-compatible assertion graders, canonical script grader handling, and grader contract docs/schema updates.\n\nValidated with focused grader/parser tests, lint, typecheck, full workspace tests, and green GitHub Actions. Live-provider dogfood evidence is recorded on agentv-private:evidence/av-kfik-7-2-graders.\n\nBranch feat/av-kfik-7-graders intentionally retained because PR #1603 is stacked on it.
|
Stack maintenance note after #1599 merge:
|
1f443fe to
cd81ca5
Compare
|
Post-rebase manual CI dispatched for recovered head cd81ca5: https://github.com/EntityProcess/agentv/actions/runs/28603514411 |
|
Review findings for exact head
I did not run local tests/builds/evals; both findings are visible from the diff and the exact-head manual CI run is already green. |
|
Addressed the review blockers in new head
Validation:
Live dogfood through
CI for this head started automatically: https://github.com/EntityProcess/agentv/actions/runs/28605214475 |
|
Private dogfood evidence for review-fix head |
|
Re-review for head The two prior blockers are resolved:
I did not find new blockers in the review-fix commit. Fresh PR CI is green for this head: https://github.com/EntityProcess/agentv/actions/runs/28605214475. Orchestrator may proceed to ready/merge. |
Summary
Implements Bead
av-kfik.11by makinggrading.jsonexposeassertion_resultswith row evidence, rowscore, rowverdict, and top-levelscore/verdictwhile keeping internal/index grader data on the existingassertionspath.The default LLM judge prompts now bias toward skeptical, evidence-by-path grading unless the eval author supplies an explicit prompt.
Recovery
feat/av-kfik-7-graders.recovery/av-kfik11-pre-rebase.origin/main.cd81ca51902591aa6ce8c52e8bd9471a774e3d8e.main.Changes
assertion_results, include summary counts, and carry top-level score/verdict.grading.jsoncontract, with legacy read fallback for old sidecars.grading.jsonartifact contract.assertion_results, evidence,score,verdict, and absence of legacyassertionsingrading.json.Validation
bun installbun test packages/core/test/evaluation/graders.test.ts packages/core/test/evaluation/graders/promptfoo-assertions.test.ts- 64 passbun test apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/results/export.test.ts apps/cli/test/commands/results/export-e2e-providers.test.ts apps/cli/test/commands/eval/aggregate.test.ts apps/cli/test/commands/eval/pipeline/bench.test.ts apps/cli/test/commands/results/summary.test.ts apps/cli/test/commands/results/validate.test.ts apps/dashboard/src/components/EvalDetail.test.ts- 152 passbun run --cwd packages/core typecheckbun run --cwd packages/core lintbun run --cwd packages/core buildbun run --cwd apps/cli typecheckbun run --cwd apps/cli lintbun run --cwd apps/cli buildbun run --cwd apps/dashboard buildbun run --cwd apps/web buildManual CI for the recovered head passed:
Dogfood Evidence
Live provider + LLM grader dogfood ran against
http://127.0.0.1:10531/v1with modelgpt-5.3-codex-spark.LOCAL_OPENAI_PROXY_BASE_URL=http://127.0.0.1:10531/v1 LOCAL_OPENAI_PROXY_API_KEY=local-dummy-key LOCAL_OPENAI_PROXY_MODEL=gpt-5.3-codex-spark bun apps/cli/src/cli.ts eval run /tmp/agentv-kfik11-dogfood/eval.yaml --targets /tmp/agentv-kfik11-dogfood/targets.yaml --target local-openai-candidate --workers 1.agentv/results/2026-07-02T15-49-55-807Z/.agentv/results/2026-07-02T15-49-55-807Z/live-grading-contract--f63003d85eca/run-1/grading.jsonEntityProcess/agentv-private:evidence/av-kfik11-grading-contract087f9a2bd6bef5b7e38a0f777a017d4d71ca071cdogfood/post-rebase-main/The captured jq check verifies top-level
score, top-levelverdict, top-levelassertion_results, row evidence/score/verdict, nested graderassertion_results, and absence of legacyassertionsingrading.json.