Grading contract: assertion_results rows by christso · Pull Request #1603 · EntityProcess/agentv

christso · 2026-07-02T12:06:38Z

Summary

Implements Bead av-kfik.11 by making grading.json expose assertion_results with row evidence, row score, row verdict, and top-level score/verdict while keeping internal/index grader data on the existing assertions path.

The default LLM judge prompts now bias toward skeptical, evidence-by-path grading unless the eval author supplies an explicit prompt.

Recovery

Recovered the PR from the old stacked base feat/av-kfik-7-graders.
Preserved the original PR head locally as recovery/av-kfik11-pre-rebase.
Rebased only the two grading-contract commits onto current origin/main.
Current head: cd81ca51902591aa6ce8c52e8bd9471a774e3d8e.
Current base: main.

Changes

Updates core artifact writers so per-run and aggregate grading artifacts use assertion_results, include summary counts, and carry top-level score/verdict.
Updates CLI result projection and validation to read the new grading.json contract, with legacy read fallback for old sidecars.
Updates pipeline bench sidecar output to match the grading.json artifact contract.
Adds golden assertions across artifact writer, aggregate, export, summary, e2e provider, dashboard, and pipeline tests for assertion_results, evidence, score, verdict, and absence of legacy assertions in grading.json.
Documents the new artifact shape in the public result artifacts reference.

Validation

bun install
bun test packages/core/test/evaluation/graders.test.ts packages/core/test/evaluation/graders/promptfoo-assertions.test.ts - 64 pass
bun test apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/results/export.test.ts apps/cli/test/commands/results/export-e2e-providers.test.ts apps/cli/test/commands/eval/aggregate.test.ts apps/cli/test/commands/eval/pipeline/bench.test.ts apps/cli/test/commands/results/summary.test.ts apps/cli/test/commands/results/validate.test.ts apps/dashboard/src/components/EvalDetail.test.ts - 152 pass
bun run --cwd packages/core typecheck
bun run --cwd packages/core lint
bun run --cwd packages/core build
bun run --cwd apps/cli typecheck
bun run --cwd apps/cli lint
bun run --cwd apps/cli build
bun run --cwd apps/dashboard build
bun run --cwd apps/web build

Manual CI for the recovered head passed:

https://github.com/EntityProcess/agentv/actions/runs/28603514411

Dogfood Evidence

Live provider + LLM grader dogfood ran against http://127.0.0.1:10531/v1 with model gpt-5.3-codex-spark.

Command: LOCAL_OPENAI_PROXY_BASE_URL=http://127.0.0.1:10531/v1 LOCAL_OPENAI_PROXY_API_KEY=local-dummy-key LOCAL_OPENAI_PROXY_MODEL=gpt-5.3-codex-spark bun apps/cli/src/cli.ts eval run /tmp/agentv-kfik11-dogfood/eval.yaml --targets /tmp/agentv-kfik11-dogfood/targets.yaml --target local-openai-candidate --workers 1
Public run bundle path: .agentv/results/2026-07-02T15-49-55-807Z/
Inspected artifact: .agentv/results/2026-07-02T15-49-55-807Z/live-grading-contract--f63003d85eca/run-1/grading.json
Private evidence branch: EntityProcess/agentv-private:evidence/av-kfik11-grading-contract
Private evidence commit: 087f9a2bd6bef5b7e38a0f777a017d4d71ca071c
Evidence folder: dogfood/post-rebase-main/

The captured jq check verifies top-level score, top-level verdict, top-level assertion_results, row evidence/score/verdict, nested grader assertion_results, and absence of legacy assertions in grading.json.

cloudflare-workers-and-pages · 2026-07-02T12:06:54Z

Deploying agentv with Cloudflare Pages

Latest commit:	`864fc93`
Status:	✅ Deploy successful!
Preview URL:	https://624c371c.agentv.pages.dev
Branch Preview URL:	https://feat-av-kfik-11-grading-cont.agentv.pages.dev

View logs

christso · 2026-07-02T12:35:03Z

Manual CI was dispatched because this stacked PR targets feat/av-kfik-7-graders and pull_request CI only runs for base main. Green run: https://github.com/EntityProcess/agentv/actions/runs/28590245280

christso · 2026-07-02T12:45:47Z

Review findings for av-kfik.11.1:

P1 - Custom rubric prompts now receive the new skeptical default instructions. buildRubricOutputSchema() and buildScoreRangeOutputSchema() now include the skeptical evidence-by-path language (packages/core/src/evaluation/graders/llm-grader.ts:1216, packages/core/src/evaluation/graders/llm-grader.ts:1281), and both explicit custom-prompt paths still use those schemas (packages/core/src/evaluation/graders/llm-grader.ts:789, packages/core/src/evaluation/graders/llm-grader-prompt.ts:153). That means an author-supplied rubric or score-range prompt still gets AgentV's skeptical grading behavior injected through the system/schema prompt, which conflicts with av-kfik.11's requirement that the skeptical default apply unless the author supplies an explicit prompt. Please keep the skeptical language in the default prompt builders only, or split the schema text so custom prompts get format instructions without behavioral grading guidance.
P1 - Dashboard trial checks lose the new grading evidence rows. parseGradingArtifact still reads only parsed.assertions (apps/dashboard/src/components/EvalDetail.tsx:676), while this PR stops writing legacy grading.json.assertions. For repeated/trial runs, the Checks tab fetches grading.json directly and will render No assertion steps recorded in grading.json even when the sidecar has assertion_results with evidence. Since the Dashboard is the supported zero-infra inspection path for run artifacts, please read assertion_results there with a legacy assertions fallback.
P2 - Manifest hydration drops nested grader rows from the new sidecar shape. buildEvaluators now writes nested child grader rows under scores in grading.json (packages/core/src/evaluation/run-artifacts.ts:718), but hydrateManifestRecord maps grading.graders into EvaluationResult.scores without recursively copying evaluator.scores (apps/cli/src/commands/results/manifest.ts:262). Because grading.graders exists, the fallback to record.scores at line 288 will not run, so consumers of loadManifestResults lose nested grader rows after hydration. Please make the grader mapper recursive and translate child assertion_results/legacy assertions as well.
P2 - results validate rejects old bundles that the readers otherwise support. The validator now errors unless grading.assertion_results exists (apps/cli/src/commands/results/validate.ts:277), but hydrateManifestRecord explicitly falls back from assertion_results to legacy assertions for old sidecars (apps/cli/src/commands/results/manifest.ts:232). If old-bundle read compatibility is intended for this migration, validation should accept either key, or at least downgrade legacy-only sidecars to a compatibility warning instead of an error.

christso · 2026-07-02T13:13:54Z

Review-fix commit pushed: 1f443fe4 (fix(eval): address grading contract review findings).

Addressed the four findings from #1603 (comment):

Custom rubric/score-range prompts now use format-only schema instructions; default prompts keep AgentV skeptical evidence-by-path guidance.
Dashboard trial Checks reads grading.json.assertion_results with legacy assertions fallback.
Manifest hydration recursively maps nested grader scores and translates child assertion_results / legacy assertions.
results validate accepts legacy-only grading.json.assertions sidecars with a compatibility warning instead of a hard error.

Validation run:

bun test packages/core/test/evaluation/graders.test.ts apps/cli/test/commands/results/shared.test.ts apps/cli/test/commands/results/validate.test.ts apps/dashboard/src/components/EvalDetail.test.ts
bun run --cwd packages/core typecheck
bun run --cwd packages/core lint
bun run --cwd packages/core build
bun run --cwd apps/cli lint
bun run --cwd apps/cli typecheck
bun run --cwd apps/cli build
bun run --cwd apps/dashboard test
bun run --cwd apps/dashboard build
bun run lint

Live dogfood rerun because the fix changes explicit custom grader prompt behavior:

endpoint/model: http://127.0.0.1:10531/v1, gpt-5.3-codex-spark
command: LOCAL_OPENAI_PROXY_BASE_URL=http://127.0.0.1:10531/v1 LOCAL_OPENAI_PROXY_API_KEY=dummy LOCAL_OPENAI_PROXY_MODEL=gpt-5.3-codex-spark bun apps/cli/src/cli.ts eval run /tmp/agentv-kfik112-dogfood/eval.yaml --targets /tmp/agentv-kfik112-dogfood/targets.yaml --target local-openai-candidate --workers 1
passing run bundle: .agentv/results/2026-07-02T13-11-03-936Z/
inspected artifact: .agentv/results/2026-07-02T13-11-03-936Z/custom-rubric-contract--17ad22116714/run-1/grading.json
verified: custom rubric systemPrompt has only format instructions, grading.json has top-level and nested assertion_results, no legacy assertions, score=1, verdict=pass
private evidence: EntityProcess/agentv-private:evidence/av-kfik11-grading-contract at d465b44

christso · 2026-07-02T13:18:49Z

Manual CI for pushed commit 1f443fe is green: https://github.com/EntityProcess/agentv/actions/runs/28592955489

Adds promptfoo-compatible assertion graders, canonical script grader handling, and grader contract docs/schema updates.\n\nValidated with focused grader/parser tests, lint, typecheck, full workspace tests, and green GitHub Actions. Live-provider dogfood evidence is recorded on agentv-private:evidence/av-kfik-7-2-graders.\n\nBranch feat/av-kfik-7-graders intentionally retained because PR #1603 is stacked on it.

christso · 2026-07-02T15:18:27Z

Stack maintenance note after #1599 merge:

[codex] Add promptfoo assertion grader surface #1599 was squash-merged to main as a773fc88d4100bca6a3a5ef1a73ef982bb7a155f.
GitHub briefly removed/auto-retargeted the stacked base; I restored feat/av-kfik-7-graders at 5d727840d5be91380a97287d88d5613912b57be7 and retargeted this PR back to that base to preserve the stacked diff.
Next worker should rebase/retarget this PR safely onto current main when ready, accounting for [codex] Add promptfoo assertion grader surface #1599 being squash-merged rather than merged with its original commit SHAs.

christso · 2026-07-02T15:54:38Z

Post-rebase manual CI dispatched for recovered head cd81ca5: https://github.com/EntityProcess/agentv/actions/runs/28603514411

christso · 2026-07-02T16:02:16Z

Review findings for exact head cd81ca51902591aa6ce8c52e8bd9471a774e3d8e (manual CI run 28603514411 is green):

P1 - grading.json.verdict ignores configured thresholds. buildGradingArtifact() now writes top-level verdict from resultVerdict() (packages/core/src/evaluation/run-artifacts.ts:1236), but resultVerdict() calls scoreToVerdict(clampScore(result.score)) with the default 0.8 threshold (packages/core/src/evaluation/run-artifacts.ts:707). The orchestrator classifies the same score with the effective case/suite/CLI threshold (packages/core/src/evaluation/orchestrator.ts:3081, and execution status uses the effective threshold at packages/core/src/evaluation/orchestrator.ts:2301). Any eval using threshold: 0.5 can produce an ok result with score 0.6 while the new sidecar says verdict: "fail"; a stricter threshold can do the reverse. Please carry the effective threshold or derived verdict into the artifact writer, and add a regression test for a non-default threshold.
P2 - The exported aggregate grading artifact still lacks top-level score/verdict. The PR description says aggregate grading artifacts carry top-level score/verdict, but AggregateGradingArtifact only defines assertion_results and summary (packages/core/src/evaluation/run-artifacts.ts:465), and buildAggregateGradingArtifact() returns only those fields (packages/core/src/evaluation/run-artifacts.ts:1552). Since this helper is re-exported from core/CLI and has golden tests, consumers using the aggregate sidecar/helper will not see the new contract fields. Please either add aggregate score/verdict and tests, or update the stated contract if aggregate sidecars are intentionally summary-only.

I did not run local tests/builds/evals; both findings are visible from the diff and the exact-head manual CI run is already green.

christso · 2026-07-02T16:21:46Z

Addressed the review blockers in new head 864fc93748cfc8b5b9c7b44d09d3e4135a210804.

Fixed grading.json.verdict to derive from the already-resolved executionStatus instead of recomputing with DEFAULT_THRESHOLD, including single-run trial metadata.
Added regression coverage where score: 0.7 + executionStatus: ok must emit verdict: pass, and score: 0.85 + executionStatus: quality_failure must emit verdict: fail.
Added top-level score and verdict to AggregateGradingArtifact / buildAggregateGradingArtifact() with tests and docs. Aggregate score is the mean normalized score for non-execution-error results; aggregate verdict uses the already-derived quality result status.

Validation:

bun test apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/results/summary.test.ts
bun run --cwd packages/core typecheck
bun run --cwd packages/core lint
bun run --cwd packages/core build
bun run --cwd apps/cli typecheck
bun run --cwd apps/cli lint
bun run --cwd apps/cli build
bun run --cwd apps/web build

Live dogfood through http://127.0.0.1:10531/v1 with gpt-5.3-codex-spark:

Single-attempt run: .agentv/results/2026-07-02T16-18-00-527Z/live-grading-contract--f63003d85eca/run-1/grading.json verified top-level score: 1, verdict: pass, assertion rows with evidence/score/verdict, and nested grader assertion_results.
Repeat run: .agentv/results/2026-07-02T16-18-52-295Z/live-grading-contract-repeat--972a926a5797/run-1/grading.json and run-2/grading.json verified top-level score: 1, verdict: pass; index.jsonl verified both trial verdicts are pass with execution_status: ok and aggregation.strategy: pass_all.

CI for this head started automatically: https://github.com/EntityProcess/agentv/actions/runs/28605214475

christso · 2026-07-02T16:26:09Z

Private dogfood evidence for review-fix head 864fc93748cfc8b5b9c7b44d09d3e4135a210804 is published on EntityProcess/agentv-private:evidence/av-kfik11-grading-contract at commit 885d035302bb95f7d337b778c4bd06319f732282, under dogfood/review-fix-864fc937/.

christso · 2026-07-02T16:27:00Z

Re-review for head 864fc93748cfc8b5b9c7b44d09d3e4135a210804 is clean.

The two prior blockers are resolved:

P1 threshold-sensitive grading.json.verdict: buildGradingArtifact() now derives top-level verdict from the already-resolved executionStatus, and regression coverage exercises below-default pass / above-default fail cases.
P2 aggregate grading artifact contract: AggregateGradingArtifact / buildAggregateGradingArtifact() now include top-level score and verdict, with tests and docs updated.

I did not find new blockers in the review-fix commit. Fresh PR CI is green for this head: https://github.com/EntityProcess/agentv/actions/runs/28605214475. Orchestrator may proceed to ready/merge.

christso force-pushed the feat/av-kfik-7-graders branch from 013cf95 to 8d9b894 Compare July 2, 2026 15:02

christso mentioned this pull request Jul 2, 2026

[codex] Add promptfoo assertion grader surface #1599

Merged

Base automatically changed from feat/av-kfik-7-graders to main July 2, 2026 15:17

christso changed the base branch from main to feat/av-kfik-7-graders July 2, 2026 15:18

christso added 2 commits July 2, 2026 17:39

feat(core): update grading artifact contract

4b9fd8b

fix(eval): address grading contract review findings

cd81ca5

christso force-pushed the feat/av-kfik-11-grading-contract branch from 1f443fe to cd81ca5 Compare July 2, 2026 15:53

christso changed the base branch from feat/av-kfik-7-graders to main July 2, 2026 15:53

christso marked this pull request as ready for review July 2, 2026 15:57

christso marked this pull request as draft July 2, 2026 16:04

fix(core): honor grading artifact verdict status

864fc93

christso marked this pull request as ready for review July 2, 2026 16:29

christso merged commit b091935 into main Jul 2, 2026
8 checks passed

christso deleted the feat/av-kfik-11-grading-contract branch July 2, 2026 16:29

christso mentioned this pull request Jul 2, 2026

feat(eval): add rerun-failed runner pooling #1609

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Grading contract: assertion_results rows#1603

Grading contract: assertion_results rows#1603
christso merged 3 commits into
mainfrom
feat/av-kfik-11-grading-contract

christso commented Jul 2, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

christso commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Recovery

Changes

Validation

Dogfood Evidence

Uh oh!

cloudflare-workers-and-pages Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented Jul 2, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented Jul 2, 2026 •

edited

Loading