Skip to content

Grading contract: assertion_results rows#1603

Merged
christso merged 3 commits into
mainfrom
feat/av-kfik-11-grading-contract
Jul 2, 2026
Merged

Grading contract: assertion_results rows#1603
christso merged 3 commits into
mainfrom
feat/av-kfik-11-grading-contract

Conversation

@christso

@christso christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

Implements Bead av-kfik.11 by making grading.json expose assertion_results with row evidence, row score, row verdict, and top-level score/verdict while keeping internal/index grader data on the existing assertions path.

The default LLM judge prompts now bias toward skeptical, evidence-by-path grading unless the eval author supplies an explicit prompt.

Recovery

  • Recovered the PR from the old stacked base feat/av-kfik-7-graders.
  • Preserved the original PR head locally as recovery/av-kfik11-pre-rebase.
  • Rebased only the two grading-contract commits onto current origin/main.
  • Current head: cd81ca51902591aa6ce8c52e8bd9471a774e3d8e.
  • Current base: main.

Changes

  • Updates core artifact writers so per-run and aggregate grading artifacts use assertion_results, include summary counts, and carry top-level score/verdict.
  • Updates CLI result projection and validation to read the new grading.json contract, with legacy read fallback for old sidecars.
  • Updates pipeline bench sidecar output to match the grading.json artifact contract.
  • Adds golden assertions across artifact writer, aggregate, export, summary, e2e provider, dashboard, and pipeline tests for assertion_results, evidence, score, verdict, and absence of legacy assertions in grading.json.
  • Documents the new artifact shape in the public result artifacts reference.

Validation

  • bun install
  • bun test packages/core/test/evaluation/graders.test.ts packages/core/test/evaluation/graders/promptfoo-assertions.test.ts - 64 pass
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/results/export.test.ts apps/cli/test/commands/results/export-e2e-providers.test.ts apps/cli/test/commands/eval/aggregate.test.ts apps/cli/test/commands/eval/pipeline/bench.test.ts apps/cli/test/commands/results/summary.test.ts apps/cli/test/commands/results/validate.test.ts apps/dashboard/src/components/EvalDetail.test.ts - 152 pass
  • bun run --cwd packages/core typecheck
  • bun run --cwd packages/core lint
  • bun run --cwd packages/core build
  • bun run --cwd apps/cli typecheck
  • bun run --cwd apps/cli lint
  • bun run --cwd apps/cli build
  • bun run --cwd apps/dashboard build
  • bun run --cwd apps/web build

Manual CI for the recovered head passed:

Dogfood Evidence

Live provider + LLM grader dogfood ran against http://127.0.0.1:10531/v1 with model gpt-5.3-codex-spark.

  • Command: LOCAL_OPENAI_PROXY_BASE_URL=http://127.0.0.1:10531/v1 LOCAL_OPENAI_PROXY_API_KEY=local-dummy-key LOCAL_OPENAI_PROXY_MODEL=gpt-5.3-codex-spark bun apps/cli/src/cli.ts eval run /tmp/agentv-kfik11-dogfood/eval.yaml --targets /tmp/agentv-kfik11-dogfood/targets.yaml --target local-openai-candidate --workers 1
  • Public run bundle path: .agentv/results/2026-07-02T15-49-55-807Z/
  • Inspected artifact: .agentv/results/2026-07-02T15-49-55-807Z/live-grading-contract--f63003d85eca/run-1/grading.json
  • Private evidence branch: EntityProcess/agentv-private:evidence/av-kfik11-grading-contract
  • Private evidence commit: 087f9a2bd6bef5b7e38a0f777a017d4d71ca071c
  • Evidence folder: dogfood/post-rebase-main/

The captured jq check verifies top-level score, top-level verdict, top-level assertion_results, row evidence/score/verdict, nested grader assertion_results, and absence of legacy assertions in grading.json.

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jul 2, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 864fc93
Status: ✅  Deploy successful!
Preview URL: https://624c371c.agentv.pages.dev
Branch Preview URL: https://feat-av-kfik-11-grading-cont.agentv.pages.dev

View logs

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Manual CI was dispatched because this stacked PR targets feat/av-kfik-7-graders and pull_request CI only runs for base main. Green run: https://github.com/EntityProcess/agentv/actions/runs/28590245280

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Review findings for av-kfik.11.1:

  1. P1 - Custom rubric prompts now receive the new skeptical default instructions. buildRubricOutputSchema() and buildScoreRangeOutputSchema() now include the skeptical evidence-by-path language (packages/core/src/evaluation/graders/llm-grader.ts:1216, packages/core/src/evaluation/graders/llm-grader.ts:1281), and both explicit custom-prompt paths still use those schemas (packages/core/src/evaluation/graders/llm-grader.ts:789, packages/core/src/evaluation/graders/llm-grader-prompt.ts:153). That means an author-supplied rubric or score-range prompt still gets AgentV's skeptical grading behavior injected through the system/schema prompt, which conflicts with av-kfik.11's requirement that the skeptical default apply unless the author supplies an explicit prompt. Please keep the skeptical language in the default prompt builders only, or split the schema text so custom prompts get format instructions without behavioral grading guidance.

  2. P1 - Dashboard trial checks lose the new grading evidence rows. parseGradingArtifact still reads only parsed.assertions (apps/dashboard/src/components/EvalDetail.tsx:676), while this PR stops writing legacy grading.json.assertions. For repeated/trial runs, the Checks tab fetches grading.json directly and will render No assertion steps recorded in grading.json even when the sidecar has assertion_results with evidence. Since the Dashboard is the supported zero-infra inspection path for run artifacts, please read assertion_results there with a legacy assertions fallback.

  3. P2 - Manifest hydration drops nested grader rows from the new sidecar shape. buildEvaluators now writes nested child grader rows under scores in grading.json (packages/core/src/evaluation/run-artifacts.ts:718), but hydrateManifestRecord maps grading.graders into EvaluationResult.scores without recursively copying evaluator.scores (apps/cli/src/commands/results/manifest.ts:262). Because grading.graders exists, the fallback to record.scores at line 288 will not run, so consumers of loadManifestResults lose nested grader rows after hydration. Please make the grader mapper recursive and translate child assertion_results/legacy assertions as well.

  4. P2 - results validate rejects old bundles that the readers otherwise support. The validator now errors unless grading.assertion_results exists (apps/cli/src/commands/results/validate.ts:277), but hydrateManifestRecord explicitly falls back from assertion_results to legacy assertions for old sidecars (apps/cli/src/commands/results/manifest.ts:232). If old-bundle read compatibility is intended for this migration, validation should accept either key, or at least downgrade legacy-only sidecars to a compatibility warning instead of an error.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Review-fix commit pushed: 1f443fe4 (fix(eval): address grading contract review findings).

Addressed the four findings from #1603 (comment):

  • Custom rubric/score-range prompts now use format-only schema instructions; default prompts keep AgentV skeptical evidence-by-path guidance.
  • Dashboard trial Checks reads grading.json.assertion_results with legacy assertions fallback.
  • Manifest hydration recursively maps nested grader scores and translates child assertion_results / legacy assertions.
  • results validate accepts legacy-only grading.json.assertions sidecars with a compatibility warning instead of a hard error.

Validation run:

  • bun test packages/core/test/evaluation/graders.test.ts apps/cli/test/commands/results/shared.test.ts apps/cli/test/commands/results/validate.test.ts apps/dashboard/src/components/EvalDetail.test.ts
  • bun run --cwd packages/core typecheck
  • bun run --cwd packages/core lint
  • bun run --cwd packages/core build
  • bun run --cwd apps/cli lint
  • bun run --cwd apps/cli typecheck
  • bun run --cwd apps/cli build
  • bun run --cwd apps/dashboard test
  • bun run --cwd apps/dashboard build
  • bun run lint

Live dogfood rerun because the fix changes explicit custom grader prompt behavior:

  • endpoint/model: http://127.0.0.1:10531/v1, gpt-5.3-codex-spark
  • command: LOCAL_OPENAI_PROXY_BASE_URL=http://127.0.0.1:10531/v1 LOCAL_OPENAI_PROXY_API_KEY=dummy LOCAL_OPENAI_PROXY_MODEL=gpt-5.3-codex-spark bun apps/cli/src/cli.ts eval run /tmp/agentv-kfik112-dogfood/eval.yaml --targets /tmp/agentv-kfik112-dogfood/targets.yaml --target local-openai-candidate --workers 1
  • passing run bundle: .agentv/results/2026-07-02T13-11-03-936Z/
  • inspected artifact: .agentv/results/2026-07-02T13-11-03-936Z/custom-rubric-contract--17ad22116714/run-1/grading.json
  • verified: custom rubric systemPrompt has only format instructions, grading.json has top-level and nested assertion_results, no legacy assertions, score=1, verdict=pass
  • private evidence: EntityProcess/agentv-private:evidence/av-kfik11-grading-contract at d465b44

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Manual CI for pushed commit 1f443fe is green: https://github.com/EntityProcess/agentv/actions/runs/28592955489

@christso christso force-pushed the feat/av-kfik-7-graders branch from 013cf95 to 8d9b894 Compare July 2, 2026 15:02
christso added a commit that referenced this pull request Jul 2, 2026
Adds promptfoo-compatible assertion graders, canonical script grader handling, and grader contract docs/schema updates.\n\nValidated with focused grader/parser tests, lint, typecheck, full workspace tests, and green GitHub Actions. Live-provider dogfood evidence is recorded on agentv-private:evidence/av-kfik-7-2-graders.\n\nBranch feat/av-kfik-7-graders intentionally retained because PR #1603 is stacked on it.
Base automatically changed from feat/av-kfik-7-graders to main July 2, 2026 15:17
@christso christso changed the base branch from main to feat/av-kfik-7-graders July 2, 2026 15:18
@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Stack maintenance note after #1599 merge:

  • [codex] Add promptfoo assertion grader surface #1599 was squash-merged to main as a773fc88d4100bca6a3a5ef1a73ef982bb7a155f.
  • GitHub briefly removed/auto-retargeted the stacked base; I restored feat/av-kfik-7-graders at 5d727840d5be91380a97287d88d5613912b57be7 and retargeted this PR back to that base to preserve the stacked diff.
  • Next worker should rebase/retarget this PR safely onto current main when ready, accounting for [codex] Add promptfoo assertion grader surface #1599 being squash-merged rather than merged with its original commit SHAs.

@christso christso force-pushed the feat/av-kfik-11-grading-contract branch from 1f443fe to cd81ca5 Compare July 2, 2026 15:53
@christso christso changed the base branch from feat/av-kfik-7-graders to main July 2, 2026 15:53
@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Post-rebase manual CI dispatched for recovered head cd81ca5: https://github.com/EntityProcess/agentv/actions/runs/28603514411

@christso christso marked this pull request as ready for review July 2, 2026 15:57
@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Review findings for exact head cd81ca51902591aa6ce8c52e8bd9471a774e3d8e (manual CI run 28603514411 is green):

  1. P1 - grading.json.verdict ignores configured thresholds. buildGradingArtifact() now writes top-level verdict from resultVerdict() (packages/core/src/evaluation/run-artifacts.ts:1236), but resultVerdict() calls scoreToVerdict(clampScore(result.score)) with the default 0.8 threshold (packages/core/src/evaluation/run-artifacts.ts:707). The orchestrator classifies the same score with the effective case/suite/CLI threshold (packages/core/src/evaluation/orchestrator.ts:3081, and execution status uses the effective threshold at packages/core/src/evaluation/orchestrator.ts:2301). Any eval using threshold: 0.5 can produce an ok result with score 0.6 while the new sidecar says verdict: "fail"; a stricter threshold can do the reverse. Please carry the effective threshold or derived verdict into the artifact writer, and add a regression test for a non-default threshold.

  2. P2 - The exported aggregate grading artifact still lacks top-level score/verdict. The PR description says aggregate grading artifacts carry top-level score/verdict, but AggregateGradingArtifact only defines assertion_results and summary (packages/core/src/evaluation/run-artifacts.ts:465), and buildAggregateGradingArtifact() returns only those fields (packages/core/src/evaluation/run-artifacts.ts:1552). Since this helper is re-exported from core/CLI and has golden tests, consumers using the aggregate sidecar/helper will not see the new contract fields. Please either add aggregate score/verdict and tests, or update the stated contract if aggregate sidecars are intentionally summary-only.

I did not run local tests/builds/evals; both findings are visible from the diff and the exact-head manual CI run is already green.

@christso christso marked this pull request as draft July 2, 2026 16:04
@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the review blockers in new head 864fc93748cfc8b5b9c7b44d09d3e4135a210804.

  • Fixed grading.json.verdict to derive from the already-resolved executionStatus instead of recomputing with DEFAULT_THRESHOLD, including single-run trial metadata.
  • Added regression coverage where score: 0.7 + executionStatus: ok must emit verdict: pass, and score: 0.85 + executionStatus: quality_failure must emit verdict: fail.
  • Added top-level score and verdict to AggregateGradingArtifact / buildAggregateGradingArtifact() with tests and docs. Aggregate score is the mean normalized score for non-execution-error results; aggregate verdict uses the already-derived quality result status.

Validation:

  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/results/summary.test.ts
  • bun run --cwd packages/core typecheck
  • bun run --cwd packages/core lint
  • bun run --cwd packages/core build
  • bun run --cwd apps/cli typecheck
  • bun run --cwd apps/cli lint
  • bun run --cwd apps/cli build
  • bun run --cwd apps/web build

Live dogfood through http://127.0.0.1:10531/v1 with gpt-5.3-codex-spark:

  • Single-attempt run: .agentv/results/2026-07-02T16-18-00-527Z/live-grading-contract--f63003d85eca/run-1/grading.json verified top-level score: 1, verdict: pass, assertion rows with evidence/score/verdict, and nested grader assertion_results.
  • Repeat run: .agentv/results/2026-07-02T16-18-52-295Z/live-grading-contract-repeat--972a926a5797/run-1/grading.json and run-2/grading.json verified top-level score: 1, verdict: pass; index.jsonl verified both trial verdicts are pass with execution_status: ok and aggregation.strategy: pass_all.

CI for this head started automatically: https://github.com/EntityProcess/agentv/actions/runs/28605214475

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Private dogfood evidence for review-fix head 864fc93748cfc8b5b9c7b44d09d3e4135a210804 is published on EntityProcess/agentv-private:evidence/av-kfik11-grading-contract at commit 885d035302bb95f7d337b778c4bd06319f732282, under dogfood/review-fix-864fc937/.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Re-review for head 864fc93748cfc8b5b9c7b44d09d3e4135a210804 is clean.

The two prior blockers are resolved:

  • P1 threshold-sensitive grading.json.verdict: buildGradingArtifact() now derives top-level verdict from the already-resolved executionStatus, and regression coverage exercises below-default pass / above-default fail cases.
  • P2 aggregate grading artifact contract: AggregateGradingArtifact / buildAggregateGradingArtifact() now include top-level score and verdict, with tests and docs updated.

I did not find new blockers in the review-fix commit. Fresh PR CI is green for this head: https://github.com/EntityProcess/agentv/actions/runs/28605214475. Orchestrator may proceed to ready/merge.

@christso christso marked this pull request as ready for review July 2, 2026 16:29
@christso christso merged commit b091935 into main Jul 2, 2026
8 checks passed
@christso christso deleted the feat/av-kfik-11-grading-contract branch July 2, 2026 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant