Skip to content

feat(evals): add prompt instance expansion#1602

Merged
christso merged 1 commit into
mainfrom
feat/av-kfik-5-instance-expansion
Jul 2, 2026
Merged

feat(evals): add prompt instance expansion#1602
christso merged 1 commit into
mainfrom
feat/av-kfik-5-instance-expansion

Conversation

@christso

@christso christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

Eval YAML can now use top-level prompts as the authored input matrix and combine it with targets, tests, and repeat.count into deterministic execution identity. Result artifacts now expose prompt identity plus sample_index and retry_index, so repeated samples and future worker-pool reruns do not have to infer pass@k inputs from run-N paths.

This keeps the current runner shape intact while making the new contract explicit: legacy input still loads only as a warned compatibility path, and mixing tests[].input with top-level prompts now fails with migration guidance.

Design Notes

  • Prompt expansion happens before the existing per-case loader normalization, so graders, workspace merging, imports, and defaults keep their current behavior.
  • Prompt-expanded cases keep a unique internal case id for scheduling, while emitted result rows carry the authored test_id plus prompt_id / prompt_label for comparison.
  • Authored targets now accept promptfoo-style id and label, mapped through the existing target selector until av-kfik.6 completes the deeper target-provider locator work.
  • repeat.count now flows into trial sample_index; provider retry remains separate as retry_index.

Related: av-kfik.5

Validation

  • bun test packages/core/test/evaluation/eval-inline-experiment.test.ts apps/cli/test/commands/eval/artifact-writer.test.ts
  • bun --filter @agentv/core build
  • bun run typecheck
  • bun run lint
  • Live dogfood through local OpenAI-compatible endpoint http://127.0.0.1:10531/v1 with gpt-5.3-codex-spark, two prompts, repeat.count: 2, and a live LLM grader: PASS, 2/2 rows scored 100%, each row carried prompt_id and trials with sample_index 0 and 1 plus retry_index 0. The temporary target used api_key: ${{ LOCAL_OPENAI_PROXY_API_KEY }} with LOCAL_OPENAI_PROXY_API_KEY=dummy-local-key; no real API key or copied .env was required.

Evidence

Private evidence branch: EntityProcess/agentv-private:evidence/av-kfik-5-instance-expansion at 179ece2.


Compound Engineering
Codex

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jul 2, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 5f041fc
Status: ✅  Deploy successful!
Preview URL: https://7365612f.agentv.pages.dev
Branch Preview URL: https://feat-av-kfik-5-instance-expa.agentv.pages.dev

View logs

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

CI follow-up pushed as 9cefd81. The Test job failed on eval-schema-sync because the generated schema source did not explicitly include prompt-object fields (prompt/file/messages) that the checked-in schema reference had. I added those fields to the Zod schema with lightweight object-array shapes and regenerated eval.schema.json. Local verification: bun test packages/core/test/evaluation/validation/eval-schema-sync.test.ts; bun run --cwd packages/core typecheck; bun run lint; bun --filter @agentv/core test (2118 pass, 0 fail).

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Code Review Findings

P1 - Function prompt sources are still rejected

av-kfik.5 says top-level prompts must support string, chat-array, file reference, and function source. The loader currently does the opposite for function prompts: packages/core/src/evaluation/yaml-parser.ts:405 throws whenever rawPrompt.function or rawPrompt.function_file is present. The generated schema also still accepts command-shaped prompt objects, but the loader never executes or resolves those as prompt sources. This leaves a required authored prompt surface unimplemented and untested.

Suggested fix: either implement the function/function-file prompt source path and add loader coverage, or explicitly revise the Bead/docs/schema contract before merging. Given the Bead is the source of truth for this PR, I would treat this as blocking.

P1 - targets[].label is used as the runtime target lookup key

The docs and skill reference describe targets[].id as the provider/backend locator and label as the display/comparison name, but parseTargetRef sets name = label ?? id at packages/core/src/evaluation/loaders/config-loader.ts:408. The CLI then resolves each name directly against targets.yaml at apps/cli/src/commands/eval/targets.ts:318-320. That means a normal promptfoo-style entry such as { id: openai:gpt-5.4-mini, label: mini } tries to resolve target mini, not openai:gpt-5.4-mini, unless the user happened to define a target named mini. The parser test covers this shape but only asserts the parsed targetRefs; it does not exercise an actual matrix run.

Suggested fix: keep runtime lookup tied to the authored locator (id or explicit use_target) and carry label separately for display/comparison identity, or change the public contract and tests to say labels are target names.

P2 - Resume keys do not match prompt-expanded result rows

Completed result keys are built from emitted row identity: buildEvaluationResultTargetKey uses result.testId plus prompt_id at packages/core/src/evaluation/run-artifacts.ts:122-123. But the source-test skip key still uses the internal expanded test id at packages/core/src/evaluation/run-artifacts.ts:137, and the CLI resume filter compares that key at apps/cli/src/commands/eval/run-eval.ts:2355. For prompt-expanded cases, a completed row is keyed like test_id=docs,prompt_id=direct, while the next run checks test_id=docs__prompt_direct,prompt_id=direct, so --resume will not skip already completed prompt-expanded instances.

Suggested fix: make buildEvalTestTargetKey use the authored test id (test.testId ?? test.id) while keeping prompt_id as the prompt dimension. Add a regression test that writes or seeds a completed prompt-expanded row and verifies resume skips it.

P2 - Synthetic skip/error rows lose authored test and prompt identity

Several direct synthetic result builders still write testId: evalCase.id and omit prompt: evalCase.prompt, for example run-budget skip at packages/core/src/evaluation/orchestrator.ts:1060, suite-budget skip at packages/core/src/evaluation/orchestrator.ts:1109, and fail_on_error halt at packages/core/src/evaluation/orchestrator.ts:1158. For prompt-expanded cases those rows will emit test_id as the internal docs__prompt_direct id and no prompt_id/prompt_label, unlike normal provider/evaluator errors that go through buildErrorResult. That breaks the stable instance identity contract exactly on skipped/error rows that downstream pass@k and resume logic need to reason about.

Suggested fix: centralize these synthetic results through the same identity helper as buildErrorResult, or at least set testId: evalCase.testId ?? evalCase.id and prompt: evalCase.prompt everywhere a synthetic EvaluationResult is constructed. Add coverage for budget/fail_on_error prompt-expanded rows.

Verification

  • bun install
  • bun test packages/core/test/evaluation/eval-inline-experiment.test.ts apps/cli/test/commands/eval/artifact-writer.test.ts initially failed before build because the CLI test could not import @agentv/core.
  • bun --filter @agentv/core build passed.
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts passed: 66 tests.
  • The core parser test file from the combined run passed before the CLI import failure: 37 tests.

Verdict: not ready until the function prompt requirement and target identity/runtime lookup drift are resolved. The resume and synthetic-row identity issues are also important for stable prompt-expanded instance identity.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the PR #1602 review blockers in commit e859e3f (fix(evals): preserve prompt matrix identity).

Validation run:

  • bun test packages/core/test/evaluation/eval-inline-experiment.test.ts
  • bun test packages/core/test/evaluation/orchestrator.test.ts
  • bun --filter @agentv/core build
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts
  • bun test apps/cli/test/commands/eval/targets.test.ts
  • bun run typecheck
  • bun run lint
  • bun --filter @agentv/core test (2121 pass)
  • bun --filter agentv test (747 pass)
  • bun test packages/core/test/evaluation/validation/eval-schema-sync.test.ts packages/core/test/evaluation/validation/eval-file-schema.test.ts
  • bun --filter @agentv/core typecheck
  • live proxy dogfood: function-file prompt source + target id/label lookup + live OpenAI-compatible target/grader passed 1/1; index row preserved test_id=dogfood, prompt_id=generated, prompt_label=Generated prompt, target=openai:gpt-5.4-mini.

Private evidence: EntityProcess/agentv-private branch evidence/av-kfik-5-instance-expansion, commit c524b22, directory pr1602-review-fixes/.

@christso christso force-pushed the feat/av-kfik-5-instance-expansion branch from e859e3f to 5f041fc Compare July 2, 2026 15:15
@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Rebased the PR on current origin/main and pushed review-fix commit 5f041fc5 (fix(evals): preserve prompt matrix identity), superseding the earlier e859e3fb head.

Addressed the four blocker findings from the review comment:

  • top-level prompt expansion now accepts authored function prompt sources (function / function_file) and preserves prompt identity;
  • authored target { id, label } refs resolve through targets.yaml by id while keeping label as display metadata;
  • prompt-expanded resume/artifact keys use authored test_id plus prompt_id, not the internal expanded id;
  • synthetic budget-skip/error rows preserve authored test and prompt identity.

Validation run locally on the rebased head:

  • bun --filter agentv test -> 750 pass, 0 fail
  • bun --filter @agentv/core test -> 2154 pass, 0 fail
  • bun run lint
  • bun run typecheck
  • Earlier focused checks also passed for parser, config-loader, artifact writer, orchestrator, schema sync, and prepare fixtures.

Live dogfood also passed against the local OpenAI proxy with function prompt expansion and target-id lookup. Private evidence is in EntityProcess/agentv-private, branch evidence/av-kfik-5-instance-expansion, commit c524b22, directory pr1602-review-fixes/.

CI for new head 5f041fc5 is queued/in progress; I’ll only mark ready/merge if Actions go green and no blockers remain.

@christso christso marked this pull request as ready for review July 2, 2026 15:18
@christso christso merged commit 24c9364 into main Jul 2, 2026
8 checks passed
@christso christso deleted the feat/av-kfik-5-instance-expansion branch July 2, 2026 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant