feat(eval): add rerun-failed runner pooling by christso · Pull Request #1604 · EntityProcess/agentv

christso · 2026-07-02T12:22:39Z

Summary

Runner execution now keeps max_concurrency as the global in-process budget across target matrices, can resume failed work from canonical run bundles with --rerun-failed <run_id>, and resets pooled workspaces before a slot is reused. Failed quality grades remain final outcomes, while infrastructure retries stay scoped to execution errors.

The rerun path reads .agentv/results/<run_id> or an explicit run/index path, skips already-passing rows, and appends artifactized replacement rows to the same bundle when no new --output is provided. Pooled workspace slots now reset both materialized repos and the slot root back to their baseline before returning to the pool, so pooling remains a performance optimization rather than shared mutable case state.

Validation

bun run --cwd packages/core typecheck
bun run --cwd packages/core lint
bun test packages/core/test/evaluation/orchestrator.test.ts
bun run --cwd packages/core build
bun run --cwd apps/cli typecheck
bun run --cwd apps/cli lint
bun test apps/cli/test/eval.integration.test.ts
bun test apps/cli/test/commands/results/serve.test.ts
bun run --cwd apps/cli build
bun run --cwd apps/web build

Evidence

Live dogfood used the local OpenAI-compatible endpoint at http://127.0.0.1:10531/v1 with model gpt-5.3-codex-spark for both the live target and LLM grader. The first run produced one pass and one intentional quality failure; --rerun-failed .agentv/results/kfik10-dogfood/first printed Rerun-failed: found 2 existing result(s), skipping 1 completed. and executed only dogfood-rerun. The final canonical index has 3 rows: original pass, original failure, and one artifactized rerun row.

Private evidence: agentv-private:evidence/av-kfik10-runner commit 9a89276.

cloudflare-workers-and-pages · 2026-07-02T12:22:44Z

Deploying agentv with Cloudflare Pages

Latest commit:	`7242c84`
Status:	✅ Deploy successful!
Preview URL:	https://b27511e8.agentv.pages.dev
Branch Preview URL:	https://feat-av-kfik-10-runner.agentv.pages.dev

View logs

christso · 2026-07-02T12:35:03Z

Manual CI was dispatched because this stacked PR targets feat/av-kfik-5-instance-expansion and pull_request CI only runs for base main. Green run: https://github.com/EntityProcess/agentv/actions/runs/28590245406

christso · 2026-07-02T12:46:39Z

Review findings:

P1 - --rerun-failed can run cases that were not in the prior failed set. The rerun path only records keys for rows that should be skipped (apps/cli/src/commands/eval/run-eval.ts:1885 and apps/cli/src/commands/eval/run-eval.ts:1889), then filters the full current suite by excluding those keys (apps/cli/src/commands/eval/run-eval.ts:2404). For --rerun-failed <run_id>, any new test, new prompt expansion, or newly selected target that is absent from the prior run has no skip key, so it runs even though the contract is to rerun only failed or errored instances from the canonical run bundle. This also means historical appended rows are not reduced to the latest row per identity before deciding pass/fail. Please build an explicit rerun include set from the latest failed/error rows and filter to that set for rerun-failed; keep the skip-only behavior for --resume.
P1 - The legacy fallback skip key can suppress a distinct failed row. For every passing prior row, the code adds both the precise result identity and the coarse test_id::target::variant fallback (apps/cli/src/commands/eval/run-eval.ts:1891 and apps/cli/src/commands/eval/run-eval.ts:1892), and the filter accepts either key (apps/cli/src/commands/eval/run-eval.ts:2408 and apps/cli/src/commands/eval/run-eval.ts:2409). If two eval files or prompt-expanded instances share a test id and target, a pass in one identity can skip a failure in the other. The fallback should only be used for genuinely legacy rows that cannot provide the canonical eval/prompt identity, not alongside a precise key for every row.

christso · 2026-07-02T13:10:31Z

Fixed the rerun-failed review findings in 7242c847 on feat/av-kfik-10-runner.

What changed:

--rerun-failed <run_id> now builds an explicit include matcher from the latest failed/error row per prior canonical identity, instead of filtering the current suite by exclusion.
--resume remains skip-only and separate from rerun-failed include filtering.
Coarse test_id::target::variant matching is now used only for genuinely legacy rows with no canonical identity; current canonical rows match through precise identity keys, including projection_identity.dimensions.
Planned current-suite identities now match canonical artifact path variants so existing bundles with absolute projection_identity.dimensions.eval_path are handled without falling back to coarse keys.

Validation:

bun test apps/cli/test/eval.integration.test.ts (33 pass)
bun test apps/cli/test/commands/eval/artifact-writer.test.ts (66 pass)
bun test packages/core/test/evaluation/orchestrator.test.ts (97 pass)
bun run --cwd apps/cli lint
bun run --cwd apps/cli typecheck
bun run --cwd apps/cli build

Live dogfood rerun:

Local OpenAI-compatible endpoint: http://127.0.0.1:10531/v1
Model: gpt-5.3-codex-spark
First run used live target + live LLM grader for two cases.
Prior index was marked so only dogfood-rerun was failed; current suite also added absent dogfood-new.
--rerun-failed reported 0/1 and reran only dogfood-rerun; final index rows are dogfood-rerun, dogfood-pass, dogfood-rerun.
Private evidence: agentv-private:evidence/av-kfik10-2-rerun-failed at 17d7bf5.

christso · 2026-07-02T13:14:01Z

Manual CI for pushed commit 7242c84 is green: https://github.com/EntityProcess/agentv/actions/runs/28592693451

christso · 2026-07-02T15:53:25Z

Recovery could not reopen/retarget this PR because GitHub refuses to reopen a closed PR whose base branch was deleted.

Replacement draft PR: #1609

Recovered branch feat/av-kfik-10-runner was rebased onto main and force-pushed with lease at 6af012a5de030bc8d92b1401889205543d695684. Original closed head was 7242c84725f7f4d556f78ba29be2aa04e9a7e2e0.

christso added 3 commits July 2, 2026 13:27

Add prompt instance expansion

dbe07e7

Sync prompt object schema

9cefd81

feat(eval): add rerun-failed runner pooling

0b707fd

fix(eval): constrain rerun-failed identities

7242c84

christso force-pushed the feat/av-kfik-5-instance-expansion branch from e859e3f to 5f041fc Compare July 2, 2026 15:15

christso deleted the branch feat/av-kfik-5-instance-expansion July 2, 2026 15:19

christso closed this Jul 2, 2026

christso mentioned this pull request Jul 2, 2026

feat(eval): add rerun-failed runner pooling #1609

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(eval): add rerun-failed runner pooling#1604

feat(eval): add rerun-failed runner pooling#1604
christso wants to merge 4 commits into
feat/av-kfik-5-instance-expansionfrom
feat/av-kfik-10-runner

christso commented Jul 2, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

christso commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related

Validation

Evidence

Uh oh!

cloudflare-workers-and-pages Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

christso commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented Jul 2, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented Jul 2, 2026 •

edited

Loading