feat(eval): add rerun-failed runner pooling#1604
Conversation
Deploying agentv with
|
| Latest commit: |
7242c84
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://b27511e8.agentv.pages.dev |
| Branch Preview URL: | https://feat-av-kfik-10-runner.agentv.pages.dev |
|
Manual CI was dispatched because this stacked PR targets feat/av-kfik-5-instance-expansion and pull_request CI only runs for base main. Green run: https://github.com/EntityProcess/agentv/actions/runs/28590245406 |
|
Review findings:
|
|
Fixed the rerun-failed review findings in What changed:
Validation:
Live dogfood rerun:
|
|
Manual CI for pushed commit 7242c84 is green: https://github.com/EntityProcess/agentv/actions/runs/28592693451 |
e859e3f to
5f041fc
Compare
|
Recovery could not reopen/retarget this PR because GitHub refuses to reopen a closed PR whose base branch was deleted. Replacement draft PR: #1609 Recovered branch |
Summary
Runner execution now keeps
max_concurrencyas the global in-process budget across target matrices, can resume failed work from canonical run bundles with--rerun-failed <run_id>, and resets pooled workspaces before a slot is reused. Failed quality grades remain final outcomes, while infrastructure retries stay scoped to execution errors.The rerun path reads
.agentv/results/<run_id>or an explicit run/index path, skips already-passing rows, and appends artifactized replacement rows to the same bundle when no new--outputis provided. Pooled workspace slots now reset both materialized repos and the slot root back to their baseline before returning to the pool, so pooling remains a performance optimization rather than shared mutable case state.Related
Related: av-kfik.10
Validation
bun run --cwd packages/core typecheckbun run --cwd packages/core lintbun test packages/core/test/evaluation/orchestrator.test.tsbun run --cwd packages/core buildbun run --cwd apps/cli typecheckbun run --cwd apps/cli lintbun test apps/cli/test/eval.integration.test.tsbun test apps/cli/test/commands/results/serve.test.tsbun run --cwd apps/cli buildbun run --cwd apps/web buildEvidence
Live dogfood used the local OpenAI-compatible endpoint at
http://127.0.0.1:10531/v1with modelgpt-5.3-codex-sparkfor both the live target and LLM grader. The first run produced one pass and one intentional quality failure;--rerun-failed .agentv/results/kfik10-dogfood/firstprintedRerun-failed: found 2 existing result(s), skipping 1 completed.and executed onlydogfood-rerun. The final canonical index has 3 rows: original pass, original failure, and one artifactized rerun row.Private evidence:
agentv-private:evidence/av-kfik10-runnercommit9a89276.