Skip to content

feat(eval): add rerun-failed runner pooling#1604

Closed
christso wants to merge 4 commits into
feat/av-kfik-5-instance-expansionfrom
feat/av-kfik-10-runner
Closed

feat(eval): add rerun-failed runner pooling#1604
christso wants to merge 4 commits into
feat/av-kfik-5-instance-expansionfrom
feat/av-kfik-10-runner

Conversation

@christso

@christso christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

Runner execution now keeps max_concurrency as the global in-process budget across target matrices, can resume failed work from canonical run bundles with --rerun-failed <run_id>, and resets pooled workspaces before a slot is reused. Failed quality grades remain final outcomes, while infrastructure retries stay scoped to execution errors.

The rerun path reads .agentv/results/<run_id> or an explicit run/index path, skips already-passing rows, and appends artifactized replacement rows to the same bundle when no new --output is provided. Pooled workspace slots now reset both materialized repos and the slot root back to their baseline before returning to the pool, so pooling remains a performance optimization rather than shared mutable case state.

Related

Related: av-kfik.10

Validation

  • bun run --cwd packages/core typecheck
  • bun run --cwd packages/core lint
  • bun test packages/core/test/evaluation/orchestrator.test.ts
  • bun run --cwd packages/core build
  • bun run --cwd apps/cli typecheck
  • bun run --cwd apps/cli lint
  • bun test apps/cli/test/eval.integration.test.ts
  • bun test apps/cli/test/commands/results/serve.test.ts
  • bun run --cwd apps/cli build
  • bun run --cwd apps/web build

Evidence

Live dogfood used the local OpenAI-compatible endpoint at http://127.0.0.1:10531/v1 with model gpt-5.3-codex-spark for both the live target and LLM grader. The first run produced one pass and one intentional quality failure; --rerun-failed .agentv/results/kfik10-dogfood/first printed Rerun-failed: found 2 existing result(s), skipping 1 completed. and executed only dogfood-rerun. The final canonical index has 3 rows: original pass, original failure, and one artifactized rerun row.

Private evidence: agentv-private:evidence/av-kfik10-runner commit 9a89276.


Compound Engineering
Codex

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jul 2, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 7242c84
Status: ✅  Deploy successful!
Preview URL: https://b27511e8.agentv.pages.dev
Branch Preview URL: https://feat-av-kfik-10-runner.agentv.pages.dev

View logs

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Manual CI was dispatched because this stacked PR targets feat/av-kfik-5-instance-expansion and pull_request CI only runs for base main. Green run: https://github.com/EntityProcess/agentv/actions/runs/28590245406

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Review findings:

  1. P1 - --rerun-failed can run cases that were not in the prior failed set. The rerun path only records keys for rows that should be skipped (apps/cli/src/commands/eval/run-eval.ts:1885 and apps/cli/src/commands/eval/run-eval.ts:1889), then filters the full current suite by excluding those keys (apps/cli/src/commands/eval/run-eval.ts:2404). For --rerun-failed <run_id>, any new test, new prompt expansion, or newly selected target that is absent from the prior run has no skip key, so it runs even though the contract is to rerun only failed or errored instances from the canonical run bundle. This also means historical appended rows are not reduced to the latest row per identity before deciding pass/fail. Please build an explicit rerun include set from the latest failed/error rows and filter to that set for rerun-failed; keep the skip-only behavior for --resume.

  2. P1 - The legacy fallback skip key can suppress a distinct failed row. For every passing prior row, the code adds both the precise result identity and the coarse test_id::target::variant fallback (apps/cli/src/commands/eval/run-eval.ts:1891 and apps/cli/src/commands/eval/run-eval.ts:1892), and the filter accepts either key (apps/cli/src/commands/eval/run-eval.ts:2408 and apps/cli/src/commands/eval/run-eval.ts:2409). If two eval files or prompt-expanded instances share a test id and target, a pass in one identity can skip a failure in the other. The fallback should only be used for genuinely legacy rows that cannot provide the canonical eval/prompt identity, not alongside a precise key for every row.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Fixed the rerun-failed review findings in 7242c847 on feat/av-kfik-10-runner.

What changed:

  • --rerun-failed <run_id> now builds an explicit include matcher from the latest failed/error row per prior canonical identity, instead of filtering the current suite by exclusion.
  • --resume remains skip-only and separate from rerun-failed include filtering.
  • Coarse test_id::target::variant matching is now used only for genuinely legacy rows with no canonical identity; current canonical rows match through precise identity keys, including projection_identity.dimensions.
  • Planned current-suite identities now match canonical artifact path variants so existing bundles with absolute projection_identity.dimensions.eval_path are handled without falling back to coarse keys.

Validation:

  • bun test apps/cli/test/eval.integration.test.ts (33 pass)
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts (66 pass)
  • bun test packages/core/test/evaluation/orchestrator.test.ts (97 pass)
  • bun run --cwd apps/cli lint
  • bun run --cwd apps/cli typecheck
  • bun run --cwd apps/cli build

Live dogfood rerun:

  • Local OpenAI-compatible endpoint: http://127.0.0.1:10531/v1
  • Model: gpt-5.3-codex-spark
  • First run used live target + live LLM grader for two cases.
  • Prior index was marked so only dogfood-rerun was failed; current suite also added absent dogfood-new.
  • --rerun-failed reported 0/1 and reran only dogfood-rerun; final index rows are dogfood-rerun, dogfood-pass, dogfood-rerun.
  • Private evidence: agentv-private:evidence/av-kfik10-2-rerun-failed at 17d7bf5.

@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Manual CI for pushed commit 7242c84 is green: https://github.com/EntityProcess/agentv/actions/runs/28592693451

@christso christso force-pushed the feat/av-kfik-5-instance-expansion branch from e859e3f to 5f041fc Compare July 2, 2026 15:15
@christso christso deleted the branch feat/av-kfik-5-instance-expansion July 2, 2026 15:19
@christso christso closed this Jul 2, 2026
@christso

christso commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Recovery could not reopen/retarget this PR because GitHub refuses to reopen a closed PR whose base branch was deleted.

Replacement draft PR: #1609

Recovered branch feat/av-kfik-10-runner was rebased onto main and force-pushed with lease at 6af012a5de030bc8d92b1401889205543d695684. Original closed head was 7242c84725f7f4d556f78ba29be2aa04e9a7e2e0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant