feat(bench): CL-bench family gates (Context Learning + Continual/Codebase Adaptation) by drewstone · Pull Request #180 · tangle-network/agent-runtime

drewstone · 2026-06-06T17:45:31Z

CL-bench family — two benchmarks, one PR

Integrates both CL-bench benchmarks as gates in bench/.

CL-bench (Context Learning) — Tencent/Fudan, arXiv:2602.03587

bench/src/clbench-context-gate.mts — router-only. Rubric-graded LLM judge → continuous fraction (within-task signal) + the official binary. Two paired arms (random@K / diverse@K), verifier-select by fraction, paired-bootstrap lifts, writes a corpus. Role: leaderboard + compute/diversity instrument — its rubric judge is the metric, so the "verifier" selection is partly circular (worker==judge confound when both are the same model; fleet runs split worker/judge to break it).

CL-Bench (Continual Learning) — pgasawa, arXiv:2606.05661

bench/src/clbench-codebase-gate.mts + bench/scripts/clbench_codebase_judge.py. Of its six domains, only Codebase Adaptation is a deployable checker (applies the provided test_patch + runs pytest in the instance's Docker image, exit-code = pass — an independent check, the clean analogue of the HumanEval gate); the other five grade against realized outcomes (oracles). The judge bridge exposes CL-Bench's own evaluate_submission(patch, instance) standalone, self-verified (gold passes, empty fails). The worker is a fault-isolated sandbox rollout via openSandboxRun semantics; cheap models resolve in-box via the openai-compat provider (the fix for "Model not found: openai/deepseek-chat").

Verification

pnpm typecheck · pnpm test (green) · bench: tsc --noEmit clean. Both gates ran end-to-end on the live router/sandbox; fleet runs in progress for powered signals. Deliberately NOT in scope: the CL-Bench thesis integration (our runtime as a system, the stateful-vs-stateless gain metric) — a fast follow.

Integrate Tencent/Fudan CL-bench (arXiv:2602.03587) as a router-only selector gate. CL-bench grades a model's answer to an in-context-knowledge task against expert rubrics; the official metric is binary (pass ALL rubrics, avg ~63/task), but the per-rubric pass-count yields a CONTINUOUS score (fraction satisfied) — the within-task graded variance a verifier-grounded selector needs and that the pass/fail-deterministic benches (aec) lacked. The gate (modeled on humaneval-gate) runs two paired arms over the same tasks — random@K identical completions vs diverse@K strategy-lensed completions — grades each with the benchmark's own rubric judge (an LLM, run by us = deployable but noisy, so we rank by the variance-reduced fraction not the binary, judge model + temp pinned), verifier-selects by fraction, and reports paired-bootstrap lifts on BOTH the continuous fraction and the official binary. Writes a corpus RunRecord/ task that `corpus-replay --selector=verifier` + `corpus-report` consume unchanged. Router-only (no sandbox); fetches the public HF jsonl via curl|head so a smoke pulls only the first N records. Fail loud on a malformed/empty task set or a failed judge parse (a real zero, never masked).

…ed selector gate Integrate pgasawa CL-Bench (arXiv:2606.05661) codebase_adaptation — the ONE of its six domains with a DEPLOYABLE checker (the rest grade against realized outcomes the agent never has = oracles). Its scorer applies the instance's provided test_patch and runs pytest in the instance's Docker image, keying off the exit code: an INDEPENDENT deployable check (tests ≠ answer), the clean analogue of the HumanEval gate and unlike the CL-bench Context gate where the rubric judge IS the metric. Two pieces: - scripts/clbench_codebase_judge.py — a thin bridge exposing CL-Bench's own `evaluate_submission(patch, instance)` as a standalone (instance_id, patch) -> verdict call (run in CL-Bench's venv). Verified to self-check: gold patch passes, empty fails. - src/clbench-codebase-gate.mts — the gate. Each instance is SWE-bench format; the worker is a fault-isolated sandbox rollout (opencode clones repo@base_commit, fixes source, writes a diff read off the box FS). Two paired arms (random@K identical vs diverse@K strategy-lensed), verifierGroundedSelect by pytest-pass, paired-bootstrap lifts on blind/random@k/diverse@k/oracle@k, and a corpus RunRecord/task that corpus-replay --selector=verifier consumes. Infra-errored rollouts/judges are excluded, never scored 0. Needs Docker + the CL-Bench images (`clbench setup codebase_adaptation`) for judging and a reachable sandbox for rollouts. Independent of the CL-bench (Context) gate (separate PR).

tangletools · 2026-06-06T17:51:09Z

❌ Needs Work — `d3e72e96`

Readiness 34/100 · Confidence 65/100 · 9 findings (1 critical, 3 medium, 5 low)

	deepseek	glm	aggregate
Readiness	34	76	34
Confidence	65	65	65
Correctness	34	76	34
Security	34	76	34
Testing	34	76	34
Architecture	34	76	34

Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision.

Blocking

🟣 CRITICAL diversifyMessages ignores lensSystem when no system role — diverse arm degenerates to uniform — bench/src/clbench-context-gate.mts

Line 114: return [{ role: 'system', content: composeStrategies(baseSystem, 1)[0] as string }, ...messages] ignores the lensSystem parameter entirely. composeStrategies(base, 1) always returns the FIRST strategy lens (DIVERSE_STRATEGY_LENSES[0]). For any CL-bench task whose messages array has no system role as the first message, ALL k diverse shots receive the IDENTICAL system message. This completely defeats the diversity mechanism on those tasks — the diverse@k arm produces no more variance than the random arm. Since CL-bench conversations may or may not have a system message (the benchmark's format varies), this silently corrupts a subset of r

Other

🟠 MEDIUM Single API failure crashes entire benchmark run, losing all prior work — bench/src/clbench-context-gate.mts

Lines 262-268: the pool() helper uses Promise.all without per-item error handling. If any single routerChatWithUsage call (worker solve or judge grade) rejects after retries are exhausted — e.g. a persistent 500, a malformed response that fails parseJudge JSON.parse — the entire benchmark crashes. For a typical run (N=20, K=4: 160 worker calls + 160 judge calls = 320 API calls), a failure on call #319 discards all prior results. The file's design says 'fail loud', but per-try failures (vs fatal config errors) should be surfaced per-shot rather than aborting the full gate. Consider catching errors per-shot in the pool callback and returning a

🟠 MEDIUM diversifyMessages ignores lensSystem in no-system-turn branch — bench/src/clbench-context-gate.mts

Line 114: when messages[0]?.role !== 'system', the else branch calls composeStrategies(baseSystem, 1)[0] instead of using the passed lensSystem. Since composeStrategies(base, 1) always returns a single-element array using DIVERSE_STRATEGY_LENSES[0], ALL k diverse shots for a task without a system turn receive the IDENTICAL first strategy lens — defeating the diversity treatment. The call site (line 258) passes lenses[s] which already carries the per-shot lens, but it's silently dropped. Impact: the diverse@k

🟠 MEDIUM readFileSync imported but never used — bench/src/clbench-context-gate.mts

Line 36: import { existsSync, readFileSync } from 'node:fs'. readFileSync is never called anywhere in the file. Dead import — should be removed.

🟡 LOW OFFSET env var not validated as non-negative integer — bench/src/clbench-context-gate.mts

Line 230: offset = Number(process.env.OFFSET ?? 0) has no validation. N and K are validated (lines 237-238) but OFFSET is not. A negative or non-integer offset produces need = offset + limit which can cause head -n on GNU coreutils to interpret a negative count as 'all but last N lines', silently fetching a different task set than intended. Fix: add if (!Number.isInteger(offset) || offset < 0) throw new Error(...) alongside the N/K validation block.

🟡 LOW Oracle metrics cover diverse arm only, not both arms — bench/src/clbench-context-gate.mts

Lines 291, 296: fr.oracle and bin.oracle compute the maximum over diverse shots only (Math.max(...dFr) and dFr.some(...)). The true oracle ceiling should be the max over ALL attempts (both random and diverse arms), since a random shot could pass while all diverse shots fail (e.g., the lens confuses the model). The label 'oracle@k' implies the ceiling of the k-attempt regime. Consider computing max across both arms: Math.max(...rFr, ...dFr).

🟡 LOW Unused import: readFileSync — bench/src/clbench-context-gate.mts

Line 36: readFileSync is imported from node:fs but never referenced. Only existsSync (line 71) is used. Dead import will trigger lint if the Biome config catches unused imports. Fix: change to import { existsSync } from 'node:fs'.

🟡 LOW parseJudge JSON.parse throws on malformed judge output, contradicting doc comment — bench/src/clbench-context-gate.mts

Line 139: JSON.parse(text.trim()) can throw if the judge LLM returns non-JSON. The doc comment on judgeRubrics (line 149-150) says 'a judge API/parse failure is a real zero — surfaced, never masked', implying a zero-valued verdict is returned. Instead the exception propagates through pool, killing the entire benchmark run. This is actually consistent with the repo's fail-loud philosophy, but the comment is misleading. Either wrap in try/catch returning a zero verdict, or update the comment to match the throw b

🟡 LOW parseJudge allPass can be true while fraction < 1.0 — inconsistency — bench/src/clbench-context-gate.mts

Line 145: const allPass = obj.all_pass === 1 || obj.all_pass === '1' || (status.length === rubricCount && yes === rubricCount && rubricCount > 0). When the judge returns an explicit all_pass: 1 but a status array shorter than rubricCount (or with some 'no' entries), allPass is true while fraction may be < 1.0. These two fields describe the same verdict; they should be consistent. Consider deriving allPass solely from the per-rubric status list when available, only falling back to the explicit field when the status list is absent.

_{tangletools · 2026-06-06T17:51:07Z · trace}

tangletools

❌ 1 Blocking Finding — `d3e72e96`

Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary

_{tangletools · 2026-06-06T17:51:07Z · immutable trace}

tangletools · 2026-06-06T17:51:13Z

Premise check withheld merge — `d3e72e96`

Classifier flagged this PR as a premise claim (numeric pp/% delta + eval terminology). Confidence: high.

Recommend re-running the underlying eval with pairedEvalueSequence before merging.

Cited claim: +3.5pp
PR body excerpt: feat(bench): CL-bench (Context Learning) deployable-selector gate

Run:

pnpm eval:evolve --reps 5 --skip-mutation

Classifier rationale: Body cites 3 numeric claim(s) (+3.5pp, +9.1pp, +3.5pp) and eval-related terms appear in pr_body, review_findings. PR is asserting a measurable result that repair-pr cannot polish away — re-run the underlying evaluation before merging.

_{tangletools premise check · #180}

…ap router models resolve The in-box opencode agent validates `openai`/`anthropic` models against its registry, so `openai/deepseek-chat` failed with "Model not found" (an empty-patch every rollout). The `openai-compat` provider is the generic passthrough — it does NOT validate the model name — so router-served cheap models (deepseek-chat, moonshotai/kimi-k2.6, glm) resolve in-box. Default the worker to openai-compat (override via WORKER_PROVIDER); verified both deepseek and kimi write a file in a live sandbox rollout.

drewstone added 2 commits June 6, 2026 11:36

tangletools requested changes Jun 6, 2026

View reviewed changes

drewstone added 3 commits June 6, 2026 11:57

Merge remote-tracking branch 'origin/main' into feat/clbench-benchmarks

1e8854e

Merge branch 'feat/clbench-continual' into feat/clbench-benchmarks

618c565

drewstone changed the title ~~feat(bench): CL-bench (Context Learning) deployable-selector gate~~ feat(bench): CL-bench family gates (Context Learning + Continual/Codebase Adaptation) Jun 6, 2026

drewstone merged commit 6974afc into main Jun 6, 2026
1 check passed

drewstone deleted the feat/clbench-benchmarks branch June 6, 2026 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): CL-bench family gates (Context Learning + Continual/Codebase Adaptation)#180

feat(bench): CL-bench family gates (Context Learning + Continual/Codebase Adaptation)#180
drewstone merged 5 commits into
mainfrom
feat/clbench-benchmarks

drewstone commented Jun 6, 2026 •

edited

Loading

Uh oh!

tangletools commented Jun 6, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CL-bench family — two benchmarks, one PR

CL-bench (Context Learning) — Tencent/Fudan, arXiv:2602.03587

CL-Bench (Continual Learning) — pgasawa, arXiv:2606.05661

Verification

Uh oh!

tangletools commented Jun 6, 2026

❌ Needs Work — d3e72e96

Blocking

Other

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

❌ 1 Blocking Finding — d3e72e96

Uh oh!

tangletools commented Jun 6, 2026

Premise check withheld merge — d3e72e96

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drewstone commented Jun 6, 2026 •

edited

Loading

❌ Needs Work — `d3e72e96`

❌ 1 Blocking Finding — `d3e72e96`

Premise check withheld merge — `d3e72e96`