Skip to content

feat(bench): CL-bench family gates (Context Learning + Continual/Codebase Adaptation)#180

Merged
drewstone merged 5 commits into
mainfrom
feat/clbench-benchmarks
Jun 6, 2026
Merged

feat(bench): CL-bench family gates (Context Learning + Continual/Codebase Adaptation)#180
drewstone merged 5 commits into
mainfrom
feat/clbench-benchmarks

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

@drewstone drewstone commented Jun 6, 2026

CL-bench family — two benchmarks, one PR

Integrates both CL-bench benchmarks as gates in bench/.

CL-bench (Context Learning) — Tencent/Fudan, arXiv:2602.03587

bench/src/clbench-context-gate.mts — router-only. Rubric-graded LLM judge → continuous fraction (within-task signal) + the official binary. Two paired arms (random@K / diverse@K), verifier-select by fraction, paired-bootstrap lifts, writes a corpus. Role: leaderboard + compute/diversity instrument — its rubric judge is the metric, so the "verifier" selection is partly circular (worker==judge confound when both are the same model; fleet runs split worker/judge to break it).

CL-Bench (Continual Learning) — pgasawa, arXiv:2606.05661

bench/src/clbench-codebase-gate.mts + bench/scripts/clbench_codebase_judge.py. Of its six domains, only Codebase Adaptation is a deployable checker (applies the provided test_patch + runs pytest in the instance's Docker image, exit-code = pass — an independent check, the clean analogue of the HumanEval gate); the other five grade against realized outcomes (oracles). The judge bridge exposes CL-Bench's own evaluate_submission(patch, instance) standalone, self-verified (gold passes, empty fails). The worker is a fault-isolated sandbox rollout via openSandboxRun semantics; cheap models resolve in-box via the openai-compat provider (the fix for "Model not found: openai/deepseek-chat").

Verification

pnpm typecheck · pnpm test (green) · bench: tsc --noEmit clean. Both gates ran end-to-end on the live router/sandbox; fleet runs in progress for powered signals. Deliberately NOT in scope: the CL-Bench thesis integration (our runtime as a system, the stateful-vs-stateless gain metric) — a fast follow.

drewstone added 2 commits June 6, 2026 11:36
Integrate Tencent/Fudan CL-bench (arXiv:2602.03587) as a router-only selector
gate. CL-bench grades a model's answer to an in-context-knowledge task against
expert rubrics; the official metric is binary (pass ALL rubrics, avg ~63/task),
but the per-rubric pass-count yields a CONTINUOUS score (fraction satisfied) —
the within-task graded variance a verifier-grounded selector needs and that the
pass/fail-deterministic benches (aec) lacked.

The gate (modeled on humaneval-gate) runs two paired arms over the same tasks —
random@K identical completions vs diverse@K strategy-lensed completions — grades
each with the benchmark's own rubric judge (an LLM, run by us = deployable but
noisy, so we rank by the variance-reduced fraction not the binary, judge model +
temp pinned), verifier-selects by fraction, and reports paired-bootstrap lifts on
BOTH the continuous fraction and the official binary. Writes a corpus RunRecord/
task that `corpus-replay --selector=verifier` + `corpus-report` consume unchanged.

Router-only (no sandbox); fetches the public HF jsonl via curl|head so a smoke
pulls only the first N records. Fail loud on a malformed/empty task set or a
failed judge parse (a real zero, never masked).
…ed selector gate

Integrate pgasawa CL-Bench (arXiv:2606.05661) codebase_adaptation — the ONE of its
six domains with a DEPLOYABLE checker (the rest grade against realized outcomes the
agent never has = oracles). Its scorer applies the instance's provided test_patch and
runs pytest in the instance's Docker image, keying off the exit code: an INDEPENDENT
deployable check (tests ≠ answer), the clean analogue of the HumanEval gate and unlike
the CL-bench Context gate where the rubric judge IS the metric.

Two pieces:
- scripts/clbench_codebase_judge.py — a thin bridge exposing CL-Bench's own
  `evaluate_submission(patch, instance)` as a standalone (instance_id, patch) -> verdict
  call (run in CL-Bench's venv). Verified to self-check: gold patch passes, empty fails.
- src/clbench-codebase-gate.mts — the gate. Each instance is SWE-bench format; the worker
  is a fault-isolated sandbox rollout (opencode clones repo@base_commit, fixes source,
  writes a diff read off the box FS). Two paired arms (random@K identical vs diverse@K
  strategy-lensed), verifierGroundedSelect by pytest-pass, paired-bootstrap lifts on
  blind/random@k/diverse@k/oracle@k, and a corpus RunRecord/task that corpus-replay
  --selector=verifier consumes. Infra-errored rollouts/judges are excluded, never scored 0.

Needs Docker + the CL-Bench images (`clbench setup codebase_adaptation`) for judging and a
reachable sandbox for rollouts. Independent of the CL-bench (Context) gate (separate PR).
@tangletools
Copy link
Copy Markdown
Contributor

❌ Needs Work — d3e72e96

Readiness 34/100 · Confidence 65/100 · 9 findings (1 critical, 3 medium, 5 low)

deepseek glm aggregate
Readiness 34 76 34
Confidence 65 65 65
Correctness 34 76 34
Security 34 76 34
Testing 34 76 34
Architecture 34 76 34

Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision.

Blocking

🟣 CRITICAL diversifyMessages ignores lensSystem when no system role — diverse arm degenerates to uniform — bench/src/clbench-context-gate.mts

Line 114: return [{ role: 'system', content: composeStrategies(baseSystem, 1)[0] as string }, ...messages] ignores the lensSystem parameter entirely. composeStrategies(base, 1) always returns the FIRST strategy lens (DIVERSE_STRATEGY_LENSES[0]). For any CL-bench task whose messages array has no system role as the first message, ALL k diverse shots receive the IDENTICAL system message. This completely defeats the diversity mechanism on those tasks — the diverse@k arm produces no more variance than the random arm. Since CL-bench conversations may or may not have a system message (the benchmark's format varies), this silently corrupts a subset of r

Other

🟠 MEDIUM Single API failure crashes entire benchmark run, losing all prior work — bench/src/clbench-context-gate.mts

Lines 262-268: the pool() helper uses Promise.all without per-item error handling. If any single routerChatWithUsage call (worker solve or judge grade) rejects after retries are exhausted — e.g. a persistent 500, a malformed response that fails parseJudge JSON.parse — the entire benchmark crashes. For a typical run (N=20, K=4: 160 worker calls + 160 judge calls = 320 API calls), a failure on call #319 discards all prior results. The file's design says 'fail loud', but per-try failures (vs fatal config errors) should be surfaced per-shot rather than aborting the full gate. Consider catching errors per-shot in the pool callback and returning a

🟠 MEDIUM diversifyMessages ignores lensSystem in no-system-turn branch — bench/src/clbench-context-gate.mts

Line 114: when messages[0]?.role !== 'system', the else branch calls composeStrategies(baseSystem, 1)[0] instead of using the passed lensSystem. Since composeStrategies(base, 1) always returns a single-element array using DIVERSE_STRATEGY_LENSES[0], ALL k diverse shots for a task without a system turn receive the IDENTICAL first strategy lens — defeating the diversity treatment. The call site (line 258) passes lenses[s] which already carries the per-shot lens, but it's silently dropped. Impact: the diverse@k

🟠 MEDIUM readFileSync imported but never used — bench/src/clbench-context-gate.mts

Line 36: import { existsSync, readFileSync } from 'node:fs'. readFileSync is never called anywhere in the file. Dead import — should be removed.

🟡 LOW OFFSET env var not validated as non-negative integer — bench/src/clbench-context-gate.mts

Line 230: offset = Number(process.env.OFFSET ?? 0) has no validation. N and K are validated (lines 237-238) but OFFSET is not. A negative or non-integer offset produces need = offset + limit which can cause head -n on GNU coreutils to interpret a negative count as 'all but last N lines', silently fetching a different task set than intended. Fix: add if (!Number.isInteger(offset) || offset < 0) throw new Error(...) alongside the N/K validation block.

🟡 LOW Oracle metrics cover diverse arm only, not both arms — bench/src/clbench-context-gate.mts

Lines 291, 296: fr.oracle and bin.oracle compute the maximum over diverse shots only (Math.max(...dFr) and dFr.some(...)). The true oracle ceiling should be the max over ALL attempts (both random and diverse arms), since a random shot could pass while all diverse shots fail (e.g., the lens confuses the model). The label 'oracle@k' implies the ceiling of the k-attempt regime. Consider computing max across both arms: Math.max(...rFr, ...dFr).

🟡 LOW Unused import: readFileSync — bench/src/clbench-context-gate.mts

Line 36: readFileSync is imported from node:fs but never referenced. Only existsSync (line 71) is used. Dead import will trigger lint if the Biome config catches unused imports. Fix: change to import { existsSync } from 'node:fs'.

🟡 LOW parseJudge JSON.parse throws on malformed judge output, contradicting doc comment — bench/src/clbench-context-gate.mts

Line 139: JSON.parse(text.trim()) can throw if the judge LLM returns non-JSON. The doc comment on judgeRubrics (line 149-150) says 'a judge API/parse failure is a real zero — surfaced, never masked', implying a zero-valued verdict is returned. Instead the exception propagates through pool, killing the entire benchmark run. This is actually consistent with the repo's fail-loud philosophy, but the comment is misleading. Either wrap in try/catch returning a zero verdict, or update the comment to match the throw b

🟡 LOW parseJudge allPass can be true while fraction < 1.0 — inconsistency — bench/src/clbench-context-gate.mts

Line 145: const allPass = obj.all_pass === 1 || obj.all_pass === '1' || (status.length === rubricCount && yes === rubricCount && rubricCount > 0). When the judge returns an explicit all_pass: 1 but a status array shorter than rubricCount (or with some 'no' entries), allPass is true while fraction may be < 1.0. These two fields describe the same verdict; they should be consistent. Consider deriving allPass solely from the per-rubric status list when available, only falling back to the explicit field when the status list is absent.


tangletools · 2026-06-06T17:51:07Z · trace

Copy link
Copy Markdown
Contributor

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ 1 Blocking Finding — d3e72e96

Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 1 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-06T17:51:07Z · immutable trace

@tangletools
Copy link
Copy Markdown
Contributor

Premise check withheld merge — d3e72e96

Classifier flagged this PR as a premise claim (numeric pp/% delta + eval terminology). Confidence: high.

Recommend re-running the underlying eval with pairedEvalueSequence before merging.

  • Cited claim: +3.5pp
  • PR body excerpt: feat(bench): CL-bench (Context Learning) deployable-selector gate

Run:

pnpm eval:evolve --reps 5 --skip-mutation

Classifier rationale: Body cites 3 numeric claim(s) (+3.5pp, +9.1pp, +3.5pp) and eval-related terms appear in pr_body, review_findings. PR is asserting a measurable result that repair-pr cannot polish away — re-run the underlying evaluation before merging.


tangletools premise check · #180

drewstone added 3 commits June 6, 2026 11:57
…ap router models resolve

The in-box opencode agent validates `openai`/`anthropic` models against its registry, so
`openai/deepseek-chat` failed with "Model not found" (an empty-patch every rollout). The
`openai-compat` provider is the generic passthrough — it does NOT validate the model name —
so router-served cheap models (deepseek-chat, moonshotai/kimi-k2.6, glm) resolve in-box.
Default the worker to openai-compat (override via WORKER_PROVIDER); verified both deepseek
and kimi write a file in a live sandbox rollout.
@drewstone drewstone changed the title feat(bench): CL-bench (Context Learning) deployable-selector gate feat(bench): CL-bench family gates (Context Learning + Continual/Codebase Adaptation) Jun 6, 2026
@drewstone drewstone merged commit 6974afc into main Jun 6, 2026
1 check passed
@drewstone drewstone deleted the feat/clbench-benchmarks branch June 6, 2026 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants