fix(bench): retry Cloudflare 524/520/522 + 408/425 in routerChatWithUsage#182
Conversation
…outerChatWithUsage
A fleet of concurrent gate runs triggers Cloudflare 524 ("origin timeout") on the
router; 524 (and siblings 520/522, plus 408/425) were not in the retry whitelist, so
a single transient 524 crashed the whole run (exit 1) instead of backing off. Add them
to the retryable set so parallel benchmark fleets are resilient to load-induced blips.
… (fetch need+2, parse need)
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 95 | 89 | 89 |
| Confidence | 65 | 65 | 65 |
| Correctness | 95 | 89 | 89 |
| Security | 95 | 89 | 89 |
| Testing | 95 | 89 | 89 |
| Architecture | 95 | 89 | 89 |
Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision.
🟡 LOW No unit tests for loadCtxTasks or routerChatWithUsage retry behavior — bench/src/clbench-context-gate.mts
Neither file has test coverage (no bench/tests/ directory exists). The JSONL truncation fix and the expanded retry list are empirically validated (the PR documents the 524 hit), but there are no regression tests. The retry list is especially prone to bit-rot: a future refactor could add a status code that isn't transient. Pre-existing gap, not introduced by this PR.
🟡 LOW routerChatWithTools has no retry logic while routerChatWithUsage does — bench/src/router-client.ts
routerChatWithTools (line 105-140) throws immediately on any non-ok response with no retry. routerChatWithUsage (line 27-64) retries on transient codes. This is pre-existing —
routerChatWithToolsis currently unused (0 importers found) — but if a caller is added, it will lack the same resilience the sibling function just gained. Not a regression, just a widening gap. No action required for merge.
tangletools · 2026-06-06T18:16:31Z · trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 2 non-blocking findings — 26196556
Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-06T18:16:31Z · immutable trace
…ansient failure scores 0, never crashes the run)
tangletools
left a comment
There was a problem hiding this comment.
✅ Refreshed approval after new commits — 59a9327e
A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-06T18:31:56Z
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 82 | 92 | 82 |
| Confidence | 65 | 65 | 65 |
| Correctness | 82 | 92 | 82 |
| Security | 82 | 92 | 82 |
| Testing | 82 | 92 | 82 |
| Architecture | 82 | 92 | 82 |
Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM Stale docstring contradicts new behavior in judgeRubrics — bench/src/clbench-context-gate.mts
Lines 154-155: docstring reads 'A judge API/parse failure is a real zero (the response could not be validated) — surfaced, never masked.' The new try/catch on lines 163-168 silently returns { fraction: 0, allPass: false, graded: 0 }, which IS masking. The docstring must be updated to match the new fault-isolation semantics, or the function should log the suppressed error.
🟡 LOW Judge failure catch block emits no diagnostic log — bench/src/clbench-context-gate.mts
Lines 163-168: the
catch {}injudgeRubricsreturns{ fraction: 0, allPass: false, graded: 0 }without logging the error. In a multi-hour N×K×2 run, a systematic judge failure (e.g., model returns HTML error pages instead of JSON) would silently grade everything 0 with no signal in the console output. Thegraded: 0marker distinguishes it downstream, but aconsole.warnwith the error message and task ID would make diagnosis faster without affecting control flow. Very low severity because this is a bench harness, not production.
🟡 LOW judgeRubrics catch block has no observability — persistent errors are silent — bench/src/clbench-context-gate.mts
Lines 166-167: catch returns { fraction: 0, allPass: false, graded: 0 } with no console.warn. If the judge router key is misconfigured (causing 401 across ALL units), or a model consistently returns unparseable JSON, the entire N×K×2 run completes with all-zero scores and no diagnostic signal. While graded=0 is distinguishable from a real 0/N rubric pass in principle, the per-shot graded value is NOT surfaced in the main() summary output (only aggregated fraction/allPass are reported, lines 313-333). Add a cons
tangletools · 2026-06-06T18:38:36Z · trace
A parallel fleet of gate runs hits Cloudflare 524 (origin timeout); it was not in the retry whitelist so a transient 524 crashed the run. Add 408/425/520/522/524 to the retryable set. One-line change to the retry condition; CI green.