Skip to content

fix(bench): retry Cloudflare 524/520/522 + 408/425 in routerChatWithUsage#182

Merged
drewstone merged 3 commits into
mainfrom
fix/router-transient-codes
Jun 6, 2026
Merged

fix(bench): retry Cloudflare 524/520/522 + 408/425 in routerChatWithUsage#182
drewstone merged 3 commits into
mainfrom
fix/router-transient-codes

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

A parallel fleet of gate runs hits Cloudflare 524 (origin timeout); it was not in the retry whitelist so a transient 524 crashed the run. Add 408/425/520/522/524 to the retryable set. One-line change to the retry condition; CI green.

drewstone added 2 commits June 6, 2026 12:10
…outerChatWithUsage

A fleet of concurrent gate runs triggers Cloudflare 524 ("origin timeout") on the
router; 524 (and siblings 520/522, plus 408/425) were not in the retry whitelist, so
a single transient 524 crashed the whole run (exit 1) instead of backing off. Add them
to the retryable set so parallel benchmark fleets are resilient to load-induced blips.
@tangletools
Copy link
Copy Markdown
Contributor

✅ No Blockers — 26196556

Readiness 89/100 · Confidence 65/100 · 2 findings (2 low)

deepseek glm aggregate
Readiness 95 89 89
Confidence 65 65 65
Correctness 95 89 89
Security 95 89 89
Testing 95 89 89
Architecture 95 89 89

Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision.

🟡 LOW No unit tests for loadCtxTasks or routerChatWithUsage retry behavior — bench/src/clbench-context-gate.mts

Neither file has test coverage (no bench/tests/ directory exists). The JSONL truncation fix and the expanded retry list are empirically validated (the PR documents the 524 hit), but there are no regression tests. The retry list is especially prone to bit-rot: a future refactor could add a status code that isn't transient. Pre-existing gap, not introduced by this PR.

🟡 LOW routerChatWithTools has no retry logic while routerChatWithUsage does — bench/src/router-client.ts

routerChatWithTools (line 105-140) throws immediately on any non-ok response with no retry. routerChatWithUsage (line 27-64) retries on transient codes. This is pre-existing — routerChatWithTools is currently unused (0 importers found) — but if a caller is added, it will lack the same resilience the sibling function just gained. Not a regression, just a widening gap. No action required for merge.


tangletools · 2026-06-06T18:16:31Z · trace

tangletools
tangletools previously approved these changes Jun 6, 2026
Copy link
Copy Markdown
Contributor

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 2 non-blocking findings — 26196556

Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-06T18:16:31Z · immutable trace

…ansient failure scores 0, never crashes the run)
Copy link
Copy Markdown
Contributor

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Refreshed approval after new commits — 59a9327e

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-06T18:31:56Z

@tangletools
Copy link
Copy Markdown
Contributor

✅ No Blockers — 59a9327e

Readiness 82/100 · Confidence 65/100 · 3 findings (1 medium, 2 low)

deepseek glm aggregate
Readiness 82 92 82
Confidence 65 65 65
Correctness 82 92 82
Security 82 92 82
Testing 82 92 82
Architecture 82 92 82

Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Stale docstring contradicts new behavior in judgeRubrics — bench/src/clbench-context-gate.mts

Lines 154-155: docstring reads 'A judge API/parse failure is a real zero (the response could not be validated) — surfaced, never masked.' The new try/catch on lines 163-168 silently returns { fraction: 0, allPass: false, graded: 0 }, which IS masking. The docstring must be updated to match the new fault-isolation semantics, or the function should log the suppressed error.

🟡 LOW Judge failure catch block emits no diagnostic log — bench/src/clbench-context-gate.mts

Lines 163-168: the catch {} in judgeRubrics returns { fraction: 0, allPass: false, graded: 0 } without logging the error. In a multi-hour N×K×2 run, a systematic judge failure (e.g., model returns HTML error pages instead of JSON) would silently grade everything 0 with no signal in the console output. The graded: 0 marker distinguishes it downstream, but a console.warn with the error message and task ID would make diagnosis faster without affecting control flow. Very low severity because this is a bench harness, not production.

🟡 LOW judgeRubrics catch block has no observability — persistent errors are silent — bench/src/clbench-context-gate.mts

Lines 166-167: catch returns { fraction: 0, allPass: false, graded: 0 } with no console.warn. If the judge router key is misconfigured (causing 401 across ALL units), or a model consistently returns unparseable JSON, the entire N×K×2 run completes with all-zero scores and no diagnostic signal. While graded=0 is distinguishable from a real 0/N rubric pass in principle, the per-shot graded value is NOT surfaced in the main() summary output (only aggregated fraction/allPass are reported, lines 313-333). Add a cons


tangletools · 2026-06-06T18:38:36Z · trace

@drewstone drewstone merged commit 588d740 into main Jun 6, 2026
1 check passed
@drewstone drewstone deleted the fix/router-transient-codes branch June 6, 2026 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants