fix(bench): retry Cloudflare 524/520/522 + 408/425 in routerChatWithUsage by drewstone · Pull Request #182 · tangle-network/agent-runtime

drewstone · 2026-06-06T18:10:29Z

A parallel fleet of gate runs hits Cloudflare 524 (origin timeout); it was not in the retry whitelist so a transient 524 crashed the run. Add 408/425/520/522/524 to the retryable set. One-line change to the retry condition; CI green.

…outerChatWithUsage A fleet of concurrent gate runs triggers Cloudflare 524 ("origin timeout") on the router; 524 (and siblings 520/522, plus 408/425) were not in the retry whitelist, so a single transient 524 crashed the whole run (exit 1) instead of backing off. Add them to the retryable set so parallel benchmark fleets are resilient to load-induced blips.

… (fetch need+2, parse need)

tangletools · 2026-06-06T18:16:34Z

✅ No Blockers — `26196556`

Readiness 89/100 · Confidence 65/100 · 2 findings (2 low)

	deepseek	glm	aggregate
Readiness	95	89	89
Confidence	65	65	65
Correctness	95	89	89
Security	95	89	89
Testing	95	89	89
Architecture	95	89	89

Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision.

🟡 LOW No unit tests for loadCtxTasks or routerChatWithUsage retry behavior — bench/src/clbench-context-gate.mts

Neither file has test coverage (no bench/tests/ directory exists). The JSONL truncation fix and the expanded retry list are empirically validated (the PR documents the 524 hit), but there are no regression tests. The retry list is especially prone to bit-rot: a future refactor could add a status code that isn't transient. Pre-existing gap, not introduced by this PR.

🟡 LOW routerChatWithTools has no retry logic while routerChatWithUsage does — bench/src/router-client.ts

routerChatWithTools (line 105-140) throws immediately on any non-ok response with no retry. routerChatWithUsage (line 27-64) retries on transient codes. This is pre-existing — routerChatWithTools is currently unused (0 importers found) — but if a caller is added, it will lack the same resilience the sibling function just gained. Not a regression, just a widening gap. No action required for merge.

_{tangletools · 2026-06-06T18:16:31Z · trace}

tangletools

✅ Approved — 2 non-blocking findings — `26196556`

Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary

_{tangletools · 2026-06-06T18:16:31Z · immutable trace}

…ansient failure scores 0, never crashes the run)

tangletools

✅ Refreshed approval after new commits — `59a9327e`

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-06T18:31:56Z}

tangletools · 2026-06-06T18:38:38Z

✅ No Blockers — `59a9327e`

Readiness 82/100 · Confidence 65/100 · 3 findings (1 medium, 2 low)

	deepseek	glm	aggregate
Readiness	82	92	82
Confidence	65	65	65
Correctness	82	92	82
Security	82	92	82
Testing	82	92	82
Architecture	82	92	82

Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Stale docstring contradicts new behavior in judgeRubrics — bench/src/clbench-context-gate.mts

Lines 154-155: docstring reads 'A judge API/parse failure is a real zero (the response could not be validated) — surfaced, never masked.' The new try/catch on lines 163-168 silently returns { fraction: 0, allPass: false, graded: 0 }, which IS masking. The docstring must be updated to match the new fault-isolation semantics, or the function should log the suppressed error.

🟡 LOW Judge failure catch block emits no diagnostic log — bench/src/clbench-context-gate.mts

Lines 163-168: the catch {} in judgeRubrics returns { fraction: 0, allPass: false, graded: 0 } without logging the error. In a multi-hour N×K×2 run, a systematic judge failure (e.g., model returns HTML error pages instead of JSON) would silently grade everything 0 with no signal in the console output. The graded: 0 marker distinguishes it downstream, but a console.warn with the error message and task ID would make diagnosis faster without affecting control flow. Very low severity because this is a bench harness, not production.

🟡 LOW judgeRubrics catch block has no observability — persistent errors are silent — bench/src/clbench-context-gate.mts

Lines 166-167: catch returns { fraction: 0, allPass: false, graded: 0 } with no console.warn. If the judge router key is misconfigured (causing 401 across ALL units), or a model consistently returns unparseable JSON, the entire N×K×2 run completes with all-zero scores and no diagnostic signal. While graded=0 is distinguishable from a real 0/N rubric pass in principle, the per-shot graded value is NOT surfaced in the main() summary output (only aggregated fraction/allPass are reported, lines 313-333). Add a cons

_{tangletools · 2026-06-06T18:38:36Z · trace}

drewstone added 2 commits June 6, 2026 12:10

fix(bench): tolerate head tail-truncation on CL-bench huge-line JSONL…

2619655

… (fetch need+2, parse need)

tangletools previously approved these changes Jun 6, 2026

View reviewed changes

fix(bench): fault-isolate the CL-bench Context rubric judge (parse/tr…

59a9327

…ansient failure scores 0, never crashes the run)

drewstone dismissed tangletools’s stale review via 59a9327 June 6, 2026 18:31

tangletools approved these changes Jun 6, 2026

View reviewed changes

drewstone merged commit 588d740 into main Jun 6, 2026
1 check passed

drewstone deleted the fix/router-transient-codes branch June 6, 2026 19:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bench): retry Cloudflare 524/520/522 + 408/425 in routerChatWithUsage#182

fix(bench): retry Cloudflare 524/520/522 + 408/425 in routerChatWithUsage#182
drewstone merged 3 commits into
mainfrom
fix/router-transient-codes

drewstone commented Jun 6, 2026

Uh oh!

tangletools commented Jun 6, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 6, 2026

Uh oh!

tangletools commented Jun 6, 2026

✅ No Blockers — 26196556

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Approved — 2 non-blocking findings — 26196556

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Refreshed approval after new commits — 59a9327e

Uh oh!

tangletools commented Jun 6, 2026

✅ No Blockers — 59a9327e

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ No Blockers — `26196556`

✅ Approved — 2 non-blocking findings — `26196556`

✅ Refreshed approval after new commits — `59a9327e`

✅ No Blockers — `59a9327e`