fix(webapp): recover from ClickHouse JSON parse failures on out-of-range integers#3759
fix(webapp): recover from ClickHouse JSON parse failures on out-of-range integers#3759matt-aitken wants to merge 2 commits into
Conversation
…nge integers Second class of poisoned-row failure in the runs replication path. PR #3708 handled lone UTF-16 surrogates; this one handles bare JSON integer literals that exceed ClickHouse's Int64/UInt64 range. ClickHouse's `JSON(max_dynamic_paths=...)` column fits each bare integer token into Int64 (signed) or UInt64 (unsigned). Bare integers strictly outside `[-2^63, 2^64 - 1]` are rejected with `INCORRECT_DATA` (no silent fallback to Float64). JS Numbers that are integer-valued but above `Number.MAX_SAFE_INTEGER` still serialise via JSON.stringify as bare integer tokens (no exponent) while `|value| < 1e21`, so any such Number lands on the wire as a token CH cannot accept. Customer-facing symptom: `scan-social-profiles` runs continued to be stranded in `EXECUTING` on the Tasks page even after the surrogate fix landed. CloudWatch showed `Dropped batch — ClickHouse JSON parse error but sanitizer found nothing to fix` firing 8/8 times since the previous deploy. Root cause: upstream JS Number precision loss on a 21-digit Google Plus ID (`117039831458782873093` → `117039831458782870000`) — the precision-lossy value still serialises as a bare integer that exceeds UInt64.MAX, which CH rejects. Reproduced end-to-end against ClickHouse 25.12.11.4 in Docker with the exact `Cannot parse JSON object here` error from prod. `apps/webapp/app/v3/eventRepository/sanitizeRowsOnParseError.server.ts`: - New private `isUnsafeJsonInteger(value)` helper — true iff value is a finite, integer-valued JS Number where `|value| < 1e21` (i.e. JSON.stringify emits integer form, not exponent) AND `value` falls outside `[Int64.MIN, UInt64.MAX]`. - `sanitizeUnknownInPlace` gains a number-branch: when the predicate holds, replace the Number with its string form. CH's dynamic JSON column accepts a `String` subtype on the same path, so the row inserts cleanly on retry. The numeric value was already precision-lossy upstream (JS Number can't represent integers above 2^53 faithfully), so type-flipping to string is information-preserving relative to what arrived. - Float-valued numbers and large floats (>= 1e21, NaN, Infinity) are left alone — JSON.stringify emits them with exponents or as `null`, both of which CH accepts. Recovery stays purely reactive — no extra cost on the hot replication path. The sanitizer only runs after a ClickHouse parse-error rejection, so healthy rows pay nothing. `apps/webapp/test/sanitizeRowsOnParseError.test.ts`: four new unit tests covering positive/negative out-of-range integers, boundary values (MAX_SAFE_INTEGER, 2^63, UInt64.MAX itself), non-integer numbers, and the actual `scan-social-profiles` nested shape with `gp_id: 117039831458782870000`. Plus an extension to `sanitizeRows` that verifies surrogate and integer fixes are counted together across rows. `.server-changes/runs-replication-bigint-recovery.md` — release notes. Refs TRI-9755. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
Hidden review stack artifact:WalkthroughThis pull request extends the runs-replication row sanitizer to handle out-of-range integers that JSON-serialize as bare integer tokens. The change adds an Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…ness Address review (CodeRabbit + Devin convergent finding): the literal `18446744073709551615` in JS source rounds to 2^64 in float64 (the spacing around 2^64 is 2048), so `value > 18446744073709551615` was effectively `value > 2**64` and missed the case where a JS Number's float64 value is *exactly* 2^64. `JSON.stringify(2**64)` emits "18446744073709552000" — a bare integer above UInt64.MAX that ClickHouse rejects — so the sanitizer would have let that row through unchanged and the batch would still drop. Switch to BigInt comparison against named `UINT64_MAX` / `INT64_MIN` constants. `BigInt(value)` is safe because we already gate on `Number.isInteger(value)`. The negative side is unaffected (Int64.MIN = -2^63 is exactly representable in float64) but the BigInt form keeps both sides symmetric and self-documenting. Regression test added: `sanitizeUnknownInPlace(2 ** 64)` must produce "18446744073709552000" with fixed=1. A naïve `>` literal comparison would not catch this. 23/23 sanitizer tests green; webapp typecheck clean.
73f80f8 to
b2cc562
Compare
Summary
Second class of poisoned-row failure in the runs replication path. PR #3708 plugged lone UTF-16 surrogates; this one handles bare JSON integer literals outside ClickHouse's
Int64..UInt64range. Recovery stays purely reactive — the existingsanitizeRowswalker just gains an extra branch, so the hot replication path pays nothing on healthy rows.Fixes the still-firing customer-facing symptom from TRI-9755:
scan-social-profilesruns continued to be stranded inEXECUTINGon the Tasks page after #3708 deployed. CloudWatch showedDropped batch — ClickHouse JSON parse error but sanitizer found nothing to fixfiring 8/8 times since the previous deploy (zero successful sanitizations). Root cause: upstream JS Number precision loss on a 21-digit Google Plus ID (117039831458782873093→117039831458782870000) — the precision-lossy value still serialises as a bare integer that exceedsUInt64.MAX, which ClickHouse rejects withINCORRECT_DATA.How the bug ships
The customer task emits an output containing a Poshmark profile's
spec_format:{"key":"gp_id","proper_key":"Gp Id","value":117039831458782870000,"type":"int"}That value is
1.17e20— comfortably aboveUInt64.MAX(1.84e19) but comfortably below1e21.Number.prototype.toStringonly switches to exponential form at|value| >= 1e21, soJSON.stringifyemits the bare token117039831458782870000and the ClickHouseJSON(max_dynamic_paths)column fails with:Same error verbatim as prod. The same number quoted (
"117039831458782870000") inserts fine — ClickHouse's dynamic JSON column accepts aStringsubtype on the same path.What changed
apps/webapp/app/v3/eventRepository/sanitizeRowsOnParseError.server.ts:isUnsafeJsonInteger(value)helper — true iffvalueis a finite integer-valued JS Number where|value| < 1e21(soJSON.stringifyemits integer form, not exponent) andvaluefalls outside[Int64.MIN, UInt64.MAX].sanitizeUnknownInPlacegains a number-branch: when the predicate holds, replace the Number withString(value). The downstream JSON column dynamic-types the path as String for that row — fine, since the value was already precision-lossy upstream (no JS Number above 2^53 is numerically meaningful anyway).JSON.stringifyemits them with exponents or asnull, both of which ClickHouse accepts.apps/webapp/test/sanitizeRowsOnParseError.test.ts: four new unit tests + an extension tosanitizeRowscovering surrogate + integer fixes counted together across rows. The unit suite now covers:UInt64.MAX(117039831458782870000— the actual prod value)Int64.MIN42,Number.MAX_SAFE_INTEGER,2^63)1e25, NaN, Infinity)scan-social-profilesnested shape — finds the offendinggp_iddeep insideoutput.data.profiles[].spec_format[].platform_variables[].value.server-changes/runs-replication-bigint-recovery.md— release notes entry.Why reactive, not pre-flight
#prepareJsonruns millions of times per day on the replication hot path. Walking every JSON tree to look for oversized integers would add bounded-but-real CPU on every healthy row.sanitizeRowsonly fires after a ClickHouse parse-error rejection, which is a few times a day platform-wide. Extending it costs effectively zero on healthy traffic and gains us recovery on the rare poisoned row.Verification
clickhouse/clickhouse-server:25.12.11.4(closest available to the prod25.12.1.1579build). Pre-sanitize JSON fails with the exact prod error; post-sanitize JSON inserts cleanly and the row is readable withgp_idstored as a String subtype.pnpm --filter webapp exec vitest run test/sanitizeRowsOnParseError.test.ts— 22/22 passing (18 existing + 4 new).pnpm run typecheck --filter webapp— clean.Test plan
pnpm run typecheck --filter webappSanitizing batch after ClickHouse JSON parse errorwarns fire instead ofDropped batch …errors whenscan-social-profilesoutputs trip CH againpermanentlyDroppedBatchescounter stops climbing in/stp/trigger-app-prod/ecs/replication/service-container/process-logsWhat this does NOT do
EXECUTINGrows in production. Same as fix(webapp): recover from ClickHouse JSON parse failures in runs replication #3708 — that needs a reconciliation/backfill sweep (separate ticket — TRI-9755 fix [Docs] Add initial files and configuration #3).🤖 Generated with Claude Code