contract(unicharset): direction + mirror leaf — byte-parity 112/112, varied-field surface complete#633
Conversation
…arity 112/112 Add get_direction + get_mirror + dump_direction + dump_mirror to UniCharSet, backed by directions: Vec<i32> + mirrors: Vec<i32>. These are the two columns after other_case; the bbox+stats group is a single whitespace token, so the columns land at fixed offsets across all 5 of tesseract's istringstream tiers (unicharset.cpp:833-868) — the per-line token walk continued one/two positions past other_case reads them, no bespoke tier detector. A tier without the columns leaves the walk exhausted -> defaults. - direction: stored as-is (ICU UCharDirection); load default U_LEFT_TO_RIGHT (0) for an absent column; get_direction returns U_OTHER_NEUTRAL (10) out of range (unicharset.h:712) -- two distinct "defaults" for two distinct conditions. - mirror: clamped at load like other_case (>= size -> self); get_mirror returns INVALID_UNICHAR_ID (-1) out of range (unicharset.h:721). Byte-identical 112/112 each vs tesseract's own get_direction/get_mirror on real eng.lstm-unicharset (self-validating oracle; direction 6 distinct values incl. 55x LTR + 33x OTHER_NEUTRAL, mirror 10 bracket/paren pairs). Sixth leaf of PROBE-OGAR-ADAPTER-UNICHARSET; first to read past the bbox CSV. Remaining sub-leaf: the float stats inside the CSV. - +3 unicharset tests (26 total); my files clippy -D warnings + fmt clean - examples/unicharset_dump.rs gains direction|mirror modes (reproduce the diffs) - board: EPIPHANIES E-CPP-PARITY-6; LATEST_STATE branch-work + D-UNICHARSET-DIR-MIRROR; TECH_DEBT TD-CONTRACT-NOT-FMT-GATED (contract crate not fmt-checked in CI; the rustfmt drift in hhtl/nan_projection/soa_graph is from merged PRs, not this leaf) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
|
Warning Review limit reached
Next review available in: 17 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (5)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The sixth UNICHARSET leaf:
direction+mirror— the first columns read PAST the bbox+stats CSV, proving the CSV-skip and completing the varied-field surface of the character set.What ships (one commit,
cc76557-rebased)UniCharSet::{get_direction, get_mirror}+dump_direction/dump_mirror, backed bydirections/mirrors: Vec<i32>parsed by continuing the per-line token walk (the bbox+stats group is ONE whitespace token, so columns land at fixed offsets across all 5 of tesseract's istringstream fallback tiers — no bespoke tier detector).directionload defaultU_LEFT_TO_RIGHT(0) vs out-of-rangeU_OTHER_NEUTRAL(10) — two distinct defaults for two distinct conditions (unicharset.h:712);mirrorclamped likeother_case, out-of-range →INVALID_UNICHAR_ID(unicharset.h:721).examples/unicharset_dump.rsgainsdirection|mirrormodes (reproduces the parity diffs).Proof
Byte-identical 112/112 each vs tesseract's own
get_direction/get_mirroron realeng.lstm-unicharset, via the self-validating oracle (bijection half re-proves the 5.5.0-header/5.3.4-lib layout before the new field is trusted). Direction is genuinely varied on eng (55× LTR, 33× OTHER_NEUTRAL, plus 2/3/4/6 codes); mirror has 10 real bracket/paren pairs — the parse is exercised, not just defaults.With this, every UNICHARSET field that varies on real eng data is transcoded and parity-proven (E-CPP-PARITY-1..6). bbox/stats/normed are deferred with reason (uniform on LSTM data = weak falsifier; gated on a legacy unicharset).
Robustness
This commit has survived six rebases across ~240 commits of main churn (V3 substrate, CanonHigh flip, #624–#632) with parity re-verified green each time. Current base: post-#632, 795 contract lib tests, clippy
-D warnings+ fmt clean on touched files.Board hygiene in-commit: EPIPHANIES
E-CPP-PARITY-6, LATEST_STATE branch-work +D-UNICHARSET-DIR-MIRROR, TECH_DEBTTD-CONTRACT-NOT-FMT-GATED.Companion: tesseract-rs PR (consumer wiring + awareness artifacts) — merge this one FIRST; its CI builds against lance-graph
main.🤖 Generated with Claude Code
https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1
Generated by Claude Code