From cc765579cc4976924bfd39663520e8cfc362bd42 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 20 Jun 2026 17:47:17 +0000 Subject: [PATCH] contract(unicharset): transcode UNICHARSET direction + mirror, byte-parity 112/112 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add get_direction + get_mirror + dump_direction + dump_mirror to UniCharSet, backed by directions: Vec + mirrors: Vec. These are the two columns after other_case; the bbox+stats group is a single whitespace token, so the columns land at fixed offsets across all 5 of tesseract's istringstream tiers (unicharset.cpp:833-868) — the per-line token walk continued one/two positions past other_case reads them, no bespoke tier detector. A tier without the columns leaves the walk exhausted -> defaults. - direction: stored as-is (ICU UCharDirection); load default U_LEFT_TO_RIGHT (0) for an absent column; get_direction returns U_OTHER_NEUTRAL (10) out of range (unicharset.h:712) -- two distinct "defaults" for two distinct conditions. - mirror: clamped at load like other_case (>= size -> self); get_mirror returns INVALID_UNICHAR_ID (-1) out of range (unicharset.h:721). Byte-identical 112/112 each vs tesseract's own get_direction/get_mirror on real eng.lstm-unicharset (self-validating oracle; direction 6 distinct values incl. 55x LTR + 33x OTHER_NEUTRAL, mirror 10 bracket/paren pairs). Sixth leaf of PROBE-OGAR-ADAPTER-UNICHARSET; first to read past the bbox CSV. Remaining sub-leaf: the float stats inside the CSV. - +3 unicharset tests (26 total); my files clippy -D warnings + fmt clean - examples/unicharset_dump.rs gains direction|mirror modes (reproduce the diffs) - board: EPIPHANIES E-CPP-PARITY-6; LATEST_STATE branch-work + D-UNICHARSET-DIR-MIRROR; TECH_DEBT TD-CONTRACT-NOT-FMT-GATED (contract crate not fmt-checked in CI; the rustfmt drift in hhtl/nan_projection/soa_graph is from merged PRs, not this leaf) Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_016b33swuXE23hKtqxsHu9p1 --- .claude/board/EPIPHANIES.md | 13 ++ .claude/board/LATEST_STATE.md | 4 + .claude/board/TECH_DEBT.md | 16 ++ .../examples/unicharset_dump.rs | 14 +- crates/lance-graph-contract/src/unicharset.rs | 140 ++++++++++++++++++ 5 files changed, 182 insertions(+), 5 deletions(-) diff --git a/.claude/board/EPIPHANIES.md b/.claude/board/EPIPHANIES.md index 4babdd57..9dd3e6b4 100644 --- a/.claude/board/EPIPHANIES.md +++ b/.claude/board/EPIPHANIES.md @@ -1805,6 +1805,19 @@ Why this is the right move, not just a bug patch: 3. **Flexibility + the one cost.** A node mixes in up to 16 family adjacencies (huge flexibility, any-to-any within 256). The named limitation is **mixin dependency**: a referenced family must exist or the slot is a dangling adapter (skipped). That is the honest trade — and it is cheap, because a missing family is a render no-op, not a corruption. The general rule for graph edges on this substrate: **resolve to the stable grouping (family), not the volatile leaf (member)** — unless a richer flavor (8×16-bit, 32×4 residue, member→member second-hop) is measured to be needed. Cross-ref: `E-ANCHOR-IS-A-HEAD-FIELD-NOT-A-VALUE-TYPE` (the static dual), `E-GUID-IS-THE-GRAPH`, the operator's deferred helix-basin-anchor (CLAM ⇄ Louvain turbovec edge residue) as the eventual richer flavor; `aiwar.rs` (the POC: 221 aiwar entities → 60 category family hubs). +## 2026-06-20 — E-CPP-PARITY-6 — the UNICHARSET `direction` + `mirror` columns are byte-identical to libtesseract; the sixth leaf, and the first to read PAST the bbox CSV into the multi-column tail + +**Status:** FINDING (in-env, real trained data). `lance_graph_contract::unicharset::UniCharSet::{get_direction, get_mirror}` dump the `eng.lstm-unicharset` per-id bidi direction codes and mirror ids **byte-identical to tesseract's own `get_direction` / `get_mirror`, 112/112 each** (same self-validating oracle, `direction` + `mirror` modes). Sixth + seventh proven accessor surfaces. + +**Why this was the "multi-tier parser" leaf — and why it turned out simple.** `direction`/`mirror` sit two/three columns past the script, after the bbox+stats CSV. Tesseract places them via a 5-tier `istringstream` fallback (`unicharset.cpp:833-868`). But the bbox+stats group is always a SINGLE whitespace token (comma-separated, no spaces), so on a whitespace split the columns land at fixed offsets regardless of tier: `script`, `other_case`, `direction`, `mirror` are simply the 1st/2nd/3rd/4th tokens after the optional CSV. Continuing the existing per-line token walk one and two positions past `other_case` reads them; a tier without the columns leaves the walk exhausted → defaults. No bespoke tier detector needed — the token walk IS the tier collapse. (The float stats inside the CSV still need decimal parsing; that's the remaining sub-leaf.) + +**Two transcode subtleties the oracle pinned (read-the-truth-first, again).** (1) `direction`'s load default is `U_LEFT_TO_RIGHT` (0) for an absent column, but `get_direction`'s OUT-OF-RANGE return is `U_OTHER_NEUTRAL` (10) — two different "defaults" for two different conditions (`unicharset.h:712-714`). (2) `mirror` is clamped at load exactly like `other_case` (`>= size` → self) and returns `INVALID_UNICHAR_ID` (-1) out of range. The oracle confirmed direction is genuinely varied on eng (55× LTR=0, 33× OTHER_NEUTRAL=10, plus 2/3/4/6 for digit-class chars) and mirror has 10 real pairs (bracket/paren/brace mirrors, e.g. `(`↔`)`), so this exercises the parse, not just the defaults. + +**Pattern holds (E-CPP-KEYSTONE-1).** +2 accessors + 2 dumps + one `diff` each, no new architecture, no Core gap. +3 contract tests (26 unicharset total). Consumed by `tesseract-core::CharSet::{get_direction,get_mirror}`. Reproducible via the committed `examples/unicharset_dump.rs {direction,mirror}`. + +**Tooling note (TECH_DEBT filed):** the contract crate is NOT fmt-gated in CI (`style.yml` checks only `lance-graph` + `deepnsm`), so merged symbiont/SoA PRs left rustfmt-1.9.0 drift in `hhtl.rs`/`nan_projection.rs`/`soa_graph.rs`. My leaf files are fmt-clean; I did not reformat others' merged files. See TECH_DEBT. + +Cross-ref: `E-CPP-PARITY-1..5` (the prior leaves), `E-CPP-KEYSTONE-1`, `.claude/knowledge/core-first-transcode-doctrine.md`. Branch `claude/happy-hamilton-0azlw4`, lance-graph + tesseract-rs. --- diff --git a/.claude/board/LATEST_STATE.md b/.claude/board/LATEST_STATE.md index 48dbb6be..9c160ff8 100644 --- a/.claude/board/LATEST_STATE.md +++ b/.claude/board/LATEST_STATE.md @@ -36,6 +36,8 @@ Membrane consumers can now pull BOTH halves of a render `classid` BBB-safely fro --- +> **2026-06-20 — branch work (`claude/happy-hamilton-0azlw4`)** — **UNICHARSET `direction` + `mirror` transcoded + byte-parity proven (E-CPP-PARITY-6), the sixth leaf — first to read PAST the bbox CSV.** `UniCharSet` now parses the two columns after `other_case` into `directions: Vec` + `mirrors: Vec` by continuing the per-line token walk (the bbox+stats group is one whitespace token, so columns land at fixed offsets regardless of the 5-tier fallback — no bespoke tier detector). `get_direction` (`unicharset.h:712`, load default `U_LEFT_TO_RIGHT` 0, out-of-range → `U_OTHER_NEUTRAL` 10) + `get_mirror` (`unicharset.h:721`, clamped like other_case, out-of-range → -1) + `dump_direction`/`dump_mirror`. **Byte-identical 112/112 each** on real `eng.lstm-unicharset` (self-validating oracle; direction varied: 55× LTR / 33× OTHER_NEUTRAL / 2·3·4·6 for digit chars; mirror has 10 bracket/paren pairs). Additive, zero-dep; +3 contract tests (26 unicharset total), my files clippy + fmt clean; reproducible via `examples/unicharset_dump.rs {direction,mirror}`. Consumed by `tesseract-core::CharSet::{get_direction,get_mirror}`. No Core gap. Remaining UNICHARSET sub-leaf: the float stats (bbox ints + width/bearing/advance) inside the CSV. EPIPHANIES `E-CPP-PARITY-6`; TECH_DEBT (contract crate not fmt-gated in CI). +> > **2026-06-20 — IN PR (`claude/jirak-math-theorems-harvest-rfii13`)** — **kanban×Rubicon SoA value tenant + per-tenant counters (capstone S1 green).** NEW `ValueTenant::Kanban = 9` at value-slab `[112,120)` (8 B: `phase|exec|reserved|cycle`), added to `ValueSchema::{Cognitive,Full}` — reserve-don't-reclaim, **layout-preserving** (Full 112→120 B, stride 512 untouched, no version bump). `KanbanTenant` Copy view + `NodeRow::{kanban,set_kanban}` (owner-gated write / surreal read-only / Rubicon); `KanbanColumn`/`ExecTarget` `from_u8`. **Subsumes the envelope-pointer G1** — the node carries its own phase+cycle, pinning SoA↔kanban in the LE blob (a `FixedSizeBinary(512)` store reads kanban zero-copy at any version). NEW `tenant_counter` module + feature `tenant-counters` (default OFF, zero-cost no-op; one relaxed atomic/tenant-write when on) — the capstone NaN-census instrument; `set_kanban` is the first wired cascade point. Decisions kept (I-VSA-IDENTITIES + AGI-glove): thinking-style is ClassView+`Meta`, NOT a 128-bit tenant; plan-shape ClassView-derived; MUL flow-trigger is a function, not a tenant. Contract lib **714**/715(tenant-counters)/720(guid-v2-tail), clippy `-D warnings` + fmt clean all three. Refs: AGENT_LOG (cont.¹⁷), EPIPHANIES `E-KANBAN-IS-A-VALUE-TENANT-SUBSUMES-G1`, plan `capstone-cognitive-loop-wiring-nan-census-v1` (S1 green). > > **2026-06-20 — IN PR (`claude/jirak-math-theorems-harvest-rfii13`)** — **Zero-copy SoA read contract: `node_rows_from_le_bytes` (the surrealdb "second brain" primitive).** The inverse of `NodeRowPacket::as_le_bytes` (WRITE) — `canonical_node::node_rows_from_le_bytes(&[u8]) -> Option<&[NodeRow]>`, a CHECKED zero-copy cast (`len % 512 == 0` AND `ptr % 64 == 0`, else `None` → caller copies, no UB; empty→Some(empty)). This IS the LE contract a backing store satisfies so its bytes ARE the SoA the cognitive shader reads in place. **Brutal verdict:** lance-graph side now zero-copy-ready end-to-end; surrealdb's kv-lance does NOT qualify as scaffolded (`val: DataType::Binary` variable-length → needs `FixedSizeBinary(512)`), and value zero-copy holds only if stored UNcompressed (key/address always zero-copy). 712 contract lib green, clippy `-D warnings` both configs + fmt clean. Refs: AGENT_LOG 2026-06-20 (cont.¹⁴), EPIPHANIES `E-SURREALDB-SECOND-BRAIN-IS-ZERO-COPY-IFF-FIXEDSIZEBINARY`. @@ -160,6 +162,8 @@ Membrane consumers can now pull BOTH halves of a render `classid` BBB-safely fro > **2026-06-18 — ADDED (D-DO-ARM-1, the OGAR DO arm)**: `lance_graph_contract::action::{ActionState, StateGuard, ActionDef, ClassActions, actions_for, effective_actions, ActionInvocation}` — the Perdurant DO arm completing the OGAR IR (the action-axis sibling of `codegen_manifest`'s `MethodSig`/THINK). Both the 4-agent `sale_order` AR→DO probe (runtime-archaeologist) AND the merged cross-repo PR survey (ruff/OGAR/lance-graph/openproject/tesseract) agreed this was the ONE missing wire: the THINK arm (`classid → ClassView`, `has_function → MethodSig`) is converged + merged; the DO-arm `ActionInvocation`/`ActionDef` type was ABSENT. **`ActionDef`** (static, `const`-constructible, all `&'static`/`Copy`): `predicate` (= harvested `has_function` method), `object_class` (classid), `exec` (`ExecTarget` incl `SurrealQl`), `guard` (`StateGuard` = KausalSpec field==value), `required_role` (RBAC), `overrides` (OGAR `classid→ClassView` inheritance). **`ClassActions`+`actions_for`** (zero-fallback) mirror `ClassMethods`/`methods_for`. **`effective_actions(parent, child)`** = OGAR inheritance on the action axis (child overrides parent by predicate). **`ActionInvocation`** (dynamic, `Copy`): lifecycle `ActionState{Pending→Committed|Failed|Cancelled}` (sticky terminals), S2.5 `cycle` stamp, idempotency/trace keys, HLC `emitted_at_millis`. **`ActionInvocation::commit(def, actor, impact, now)`** is the gated egress — RBAC FIRST (`auth::ActorContext` must hold `required_role` or be admin → else `Failed`), THEN MUL impact (`mul::GateDecision`: `Flow→Committed`+stamped, `Hold→`Pending/escalate, `Block→Cancelled`). This IS "commit to the external consumer (odoo/openproject/woa/tesseract) after the cycle decides sound." Dispatched via `UnifiedStep`/`ExecTarget`, NOT a per-crate endpoint. Additive, zero-dep. +5 tests green. Consumer reference: `docs/OGAR_CONSUMER_API.md`. Branch `claude/soa-write-deinterlace-inc2`. +> **2026-06-20 — ADDED (D-UNICHARSET-DIR-MIRROR, the bidi-direction + mirror leaf)**: `lance_graph_contract::unicharset::UniCharSet` gained `get_direction(id) -> i32` + `get_mirror(id) -> i32` + `dump_direction()` + `dump_mirror()`, backed by `directions: Vec` + `mirrors: Vec`. The two columns after `other_case`, read by continuing the per-line token walk (the bbox+stats CSV is one whitespace token → fixed offsets across all 5 column tiers; no bespoke tier detector). `direction` = ICU `UCharDirection` code, load default `U_LEFT_TO_RIGHT` 0, out-of-range → `U_OTHER_NEUTRAL` 10 (`unicharset.h:712`). `mirror` clamped like other_case, out-of-range → -1 (`unicharset.h:721`). **Byte-identical 112/112 each** vs tesseract's own `get_direction`/`get_mirror` on real `eng.lstm-unicharset` (self-validating oracle; direction 6 distinct values, mirror 10 bracket pairs). Additive, zero-dep. +3 tests (26 unicharset total). Consumed by `tesseract-core::CharSet::{get_direction,get_mirror}`. EPIPHANIES `E-CPP-PARITY-6`; sixth leaf of `PROBE-OGAR-ADAPTER-UNICHARSET`; first to read past the bbox CSV. Remaining sub-leaf: the float stats inside the CSV. Branch `claude/happy-hamilton-0azlw4`. + > **2026-06-20 — ADDED (D-UNICHARSET-OTHERCASE, the case-pair leaf)**: `lance_graph_contract::unicharset::UniCharSet` gained `get_other_case(id) -> i32` + `dump_other_case()`, backed by `other_cases: Vec`. The case-paired unichar id (`'C'`→`'c'`), parsed as the token after the script and clamped at load (`unicharset.cpp:901`: a value `>= size`, and the absent default = size, fold to the id itself). Out-of-range id → `INVALID_UNICHAR_ID` -1 (`unicharset.h:703`). **Byte-identical 112/112** vs tesseract's own `get_other_case` on real `eng.lstm-unicharset` (self-validating oracle `other_case` mode; 60 self / 52 pairs). Additive, zero-dep. +4 tests (23 unicharset total). Consumed by `tesseract-core::CharSet::get_other_case`. EPIPHANIES `E-CPP-PARITY-5`; fifth leaf of `PROBE-OGAR-ADAPTER-UNICHARSET`; the last field reachable by token-offset (direction/mirror/bbox need the multi-tier parser). Branch `claude/happy-hamilton-0azlw4`. > **2026-06-20 — ADDED (D-UNICHARSET-SCRIPT, the script-table leaf)**: `lance_graph_contract::unicharset::UniCharSet` gained `get_script(id) -> i32` / `get_script_table_size()` / `script_from_script_id(sid) -> Option<&str>` / `script_of(id) -> Option<&str>` / `dump_script()`, backed by new `script_ids: Vec` + an interned `scripts: Vec`. The first leaf to transcode an **interning side-table** (`add_script`, `unicharset.cpp:1063`): `null_script` "NULL" seeded at sid 0 (the `unichar_insert` set_script, `unicharset.cpp:680` → `null_sid_ == 0`), real scripts intern from 1 in id order. Script name = token after the optional bbox/stats CSV (mixed-tier safe). Out-of-range → `null_sid_` 0 (`unicharset.h:681`). **Byte-identical 112/112** vs tesseract's own `get_script` on real `eng.lstm-unicharset` (self-validating oracle `script` mode; table `["NULL","Common","Latin"]`). Additive, zero-dep, behaviour-preserving on the bijection. +4 tests (19 unicharset total). Consumed by `tesseract-core::CharSet::{get_script,script_of}`. EPIPHANIES `E-CPP-PARITY-4`; fourth leaf of `PROBE-OGAR-ADAPTER-UNICHARSET`. Branch `claude/happy-hamilton-0azlw4`. diff --git a/.claude/board/TECH_DEBT.md b/.claude/board/TECH_DEBT.md index 4e1a42e5..3b14c6d5 100644 --- a/.claude/board/TECH_DEBT.md +++ b/.claude/board/TECH_DEBT.md @@ -52,6 +52,22 @@ enum, not a local rename. Canonical = the contract. Surfaced while grounding Deferred: cross-crate dep addition, out of scope for the convergence-probe increment. Same class as the resolved `CausalEdge64` shadow. +### TD-CONTRACT-NOT-FMT-GATED — `lance-graph-contract` is not fmt-checked in CI (2026-06-20) + +**Open.** `.github/workflows/style.yml` runs `cargo fmt --check` only on +`crates/lance-graph/` and `crates/deepnsm/` — NOT on `crates/lance-graph-contract/`. +Consequence: merged PRs (symbiont / SoA work) have left rustfmt-1.9.0 drift in +contract files — observed in `hhtl.rs:682`, `nan_projection.rs:125`, and +`soa_graph.rs:{178,248,326,413}` (all whitespace/wrapping, behaviour-preserving). +A local `cargo fmt -p lance-graph-contract -- --check` is therefore red on `main` +even when a given PR's own files are clean. This has been re-discovered 3× +(class_view.rs, nan_projection.rs, now hhtl/soa_graph) — recording it so the next +session doesn't a 4th time. **Pay by** either adding a contract-crate `cargo fmt +--check` step to `style.yml` (and a one-shot `cargo fmt -p lance-graph-contract` +normalization commit), OR a deliberate decision to leave the contract crate +ungated. Until then: leaf PRs keep their OWN files fmt-clean and do not reformat +others' merged files (avoids muddied diffs + conflicts with in-flight PRs). + ### TD-ONTOLOGY-LINT — `lance-graph-ontology` pre-existing clippy (12) on toolchain 1.95 (2026-06-18) `cargo clippy -p lance-graph-ontology -- -D warnings` exits 101 with 12 errors on the pinned 1.95 toolchain — all **pre-existing on `main`** (e.g. `odoo_blueprint/op_emitter.rs:182` is byte-identical on `origin/main`), in `hydrators/owl.rs` (2), `odoo_blueprint/op_emitter.rs` (1), `ttl_parse.rs` (3), + others. Mostly mechanical (`iter_cloned_collect` → `.to_vec()`, etc.). The crate is not in the CI clippy sweep ("CI tests 4 of ~30 crates"), so the debt accumulated un-gated. Surfaced while wiring `class_id_for_guid` (E-OGAR-ONTOLOGY-WIRED-1; `registry.rs` itself is clippy-clean + fmt-clean). Fix is a focused lint pass, out of scope for the wiring increment. Same class as `TD-CAUSAL-EDGE-LINT`. diff --git a/crates/lance-graph-contract/examples/unicharset_dump.rs b/crates/lance-graph-contract/examples/unicharset_dump.rs index 04bd4d67..4f1a04c4 100644 --- a/crates/lance-graph-contract/examples/unicharset_dump.rs +++ b/crates/lance-graph-contract/examples/unicharset_dump.rs @@ -1,7 +1,7 @@ -//! Dump a `.unicharset`'s id→unichar table (default), its per-id property bits -//! (`properties` mode), its per-id script ids (`script` mode), or its per-id -//! case-pair ids (`other_case` mode) — the Rust side of the byte-parity probe -//! `PROBE-OGAR-ADAPTER-UNICHARSET`. +//! Dump a `.unicharset`'s id→unichar table (default) or a per-id column: +//! `properties` (category bits), `script` (script ids), `other_case` (case-pair +//! ids), `direction` (bidi codes), `mirror` (mirror ids) — the Rust side of the +//! byte-parity probe `PROBE-OGAR-ADAPTER-UNICHARSET`. //! //! ```sh //! # on a box with libtesseract + libleptonica installed: @@ -33,7 +33,9 @@ use lance_graph_contract::unicharset::UniCharSet; fn main() -> ExitCode { let Some(path) = std::env::args().nth(1) else { - eprintln!("usage: unicharset_dump [properties|script|other_case]"); + eprintln!( + "usage: unicharset_dump [properties|script|other_case|direction|mirror]" + ); return ExitCode::FAILURE; }; let mode = std::env::args().nth(2).unwrap_or_default(); @@ -43,6 +45,8 @@ fn main() -> ExitCode { "properties" => print!("{}", unicharset.dump_properties()), "script" => print!("{}", unicharset.dump_script()), "other_case" => print!("{}", unicharset.dump_other_case()), + "direction" => print!("{}", unicharset.dump_direction()), + "mirror" => print!("{}", unicharset.dump_mirror()), _ => print!("{}", unicharset.dump()), } ExitCode::SUCCESS diff --git a/crates/lance-graph-contract/src/unicharset.rs b/crates/lance-graph-contract/src/unicharset.rs index 28932392..e7fb37b5 100644 --- a/crates/lance-graph-contract/src/unicharset.rs +++ b/crates/lance-graph-contract/src/unicharset.rs @@ -64,6 +64,21 @@ //! folds to the id itself. [`UniCharSet::get_other_case`] mirrors the C++ //! accessor (`unicharset.h:703`): out-of-range id → `INVALID_UNICHAR_ID` (-1). //! [`UniCharSet::dump_other_case`] is the byte-parity surface. +//! +//! # Direction / mirror leaf +//! +//! The two columns after `other_case` (present on the CSV-bearing tiers): +//! `direction` (an ICU `UCharDirection` bidi code, `unicharset.h:175`) and +//! `mirror` (the mirror unichar id, e.g. `'('` ↔ `')'`). Both are read by +//! continuing the same per-line token walk one position past `other_case`; a tier +//! without them leaves the walk exhausted and both take their defaults. +//! [`UniCharSet::get_direction`] (`unicharset.h:712`) stores the parsed code +//! as-is (load default `U_LEFT_TO_RIGHT` 0) and returns `U_OTHER_NEUTRAL` (10) +//! for an out-of-range id — distinct from the load default. +//! [`UniCharSet::get_mirror`] (`unicharset.h:721`) is clamped exactly like +//! `other_case` and returns `INVALID_UNICHAR_ID` (-1) out of range. +//! [`UniCharSet::dump_direction`] / [`UniCharSet::dump_mirror`] are the +//! byte-parity surfaces. use std::collections::HashMap; use std::path::Path; @@ -91,6 +106,14 @@ pub struct UniCharSet { /// Clamped at load: a parsed value `>= size` (incl. the default) becomes the /// id itself (tesseract `unicharset.cpp:901`). other_cases: Vec, + /// id → bidi direction code (ICU `UCharDirection`), parallel to `reverse`. + /// Load default `U_LEFT_TO_RIGHT` (0) when the column is absent (tesseract + /// `unicharset.cpp:812,900`); stored as-is (no clamp). + directions: Vec, + /// id → the mirror unichar id, parallel to `reverse`. Clamped at load like + /// `other_case`: a parsed value `>= size` (incl. the default) becomes the id + /// itself (tesseract `unicharset.cpp:902`). + mirrors: Vec, } /// `isalpha` property bit (tesseract `unicharset.cpp:41`). @@ -116,6 +139,12 @@ const NULL_SID: i32 = 0; /// The C++ `INVALID_UNICHAR_ID` sentinel — what id-returning accessors yield for /// an out-of-range id (tesseract `unichar.h`; e.g. `get_other_case`). const INVALID_UNICHAR_ID: i32 = -1; +/// `Direction::U_LEFT_TO_RIGHT` (0) — the load default for `direction` when the +/// column is absent (tesseract `unicharset.cpp:812`, `unicharset.h:176`). +const U_LEFT_TO_RIGHT: i32 = 0; +/// `Direction::U_OTHER_NEUTRAL` (10) — what `get_direction` returns for an +/// out-of-range id (tesseract `unicharset.h:714`). Distinct from the load default. +const U_OTHER_NEUTRAL: i32 = 10; /// Intern `name` into `scripts` (insertion-order dedup), returning its index — /// the transcription of `UNICHARSET::add_script` (tesseract `unicharset.cpp:1063`). @@ -152,6 +181,8 @@ impl UniCharSet { let mut script_ids = Vec::with_capacity(count); let mut scripts: Vec = Vec::new(); let mut other_cases = Vec::with_capacity(count); + let mut directions = Vec::with_capacity(count); + let mut mirrors = Vec::with_capacity(count); let count_i32 = i32::try_from(count).unwrap_or(i32::MAX); for line in lines.take(count) { // The unichar is the first whitespace-delimited token; the id is the @@ -208,6 +239,21 @@ impl UniCharSet { .and_then(|t| t.parse::().ok()) .unwrap_or(count_i32); other_cases.push(if oc < count_i32 { oc } else { id_i32 }); + // direction + mirror follow other_case in the column tiers that carry + // them (the CSV-bearing lines); on a tier without them the iterator is + // exhausted and both take their defaults. direction is stored as-is + // (`unicharset.cpp:900`, default U_LEFT_TO_RIGHT). mirror is clamped + // like other_case (`unicharset.cpp:902`, absent default = size → id). + let dir = tokens + .next() + .and_then(|t| t.parse::().ok()) + .unwrap_or(U_LEFT_TO_RIGHT); + directions.push(dir); + let mir = tokens + .next() + .and_then(|t| t.parse::().ok()) + .unwrap_or(count_i32); + mirrors.push(if mir < count_i32 { mir } else { id_i32 }); } if reverse.len() != count { @@ -223,6 +269,8 @@ impl UniCharSet { script_ids, scripts, other_cases, + directions, + mirrors, }) } @@ -360,6 +408,30 @@ impl UniCharSet { .unwrap_or(INVALID_UNICHAR_ID) } + /// The bidi direction code (ICU `UCharDirection`) of `id`. Mirrors + /// `UNICHARSET::get_direction` (tesseract `unicharset.h:712`): an out-of-range + /// id returns `U_OTHER_NEUTRAL` (10) — note this differs from the load + /// default `U_LEFT_TO_RIGHT` (0) used for an absent column. + #[must_use] + pub fn get_direction(&self, id: u32) -> i32 { + self.directions + .get(id as usize) + .copied() + .unwrap_or(U_OTHER_NEUTRAL) + } + + /// The mirror unichar id of `id` (e.g. `'('` → the id of `')'`), or the id + /// itself when there is no mirror. Mirrors `UNICHARSET::get_mirror` (tesseract + /// `unicharset.h:721`): an out-of-range id (the `INVALID_UNICHAR_ID` sentinel) + /// returns `INVALID_UNICHAR_ID` (-1). + #[must_use] + pub fn get_mirror(&self, id: u32) -> i32 { + self.mirrors + .get(id as usize) + .copied() + .unwrap_or(INVALID_UNICHAR_ID) + } + /// Render the id→properties table as /// `"\t \n"` lines /// (each flag `0`/`1`) — the exact shape the C++ property oracle prints, so @@ -414,6 +486,36 @@ impl UniCharSet { out } + /// Render the id→direction table as `"\t\n"` lines — the exact + /// shape the C++ `get_direction` oracle prints, so the byte-parity diff is + /// `diff oracle_direction.tsv rust_direction.tsv`. + #[must_use] + pub fn dump_direction(&self) -> String { + let mut out = String::new(); + for id in 0..self.reverse.len() as u32 { + out.push_str(&id.to_string()); + out.push('\t'); + out.push_str(&self.get_direction(id).to_string()); + out.push('\n'); + } + out + } + + /// Render the id→mirror table as `"\t\n"` lines — the exact shape + /// the C++ `get_mirror` oracle prints, so the byte-parity diff is + /// `diff oracle_mirror.tsv rust_mirror.tsv`. + #[must_use] + pub fn dump_mirror(&self) -> String { + let mut out = String::new(); + for id in 0..self.reverse.len() as u32 { + out.push_str(&id.to_string()); + out.push('\t'); + out.push_str(&self.get_mirror(id).to_string()); + out.push('\n'); + } + out + } + /// Render the id→unichar table as `"\t\n"` lines — the exact /// shape the C++ oracle harness prints, so a byte-parity diff is /// `diff oracle_dump.tsv rust_dump.tsv`. @@ -673,6 +775,44 @@ c 3 0,255,0,255,0,0,0,0,0,0 Latin 0 0 1 c assert_eq!(u.dump_other_case(), "0\t1\n1\t0\n2\t2\n"); } + /// id 0 `(` mirrors id 1 `)` (direction 10 = U_OTHER_NEUTRAL); id 2 `7` is a + /// European number (direction 2) with an out-of-range mirror (99 ≥ size 4 → + /// self); id 3 is a tier-5 line with no direction/mirror columns → defaults + /// (direction 0 = U_LEFT_TO_RIGHT, mirror = self). + const DIR_MIRROR_SAMPLE: &str = "\ +4 +( 10 0,255,0,255,0,0,0,0,0,0 Common 0 10 1 ( +) 10 0,255,0,255,0,0,0,0,0,0 Common 0 10 0 ) +7 8 0,255,0,255,0,0,0,0,0,0 Common 7 2 99 7 +NULL 0 Common 0 +"; + + #[test] + fn direction_decodes_with_default_and_neutral_out_of_range() { + let u = UniCharSet::load_from_str(DIR_MIRROR_SAMPLE).expect("valid"); + assert_eq!(u.get_direction(0), 10); // U_OTHER_NEUTRAL, parsed + assert_eq!(u.get_direction(2), 2); // U_EUROPEAN_NUMBER, parsed + assert_eq!(u.get_direction(3), 0); // tier-5 absent -> load default LTR + assert_eq!(u.get_direction(99), 10); // out-of-range -> U_OTHER_NEUTRAL (not 0) + } + + #[test] + fn mirror_decodes_and_clamps() { + let u = UniCharSet::load_from_str(DIR_MIRROR_SAMPLE).expect("valid"); + assert_eq!(u.get_mirror(0), 1); // ( -> ) + assert_eq!(u.get_mirror(1), 0); // ) -> ( + assert_eq!(u.get_mirror(2), 2); // 99 >= size -> clamped to self + assert_eq!(u.get_mirror(3), 3); // tier-5 absent -> default self + assert_eq!(u.get_mirror(99), -1); // out-of-range -> INVALID_UNICHAR_ID + } + + #[test] + fn dump_direction_and_mirror_match_oracle_shape() { + let u = UniCharSet::load_from_str(DIR_MIRROR_SAMPLE).expect("valid"); + assert_eq!(u.dump_direction(), "0\t10\n1\t10\n2\t2\n3\t0\n"); + assert_eq!(u.dump_mirror(), "0\t1\n1\t0\n2\t2\n3\t3\n"); + } + #[test] fn errors_are_typed() { assert_eq!(UniCharSet::load_from_str(""), Err(UniCharSetError::Empty));