From a3857a1971a012da9404c65678f4d61625cde11d Mon Sep 17 00:00:00 2001 From: d-laub Date: Tue, 30 Jun 2026 22:56:22 -0700 Subject: [PATCH 1/4] docs: sync docs with rust/rayon migration + add docs-audit gate MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Recent Phase 5 work (numba read-path deleted → Rust-only, awkward → _core migration, new rayon threading env vars) left user-facing docs stale. - faq: rewrite Ragged answer (seqpro _core.Ragged Rust backend, not Awkward/numba); document GVL_NUM_THREADS / GVL_FORCE_PARALLEL / RAYON_NUM_THREADS override - README: drop tbb/pyomp-for-numba note; parallelism is built-in Rust/rayon - SKILL: _core.Ragged is a Rust backend (rag layer is numba-free) - CLAUDE.md: require a docs audit before feature/breaking-change PRs Co-Authored-By: Claude Opus 4.8 --- CLAUDE.md | 12 +++++ README.md | 2 +- docs/source/faq.md | 10 +++- ...026-06-30-docs-consistency-audit-design.md | 50 +++++++++++++++++++ skills/genvarloader/SKILL.md | 2 +- 5 files changed, 73 insertions(+), 3 deletions(-) create mode 100644 docs/superpowers/specs/2026-06-30-docs-consistency-audit-design.md diff --git a/CLAUDE.md b/CLAUDE.md index 42ca5a1b..53807b6f 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -238,6 +238,18 @@ In scope: When a change ships, update the relevant section of the skill and re-check the "Common gotchas" and "Where to look next" pointer table. The skill is published to https://www.skills.sh/ as `mcvickerlab/GenVarLoader` (installable via `npx skills add mcvickerlab/GenVarLoader`); keep it accurate against `main`. +## Docs audit before feature/breaking-change PRs + +Before opening any PR that adds a user-facing feature or makes a breaking change, audit and update the user-facing docs so they stay consistent with the code: + +- `README.md` (features, installation, requirements) +- `docs/source/*.md` — especially `faq.md`, `write.md`, `dataset.md`, `format.md`, `index.md` +- `skills/genvarloader/SKILL.md` (see "Maintaining the `genvarloader` skill" above) + +Check for: now-false claims (deleted backends, removed deps, changed defaults, renamed/removed symbols), new user-facing config or environment variables that need documenting, and changed installation/preprocessing (bcftools/plink2) requirements. + +The auto-generated `docs/source/changelog.md` (built from commit messages via `changelog.md.j2`) does **not** count as documentation — never treat a changelog entry as a substitute for prose docs. This gate complements the skill-maintenance rule above: public-API changes must update the skill, and any user-facing change must also keep the prose docs true. + ## Rust migration roadmap Any task that mentions "rust" (adding or porting Rust code, touching `src/`, or migrating numba/Python hot paths) **must** read `docs/roadmaps/rust-migration.md` before starting and update it as part of the work — tick completed tasks, record measurement results under the relevant checkpoint, and set the phase status marker (⬜/🚧/✅) + PR link. The roadmap is the source of truth for migration sequencing and the byte-identical parity contract. diff --git a/README.md b/README.md index 1843c067..7c4513e7 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,7 @@ Documentation is available [here](https://genvarloader.readthedocs.io/). See our pip install genvarloader ``` -A PyTorch dependency is **not** included since it may require [special instructions](https://pytorch.org/get-started/locally/). `tbb` and/or `pyomp` are optional dependencies but highly recommended as they can improve throughput for parallelized numba code. +A PyTorch dependency is **not** included since it may require [special instructions](https://pytorch.org/get-started/locally/). GenVarLoader parallelizes its data-loading hot paths in Rust (rayon) out of the box, with no extra dependencies required; you can tune the worker count with the `GVL_NUM_THREADS` environment variable (see the [FAQ](https://genvarloader.readthedocs.io/en/latest/faq.html)). ## Contributing diff --git a/docs/source/faq.md b/docs/source/faq.md index 05409bf9..bb3b8a32 100644 --- a/docs/source/faq.md +++ b/docs/source/faq.md @@ -24,7 +24,7 @@ ragged = gvl.Ragged.from_offsets(data, shape, offsets) # ] ``` -Ragged arrays are subclasses of [Awkward Arrays](https://github.com/scikit-hep/awkward), so anything you can do with Awkward Arrays you can do with Ragged arrays. Within GVL, we use numba JIT'd functions to compute on the ragged objects' buffers directly since it's relatively straightforward (i.e. iterating over the rows of `data` via the `offsets` array). +Ragged arrays are backed by [`seqpro`](https://github.com/ML4GLand/SeqPro)'s `Ragged` type (a Rust-backed `_core.Ragged`). GVL computes on the `data` and `offsets` buffers directly in Rust, which is relatively straightforward (i.e. iterating over the rows of `data` via the `offsets` array). (Earlier releases subclassed [Awkward Arrays](https://github.com/scikit-hep/awkward); GVL no longer depends on `awkward`.) .. note:: @@ -63,6 +63,14 @@ bcftools view -Hp $vcf | wc -l plink2 --pgen-info $prefix ``` +## How do I control how many threads GVL uses? + +GVL's read path (haplotype reconstruction and track re-alignment) is parallelized in Rust with [rayon](https://github.com/rayon-rs/rayon). By default it uses one worker per available CPU, detected from the Linux cgroup cpuset (`sched_getaffinity`) so it respects container limits, and falling back to `os.cpu_count()` elsewhere. Three environment variables tune this: + +- **`GVL_NUM_THREADS`** — set the worker count explicitly (e.g. `GVL_NUM_THREADS=4`). Overrides cgroup detection. Resolved once, on first use, so set it before your first GVL call. +- **`GVL_FORCE_PARALLEL`** — set to a truthy value (`1`, `true`, `yes`, `on`) to force the multithreaded paths even on small inputs. By default GVL runs small inputs serially because thread overhead would dominate; this bypasses that size gate. Mainly useful for benchmarking. +- **`RAYON_NUM_THREADS`** — GVL **overwrites** this with its own resolved count so an inherited value (e.g. baked into a base image) can't defeat the cgroup-aware cap. To size the pool yourself, use `GVL_NUM_THREADS` instead. + ## How can I get personalized protein/spliced RNA sequences? This is not yet supported but on GVL's roadmap for the near future. Keep an eye out in future releases! diff --git a/docs/superpowers/specs/2026-06-30-docs-consistency-audit-design.md b/docs/superpowers/specs/2026-06-30-docs-consistency-audit-design.md new file mode 100644 index 00000000..0597e0fb --- /dev/null +++ b/docs/superpowers/specs/2026-06-30-docs-consistency-audit-design.md @@ -0,0 +1,50 @@ +# Docs consistency pass + CLAUDE.md docs-audit gate + +**Date:** 2026-06-30 +**Branch:** `docs/consistency-audit` + +## Problem + +Recent gvl work (Phase 5: numba read-path backend deleted → Rust-only; awkward → +`_core.Ragged` migration; new rayon threading knobs) left user-facing docs stale, +and there is no process gate ensuring docs stay consistent with future +feature/breaking-change PRs. + +Key facts established during the audit: +- gvl's **own** code is numba-free (`pixi.toml` comment; `tests/parity/test_import_no_numba.py`). + Numba survives only as a conda pin because **seqpro** transitively imports it. The + residual `_numba`-suffixed names in gvl route only to Rust or numpy. +- The read path is parallelized in Rust with rayon, tuned via env vars in + `python/genvarloader/_threads.py`: `GVL_NUM_THREADS`, `GVL_FORCE_PARALLEL`, and a + `RAYON_NUM_THREADS` override (issue #263). None were documented user-side. + +## Scope (focused fix — not a full line-by-line sweep) + +### Part A — docs fixes +1. `docs/source/faq.md` — rewrite the "Ragged objects" answer's stale + "subclass of Awkward Arrays / numba JIT'd functions" paragraph to reflect the + `seqpro.rag.Ragged` (`_core.Ragged`, Rust) backend; note awkward is no longer a dep. +2. `docs/source/faq.md` — new entry "How do I control how many threads GVL uses?" + documenting the three env vars, sourced from `_threads.py`. +3. `README.md` — replace the `tbb`/`pyomp`-for-numba install note with a note that + parallelism is built-in (Rust/rayon), tunable via `GVL_NUM_THREADS`. +4. `skills/genvarloader/SKILL.md` — `_core.Ragged` "Rust+numba backend" → "Rust backend" + (seqpro-core's rag layer is numba-free). +5. Targeted leftover sweep of README + `docs/source/*.md` + SKILL.md for other + `numba`/`awkward`/`GVL_BACKEND`/`tbb`/`pyomp` references — none remaining (the + surviving `awkward` mentions in SKILL.md describe "zero-awkward" as a feature). + +The auto-generated `docs/source/changelog.md` is left untouched (built from commit +messages via `changelog.md.j2`). + +### Part B — CLAUDE.md gate +Add a "Docs audit before feature/breaking-change PRs" section that requires auditing +README + `docs/source/*.md` + SKILL.md before such PRs, lists what to check +(now-false claims, new config/env vars, changed preprocessing), and states the +auto-generated changelog does not count as documentation. Complements the existing +skill-maintenance rule. + +## Verification +- Markdown edits are prose-only in existing files with no new MyST directives. +- Full `pixi run -e docs doc` build not run in-worktree (docs env not provisioned there); + low build-break risk given no directive changes. diff --git a/skills/genvarloader/SKILL.md b/skills/genvarloader/SKILL.md index b04835a8..ea0da1d5 100644 --- a/skills/genvarloader/SKILL.md +++ b/skills/genvarloader/SKILL.md @@ -341,7 +341,7 @@ Footprint is computed exactly via `Dataset._output_bytes_per_instance(...)` (use - `gvl.Reference.from_path(fasta, contigs=None)` — wrap a FASTA (path to a `.fa`/`.fa.bgz`, or a `.gvlfa` cache dir). Builds/reuses a sibling `.gvlfa` cache directory (self-describing, fingerprint-validated; legacy `.fa.gvl` caches auto-migrate). The cache is built atomically (temp + `os.replace`) under a best-effort lock, so concurrent builders sharing one reference are safe; the cache **auto-rebuilds** from its source when stale or missing. - `gvl.read_bedlike(path)` / `gvl.with_length(bed, L)` — BED helpers (re-exported from `seqpro`). -- `gvl.Ragged`, `gvl.RaggedAnnotatedHaps`, `gvl.RaggedVariants`, `gvl.RaggedIntervals` — ragged return containers. All are backed by `seqpro.rag.Ragged` (`_core.Ragged` Rust+numba backend); **not** `awkward`. `RaggedVariants` is a **subclass** of `seqpro.rag.Ragged` (`class RaggedVariants(seqpro.rag.Ragged)`), so `isinstance(rv, Ragged) is True`. Structural methods — indexing, `reshape`, `squeeze`, `to_packed` — are inherited from the base and **preserve the `RaggedVariants` type** (positional/structural operations return `RaggedVariants`). A **string key** (`rv["start"]`) returns a bare `Ragged` field, not a `RaggedVariants`. `reshape` takes the new shape either as unpacked ints — e.g. `rv.reshape(1, 2, None)` — or as a single tuple `rv.reshape((1, 2, None))`; the base `Ragged` signature accepts both. `squeeze(axis=None)` is a real axis-squeeze (base semantics) — it squeezes any size-1 axis, **not** a fixed "drop axis 0". An int index collapses the leading axis (numpy-consistent); slice/array indexing preserves it. Named properties (`.alt`, `.ref`, `.start`, `.ilen`, `.end`) are the primary access point; extra fields (e.g. `AF`, custom FORMAT fields) are also accessible via `rv["field"]` or `rv.field` (via `__getattr__`). `RaggedVariants` itself does not define `__eq__` (wrapper-level `==` is Python object-identity, not element-wise); to compare contents, compare individual fields — e.g. `rv["alt"] == other_alt` or `rv.start == other_start` — which use `seqpro.rag.Ragged`'s ufunc-based (element-wise) comparison. Domain methods retained on `RaggedVariants`: `.rc_()`, `.pad()`, `.to_nested_tensor_batch()`; derived read-only properties: `.ilen`, `.end`; fields: `.alt`, `.ref`, `.start`, `.dosage`. +- `gvl.Ragged`, `gvl.RaggedAnnotatedHaps`, `gvl.RaggedVariants`, `gvl.RaggedIntervals` — ragged return containers. All are backed by `seqpro.rag.Ragged` (`_core.Ragged` Rust backend); **not** `awkward`. `RaggedVariants` is a **subclass** of `seqpro.rag.Ragged` (`class RaggedVariants(seqpro.rag.Ragged)`), so `isinstance(rv, Ragged) is True`. Structural methods — indexing, `reshape`, `squeeze`, `to_packed` — are inherited from the base and **preserve the `RaggedVariants` type** (positional/structural operations return `RaggedVariants`). A **string key** (`rv["start"]`) returns a bare `Ragged` field, not a `RaggedVariants`. `reshape` takes the new shape either as unpacked ints — e.g. `rv.reshape(1, 2, None)` — or as a single tuple `rv.reshape((1, 2, None))`; the base `Ragged` signature accepts both. `squeeze(axis=None)` is a real axis-squeeze (base semantics) — it squeezes any size-1 axis, **not** a fixed "drop axis 0". An int index collapses the leading axis (numpy-consistent); slice/array indexing preserves it. Named properties (`.alt`, `.ref`, `.start`, `.ilen`, `.end`) are the primary access point; extra fields (e.g. `AF`, custom FORMAT fields) are also accessible via `rv["field"]` or `rv.field` (via `__getattr__`). `RaggedVariants` itself does not define `__eq__` (wrapper-level `==` is Python object-identity, not element-wise); to compare contents, compare individual fields — e.g. `rv["alt"] == other_alt` or `rv.start == other_start` — which use `seqpro.rag.Ragged`'s ufunc-based (element-wise) comparison. Domain methods retained on `RaggedVariants`: `.rc_()`, `.pad()`, `.to_nested_tensor_batch()`; derived read-only properties: `.ilen`, `.end`; fields: `.alt`, `.ref`, `.start`, `.dosage`. - `gvl.FlatRagged` — flat analog of `Ragged`: `.data` (flat numpy array), `.offsets` (int64), `.shape`. Methods: `.to_ragged()`, `.to_fixed(length)`, `.to_padded(pad_value)`, `.reshape(shape)`, `.squeeze(axis)`. Source: `python/genvarloader/_flat.py`. - `gvl.FlatIntervals` — flat-buffer interval container returned by `with_tracks(kind="intervals")` + `with_output_format("flat")`. Fields `.starts`/`.ends`/`.values` are `FlatRagged`; `.to_ragged()` → `RaggedIntervals`; `.reshape(...)`, `.squeeze(...)`, `.shape`. Source: `python/genvarloader/_ragged.py`. - `gvl.FlatAnnotatedHaps` — flat analog of `RaggedAnnotatedHaps`: fields `.haps`, `.var_idxs`, `.ref_coords` (each a `FlatRagged`). Methods: `.to_ragged()`, `.to_fixed(length)`, `.to_padded()`, `.reshape(shape)`, `.squeeze(axis)`. Source: `python/genvarloader/_flat.py`. From eaf44ceff68edc366d3c70d722549350e455ef3d Mon Sep 17 00:00:00 2001 From: d-laub Date: Tue, 30 Jun 2026 23:00:49 -0700 Subject: [PATCH 2/4] docs(api): document 18 missing public symbols in api.md api.md drifted from __init__.__all__. Adds autodoc entries for every undocumented public symbol (verified: 39/39 __all__ names now present): - Insertion fill: InsertionFill + Constant/FlankSample/Interpolate/ Repeat5p/Repeat5pNormalized (new section) - Flat containers: FlatRagged/FlatAnnotatedHaps/FlatIntervals/ FlatVariants/FlatAlleles/FlatVariantWindows (new subsection) - Variant windows: VarWindowOpt, DummyVariant - Dataset maintenance: migrate, migrate_svar_link (new section) - Writing: update (sibling of write) - PyTorch interop: to_nested_tensor Co-Authored-By: Claude Opus 4.8 --- docs/source/api.md | 86 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 86 insertions(+) diff --git a/docs/source/api.md b/docs/source/api.md index 382f0f88..ae3fb0e9 100644 --- a/docs/source/api.md +++ b/docs/source/api.md @@ -7,6 +7,8 @@ .. autofunction:: write +.. autofunction:: update + .. autofunction:: get_splice_bed .. autofunction:: read_bedlike @@ -22,6 +24,44 @@ :exclude-members: __new__ ``` +## Insertion fill + +Strategies controlling how re-aligned track values are filled across inserted bases (indels). Pass an instance to [`gvl.Dataset.with_insertion_fill()`](#genvarloader.Dataset.with_insertion_fill). `InsertionFill` is the abstract base; instantiate one of the concrete strategies. + +```{eval-rst} +.. currentmodule:: genvarloader + +.. autoclass:: InsertionFill + :members: + +.. autoclass:: Constant + :members: + +.. autoclass:: FlankSample + :members: + +.. autoclass:: Interpolate + :members: + +.. autoclass:: Repeat5p + :members: + +.. autoclass:: Repeat5pNormalized + :members: +``` + +## Dataset maintenance + +Utilities for upgrading on-disk datasets written by older GVL versions. + +```{eval-rst} +.. currentmodule:: genvarloader + +.. autofunction:: migrate + +.. autofunction:: migrate_svar_link +``` + ## Reading ### Personalized data @@ -35,6 +75,12 @@ .. autofunction:: get_dummy_dataset +.. autoclass:: DummyVariant + :members: + +.. autoclass:: VarWindowOpt + :members: + .. autoclass:: RaggedDataset :exclude-members: __new__, __init__ @@ -102,4 +148,44 @@ Classes that GVL Datasets may return. .. autoclass:: RaggedIntervals :members: :exclude-members: __init__ +``` + +### Flat containers + +Returned in place of the ragged containers when a Dataset uses [`with_output_format("flat")`](#genvarloader.Dataset.with_output_format). Each carries flat `data`/`offsets` buffers and a `to_ragged()` escape hatch back to the ragged form. + +```{eval-rst} +.. currentmodule:: genvarloader + +.. autoclass:: FlatRagged + :members: + :exclude-members: __init__ + +.. autoclass:: FlatAnnotatedHaps + :members: + :exclude-members: __init__ + +.. autoclass:: FlatIntervals + :members: + :exclude-members: __init__ + +.. autoclass:: FlatVariants + :members: + :exclude-members: __init__ + +.. autoclass:: FlatAlleles + :members: + :exclude-members: __init__ + +.. autoclass:: FlatVariantWindows + :members: + :exclude-members: __init__ +``` + +### PyTorch interop + +```{eval-rst} +.. currentmodule:: genvarloader + +.. autofunction:: to_nested_tensor ``` \ No newline at end of file From 68b3dff778cd9a1f9bf96b6b26ec657298d00ac6 Mon Sep 17 00:00:00 2001 From: d-laub Date: Tue, 30 Jun 2026 23:01:06 -0700 Subject: [PATCH 3/4] =?UTF-8?q?docs(claude):=20add=20api.md=20=E2=86=94=20?= =?UTF-8?q?=5F=5Fall=5F=5F=20sync=20check=20to=20docs-audit=20gate?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The api.md drift (18 missing symbols) motivates an explicit gate check: list api.md alongside the audited docs and add a one-liner that flags any __all__ export missing from the API reference. Co-Authored-By: Claude Opus 4.8 --- CLAUDE.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/CLAUDE.md b/CLAUDE.md index 53807b6f..879d0796 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -243,11 +243,17 @@ When a change ships, update the relevant section of the skill and re-check the " Before opening any PR that adds a user-facing feature or makes a breaking change, audit and update the user-facing docs so they stay consistent with the code: - `README.md` (features, installation, requirements) -- `docs/source/*.md` — especially `faq.md`, `write.md`, `dataset.md`, `format.md`, `index.md` +- `docs/source/*.md` — especially `api.md`, `faq.md`, `write.md`, `dataset.md`, `format.md`, `index.md` - `skills/genvarloader/SKILL.md` (see "Maintaining the `genvarloader` skill" above) Check for: now-false claims (deleted backends, removed deps, changed defaults, renamed/removed symbols), new user-facing config or environment variables that need documenting, and changed installation/preprocessing (bcftools/plink2) requirements. +**`api.md` must stay in sync with `__all__`.** Every symbol exported in `python/genvarloader/__init__.py`'s `__all__` needs an autodoc entry in `docs/source/api.md`; adding a public symbol without one silently drops it from the rendered API reference. Quick check: + +```bash +python -c "import re,genvarloader as g; api=open('docs/source/api.md').read(); print('MISSING:', [n for n in g.__all__ if n not in api] or 'none')" +``` + The auto-generated `docs/source/changelog.md` (built from commit messages via `changelog.md.j2`) does **not** count as documentation — never treat a changelog entry as a substitute for prose docs. This gate complements the skill-maintenance rule above: public-API changes must update the skill, and any user-facing change must also keep the prose docs true. ## Rust migration roadmap From e8ea5dd0a4d372e84dc27444ba9b351c1e759bd5 Mon Sep 17 00:00:00 2001 From: d-laub Date: Tue, 30 Jun 2026 23:15:16 -0700 Subject: [PATCH 4/4] chore(docs): remove obsolete install-e task The install-e task editable-installed sibling seqpro/genoray from hardcoded absolute cluster paths. It was never required: seqpro/genoray are transitive deps via pyproject.toml, and the locked seqpro 0.20.0 already provides every symbol gvl imports (e.g. seqpro.rag.reverse_complement). The docs build only failed on machines whose docs env had drifted stale from the lock (an ancient seqpro 0.11.0); `pixi install -e docs` reconciles it, no editable install needed. Verified: docs build cleanly on darwin (osx-arm64) with the plain PyPI wheels after reconciling the env. Co-Authored-By: Claude Opus 4.8 --- pixi.toml | 1 - 1 file changed, 1 deletion(-) diff --git a/pixi.toml b/pixi.toml index 3e54e402..e05e09ed 100644 --- a/pixi.toml +++ b/pixi.toml @@ -163,7 +163,6 @@ cargo-test = { cmd = "cargo test --release" } memray-write = { cmd = "memray run -fo tests/benchmarks/profiling/write.memray.bin tests/benchmarks/profiling/profile_write.py --op write" } [feature.docs.tasks] -install-e = "uv pip install -e /cellar/users/dlaub/projects/ML4GLand/SeqPro -e /cellar/users/dlaub/projects/genoray -e ." i-kernel = "ipython kernel install --user --name 'gvl-docs' --display-name 'GVL Docs'" i-kernel-gpu = "ipython kernel install --user --name 'gvl-docs-gpu' --display-name 'GVL Docs GPU'" doc = "cd docs && make clean && make html"