Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,24 @@ In scope:

When a change ships, update the relevant section of the skill and re-check the "Common gotchas" and "Where to look next" pointer table. The skill is published to https://www.skills.sh/ as `mcvickerlab/GenVarLoader` (installable via `npx skills add mcvickerlab/GenVarLoader`); keep it accurate against `main`.

## Docs audit before feature/breaking-change PRs

Before opening any PR that adds a user-facing feature or makes a breaking change, audit and update the user-facing docs so they stay consistent with the code:

- `README.md` (features, installation, requirements)
- `docs/source/*.md` — especially `api.md`, `faq.md`, `write.md`, `dataset.md`, `format.md`, `index.md`
- `skills/genvarloader/SKILL.md` (see "Maintaining the `genvarloader` skill" above)

Check for: now-false claims (deleted backends, removed deps, changed defaults, renamed/removed symbols), new user-facing config or environment variables that need documenting, and changed installation/preprocessing (bcftools/plink2) requirements.

**`api.md` must stay in sync with `__all__`.** Every symbol exported in `python/genvarloader/__init__.py`'s `__all__` needs an autodoc entry in `docs/source/api.md`; adding a public symbol without one silently drops it from the rendered API reference. Quick check:

```bash
python -c "import re,genvarloader as g; api=open('docs/source/api.md').read(); print('MISSING:', [n for n in g.__all__ if n not in api] or 'none')"
```

The auto-generated `docs/source/changelog.md` (built from commit messages via `changelog.md.j2`) does **not** count as documentation — never treat a changelog entry as a substitute for prose docs. This gate complements the skill-maintenance rule above: public-API changes must update the skill, and any user-facing change must also keep the prose docs true.

## Rust migration roadmap

Any task that mentions "rust" (adding or porting Rust code, touching `src/`, or migrating numba/Python hot paths) **must** read `docs/roadmaps/rust-migration.md` before starting and update it as part of the work — tick completed tasks, record measurement results under the relevant checkpoint, and set the phase status marker (⬜/🚧/✅) + PR link. The roadmap is the source of truth for migration sequencing and the byte-identical parity contract.
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Documentation is available [here](https://genvarloader.readthedocs.io/). See our
pip install genvarloader
```

A PyTorch dependency is **not** included since it may require [special instructions](https://pytorch.org/get-started/locally/). `tbb` and/or `pyomp` are optional dependencies but highly recommended as they can improve throughput for parallelized numba code.
A PyTorch dependency is **not** included since it may require [special instructions](https://pytorch.org/get-started/locally/). GenVarLoader parallelizes its data-loading hot paths in Rust (rayon) out of the box, with no extra dependencies required; you can tune the worker count with the `GVL_NUM_THREADS` environment variable (see the [FAQ](https://genvarloader.readthedocs.io/en/latest/faq.html)).

## Contributing

Expand Down
86 changes: 86 additions & 0 deletions docs/source/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@

.. autofunction:: write

.. autofunction:: update

.. autofunction:: get_splice_bed

.. autofunction:: read_bedlike
Expand All @@ -22,6 +24,44 @@
:exclude-members: __new__
```

## Insertion fill

Strategies controlling how re-aligned track values are filled across inserted bases (indels). Pass an instance to [`gvl.Dataset.with_insertion_fill()`](#genvarloader.Dataset.with_insertion_fill). `InsertionFill` is the abstract base; instantiate one of the concrete strategies.

```{eval-rst}
.. currentmodule:: genvarloader

.. autoclass:: InsertionFill
:members:

.. autoclass:: Constant
:members:

.. autoclass:: FlankSample
:members:

.. autoclass:: Interpolate
:members:

.. autoclass:: Repeat5p
:members:

.. autoclass:: Repeat5pNormalized
:members:
```

## Dataset maintenance

Utilities for upgrading on-disk datasets written by older GVL versions.

```{eval-rst}
.. currentmodule:: genvarloader

.. autofunction:: migrate

.. autofunction:: migrate_svar_link
```

## Reading

### Personalized data
Expand All @@ -35,6 +75,12 @@

.. autofunction:: get_dummy_dataset

.. autoclass:: DummyVariant
:members:

.. autoclass:: VarWindowOpt
:members:

.. autoclass:: RaggedDataset
:exclude-members: __new__, __init__

Expand Down Expand Up @@ -102,4 +148,44 @@ Classes that GVL Datasets may return.
.. autoclass:: RaggedIntervals
:members:
:exclude-members: __init__
```

### Flat containers

Returned in place of the ragged containers when a Dataset uses [`with_output_format("flat")`](#genvarloader.Dataset.with_output_format). Each carries flat `data`/`offsets` buffers and a `to_ragged()` escape hatch back to the ragged form.

```{eval-rst}
.. currentmodule:: genvarloader

.. autoclass:: FlatRagged
:members:
:exclude-members: __init__

.. autoclass:: FlatAnnotatedHaps
:members:
:exclude-members: __init__

.. autoclass:: FlatIntervals
:members:
:exclude-members: __init__

.. autoclass:: FlatVariants
:members:
:exclude-members: __init__

.. autoclass:: FlatAlleles
:members:
:exclude-members: __init__

.. autoclass:: FlatVariantWindows
:members:
:exclude-members: __init__
```

### PyTorch interop

```{eval-rst}
.. currentmodule:: genvarloader

.. autofunction:: to_nested_tensor
```
10 changes: 9 additions & 1 deletion docs/source/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ ragged = gvl.Ragged.from_offsets(data, shape, offsets)
# ]
```

Ragged arrays are subclasses of [Awkward Arrays](https://github.com/scikit-hep/awkward), so anything you can do with Awkward Arrays you can do with Ragged arrays. Within GVL, we use numba JIT'd functions to compute on the ragged objects' buffers directly since it's relatively straightforward (i.e. iterating over the rows of `data` via the `offsets` array).
Ragged arrays are backed by [`seqpro`](https://github.com/ML4GLand/SeqPro)'s `Ragged` type (a Rust-backed `_core.Ragged`). GVL computes on the `data` and `offsets` buffers directly in Rust, which is relatively straightforward (i.e. iterating over the rows of `data` via the `offsets` array). (Earlier releases subclassed [Awkward Arrays](https://github.com/scikit-hep/awkward); GVL no longer depends on `awkward`.)

.. note::

Expand Down Expand Up @@ -63,6 +63,14 @@ bcftools view -Hp $vcf | wc -l
plink2 --pgen-info $prefix
```

## How do I control how many threads GVL uses?

GVL's read path (haplotype reconstruction and track re-alignment) is parallelized in Rust with [rayon](https://github.com/rayon-rs/rayon). By default it uses one worker per available CPU, detected from the Linux cgroup cpuset (`sched_getaffinity`) so it respects container limits, and falling back to `os.cpu_count()` elsewhere. Three environment variables tune this:

- **`GVL_NUM_THREADS`** — set the worker count explicitly (e.g. `GVL_NUM_THREADS=4`). Overrides cgroup detection. Resolved once, on first use, so set it before your first GVL call.
- **`GVL_FORCE_PARALLEL`** — set to a truthy value (`1`, `true`, `yes`, `on`) to force the multithreaded paths even on small inputs. By default GVL runs small inputs serially because thread overhead would dominate; this bypasses that size gate. Mainly useful for benchmarking.
- **`RAYON_NUM_THREADS`** — GVL **overwrites** this with its own resolved count so an inherited value (e.g. baked into a base image) can't defeat the cgroup-aware cap. To size the pool yourself, use `GVL_NUM_THREADS` instead.

## How can I get personalized protein/spliced RNA sequences?

This is not yet supported but on GVL's roadmap for the near future. Keep an eye out in future releases!
Expand Down
50 changes: 50 additions & 0 deletions docs/superpowers/specs/2026-06-30-docs-consistency-audit-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Docs consistency pass + CLAUDE.md docs-audit gate

**Date:** 2026-06-30
**Branch:** `docs/consistency-audit`

## Problem

Recent gvl work (Phase 5: numba read-path backend deleted → Rust-only; awkward →
`_core.Ragged` migration; new rayon threading knobs) left user-facing docs stale,
and there is no process gate ensuring docs stay consistent with future
feature/breaking-change PRs.

Key facts established during the audit:
- gvl's **own** code is numba-free (`pixi.toml` comment; `tests/parity/test_import_no_numba.py`).
Numba survives only as a conda pin because **seqpro** transitively imports it. The
residual `_numba`-suffixed names in gvl route only to Rust or numpy.
- The read path is parallelized in Rust with rayon, tuned via env vars in
`python/genvarloader/_threads.py`: `GVL_NUM_THREADS`, `GVL_FORCE_PARALLEL`, and a
`RAYON_NUM_THREADS` override (issue #263). None were documented user-side.

## Scope (focused fix — not a full line-by-line sweep)

### Part A — docs fixes
1. `docs/source/faq.md` — rewrite the "Ragged objects" answer's stale
"subclass of Awkward Arrays / numba JIT'd functions" paragraph to reflect the
`seqpro.rag.Ragged` (`_core.Ragged`, Rust) backend; note awkward is no longer a dep.
2. `docs/source/faq.md` — new entry "How do I control how many threads GVL uses?"
documenting the three env vars, sourced from `_threads.py`.
3. `README.md` — replace the `tbb`/`pyomp`-for-numba install note with a note that
parallelism is built-in (Rust/rayon), tunable via `GVL_NUM_THREADS`.
4. `skills/genvarloader/SKILL.md` — `_core.Ragged` "Rust+numba backend" → "Rust backend"
(seqpro-core's rag layer is numba-free).
5. Targeted leftover sweep of README + `docs/source/*.md` + SKILL.md for other
`numba`/`awkward`/`GVL_BACKEND`/`tbb`/`pyomp` references — none remaining (the
surviving `awkward` mentions in SKILL.md describe "zero-awkward" as a feature).

The auto-generated `docs/source/changelog.md` is left untouched (built from commit
messages via `changelog.md.j2`).

### Part B — CLAUDE.md gate
Add a "Docs audit before feature/breaking-change PRs" section that requires auditing
README + `docs/source/*.md` + SKILL.md before such PRs, lists what to check
(now-false claims, new config/env vars, changed preprocessing), and states the
auto-generated changelog does not count as documentation. Complements the existing
skill-maintenance rule.

## Verification
- Markdown edits are prose-only in existing files with no new MyST directives.
- Full `pixi run -e docs doc` build not run in-worktree (docs env not provisioned there);
low build-break risk given no directive changes.
1 change: 0 additions & 1 deletion pixi.toml
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,6 @@ cargo-test = { cmd = "cargo test --release" }
memray-write = { cmd = "memray run -fo tests/benchmarks/profiling/write.memray.bin tests/benchmarks/profiling/profile_write.py --op write" }

[feature.docs.tasks]
install-e = "uv pip install -e /cellar/users/dlaub/projects/ML4GLand/SeqPro -e /cellar/users/dlaub/projects/genoray -e ."
i-kernel = "ipython kernel install --user --name 'gvl-docs' --display-name 'GVL Docs'"
i-kernel-gpu = "ipython kernel install --user --name 'gvl-docs-gpu' --display-name 'GVL Docs GPU'"
doc = "cd docs && make clean && make html"
Expand Down
2 changes: 1 addition & 1 deletion skills/genvarloader/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -341,7 +341,7 @@ Footprint is computed exactly via `Dataset._output_bytes_per_instance(...)` (use

- `gvl.Reference.from_path(fasta, contigs=None)` — wrap a FASTA (path to a `.fa`/`.fa.bgz`, or a `.gvlfa` cache dir). Builds/reuses a sibling `.gvlfa` cache directory (self-describing, fingerprint-validated; legacy `.fa.gvl` caches auto-migrate). The cache is built atomically (temp + `os.replace`) under a best-effort lock, so concurrent builders sharing one reference are safe; the cache **auto-rebuilds** from its source when stale or missing.
- `gvl.read_bedlike(path)` / `gvl.with_length(bed, L)` — BED helpers (re-exported from `seqpro`).
- `gvl.Ragged`, `gvl.RaggedAnnotatedHaps`, `gvl.RaggedVariants`, `gvl.RaggedIntervals` — ragged return containers. All are backed by `seqpro.rag.Ragged` (`_core.Ragged` Rust+numba backend); **not** `awkward`. `RaggedVariants` is a **subclass** of `seqpro.rag.Ragged` (`class RaggedVariants(seqpro.rag.Ragged)`), so `isinstance(rv, Ragged) is True`. Structural methods — indexing, `reshape`, `squeeze`, `to_packed` — are inherited from the base and **preserve the `RaggedVariants` type** (positional/structural operations return `RaggedVariants`). A **string key** (`rv["start"]`) returns a bare `Ragged` field, not a `RaggedVariants`. `reshape` takes the new shape either as unpacked ints — e.g. `rv.reshape(1, 2, None)` — or as a single tuple `rv.reshape((1, 2, None))`; the base `Ragged` signature accepts both. `squeeze(axis=None)` is a real axis-squeeze (base semantics) — it squeezes any size-1 axis, **not** a fixed "drop axis 0". An int index collapses the leading axis (numpy-consistent); slice/array indexing preserves it. Named properties (`.alt`, `.ref`, `.start`, `.ilen`, `.end`) are the primary access point; extra fields (e.g. `AF`, custom FORMAT fields) are also accessible via `rv["field"]` or `rv.field` (via `__getattr__`). `RaggedVariants` itself does not define `__eq__` (wrapper-level `==` is Python object-identity, not element-wise); to compare contents, compare individual fields — e.g. `rv["alt"] == other_alt` or `rv.start == other_start` — which use `seqpro.rag.Ragged`'s ufunc-based (element-wise) comparison. Domain methods retained on `RaggedVariants`: `.rc_()`, `.pad()`, `.to_nested_tensor_batch()`; derived read-only properties: `.ilen`, `.end`; fields: `.alt`, `.ref`, `.start`, `.dosage`.
- `gvl.Ragged`, `gvl.RaggedAnnotatedHaps`, `gvl.RaggedVariants`, `gvl.RaggedIntervals` — ragged return containers. All are backed by `seqpro.rag.Ragged` (`_core.Ragged` Rust backend); **not** `awkward`. `RaggedVariants` is a **subclass** of `seqpro.rag.Ragged` (`class RaggedVariants(seqpro.rag.Ragged)`), so `isinstance(rv, Ragged) is True`. Structural methods — indexing, `reshape`, `squeeze`, `to_packed` — are inherited from the base and **preserve the `RaggedVariants` type** (positional/structural operations return `RaggedVariants`). A **string key** (`rv["start"]`) returns a bare `Ragged` field, not a `RaggedVariants`. `reshape` takes the new shape either as unpacked ints — e.g. `rv.reshape(1, 2, None)` — or as a single tuple `rv.reshape((1, 2, None))`; the base `Ragged` signature accepts both. `squeeze(axis=None)` is a real axis-squeeze (base semantics) — it squeezes any size-1 axis, **not** a fixed "drop axis 0". An int index collapses the leading axis (numpy-consistent); slice/array indexing preserves it. Named properties (`.alt`, `.ref`, `.start`, `.ilen`, `.end`) are the primary access point; extra fields (e.g. `AF`, custom FORMAT fields) are also accessible via `rv["field"]` or `rv.field` (via `__getattr__`). `RaggedVariants` itself does not define `__eq__` (wrapper-level `==` is Python object-identity, not element-wise); to compare contents, compare individual fields — e.g. `rv["alt"] == other_alt` or `rv.start == other_start` — which use `seqpro.rag.Ragged`'s ufunc-based (element-wise) comparison. Domain methods retained on `RaggedVariants`: `.rc_()`, `.pad()`, `.to_nested_tensor_batch()`; derived read-only properties: `.ilen`, `.end`; fields: `.alt`, `.ref`, `.start`, `.dosage`.
- `gvl.FlatRagged` — flat analog of `Ragged`: `.data` (flat numpy array), `.offsets` (int64), `.shape`. Methods: `.to_ragged()`, `.to_fixed(length)`, `.to_padded(pad_value)`, `.reshape(shape)`, `.squeeze(axis)`. Source: `python/genvarloader/_flat.py`.
- `gvl.FlatIntervals` — flat-buffer interval container returned by `with_tracks(kind="intervals")` + `with_output_format("flat")`. Fields `.starts`/`.ends`/`.values` are `FlatRagged`; `.to_ragged()` → `RaggedIntervals`; `.reshape(...)`, `.squeeze(...)`, `.shape`. Source: `python/genvarloader/_ragged.py`.
- `gvl.FlatAnnotatedHaps` — flat analog of `RaggedAnnotatedHaps`: fields `.haps`, `.var_idxs`, `.ref_coords` (each a `FlatRagged`). Methods: `.to_ragged()`, `.to_fixed(length)`, `.to_padded()`, `.reshape(shape)`, `.squeeze(axis)`. Source: `python/genvarloader/_flat.py`.
Expand Down
Loading