feat: support merging zonemap index segments#7128
Conversation
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
@claude review once |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
|
||
| Within a single commit, built segments must have disjoint fragment coverage. | ||
|
|
||
| `merge_existing_index_segments(...)` currently supports vector, inverted, bitmap, |
There was a problem hiding this comment.
🟡 The new paragraph in docs/src/guide/distributed_indexing.md says merge_existing_index_segments supports "vector, inverted, bitmap, and zone map segments", but BTree is also supported (it was added in #6889 / c7ae55a). The list should include btree to avoid misleading users; suggest "vector, inverted, bitmap, btree, and zone map segments".
Extended reasoning...
What the bug is
The PR adds a new prose paragraph at docs/src/guide/distributed_indexing.md:109-113 describing which index families merge_existing_index_segments(...) can merge. It enumerates the supported types as "vector, inverted, bitmap, and zone map segments" and tells users that "other scalar index families … cannot be merged into a larger physical segment until they add a merge implementation." BTree is missing from this list even though it has a working merge_segments implementation.
Why the existing code already supports BTree
The supported-type guard in rust/lance/src/index.rs:1136-1138 explicitly checks all_btree = source_segments.iter().all(segment_has_btree_details) and includes it in the validity test alongside vector / inverted / bitmap / zonemap. The dispatch at rust/lance/src/index.rs:1158-1159 routes the all-BTree case to crate::index::scalar::btree::merge_segments (the trailing else arm). The BTree merge primitive itself lives at rust/lance/src/index/scalar/btree.rs:90 (pub(crate) async fn merge_segments). All three pieces were landed by PR #6889 (commit c7ae55a 'feat: add segmented BTree index merge_segments support'), which is already on main and predates this PR.
Why this matters
The docs now actively contradict the code. A user reading the new paragraph will reasonably conclude that BTree segments cannot be merged via merge_existing_index_segments and that they must keep paying the cost of many small per-fragment BTree segments, or wait for someone to "add a merge implementation" that already exists. The same paragraph also tells them which families fall under the "cannot be merged" caveat, so the omission is doubly misleading — it both removes BTree from the supported list and implies it belongs in the unsupported list.
Step-by-step proof
- Stage two BTree segments with disjoint fragment coverage (the existing tests in
rust/lance/src/index/scalar_logical.rsalready do exactly this — seetest_open_named_scalar_index_uses_all_btree_segmentsandtest_btree_segment_search_is_exact_across_fragments, which callexecute_uncommittedper fragment and then commit). - Call
dataset.merge_existing_index_segments(staged).await. merge_existing_index_segments(index.rs:1127ff) computesall_btree = true, passes the guard at line 1138 (!all_btreeis false → no error), and at line 1158-1163 enters theelsebranch, callingcrate::index::scalar::btree::merge_segments(self, source_segments).await?.- That returns a merged
IndexMetadata, which is the contract the new paragraph implies is unavailable for BTree.
So the API does merge BTree today; only the prose says otherwise.
Addressing the duplicate-refutation note
A refuter argued bug_001 should be dropped in favor of earlier IDs (bug_002 / bug_003) per the duplicates rule. The synthesis step has already taken bug_001 as the canonical instance and merged the duplicates into it, which is the standard resolution — the report itself is still correct and the underlying defect is real. The other two verifier confirmations independently checked the code paths and reached the same conclusion (one even with file:line citations matching mine), so I am submitting under bug_001.
Fix
In docs/src/guide/distributed_indexing.md:109, change "vector, inverted, bitmap, and zone map segments" to "vector, inverted, bitmap, btree, and zone map segments" (or equivalent ordering). No code change is required.
Severity: nit — documentation-only inconsistency, no runtime impact, but it is introduced by the same paragraph this PR adds, so it is in-scope to correct here.
BubbleCal
left a comment
There was a problem hiding this comment.
One concern about the merge-then-commit path.
| uuid: new_uuid, | ||
| fields: vec![field_id], | ||
| dataset_version: dataset.manifest.version, | ||
| fragment_bitmap: Some(fragment_bitmap), |
There was a problem hiding this comment.
Could we add a merge-then-commit regression test here? Since the merged segment keeps only live fragment coverage, committing it under the same index name can compare against the old segment's original coverage and report retired fragments as orphaned.
ZoneMap already supports fragment-scoped segment builds and multi-segment query, but its segments could not be consolidated through the canonical distributed indexing merge path. This adds a ZoneMap merge primitive that rewrites source segments into one self-contained physical segment while preserving fragment coverage and rows-per-zone metadata, and updates the distributed indexing guide to include zone map merge support.