perf!: avoid listing index files after writes by Xuanwo · Pull Request #7129 · lance-format/lance

Xuanwo · 2026-06-05T11:00:41Z

Avoids listing index directories after new index writes by propagating IndexFile metadata from writer and builder APIs into manifest metadata. This covers scalar index writers, V3 vector builders, vector append/remap paths, and distributed vector merge while preserving legacy and extension fallbacks where the API cannot return files yet.

This reduces extra object-store list/stat IO on fresh index creation and maintenance paths, with IO-count regression coverage for scalar and V3 vector index creation.

codecov · 2026-06-05T11:38:37Z

Codecov Report

❌ Patch coverage is 70.80292% with 120 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/index/vector/ivf.rs	33.00%	64 Missing and 5 partials ⚠️
rust/lance-index/src/scalar/inverted/builder.rs	74.07%	3 Missing and 18 partials ⚠️
rust/lance-index/src/scalar/bitmap.rs	56.25%	6 Missing and 1 partial ⚠️
rust/lance-index/src/scalar/btree.rs	64.70%	0 Missing and 6 partials ⚠️
rust/lance-index/src/scalar/fmindex.rs	72.72%	2 Missing and 1 partial ⚠️
...lance-index/src/vector/distributed/index_merger.rs	62.50%	2 Missing and 1 partial ⚠️
rust/lance-index/src/scalar/inverted/index.rs	50.00%	2 Missing ⚠️
rust/lance-index/src/scalar/lance_format.rs	71.42%	2 Missing ⚠️
rust/lance/src/index/vector/builder.rs	89.47%	0 Missing and 2 partials ⚠️
rust/lance-index/src/scalar/inverted.rs	50.00%	0 Missing and 1 partial ⚠️
... and 4 more

📢 Thoughts on this report? Let us know!

claude

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

_{Tip: disable this comment in your organization's Code Review settings.}

Xuanwo · 2026-06-05T17:33:12Z

@claude review once

claude · 2026-06-05T18:02:28Z

    ivf_mut.write(&mut writer).await?;
-    writer.finish().await?;
+    let index_size = writer.finish().await? as u64;

    // Write the aux file
    aux_ivf.write(&mut aux_writer).await?;
-    aux_writer.finish().await?;
+    let aux_size = aux_writer.finish().await? as u64;

-    Ok(existing_indices.len() - start_pos)
+    Ok((
+        existing_indices.len() - start_pos,
+        vec![
+            IndexFile {
+                path: INDEX_FILE_NAME.to_string(),
+                size_bytes: index_size,
+            },
+            IndexFile {
+                path: INDEX_AUXILIARY_FILE_NAME.to_string(),
+                size_bytes: aux_size,
+            },
+        ],
+    ))


🔴 In optimize_ivf_hnsw_indices (rust/lance/src/index/vector/ivf.rs:949-968), writer.finish().await? and aux_writer.finish().await? are dispatched against the concrete PreviousFileWriter constructed at lines 861-865/889-893, whose inherent finish() (rust/lance-file/src/previous/writer/mod.rs:253) returns the number of rows, not bytes. Those values are then stored in IndexFile.size_bytes and fed through LanceIndexStore::with_file_sizes as CachedFileSize hints, so the IVF_HNSW_SQ optimize path persists garbage sizes (off by orders of magnitude) into the manifest and into the HEAD-call-avoidance cache introduced by this PR. Fix by mirroring the sibling functions in this file (e.g. optimize_ivf_pq_indices at line 790, remap_index_file at line 1908, write_ivf_pq_file at line 1973) and capturing writer.tell().await? as u64 before Writer::shutdown.

Extended reasoning...

Bug

In optimize_ivf_hnsw_indices (rust/lance/src/index/vector/ivf.rs:949-968 in this PR), the IVF and auxiliary file sizes are captured as:

ivf_mut.write(&mut writer).await?; let index_size = writer.finish().await? as u64; // Write the aux file aux_ivf.write(&mut aux_writer).await?; let aux_size = aux_writer.finish().await? as u64;

and then placed into the returned IndexFile entries as size_bytes.

Why this is wrong

writer and aux_writer are not Box<dyn Writer> here. At lines 861-865 and 889-893 they are wrapped with PreviousFileWriter::with_object_writer(...), so their concrete type is PreviousFileWriter<ManifestDescribing>. PreviousFileWriter carries two finish methods:

The inherent method at rust/lance-file/src/previous/writer/mod.rs:253-263, which returns Result<usize> containing self.metadata.batch_offsets.last() — i.e. the number of rows written, not bytes:

pub async fn finish(&mut self) -> Result<usize> { self.write_footer().await?; Writer::shutdown(self.object_writer.as_mut()).await?; let num_rows = self.metadata.batch_offsets.last().cloned().unwrap_or_default(); Ok(num_rows as usize) }

The trait IndexWriter::finish impl (in lance_format.rs), which returns Result<IndexFile>.

Rust method resolution prefers the inherent method, so writer.finish().await? invokes the row-count variant. The as u64 cast at lines 950/954 also confirms this — if dispatch had hit the trait method (returning IndexFile), the cast would not compile.

The net effect: for the IVF_HNSW v1 optimize path, index_size is the count of HNSW nodes (the .idx writer) and aux_size is the count of stored vectors (the .aux writer). These row counts then become the persisted IndexFile.size_bytes.

Why this PR makes it dangerous

This PR’s explicit goal is to skip HEAD calls when opening index files by propagating writer-provided sizes into LanceIndexStore::with_file_sizes, which constructs CachedFileSize::new(size) in open_index_file. A wrong-by-orders-of-magnitude cached size feeds straight into the v2 FileReader’s end-of-file offset arithmetic (footer location, metadata length, etc.). When the reader trusts a "size_bytes" that is actually a few hundred / few thousand instead of the real megabyte+ file size, footer offset computation reads from the wrong place — surfacing as corrupt-file errors or, worse, silently wrong reads.

Even on the read paths that still issue their own HEAD (i.e. the path that bypasses with_file_sizes), the garbage values still flow into IndexMetadata::total_size_bytes() and describe_indices(), breaking the very accounting this PR adds.

How to fix

Mirror the pattern already used everywhere else in this file:

optimize_ivf_pq_indices (line 790): let size_bytes = writer.tell().await? as u64; immediately before Writer::shutdown(writer.as_mut()).await?;

remap_index_file (line 1908): same pattern.

write_ivf_pq_file (line 1973): same pattern.

For optimize_ivf_hnsw_indices, after ivf_mut.write(&mut writer).await? (and analogously for aux_writer), do something like:

ivf_mut.write(&mut writer).await?; let index_size = writer.tell().await? as u64; PreviousFileWriter::finish(&mut writer).await?; aux_ivf.write(&mut aux_writer).await?; let aux_size = aux_writer.tell().await? as u64; PreviousFileWriter::finish(&mut aux_writer).await?;

or equivalently call the trait method IndexWriter::finish (which already does tell+shutdown internally for this writer type — see rust/lance-index/src/scalar/lance_format.rs impl IndexWriter for PreviousFileWriter) and use the returned IndexFile.size_bytes.

Step-by-step proof

Build an IVF_HNSW_SQ index containing, say, ~10,000 vectors of dimension 128, so the underlying .idx and .aux files end up at hundreds of KB or more.

Trigger the segment-commit optimize path that calls optimize_vector_indices → optimize_ivf_hnsw_indices.

After optimize, inspect the committed IndexMetadata.files for the new UUID. The reported size_bytes for index.idx equals the number of HNSW nodes (≈10,000) and for auxiliary.idx equals the number of stored vectors (≈10,000) — not the actual byte sizes (which are orders of magnitude larger).

Because the PR now feeds these as CachedFileSize hints via LanceIndexStore::with_file_sizes, a subsequent read that uses this path will compute footer offsets as if the file were ~10,000 bytes long, fetching bytes from outside the real footer region.

Compare with optimize_ivf_pq_indices (line 790) on the same dataset: there size_bytes reflects the true file size because that code path captures writer.tell().await? before shutdown.

All four independent verifiers reached the same conclusion (Rust dispatch goes to the inherent PreviousFileWriter::finish, the value returned is num_rows, and these values then flow into manifest size_bytes and the new HEAD-avoidance cache). The fix is a small one-liner per writer, matching the pattern already established elsewhere in the same file.

perf: avoid listing index files after writes

00b284e

github-actions Bot added A-index Vector index, linalg, tokenizer performance labels Jun 5, 2026

perf: require index writers to return file metadata

be33d89

Xuanwo marked this pull request as ready for review June 5, 2026 17:31

claude Bot reviewed Jun 5, 2026

View reviewed changes

Xuanwo changed the title ~~perf: avoid listing index files after writes~~ perf!: avoid listing index files after writes Jun 5, 2026

github-actions Bot added the breaking-change label Jun 5, 2026

claude Bot reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf!: avoid listing index files after writes#7129

perf!: avoid listing index files after writes#7129
Xuanwo wants to merge 2 commits into
mainfrom
xuanwo/reduce-index-file-io

Xuanwo commented Jun 5, 2026

Uh oh!

codecov Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

Xuanwo commented Jun 5, 2026

Uh oh!

claude Bot Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Xuanwo commented Jun 5, 2026

Uh oh!

codecov Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Xuanwo commented Jun 5, 2026

Uh oh!

claude Bot Jun 5, 2026

Choose a reason for hiding this comment

Bug

Why this is wrong

Why this PR makes it dangerous

How to fix

Step-by-step proof

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Jun 5, 2026 •

edited

Loading