feat(scalar-index): push scan limit into index search#7065
Open
gstamatakis95 wants to merge 2 commits into
Open
feat(scalar-index): push scan limit into index search#7065gstamatakis95 wants to merge 2 commits into
gstamatakis95 wants to merge 2 commits into
Conversation
bb7aa09 to
aa05c9d
Compare
aa05c9d to
1d1b5c1
Compare
1d1b5c1 to
91cbb82
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #6949
Problem
A scalar index query that matches many rows can be slow. For example, a large range query against a B-tree index that hits most or all of the data. When the scan has a LIMIT, the index still builds the full set of matching rows before the limit is applied later in the plan.
This change passes the limit down into the index search. A B-tree can then stop early once it has found enough matches.
What changed
Index layer (lance-index)
ScalarIndex::search_limited(query, metrics, limit). It has a default that ignores the limit and callssearch. Because of the default, existing index types and call sites do not need any changes.BTreeIndeximplements the new method. It searches matching pages in order and stops as soon as it has gathered at leastlimitreal matches. The pages it has not reached yet are never read. Null page handling is skipped when a limit is set.ScalarIndexExpr::evaluate_limitedpasses the limit only to a single index lookup. For AND, OR, and NOT it does not pass the limit, because those need the full result of each side.Exec and planner (lance)
MaterializeIndexExec(legacy path) andScalarIndexExec(default path) now accept a limit throughwith_limit(...).Scanner::index_search_limit()computes the value to push, which islimit + offset. It is wired into bothscalar_indexed_scanandnew_filtered_read.LogicalScalarIndexpasses the limit to each of its segments.Correctness
Pushing the limit is only a speed optimization. A
GlobalLimitExecstill applies the exact limit and offset at the top of the plan, so the index only needs to return at leastlimit + offsetrows.The pushdown is turned off in any case where returning the first N matches is not the same as returning any N matches:
limitlive rows.Tests
test_search_limited_short_circuits(B-tree unit test). Confirms the search returns at leastlimitrows but reads fewer than all pages.test_limit_pushed_into_scalar_index(scanner, end to end). Checks the exact row count and that every returned row matches the filter. A second phase adds deletions and confirms results stay correct.test_limittests, andscalar_logicaltests. All pass.cargo clippy --tests -- -D warningsandcargo fmtare clean.