Skip to content

feat: add PaginationList for lazy page fetching in catalog list operations#3454

Open
stark256-spec wants to merge 11 commits into
apache:mainfrom
stark256-spec:feat/lazy-pagination-list
Open

feat: add PaginationList for lazy page fetching in catalog list operations#3454
stark256-spec wants to merge 11 commits into
apache:mainfrom
stark256-spec:feat/lazy-pagination-list

Conversation

@stark256-spec
Copy link
Copy Markdown

Problem

list_tables, list_views, and list_namespaces in the REST catalog eagerly collect every page before returning, even if the caller only needs the first few results. In namespaces with thousands of tables this creates unnecessary network round-trips and latency before the first result is visible.

Closes #3365

Solution

Adds PaginationList[T] (pyiceberg/utils/pagination.py) — a list subclass that pre-loads the first page and lazily fetches subsequent pages only as the caller iterates past items already in memory.

Design

Operation Behaviour
for item in result Lazy — next page fetched only when iterator exhausts current buffer
result[0] / result[2] Lazy — fetches pages until the requested index is available
result[-1] / result[1:3] / len(result) / x in result / result == other Eager — fetches all remaining pages
isinstance(result, list) True — full backward compatibility

Key properties

  • Zero breaking changes: PaginationList subclasses list, so all existing call sites that iterate, compare, or extend the return value continue to work without modification.
  • First page always pre-loaded: Callers that only look at the first few items pay zero extra latency compared to the old implementation.
  • Single fetch per page: Each page token is consumed at most once; no redundant requests.

Changes

  • pyiceberg/utils/pagination.py — new PaginationList[T] class
  • pyiceberg/catalog/rest/__init__.pylist_tables, list_views, list_namespaces refactored to return PaginationList
  • tests/utils/test_pagination.py — 14 unit tests for all PaginationList operations
  • tests/catalog/test_rest.pytest_list_tables_returns_pagination_list verifies lazy behaviour (call count stays at 1 while iterating within the first page, rises to 2 only after crossing the page boundary)

…tions

Implements a PaginationList[T] class (pyiceberg/utils/pagination.py) that
extends list and lazily fetches subsequent pages from a paginated REST
endpoint only as the caller iterates past items already in memory.

Design:
- First page is pre-loaded on construction; the list is immediately usable
  for callers that only inspect the first page or stop iteration early.
- __iter__ advances through in-memory items and triggers a new HTTP fetch
  only when the iterator would otherwise block on an empty buffer.
- Operations requiring the full result set (__len__, __contains__, __eq__,
  slicing, repr, negative indexing) call _fetch_all() before delegating to
  the underlying list implementation.
- Positive-index __getitem__ calls _fetch_through_index() to fetch only as
  many pages as needed to satisfy the access.
- Subclasses list[T], so isinstance(result, list) remains True and all
  existing call sites that iterate, compare, or extend the return value
  continue to work without changes.

REST catalog integration:
- list_tables, list_views, and list_namespaces in RestCatalog now return
  a PaginationList instead of eagerly collecting all pages in a while loop.
- Each method defines a _fetch_page closure capturing the session, URL and
  params, matching the (items, next_token | None) callback contract.

Tests:
- tests/utils/test_pagination.py: 14 unit tests covering single-page and
  multi-page scenarios, lazy iteration, partial consumption, __len__,
  __contains__, __eq__, slicing, negative indexing, and repr.
- tests/catalog/test_rest.py: adds test_list_tables_returns_pagination_list
  which verifies that iterating only the first two items does not trigger
  the second HTTP request (call_count == 1), and that consuming all items
  triggers exactly one additional request (call_count == 2).

Closes apache#3365
- Change PaginationList base from 'list, Generic[T]' to 'list[T]' to
  satisfy mypy type-arg requirement
- Replace 'int' with 'SupportsIndex' in __getitem__ overloads to match
  the supertype signature expected by mypy
- Remove now-unnecessary '# type: ignore[return-value]' comment
- Add docstrings to all public methods (ruff D requirement)
- Add return type annotation to _make_fetch in test file
- Use PaginationList[int] with type parameter in test helper signature
- Import Callable from collections.abc in test file
@rambleraptor
Copy link
Copy Markdown
Collaborator

Thanks for doing this! This was my idea and I'm super excited to have this in here. Can you run make lint and get CI to pass? Happy to do a review after that.

@rambleraptor
Copy link
Copy Markdown
Collaborator

Is there any way you can do some quick performance tests? I'm particularly interested in len() - what does it look like when you only have one page of results versus multiple pages?

(People call len() on arrays all the time. I want to make sure we don't have a performance regression in the normal case where you don't have other pages)

Ruff fixes:
- Remove unused 'import pytest' (F401)
- Remove extraneous f-prefix from plain string 'tok1' (F541)

Performance tests (requested by reviewer):
- test_len_single_page_makes_no_extra_fetches: verifies that len() on a
  single-page PaginationList does not invoke the fetch callback at all
- test_len_multi_page_fetches_all_pages_once: verifies that len() fetches
  remaining pages exactly once and caches the result for subsequent calls
@stark256-spec
Copy link
Copy Markdown
Author

Fixed the two ruff errors (import pytest unused, bare f"tok1" f-string) and added the performance tests you asked about:

  • test_len_single_page_makes_no_extra_fetches: asserts the fetch callback is never called when all results are in the first page
  • test_len_multi_page_fetches_all_pages_once: asserts pages 2 and 3 are each fetched exactly once, and a second len() call hits the cache with 0 additional fetches

The len() cost for a single-page list is identical to a plain list — zero network calls.

D418 prohibits docstrings on @overload stubs; suppress D105 (missing
docstring in magic method) with inline noqa comments instead.
… conflict

pydocstyle D105 requires docstrings on magic methods but D418 prohibits
docstrings on @overload stubs — these rules are irreconcilable for
@overload-decorated __getitem__. Since PaginationList is the only file in
the project using @overload, there is no established project pattern.

Drop the two stubs and keep only the concrete implementation with its
existing docstring. The union signature 'SupportsIndex | slice -> T | list[T]'
is correct; callers lose the narrowed return type but the runtime behaviour
is unchanged. Also removes the now-unused 'overload' import.
…erload stubs

D105 (missing docstring in magic method) and D418 (@overload stubs must not
have docstrings) are mutually exclusive for @overload-decorated __getitem__.
Removing the stubs silenced pydocstyle but caused 3 mypy [override] errors
because list[T]'s __getitem__ is itself overloaded and the single-signature
Union form is incompatible with the supertype.

Fix: add D105 and D418 to the pydocstyle --ignore list, consistent with
D107 (missing __init__ docstring) which is already suppressed. Restore the
@overload stubs so mypy correctly narrows SupportsIndex -> T and slice ->
list[T]. Verified: mypy passes, ruff passes, pydocstyle passes.
…t assertions

RestCatalog.__init__ calls v1/config which increments rest_mock.call_count.
Capture the baseline after construction and assert relative to it.
@ebyhr
Copy link
Copy Markdown
Member

ebyhr commented Jun 4, 2026

@stark256-spec, @GayathriSrividya is also working on the pagination in #3416. Could you please coordinate the work so that we don't end up duplicating efforts?

@GayathriSrividya
Copy link
Copy Markdown

@stark256-spec, @GayathriSrividya is also working on the pagination in #3416. Could you please coordinate the work so that we don't end up duplicating efforts?

Thanks for flagging this. PR #3416 was opened first and it overlaps with #3454 on the REST lazy pagination work for #3365.

The main difference is approach:

I’m happy to coordinate and avoid duplicate work. If maintainers prefer the backward-compatible PaginationList approach, I can narrow or rework #3416 accordingly. If the iterator-based API is preferred, then I think #3416 remains the better home for the change.

Please let me know which direction you’d like us to converge on.

@stark256-spec
Copy link
Copy Markdown
Author

Thanks for the coordination note. Happy to defer to the maintainers on which approach lands.

To summarise the trade-off for the review:

#3416 (iterator semantics): Changes return types to Iterator[Identifier] across all catalog implementations. Clean interface but is a breaking change — callers using len(), slicing, in, or converting to list implicitly will need updates.

#3454 (PaginationList): Keeps list[Identifier] return type (backward compatible). Subclasses list so all existing callers work unchanged. Lazy fetching happens transparently during iteration.

If maintainers prefer the iterator direction from #3416, I'm happy to close this PR. If PaginationList's backward-compatibility is valued, I can rebase onto whatever #3416 lands on and focus only on the REST pagination piece.

Whatever the decision, closing is fine — the important thing is that #3365 gets resolved.

@GayathriSrividya
Copy link
Copy Markdown

Thanks for the coordination note. Happy to defer to the maintainers on which approach lands.

To summarise the trade-off for the review:

#3416 (iterator semantics): Changes return types to Iterator[Identifier] across all catalog implementations. Clean interface but is a breaking change — callers using len(), slicing, in, or converting to list implicitly will need updates.

#3454 (PaginationList): Keeps list[Identifier] return type (backward compatible). Subclasses list so all existing callers work unchanged. Lazy fetching happens transparently during iteration.

If maintainers prefer the iterator direction from #3416, I'm happy to close this PR. If PaginationList's backward-compatibility is valued, I can rebase onto whatever #3416 lands on and focus only on the REST pagination piece.

Whatever the decision, closing is fine — the important thing is that #3365 gets resolved.

Thanks Yuya for the coordination note, and thanks @stark256-spec for laying out the trade-off clearly.

From my side, the main intent of #3416 was to address #3365 by making pagination explicit through iterator semantics across catalog list operations. I agree this is a larger API change than returning a list-like wrapper, but I think the benefit is that it makes the lazy behavior clearer to callers and avoids pretending the full result set is already materialized.

That said, I also understand the compatibility advantage of #3454. If maintainers prefer preserving the current list[Identifier] contract, I’m happy to coordinate and make sure the relevant parts of #3416, especially the page-by-page REST fetching/retry behavior, are carried over cleanly.

So I’m flexible on the final direction, but my preference is still the explicit iterator approach unless backward compatibility is the deciding factor. Either way, I agree we should avoid duplicate implementations and converge on one solution for #3365.

@rambleraptor
Copy link
Copy Markdown
Collaborator

I'm not one of the maintainers, but I have a huge preference towards the backwards compatibility approach. Moving to an iterator-based approach means we have a breaking change where users lose many array methods unexpectedly.

@corleyma
Copy link
Copy Markdown

corleyma commented Jun 4, 2026

I'm not one of the maintainers, but I have a huge preference towards the backwards compatibility approach. Moving to an iterator-based approach means we have a breaking change where users lose many array methods unexpectedly.

Normally I'd agree, but since the iterator-based API is just much more pythonic, and the old API was designed absent any consideration for paging, why not just bite the bullet and do a major version bump? Less complexity to carry around in the implementation and it drives the project toward a cleaner API.

Especially these days, the worst case is somebody's LLM spends a few tokens updating for iterator semantics. Is it really that bad?

But also not a maintainer, I just think it's silly it has taken so long to land proper pagination support in pyiceberg because of compatibility issues in the age of coding agents.

Comment thread pyiceberg/utils/pagination.py Outdated
T = TypeVar("T")


class PaginationList(list[T]):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this type to typedef?

Comment thread pyiceberg/catalog/rest/__init__.py Outdated
while True:
if page_token:
params["pageToken"] = page_token
def _fetch_page(page_token: str) -> tuple[list[Identifier], str | None]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see theis tuple[list[T], str | None] coming back multiple times, should we introduce a TypeAliasType to improve readability?

Comment thread tests/utils/test_pagination.py Outdated
return pl, all_items


class TestPaginationListSinglePage:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use classes in the tests for the rest, and I think it would be good to keep that consistent

Comment thread .pre-commit-config.yaml Outdated
args:
[
"--ignore=D100,D102,D101,D103,D104,D107,D203,D212,D213,D404,D405,D406,D407,D411,D413,D415,D417",
"--ignore=D100,D102,D101,D103,D104,D105,D107,D203,D212,D213,D404,D405,D406,D407,D411,D413,D415,D417,D418",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we adding here?

@Fokko
Copy link
Copy Markdown
Contributor

Fokko commented Jun 6, 2026

I went straight to the code, and missed the discussion here. I think it would be better to avoid breaking the API and go with the PaginationList instead.

Even if we end up agreeing that an Iterator is more suitable/pythonic, we need to do this in a major version bump, which is further away, so I think extending the list is a good path to move forward.

@stark256-spec
Copy link
Copy Markdown
Author

PaginationList has @overload stubs for __getitem__ (two signatures — one for SupportsIndex, one for slice). Those stubs are magic methods with no docstring, which makes pydocstyle fire D105 ("Missing docstring in magic method"). Adding a docstring to an @overload stub isn't an option either — that triggers D418. Ended up suppressing both globally to break the deadlock.

Happy to revert this and drop # noqa: D105 on just those two stub lines instead — keeps the global config clean. Will do that in the next push along with the other feedback.

…sult alias, flatten tests

- Move PaginationList from pyiceberg/utils/pagination.py into pyiceberg/typedef.py
  alongside the project's other core types; delete the now-empty module
- Introduce _PageFetchResult TypeAlias for tuple[list[Identifier], str | None]
  in rest/__init__.py to avoid repeating the type on every _fetch_page signature
- Flatten TestPaginationList* classes into top-level functions to match the
  style used throughout the rest of the test suite
- Revert D105/D418 additions from .pre-commit-config.yaml; suppress D105
  inline on the two @overload stubs that cannot have docstrings
@stark256-spec
Copy link
Copy Markdown
Author

Addressed all four comments in e2bb799:

  • Move type to typedef: PaginationList is now in pyiceberg/typedef.py alongside the project's other core types. pyiceberg/utils/pagination.py is deleted.
  • TypeAliasType: Added _PageFetchResult: TypeAlias = tuple[list[Identifier], str | None] at module level in rest/__init__.py; all three _fetch_page signatures now use it.
  • Test classes: Flattened TestPaginationList* into top-level functions matching the style in the rest of the test suite.
  • pre-commit config: Reverted the D105/D418 additions. The root issue was that @overload stubs for __getitem__ are magic methods with no docstring (D105), and you can't add a docstring to an @overload stub (D418). Fixed with # noqa: D105 inline on those two stubs instead.

@stark256-spec
Copy link
Copy Markdown
Author

Circling back on the pre-commit change — the inline # noqa: D105 approach I tried doesn't work with this linting stack, for two reasons:

  1. pydocstyle doesn't process # noqa comments at all — that directive is a ruff/flake8 concept. pydocstyle ignores it and fires D105 anyway.
  2. ruff then treats # noqa: D105 as an unnecessary noqa directive (RUF100 — D105 is not a ruff code), auto-strips the comment, and fails the hook because it modified the file.

The underlying issue is an inherent catch-22 in the @overload stubs for __getitem__:

  • D105 fires because @overload stubs are magic methods with no docstring
  • D418 fires if you add a docstring to an @overload stub
  • Neither # noqa nor any inline suppression survives this stack

The three options I can see:

  1. Add D105 to the global ignore (what the original commit did) — one-line change, the stubs stay, type narrowing is preserved
  2. Drop the @overload stubs — D105 goes away, but callers lose the narrowed return type (T vs list[T] depending on index vs slice)
  3. Use type: ignore comments — these survive ruff but still don't suppress pydocstyle

Happy to go with whatever direction you prefer. If option 1 is acceptable, I can push that immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement iterator to lazily go through the paged response

6 participants