feat: add PaginationList for lazy page fetching in catalog list operations by stark256-spec · Pull Request #3454 · apache/iceberg-python

stark256-spec · 2026-06-03T18:43:57Z

Problem

list_tables, list_views, and list_namespaces in the REST catalog eagerly collect every page before returning, even if the caller only needs the first few results. In namespaces with thousands of tables this creates unnecessary network round-trips and latency before the first result is visible.

Closes #3365

Solution

Adds PaginationList[T] (pyiceberg/utils/pagination.py) — a list subclass that pre-loads the first page and lazily fetches subsequent pages only as the caller iterates past items already in memory.

Design

Operation	Behaviour
`for item in result`	Lazy — next page fetched only when iterator exhausts current buffer
`result[0]` / `result[2]`	Lazy — fetches pages until the requested index is available
`result[-1]` / `result[1:3]` / `len(result)` / `x in result` / `result == other`	Eager — fetches all remaining pages
`isinstance(result, list)`	`True` — full backward compatibility

Key properties

Zero breaking changes: PaginationList subclasses list, so all existing call sites that iterate, compare, or extend the return value continue to work without modification.
First page always pre-loaded: Callers that only look at the first few items pay zero extra latency compared to the old implementation.
Single fetch per page: Each page token is consumed at most once; no redundant requests.

Changes

pyiceberg/utils/pagination.py — new PaginationList[T] class
pyiceberg/catalog/rest/__init__.py — list_tables, list_views, list_namespaces refactored to return PaginationList
tests/utils/test_pagination.py — 14 unit tests for all PaginationList operations
tests/catalog/test_rest.py — test_list_tables_returns_pagination_list verifies lazy behaviour (call count stays at 1 while iterating within the first page, rises to 2 only after crossing the page boundary)

…tions Implements a PaginationList[T] class (pyiceberg/utils/pagination.py) that extends list and lazily fetches subsequent pages from a paginated REST endpoint only as the caller iterates past items already in memory. Design: - First page is pre-loaded on construction; the list is immediately usable for callers that only inspect the first page or stop iteration early. - __iter__ advances through in-memory items and triggers a new HTTP fetch only when the iterator would otherwise block on an empty buffer. - Operations requiring the full result set (__len__, __contains__, __eq__, slicing, repr, negative indexing) call _fetch_all() before delegating to the underlying list implementation. - Positive-index __getitem__ calls _fetch_through_index() to fetch only as many pages as needed to satisfy the access. - Subclasses list[T], so isinstance(result, list) remains True and all existing call sites that iterate, compare, or extend the return value continue to work without changes. REST catalog integration: - list_tables, list_views, and list_namespaces in RestCatalog now return a PaginationList instead of eagerly collecting all pages in a while loop. - Each method defines a _fetch_page closure capturing the session, URL and params, matching the (items, next_token | None) callback contract. Tests: - tests/utils/test_pagination.py: 14 unit tests covering single-page and multi-page scenarios, lazy iteration, partial consumption, __len__, __contains__, __eq__, slicing, negative indexing, and repr. - tests/catalog/test_rest.py: adds test_list_tables_returns_pagination_list which verifies that iterating only the first two items does not trigger the second HTTP request (call_count == 1), and that consuming all items triggers exactly one additional request (call_count == 2). Closes apache#3365

- Change PaginationList base from 'list, Generic[T]' to 'list[T]' to satisfy mypy type-arg requirement - Replace 'int' with 'SupportsIndex' in __getitem__ overloads to match the supertype signature expected by mypy - Remove now-unnecessary '# type: ignore[return-value]' comment - Add docstrings to all public methods (ruff D requirement) - Add return type annotation to _make_fetch in test file - Use PaginationList[int] with type parameter in test helper signature - Import Callable from collections.abc in test file

rambleraptor · 2026-06-03T20:21:37Z

Thanks for doing this! This was my idea and I'm super excited to have this in here. Can you run make lint and get CI to pass? Happy to do a review after that.

rambleraptor · 2026-06-03T20:23:31Z

Is there any way you can do some quick performance tests? I'm particularly interested in len() - what does it look like when you only have one page of results versus multiple pages?

(People call len() on arrays all the time. I want to make sure we don't have a performance regression in the normal case where you don't have other pages)

Ruff fixes: - Remove unused 'import pytest' (F401) - Remove extraneous f-prefix from plain string 'tok1' (F541) Performance tests (requested by reviewer): - test_len_single_page_makes_no_extra_fetches: verifies that len() on a single-page PaginationList does not invoke the fetch callback at all - test_len_multi_page_fetches_all_pages_once: verifies that len() fetches remaining pages exactly once and caches the result for subsequent calls

stark256-spec · 2026-06-03T21:52:51Z

Fixed the two ruff errors (import pytest unused, bare f"tok1" f-string) and added the performance tests you asked about:

test_len_single_page_makes_no_extra_fetches: asserts the fetch callback is never called when all results are in the first page
test_len_multi_page_fetches_all_pages_once: asserts pages 2 and 3 are each fetched exactly once, and a second len() call hits the cache with 0 additional fetches

The len() cost for a single-page list is identical to a plain list — zero network calls.

…ocstyle D105

@overload

D418 prohibits docstrings on @overload stubs; suppress D105 (missing docstring in magic method) with inline noqa comments instead.

@overload

… conflict pydocstyle D105 requires docstrings on magic methods but D418 prohibits docstrings on @overload stubs — these rules are irreconcilable for @overload-decorated __getitem__. Since PaginationList is the only file in the project using @overload, there is no established project pattern. Drop the two stubs and keep only the concrete implementation with its existing docstring. The union signature 'SupportsIndex | slice -> T | list[T]' is correct; callers lose the narrowed return type but the runtime behaviour is unchanged. Also removes the now-unused 'overload' import.

…erload

…erload stubs D105 (missing docstring in magic method) and D418 (@overload stubs must not have docstrings) are mutually exclusive for @overload-decorated __getitem__. Removing the stubs silenced pydocstyle but caused 3 mypy [override] errors because list[T]'s __getitem__ is itself overloaded and the single-signature Union form is incompatible with the supertype. Fix: add D105 and D418 to the pydocstyle --ignore list, consistent with D107 (missing __init__ docstring) which is already suppressed. Restore the @overload stubs so mypy correctly narrows SupportsIndex -> T and slice -> list[T]. Verified: mypy passes, ruff passes, pydocstyle passes.

…on (isort)

…t assertions RestCatalog.__init__ calls v1/config which increments rest_mock.call_count. Capture the baseline after construction and assert relative to it.

ebyhr · 2026-06-04T03:05:41Z

@stark256-spec, @GayathriSrividya is also working on the pagination in #3416. Could you please coordinate the work so that we don't end up duplicating efforts?

GayathriSrividya · 2026-06-04T04:39:38Z

@stark256-spec, @GayathriSrividya is also working on the pagination in #3416. Could you please coordinate the work so that we don't end up duplicating efforts?

Thanks for flagging this. PR #3416 was opened first and it overlaps with #3454 on the REST lazy pagination work for #3365.

The main difference is approach:

Lazy pagination iterator for catalog list methods #3416 moves the catalog list APIs toward iterator semantics across catalogs.
feat: add PaginationList for lazy page fetching in catalog list operations #3454 keeps list compatibility and adds a lazy PaginationList for REST.

I’m happy to coordinate and avoid duplicate work. If maintainers prefer the backward-compatible PaginationList approach, I can narrow or rework #3416 accordingly. If the iterator-based API is preferred, then I think #3416 remains the better home for the change.

Please let me know which direction you’d like us to converge on.

stark256-spec · 2026-06-04T05:17:04Z

Thanks for the coordination note. Happy to defer to the maintainers on which approach lands.

To summarise the trade-off for the review:

#3416 (iterator semantics): Changes return types to Iterator[Identifier] across all catalog implementations. Clean interface but is a breaking change — callers using len(), slicing, in, or converting to list implicitly will need updates.

#3454 (PaginationList): Keeps list[Identifier] return type (backward compatible). Subclasses list so all existing callers work unchanged. Lazy fetching happens transparently during iteration.

If maintainers prefer the iterator direction from #3416, I'm happy to close this PR. If PaginationList's backward-compatibility is valued, I can rebase onto whatever #3416 lands on and focus only on the REST pagination piece.

Whatever the decision, closing is fine — the important thing is that #3365 gets resolved.

GayathriSrividya · 2026-06-04T07:24:23Z

Thanks for the coordination note. Happy to defer to the maintainers on which approach lands.

To summarise the trade-off for the review:

#3416 (iterator semantics): Changes return types to Iterator[Identifier] across all catalog implementations. Clean interface but is a breaking change — callers using len(), slicing, in, or converting to list implicitly will need updates.

#3454 (PaginationList): Keeps list[Identifier] return type (backward compatible). Subclasses list so all existing callers work unchanged. Lazy fetching happens transparently during iteration.

If maintainers prefer the iterator direction from #3416, I'm happy to close this PR. If PaginationList's backward-compatibility is valued, I can rebase onto whatever #3416 lands on and focus only on the REST pagination piece.

Whatever the decision, closing is fine — the important thing is that #3365 gets resolved.

Thanks Yuya for the coordination note, and thanks @stark256-spec for laying out the trade-off clearly.

From my side, the main intent of #3416 was to address #3365 by making pagination explicit through iterator semantics across catalog list operations. I agree this is a larger API change than returning a list-like wrapper, but I think the benefit is that it makes the lazy behavior clearer to callers and avoids pretending the full result set is already materialized.

That said, I also understand the compatibility advantage of #3454. If maintainers prefer preserving the current list[Identifier] contract, I’m happy to coordinate and make sure the relevant parts of #3416, especially the page-by-page REST fetching/retry behavior, are carried over cleanly.

So I’m flexible on the final direction, but my preference is still the explicit iterator approach unless backward compatibility is the deciding factor. Either way, I agree we should avoid duplicate implementations and converge on one solution for #3365.

rambleraptor · 2026-06-04T20:14:10Z

I'm not one of the maintainers, but I have a huge preference towards the backwards compatibility approach. Moving to an iterator-based approach means we have a breaking change where users lose many array methods unexpectedly.

corleyma · 2026-06-04T22:20:22Z

I'm not one of the maintainers, but I have a huge preference towards the backwards compatibility approach. Moving to an iterator-based approach means we have a breaking change where users lose many array methods unexpectedly.

Normally I'd agree, but since the iterator-based API is just much more pythonic, and the old API was designed absent any consideration for paging, why not just bite the bullet and do a major version bump? Less complexity to carry around in the implementation and it drives the project toward a cleaner API.

Especially these days, the worst case is somebody's LLM spends a few tokens updating for iterator semantics. Is it really that bad?

But also not a maintainer, I just think it's silly it has taken so long to land proper pagination support in pyiceberg because of compatibility issues in the age of coding agents.

Fokko · 2026-06-05T05:14:02Z

+T = TypeVar("T")
+
+
+class PaginationList(list[T]):


Can we move this type to typedef?

Fokko · 2026-06-05T05:22:15Z

-        while True:
-            if page_token:
-                params["pageToken"] = page_token
+        def _fetch_page(page_token: str) -> tuple[list[Identifier], str | None]:


I see theis tuple[list[T], str | None] coming back multiple times, should we introduce a TypeAliasType to improve readability?

Fokko · 2026-06-05T05:27:03Z

+    return pl, all_items
+
+
+class TestPaginationListSinglePage:


We don't use classes in the tests for the rest, and I think it would be good to keep that consistent

Fokko · 2026-06-05T05:27:17Z

        args:
          [
-            "--ignore=D100,D102,D101,D103,D104,D107,D203,D212,D213,D404,D405,D406,D407,D411,D413,D415,D417",
+            "--ignore=D100,D102,D101,D103,D104,D105,D107,D203,D212,D213,D404,D405,D406,D407,D411,D413,D415,D417,D418",


What are we adding here?

Fokko · 2026-06-06T07:09:00Z

I went straight to the code, and missed the discussion here. I think it would be better to avoid breaking the API and go with the PaginationList instead.

Even if we end up agreeing that an Iterator is more suitable/pythonic, we need to do this in a major version bump, which is further away, so I think extending the list is a good path to move forward.

stark256-spec · 2026-06-06T13:25:51Z

PaginationList has @overload stubs for __getitem__ (two signatures — one for SupportsIndex, one for slice). Those stubs are magic methods with no docstring, which makes pydocstyle fire D105 ("Missing docstring in magic method"). Adding a docstring to an @overload stub isn't an option either — that triggers D418. Ended up suppressing both globally to break the deadlock.

Happy to revert this and drop # noqa: D105 on just those two stub lines instead — keeps the global config clean. Will do that in the next push along with the other feedback.

@overload

…sult alias, flatten tests - Move PaginationList from pyiceberg/utils/pagination.py into pyiceberg/typedef.py alongside the project's other core types; delete the now-empty module - Introduce _PageFetchResult TypeAlias for tuple[list[Identifier], str | None] in rest/__init__.py to avoid repeating the type on every _fetch_page signature - Flatten TestPaginationList* classes into top-level functions to match the style used throughout the rest of the test suite - Revert D105/D418 additions from .pre-commit-config.yaml; suppress D105 inline on the two @overload stubs that cannot have docstrings

stark256-spec · 2026-06-06T14:29:05Z

Addressed all four comments in e2bb799:

Move type to typedef: PaginationList is now in pyiceberg/typedef.py alongside the project's other core types. pyiceberg/utils/pagination.py is deleted.
TypeAliasType: Added _PageFetchResult: TypeAlias = tuple[list[Identifier], str | None] at module level in rest/__init__.py; all three _fetch_page signatures now use it.
Test classes: Flattened TestPaginationList* into top-level functions matching the style in the rest of the test suite.
pre-commit config: Reverted the D105/D418 additions. The root issue was that @overload stubs for __getitem__ are magic methods with no docstring (D105), and you can't add a docstring to an @overload stub (D418). Fixed with # noqa: D105 inline on those two stubs instead.

stark256-spec · 2026-06-06T15:18:55Z

Circling back on the pre-commit change — the inline # noqa: D105 approach I tried doesn't work with this linting stack, for two reasons:

pydocstyle doesn't process # noqa comments at all — that directive is a ruff/flake8 concept. pydocstyle ignores it and fires D105 anyway.
ruff then treats # noqa: D105 as an unnecessary noqa directive (RUF100 — D105 is not a ruff code), auto-strips the comment, and fails the hook because it modified the file.

The underlying issue is an inherent catch-22 in the @overload stubs for __getitem__:

D105 fires because @overload stubs are magic methods with no docstring
D418 fires if you add a docstring to an @overload stub
Neither # noqa nor any inline suppression survives this stack

The three options I can see:

Add D105 to the global ignore (what the original commit did) — one-line change, the stubs stay, type narrowing is preserved
Drop the @overload stubs — D105 goes away, but callers lose the narrowed return type (T vs list[T] depending on index vs slice)
Use type: ignore comments — these survive ruff but still don't suppress pydocstyle

Happy to go with whatever direction you prefer. If option 1 is acceptable, I can push that immediately.

stark256-spec added 2 commits June 3, 2026 13:43

stark256-spec added 6 commits June 3, 2026 17:50

fix(pagination): add docstrings to __getitem__ overload stubs for pyd…

9fbcdd3

…ocstyle D105

fix(pagination): use noqa:D105 on overload stubs instead of docstrings

10c2ba0

D418 prohibits docstrings on @overload stubs; suppress D105 (missing docstring in magic method) with inline noqa comments instead.

fix(tests): move PaginationList import to correct alphabetical positi…

d71317c

…on (isort)

fix(tests): account for RestCatalog config endpoint call in call_coun…

1780ae6

…t assertions RestCatalog.__init__ calls v1/config which increments rest_mock.call_count. Capture the baseline after construction and assert relative to it.

Fokko reviewed Jun 5, 2026

View reviewed changes

Fokko mentioned this pull request Jun 6, 2026

Implement iterator to lazily go through the paged response #3365

Open

fix: update PaginationList import in test_rest.py

392a8f5

Conversation

stark256-spec commented Jun 3, 2026

Problem

Solution

Design

Key properties

Changes

Uh oh!

rambleraptor commented Jun 3, 2026

Uh oh!

rambleraptor commented Jun 3, 2026

Uh oh!

stark256-spec commented Jun 3, 2026

Uh oh!

ebyhr commented Jun 4, 2026

Uh oh!

GayathriSrividya commented Jun 4, 2026

Uh oh!

stark256-spec commented Jun 4, 2026

Uh oh!

GayathriSrividya commented Jun 4, 2026

Uh oh!

rambleraptor commented Jun 4, 2026

Uh oh!

corleyma commented Jun 4, 2026

Uh oh!

Fokko Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Fokko Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Fokko Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Fokko Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Fokko commented Jun 6, 2026

Uh oh!

stark256-spec commented Jun 6, 2026

Uh oh!

stark256-spec commented Jun 6, 2026

Uh oh!

stark256-spec commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants