Skip to content

API, Core, Parquet: Add filter hint support to InternalData.ReadBuilder#16583

Open
kamcheungting-db wants to merge 4 commits into
apache:mainfrom
kamcheungting-db:partition-filter-pushdown
Open

API, Core, Parquet: Add filter hint support to InternalData.ReadBuilder#16583
kamcheungting-db wants to merge 4 commits into
apache:mainfrom
kamcheungting-db:partition-filter-pushdown

Conversation

@kamcheungting-db

Copy link
Copy Markdown

InternalData.ReadBuilder had no way to pass a filter expression to the underlying format reader. For Parquet, this meant row-group skipping (already implemented via ParquetMetricsRowGroupFilter) was never reachable from internal metadata reads such as partition statistics scans.

This PR adds InternalData.read(format, file, filterHint) as the primary entry point for filtered reads.

The hint is a best-effort I/O optimization — Parquet uses it for row-group skipping; Avro ignores it. Callers are responsible for correctness via a residual filter.

Adds InternalData.read(format, file, filterHint) as the primary entry
point for reads that benefit from I/O optimization. The filter hint is
passed to the format reader as a best-effort optimization — Parquet uses
it for row-group skipping via column statistics; Avro ignores it. Callers
must always apply a residual filter for correctness.

Changes:
- InternalData.ReadBuilder: add withFilterHint(Expression) default no-op
- InternalData.read(format, file, filterHint): new factory overload that
  applies the hint only when it is not alwaysTrue(); the no-arg overload
  delegates to this for a single implementation
- Parquet.ReadBuilder: override withFilterHint() to wire into ReadConf
  row-group skipping; keep filter() intact for existing callers such as
  ParquetFormatModel; withFilterHint() ANDs with any existing filter to
  avoid silently losing predicates
- PartitionStatisticsScan: add caseSensitive(boolean) default method
- BasePartitionStatisticsScan: implement filter(), project(), and
  caseSensitive() with two-phase filtering — read(format, file, filter)
  for row-group skipping + CloseableIterable.filter(Evaluator) for exact
  per-row correctness; readSchema() unions projection and filter-referenced
  columns so the evaluator always has the data it needs

Tests:
- TestInternalData: testWithFilterHintIsNoOpForAvro verifies Avro returns
  all rows regardless of hint; testWithFilterHintEnablesRowGroupSkipping
  ForParquet writes two non-overlapping id ranges into separate row groups
  (forced via 1-byte row group size + per-record flush check) and asserts
  only high-range rows are returned without a residual filter, directly
  proving skipping happened; testWithFilterHintAndResidualFilter
  ReturnsMatchingRowsOnly verifies the two-phase pattern on both formats
- PartitionStatisticsScanTestBase: testFilterNullThrows,
  testFilterAlwaysTrueReturnsAll, testFilterAlwaysFalseReturnsEmpty,
  testFilterByDataFileCount, testFilterColumnNotInProjectionIsStillApplied

Co-authored-by: Isaac
…down

# Conflicts:
#	api/src/main/java/org/apache/iceberg/PartitionStatisticsScan.java
#	core/src/main/java/org/apache/iceberg/BasePartitionStatisticsScan.java
#	core/src/test/java/org/apache/iceberg/PartitionStatisticsScanTestBase.java
@github-actions github-actions Bot removed the API label Jun 10, 2026
…ection scan

- TestInternalData.testWithFilterHintSkipsAllRowGroupsWhenNoneMatchForParquet:
  with one row group per record and no residual filter, a hint matching no row
  group (id > 100000) returns zero rows, proving the hint drives row-group
  elimination (all rows would return if the wiring regressed).
- PartitionStatisticsScanTestBase.testCaseInsensitiveFilterWithProjectionOnColumnOutsideProjection:
  a case-insensitive filter on an uppercase stats column not in the projection
  exercises the residual Evaluator together with the projection/filter
  read-schema union.

Co-authored-by: Isaac
Fixes spotlessJavaCheck violations flagged by build-checks CI: collapse the
multi-line ternary in Parquet.ReadBuilder.withFilterHint and rewrap the
InternalData.read javadoc.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant