table: native scan and Arrow stream integration#2
Draft
abnobdoss wants to merge 10 commits into
Draft
Conversation
Fan scan tasks out over a thread pool of native ArrowReaders so decode uses multiple cores instead of one, streaming batches as they complete with at most one decoded batch per shard in flight. A default batch size amortizes the per-batch GIL handoff that otherwise dominates the fan-in. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A limited scan previously always fell back to PyArrow. Push the limit through the native reader instead: truncate the streamed result at the limit (slicing the crossing batch and closing the shards so they stop decoding early) and cap the batch size to the limit so a small limit does not decode a full batch per shard. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The native reader emits arrow-rs's own types (string rather than large_string, and run-end-encoded identity-partition columns), so its output diverged from the PyArrow scan path. Cast every batch to schema_to_pyarrow(projected_schema), decoding run-end-encoded columns first since there is no direct cast kernel for them, so the native path is a faithful drop-in. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PYICEBERG_RUST_SCAN_PLANNING plans the scan in pyiceberg-core (Table.plan_files) instead of PyIceberg's Python manifest planning, then streams the planned tasks through the same sharded, casted reader as the read path. Falls back to PyArrow on any scan pyiceberg-core cannot handle. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The native path crashed on a normal REST+S3 catalog instead of degrading: non-str FileIO props (the REST auth manager) reached the Rust binding, S3 path-style/region were never translated to the opendal keys, and the fallback guard missed decode-time errors and schemeless local paths. Filter props to strings, mirror PyArrow's path-style default and pass region, prime the first batch so a decode mismatch falls back too, and skip native for bare paths. Scoped to static-credential FileIO -- a refreshing auth manager is still not carried to the native path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
plan_files() now resolves a ScanPlanner -- local manifest planning or REST server-side, exactly as before -- or uses one injected on the scan. That injection point is the seam a Rust or engine-specific planner plugs into without touching any call site. The bundled RustScanPlanner is an opt-in, deliberately-unimplemented stub: the native plan output exposes only booleans/counts, not the residual, partition data, or deletes needed to rebuild a faithful FileScanTask, so the fused native read path stays the supported native route. Env flags and default resolution are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
60a6ee3 to
deaa353
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status: active integration branch
Scope:
Current value proven on the rebased branch:
Verification:
Known limits:
No external issue references.