Skip to content

[lake/paimon] TieringSourceReader adjust to scan arrow record batch and write arrow record batch to lake#3430

Merged
luoyuxia merged 3 commits into
apache:mainfrom
luoyuxia:fix/tiering-scan-write-arrow-record-batch-2966
Jun 5, 2026
Merged

[lake/paimon] TieringSourceReader adjust to scan arrow record batch and write arrow record batch to lake#3430
luoyuxia merged 3 commits into
apache:mainfrom
luoyuxia:fix/tiering-scan-write-arrow-record-batch-2966

Conversation

@luoyuxia
Copy link
Copy Markdown
Contributor

@luoyuxia luoyuxia commented Jun 4, 2026

Purpose

Linked issue: close #2966

Adjust TieringSourceReader to scan arrow record batch and write arrow record batch to lake, as a follow-up of #2965.

Brief change log

  • Adjust TieringSplitReader to scan arrow record batch instead of row-based records
  • Add AppendOnlyArrowBatchHelper and Arrow2PaimonVectorConverter to support writing arrow batches directly to Paimon
  • Introduce ArrowBatchData to carry arrow batch data through the tiering pipeline
  • Fix split completion check and arrow type conversion for tiering

Tests

  • PaimonTieringITCase
  • TieringSplitReaderTest

API and Format

No.

Documentation

No.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Flink tiering pipeline to consume log data as Arrow record batches (for ARROW log tables) and write those batches directly into Paimon, reducing row-by-row conversion overhead and improving tiering efficiency for append-only tables.

Changes:

  • Add an Arrow record-batch tiering path in TieringSplitReader, including split completion logic based on the last scanned offset.
  • Extend the Paimon lake writer and append-only writer to accept RecordBatch writes and persist Arrow batches via a dedicated helper.
  • Enhance Arrow batch utilities (ArrowBatchData, ArrowRecordBatch) and adjust tiering tests to align with batched writes.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
fluss-lake/fluss-lake-paimon/src/test/java/org/apache/fluss/lake/paimon/tiering/PaimonTieringITCase.java Adjusts alter-table tiering test to write log-table rows in one larger write call.
fluss-lake/fluss-lake-paimon/src/main/java/org/apache/paimon/arrow/converter/Arrow2PaimonVectorConverter.java Adds Arrow-to-Paimon column vector conversion utilities (copied/adapted).
fluss-lake/fluss-lake-paimon/src/main/java/org/apache/fluss/lake/paimon/tiering/PaimonLakeWriter.java Implements SupportsRecordBatchWrite and routes Arrow batches to append-only writer.
fluss-lake/fluss-lake-paimon/src/main/java/org/apache/fluss/lake/paimon/tiering/append/AppendOnlyWriter.java Adds lazy Arrow-batch writing support and ensures helper resources are closed.
fluss-lake/fluss-lake-paimon/src/main/java/org/apache/fluss/lake/paimon/tiering/append/AppendOnlyArrowBatchHelper.java New helper that enriches batches with system columns and writes via Paimon ArrowBundleRecords.
fluss-lake/fluss-lake-paimon/pom.xml Adds paimon-arrow and Arrow dependencies as provided for runtime batch writing.
fluss-flink/fluss-flink-common/src/test/java/org/apache/fluss/flink/tiering/source/TieringSplitReaderTest.java Updates tests to pass a user classloader into the reader.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/tiering/source/TieringSplitReader.java Adds Arrow batch polling/handling path and refactors shared log-processing workflow.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/tiering/source/TieringSourceReader.java Wires user code classloader into TieringSplitReader construction.
fluss-common/src/main/java/org/apache/fluss/record/ArrowBatchData.java Adds batch size estimation and a truncate/ownership-transfer helper.
fluss-common/src/main/java/org/apache/fluss/lake/batch/ArrowRecordBatch.java Wraps ArrowBatchData and makes it closeable for safe resource management.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@luoyuxia luoyuxia force-pushed the fix/tiering-scan-write-arrow-record-batch-2966 branch from 94c3077 to a00c17c Compare June 4, 2026 08:42
@beryllw
Copy link
Copy Markdown
Contributor

beryllw commented Jun 4, 2026

LGTM!

@luoyuxia luoyuxia force-pushed the fix/tiering-scan-write-arrow-record-batch-2966 branch from 4a6fe84 to acb5196 Compare June 4, 2026 10:57
@luoyuxia luoyuxia force-pushed the fix/tiering-scan-write-arrow-record-batch-2966 branch from acb5196 to c1f14f4 Compare June 5, 2026 02:49
@luoyuxia luoyuxia force-pushed the fix/tiering-scan-write-arrow-record-batch-2966 branch from 5392f95 to b6476bf Compare June 5, 2026 05:57
@luoyuxia luoyuxia force-pushed the fix/tiering-scan-write-arrow-record-batch-2966 branch from bab1f3d to e297634 Compare June 5, 2026 12:03
@luoyuxia luoyuxia merged commit 0a699b3 into apache:main Jun 5, 2026
7 checks passed
throws IOException;
}

private static boolean checkUnshadedArrowAvailable(ClassLoader classLoader) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we check for ArrowBundleRecords from paimon-arrow?
it's provided, so it might fail, if I understand this relationship correctly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TieringSourceReader adjust to scan arrow record batch and write arrow record batch to lake

4 participants