Skip to content

Core: Introduce builder for TrackedFile#16769

Open
gaborkaszab wants to merge 1 commit into
apache:mainfrom
gaborkaszab:main_trackedfile_builder
Open

Core: Introduce builder for TrackedFile#16769
gaborkaszab wants to merge 1 commit into
apache:mainfrom
gaborkaszab:main_trackedfile_builder

Conversation

@gaborkaszab

Copy link
Copy Markdown
Contributor

No description provided.

@github-actions github-actions Bot added the core label Jun 11, 2026
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileStruct.java
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileBuilder.java
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java Outdated
this.manifestInfo = manifestInfo;
this.keyMetadata = keyMetadata != null ? ByteBuffers.toByteArray(keyMetadata) : null;
this.splitOffsets = splitOffsets != null ? ArrayUtil.toLongArray(splitOffsets) : null;
this.equalityIds = equalityIds != null ? ArrayUtil.toIntArray(equalityIds) : null;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up, not a blocker: We are doing back and forth conversion btw array and list. I know this follows existing pattern like BaseFile. I am wondering if it makes sense to just switch to arrays for these to splitOffsets and equalityIds.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can open a separate PR for this where we can continue the discussion. WDYT?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, need a bit broader discussion on that topic.

Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileBuilder.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileBuilder.java Outdated
return entry;
}

private static void verifyFieldsAreFromSource(TrackedFile entry, TrackedFile source) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two assertions missing from this helper:

  • tracking().firstRowId() — the from() / deleted() / replaced() factories all preserve firstRowId from source (via TrackingBuilder.from() / terminal()), but no test pins this down. Worth adding given that manifest-rewrite correctness depends on it.
  • tracking().snapshotId() — EXISTING/DELETED/REPLACED entries derived from a source should retain source.snapshotId(), not the new commit's snapshot id. Currently only updateDVWhenBuildingDataFileFromSource checks this directly.

Both fit naturally into the tracking-field block at lines 767-774.

@gaborkaszab gaborkaszab Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • snapshot ID: The description says Snapshot ID where the file was added, deleted, or replaced. So except for EXISTING we should use the new one and not the one from source. I added some asserts to verify this on the tests, but we can't in the verifyFieldsAreFromSource.

  • firstRowId: Currently, there is no way we can set this field in tracking. We have inheritFrom() for seq nums, but as agreed we don't set firstRowId from the builder, however we don't have a setter function either. I think this a functionality gap. May I open a PR to add a new setter for this field too? Probably inheritFrom() is not a suitable place as firstRowId is not directly inherited from the parent manifest (but calculated based on the other entries in the manifest with unassigned firstRowId).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for snapshotId verification, I agree that we can just validate it is the same value as from source. it will be good to add a Javadoc to briefly explain that why snapshotId is not verified in the verifyFieldsAreFromSource method.

I think this a functionality gap. May I open a PR to add a new setter for this field too? Probably inheritFrom() is not a suitable place as firstRowId is not directly inherited from the parent manifest (but calculated based on the other entries in the manifest with unassigned firstRowId).

this makes sense to me. let's discuss it in the dedicated PR you just created.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to the method explaining why snapshot ID is not verified.

Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileBuilder.java Outdated
@gaborkaszab gaborkaszab force-pushed the main_trackedfile_builder branch from f9c86c3 to 2ad4d31 Compare June 17, 2026 10:04

@gaborkaszab gaborkaszab left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look @stevenzwu !
I addressed all the comments. I have a follow-up question for you firstRowId, though.

Since MODIFIED is merged, let me rebase and eliminate the relevant TODOs

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java Outdated
this.manifestInfo = manifestInfo;
this.keyMetadata = keyMetadata != null ? ByteBuffers.toByteArray(keyMetadata) : null;
this.splitOffsets = splitOffsets != null ? ArrayUtil.toLongArray(splitOffsets) : null;
this.equalityIds = equalityIds != null ? ArrayUtil.toIntArray(equalityIds) : null;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can open a separate PR for this where we can continue the discussion. WDYT?

Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileBuilder.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileBuilder.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileBuilder.java
return entry;
}

private static void verifyFieldsAreFromSource(TrackedFile entry, TrackedFile source) {

@gaborkaszab gaborkaszab Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • snapshot ID: The description says Snapshot ID where the file was added, deleted, or replaced. So except for EXISTING we should use the new one and not the one from source. I added some asserts to verify this on the tests, but we can't in the verifyFieldsAreFromSource.

  • firstRowId: Currently, there is no way we can set this field in tracking. We have inheritFrom() for seq nums, but as agreed we don't set firstRowId from the builder, however we don't have a setter function either. I think this a functionality gap. May I open a PR to add a new setter for this field too? Probably inheritFrom() is not a suitable place as firstRowId is not directly inherited from the parent manifest (but calculated based on the other entries in the manifest with unassigned firstRowId).

Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileBuilder.java Outdated
@gaborkaszab gaborkaszab force-pushed the main_trackedfile_builder branch from 2ad4d31 to 2319f92 Compare June 17, 2026 12:05
@gaborkaszab

Copy link
Copy Markdown
Contributor Author

Rebased with main and resolved the following:

  • TrackedFileAdapters to use the builder instead of the constructor
  • TestTrackedFileBuilder to adjust to MODIFIED state where necessary

assertThat(deleteFile.format()).isEqualTo(FileFormat.AVRO);
assertThat(deleteFile.recordCount()).isEqualTo(50L);
assertThat(deleteFile.fileSizeInBytes()).isEqualTo(512L);
assertThat(deleteFile.sortOrderId()).isEqualTo(5);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EQ-deletes don't allow sort order and splitOffset through the builder. Removed the checks before no longer makes sense.

@gaborkaszab gaborkaszab force-pushed the main_trackedfile_builder branch from 2319f92 to bf22506 Compare June 17, 2026 12:10
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java
@gaborkaszab gaborkaszab force-pushed the main_trackedfile_builder branch 2 times, most recently from 6060c35 to 082a487 Compare June 17, 2026 12:37
@gaborkaszab gaborkaszab requested a review from stevenzwu June 17, 2026 13:19
this.manifestInfo = manifestInfo;
this.keyMetadata = keyMetadata != null ? ByteBuffers.toByteArray(keyMetadata) : null;
this.splitOffsets = splitOffsets != null ? ArrayUtil.toLongArray(splitOffsets) : null;
this.equalityIds = equalityIds != null ? ArrayUtil.toIntArray(equalityIds) : null;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, need a bit broader discussion on that topic.

}

@Override
public boolean equals(Object other) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StructLike classes in Iceberg often don't implement the equals method. but in this case, I am ok with it for usage in the TrackedFileBuilder

return entry;
}

private static void verifyFieldsAreFromSource(TrackedFile entry, TrackedFile source) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for snapshotId verification, I agree that we can just validate it is the same value as from source. it will be good to add a Javadoc to briefly explain that why snapshotId is not verified in the verifyFieldsAreFromSource method.

I think this a functionality gap. May I open a PR to add a new setter for this field too? Probably inheritFrom() is not a suitable place as firstRowId is not directly inherited from the parent manifest (but calculated based on the other entries in the manifest with unassigned firstRowId).

this makes sense to me. let's discuss it in the dedicated PR you just created.

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java
@stevenzwu stevenzwu moved this to In review in V4: metadata tree Jun 17, 2026
.build();

// Passed for unpartitioned test files, where there is no partition tuple.
private static final PartitionData NO_PARTITION = null;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous coverage also include non partitioned cases. this changed it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact we didn't use this non-partitioned case for anything. It was needed to construct a dummy TrackedFileStruct object through the constructor with required fields. But in fact the partition part was also a dummy input, it was never exercised in the test.
Minimal file with no tracking, used by the rejection and null-tracking tests.

Now, I removed that constructor for required fields from TrackedFileStruct, and instead of using that I construct the bare minimum dummy object required for failure tests, and it apparently doesn't need a dummy partition parameter.
Maybe this answers the question you asked here.

/** Minimal file with no tracking, used by the rejection and null-tracking tests. */
private static TrackedFileStruct trackedFile(FileContent contentType) {
return new TrackedFileStruct(
null, contentType, "s3://bucket/file", FileFormat.PARQUET, NO_PARTITION, 1L, 1L);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this one?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my answer above on NO_PARTITION. I removed this constructor because I don't think it is appropriate to use one that accepts the required fields only. I see 2 ways: 1) is to use the builder, 2) is in tests if we want to do something that breaks internal invariant of the class then we can use the field setters.
Here, the purpose is to construct a dummy object for failure tests where only the content type matters. Everything else on the object seems noise.

Renamed the method to dummyTrackedFile to articulate that this creates an internal state that doesn't meet invariants.


TrackedFileBuilder sortOrderId(int newSortOrderId) {
Preconditions.checkArgument(
contentType == FileContent.DATA,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contentType == FileContent.DATA restriction on sortOrderId (here) and splitOffsets (line 263-271) is a new invariant introduced by this PR, not an inherited one. It diverges from the spec and from existing v3 writer behavior:

Spec (format/spec.md, data_file fields) marks both fields _optional_ | _optional_ | _optional_ across v1/v2/v3 with no per-content-type restriction:

| _optional_ | _optional_ | _optional_ | 132  split_offsets    | list<long> | Split offsets for the data file ...
| _optional_ | _optional_ | _optional_ | 140  sort_order_id    | int        | ID representing sort order for this file

v3 writer behaviorEqualityDeleteWriter actively sets BOTH on equality delete files (EqualityDeleteWriter.java:80-92):

FileMetadata.deleteFileBuilder(spec)
    .ofEqualityDeletes(equalityFieldIds)
    ...
    .withSplitOffsets(appender.splitOffsets())
    .withSortOrder(sortOrder)
    .build();

And FileMetadata.Builder.build() (FileMetadata.java:274-286) explicitly defaults sortOrderId to SortOrder.unsorted().orderId() for EQUALITY_DELETES if not set:

switch (content) {
  case POSITION_DELETES:
    Preconditions.checkArgument(sortOrderId == null, "Position delete file should not have sort order");
    break;
  case EQUALITY_DELETES:
    if (sortOrderId == null) {
      sortOrderId = SortOrder.unsorted().orderId();
    }
    break;
  ...
}

So in v3 / spec terms, POSITION_DELETES is the only content type where sort_order_id is prohibited; equality deletes always carry one. The v4 schema (TrackedFile.SORT_ORDER_ID, TrackedFile.SPLIT_OFFSETS) doesn't gate by content type either, and TrackedEqualityDeleteFile (TrackedFileAdapters.java:156-167) blindly delegates both fields — meaning a v4 EQ-delete entry that does carry them will surface them on the legacy DeleteFile.

This restriction also drove the assertion deletion in TestTrackedFileAdapters.testEqualityDeleteFileAdapterDelegation (your reply: #16769 (comment)). That removal is locally consistent with the builder's contract but rests on the unstated premise that the contract itself is correct — which the spec doesn't support.

Two options:

  1. Relax the restriction to match the spec — allow sortOrderId on DATA | EQUALITY_DELETES (gate against POSITION_DELETES if anything), and drop the content-type check on splitOffsets entirely. Then the deleted adapter assertions can come back.
  2. If there's a deliberate v4 design decision to drop these fields from EQ-deletes, please call it out in the PR description (currently empty) and link to wherever it's discussed — it's a content-type-semantics change, not a builder ergonomics one, and shouldn't be silent.

Same applies to the analogous check on splitOffsets at line 263-271.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for walking me through the thought process!

I followed the spec PR here that says for sort_order_id:
'sort_order_id' must be null when 'content_type' is not 0.
and for split_offsets it says:
Split offsets for the data file that I figured means only valid for content_type=0

I saw that these are allowed fields in V3, hence I called this out with the comment in the tests where I remove the checks on these for eq-deletes.

Let me involve @amogh-jahagirdar into the discussion. I wonder how much intentional it was to introduce this restriction in V4 compared to V3. If it was, then I can call this semantics change in the PR description.

file.set(KEY_METADATA_ORDINAL, ByteBuffer.wrap(new byte[] {4, 5}));
file.set(SPLIT_OFFSETS_ORDINAL, ImmutableList.of(200L));
file.set(EQUALITY_IDS_ORDINAL, ImmutableList.of(1, 2, 3));
(TrackedFileStruct)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we change the variable type to the TrackedFile interface (instead of casting here)?

populateDerivedAndInheritedFields also takes the interface as an arg.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. done

TrackingStruct tracking = new TrackingStruct();
tracking.set(STATUS_ORDINAL, EntryStatus.ADDED.id());
tracking.set(SNAPSHOT_ID_ORDINAL, 42L);
private static void populateDerivedAndInheritedFields(TrackedFile file) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method name is confusing tom me. we are not inheriting values here. sth like populateTrackfingFields might be more intuitive.

I am also wondering if we should expose a setter of Tracking object in the builder. I know normal production code probably doesn't need it. we can mark the package private method with @VisibleForTesting, which is a common pattern we do in other classes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I initially tried to implement this using the inheritFrom and setFirstRowId functions, but I saw that some tests expect FILE_SEQ_NUM and DATA_SEQ_NUM to be different, that is impossible to construct using these functions, so I reverted back using the field setters.
Renamed as you suggested.

I'm not sure about a setter for Tracking. We can already use the set method coming from SupportsIndexProjection as a cheat for tests, that we already do for other fields. I wouldn't expose more functionality for this even if @VisibleForTesting. WDYT?

@gaborkaszab gaborkaszab force-pushed the main_trackedfile_builder branch from 082a487 to 072a4e9 Compare June 18, 2026 12:29

@gaborkaszab gaborkaszab left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @stevenzwu for looking into this. I address your comments. For the new restrictions of eq-deletes I invited @amogh-jahagirdar to share some more context.

writer_format_version landed in the meantime. Let me rebase and get rid of the relevant TODOs


TrackedFileBuilder sortOrderId(int newSortOrderId) {
Preconditions.checkArgument(
contentType == FileContent.DATA,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for walking me through the thought process!

I followed the spec PR here that says for sort_order_id:
'sort_order_id' must be null when 'content_type' is not 0.
and for split_offsets it says:
Split offsets for the data file that I figured means only valid for content_type=0

I saw that these are allowed fields in V3, hence I called this out with the comment in the tests where I remove the checks on these for eq-deletes.

Let me involve @amogh-jahagirdar into the discussion. I wonder how much intentional it was to introduce this restriction in V4 compared to V3. If it was, then I can call this semantics change in the PR description.

.build();

// Passed for unpartitioned test files, where there is no partition tuple.
private static final PartitionData NO_PARTITION = null;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact we didn't use this non-partitioned case for anything. It was needed to construct a dummy TrackedFileStruct object through the constructor with required fields. But in fact the partition part was also a dummy input, it was never exercised in the test.
Minimal file with no tracking, used by the rejection and null-tracking tests.

Now, I removed that constructor for required fields from TrackedFileStruct, and instead of using that I construct the bare minimum dummy object required for failure tests, and it apparently doesn't need a dummy partition parameter.
Maybe this answers the question you asked here.

file.set(KEY_METADATA_ORDINAL, ByteBuffer.wrap(new byte[] {4, 5}));
file.set(SPLIT_OFFSETS_ORDINAL, ImmutableList.of(200L));
file.set(EQUALITY_IDS_ORDINAL, ImmutableList.of(1, 2, 3));
(TrackedFileStruct)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. done

/** Minimal file with no tracking, used by the rejection and null-tracking tests. */
private static TrackedFileStruct trackedFile(FileContent contentType) {
return new TrackedFileStruct(
null, contentType, "s3://bucket/file", FileFormat.PARQUET, NO_PARTITION, 1L, 1L);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my answer above on NO_PARTITION. I removed this constructor because I don't think it is appropriate to use one that accepts the required fields only. I see 2 ways: 1) is to use the builder, 2) is in tests if we want to do something that breaks internal invariant of the class then we can use the field setters.
Here, the purpose is to construct a dummy object for failure tests where only the content type matters. Everything else on the object seems noise.

Renamed the method to dummyTrackedFile to articulate that this creates an internal state that doesn't meet invariants.

TrackingStruct tracking = new TrackingStruct();
tracking.set(STATUS_ORDINAL, EntryStatus.ADDED.id());
tracking.set(SNAPSHOT_ID_ORDINAL, 42L);
private static void populateDerivedAndInheritedFields(TrackedFile file) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I initially tried to implement this using the inheritFrom and setFirstRowId functions, but I saw that some tests expect FILE_SEQ_NUM and DATA_SEQ_NUM to be different, that is impossible to construct using these functions, so I reverted back using the field setters.
Renamed as you suggested.

I'm not sure about a setter for Tracking. We can already use the set method coming from SupportsIndexProjection as a cheat for tests, that we already do for other fields. I wouldn't expose more functionality for this even if @VisibleForTesting. WDYT?

@gaborkaszab gaborkaszab force-pushed the main_trackedfile_builder branch 2 times, most recently from 2089622 to 0360884 Compare June 18, 2026 13:17
@gaborkaszab gaborkaszab force-pushed the main_trackedfile_builder branch from 0360884 to 707b01c Compare June 18, 2026 14:13
@gaborkaszab gaborkaszab requested a review from stevenzwu June 18, 2026 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

3 participants