Core: Introduce builder for TrackedFile#16769
Conversation
| this.manifestInfo = manifestInfo; | ||
| this.keyMetadata = keyMetadata != null ? ByteBuffers.toByteArray(keyMetadata) : null; | ||
| this.splitOffsets = splitOffsets != null ? ArrayUtil.toLongArray(splitOffsets) : null; | ||
| this.equalityIds = equalityIds != null ? ArrayUtil.toIntArray(equalityIds) : null; |
There was a problem hiding this comment.
Follow-up, not a blocker: We are doing back and forth conversion btw array and list. I know this follows existing pattern like BaseFile. I am wondering if it makes sense to just switch to arrays for these to splitOffsets and equalityIds.
There was a problem hiding this comment.
I can open a separate PR for this where we can continue the discussion. WDYT?
There was a problem hiding this comment.
yes, need a bit broader discussion on that topic.
| return entry; | ||
| } | ||
|
|
||
| private static void verifyFieldsAreFromSource(TrackedFile entry, TrackedFile source) { |
There was a problem hiding this comment.
Two assertions missing from this helper:
tracking().firstRowId()— thefrom()/deleted()/replaced()factories all preservefirstRowIdfrom source (viaTrackingBuilder.from()/terminal()), but no test pins this down. Worth adding given that manifest-rewrite correctness depends on it.tracking().snapshotId()— EXISTING/DELETED/REPLACED entries derived from a source should retainsource.snapshotId(), not the new commit's snapshot id. Currently onlyupdateDVWhenBuildingDataFileFromSourcechecks this directly.
Both fit naturally into the tracking-field block at lines 767-774.
There was a problem hiding this comment.
-
snapshot ID: The description saysSnapshot ID where the file was added, deleted, or replaced. So except forEXISTINGwe should use the new one and not the one from source. I added some asserts to verify this on the tests, but we can't in theverifyFieldsAreFromSource. -
firstRowId: Currently, there is no way we can set this field in tracking. We haveinheritFrom()for seq nums, but as agreed we don't setfirstRowIdfrom the builder, however we don't have a setter function either. I think this a functionality gap. May I open a PR to add a new setter for this field too? ProbablyinheritFrom()is not a suitable place as firstRowId is not directly inherited from the parent manifest (but calculated based on the other entries in the manifest with unassigned firstRowId).
There was a problem hiding this comment.
for snapshotId verification, I agree that we can just validate it is the same value as from source. it will be good to add a Javadoc to briefly explain that why snapshotId is not verified in the verifyFieldsAreFromSource method.
I think this a functionality gap. May I open a PR to add a new setter for this field too? Probably inheritFrom() is not a suitable place as firstRowId is not directly inherited from the parent manifest (but calculated based on the other entries in the manifest with unassigned firstRowId).
this makes sense to me. let's discuss it in the dedicated PR you just created.
There was a problem hiding this comment.
Added a comment to the method explaining why snapshot ID is not verified.
f9c86c3 to
2ad4d31
Compare
gaborkaszab
left a comment
There was a problem hiding this comment.
Thanks for taking a look @stevenzwu !
I addressed all the comments. I have a follow-up question for you firstRowId, though.
Since MODIFIED is merged, let me rebase and eliminate the relevant TODOs
| this.manifestInfo = manifestInfo; | ||
| this.keyMetadata = keyMetadata != null ? ByteBuffers.toByteArray(keyMetadata) : null; | ||
| this.splitOffsets = splitOffsets != null ? ArrayUtil.toLongArray(splitOffsets) : null; | ||
| this.equalityIds = equalityIds != null ? ArrayUtil.toIntArray(equalityIds) : null; |
There was a problem hiding this comment.
I can open a separate PR for this where we can continue the discussion. WDYT?
| return entry; | ||
| } | ||
|
|
||
| private static void verifyFieldsAreFromSource(TrackedFile entry, TrackedFile source) { |
There was a problem hiding this comment.
-
snapshot ID: The description saysSnapshot ID where the file was added, deleted, or replaced. So except forEXISTINGwe should use the new one and not the one from source. I added some asserts to verify this on the tests, but we can't in theverifyFieldsAreFromSource. -
firstRowId: Currently, there is no way we can set this field in tracking. We haveinheritFrom()for seq nums, but as agreed we don't setfirstRowIdfrom the builder, however we don't have a setter function either. I think this a functionality gap. May I open a PR to add a new setter for this field too? ProbablyinheritFrom()is not a suitable place as firstRowId is not directly inherited from the parent manifest (but calculated based on the other entries in the manifest with unassigned firstRowId).
2ad4d31 to
2319f92
Compare
|
Rebased with main and resolved the following:
|
| assertThat(deleteFile.format()).isEqualTo(FileFormat.AVRO); | ||
| assertThat(deleteFile.recordCount()).isEqualTo(50L); | ||
| assertThat(deleteFile.fileSizeInBytes()).isEqualTo(512L); | ||
| assertThat(deleteFile.sortOrderId()).isEqualTo(5); |
There was a problem hiding this comment.
EQ-deletes don't allow sort order and splitOffset through the builder. Removed the checks before no longer makes sense.
2319f92 to
bf22506
Compare
6060c35 to
082a487
Compare
| this.manifestInfo = manifestInfo; | ||
| this.keyMetadata = keyMetadata != null ? ByteBuffers.toByteArray(keyMetadata) : null; | ||
| this.splitOffsets = splitOffsets != null ? ArrayUtil.toLongArray(splitOffsets) : null; | ||
| this.equalityIds = equalityIds != null ? ArrayUtil.toIntArray(equalityIds) : null; |
There was a problem hiding this comment.
yes, need a bit broader discussion on that topic.
| } | ||
|
|
||
| @Override | ||
| public boolean equals(Object other) { |
There was a problem hiding this comment.
StructLike classes in Iceberg often don't implement the equals method. but in this case, I am ok with it for usage in the TrackedFileBuilder
| return entry; | ||
| } | ||
|
|
||
| private static void verifyFieldsAreFromSource(TrackedFile entry, TrackedFile source) { |
There was a problem hiding this comment.
for snapshotId verification, I agree that we can just validate it is the same value as from source. it will be good to add a Javadoc to briefly explain that why snapshotId is not verified in the verifyFieldsAreFromSource method.
I think this a functionality gap. May I open a PR to add a new setter for this field too? Probably inheritFrom() is not a suitable place as firstRowId is not directly inherited from the parent manifest (but calculated based on the other entries in the manifest with unassigned firstRowId).
this makes sense to me. let's discuss it in the dedicated PR you just created.
| .build(); | ||
|
|
||
| // Passed for unpartitioned test files, where there is no partition tuple. | ||
| private static final PartitionData NO_PARTITION = null; |
There was a problem hiding this comment.
previous coverage also include non partitioned cases. this changed it.
There was a problem hiding this comment.
In fact we didn't use this non-partitioned case for anything. It was needed to construct a dummy TrackedFileStruct object through the constructor with required fields. But in fact the partition part was also a dummy input, it was never exercised in the test.
Minimal file with no tracking, used by the rejection and null-tracking tests.
Now, I removed that constructor for required fields from TrackedFileStruct, and instead of using that I construct the bare minimum dummy object required for failure tests, and it apparently doesn't need a dummy partition parameter.
Maybe this answers the question you asked here.
| /** Minimal file with no tracking, used by the rejection and null-tracking tests. */ | ||
| private static TrackedFileStruct trackedFile(FileContent contentType) { | ||
| return new TrackedFileStruct( | ||
| null, contentType, "s3://bucket/file", FileFormat.PARQUET, NO_PARTITION, 1L, 1L); |
There was a problem hiding this comment.
See my answer above on NO_PARTITION. I removed this constructor because I don't think it is appropriate to use one that accepts the required fields only. I see 2 ways: 1) is to use the builder, 2) is in tests if we want to do something that breaks internal invariant of the class then we can use the field setters.
Here, the purpose is to construct a dummy object for failure tests where only the content type matters. Everything else on the object seems noise.
Renamed the method to dummyTrackedFile to articulate that this creates an internal state that doesn't meet invariants.
|
|
||
| TrackedFileBuilder sortOrderId(int newSortOrderId) { | ||
| Preconditions.checkArgument( | ||
| contentType == FileContent.DATA, |
There was a problem hiding this comment.
The contentType == FileContent.DATA restriction on sortOrderId (here) and splitOffsets (line 263-271) is a new invariant introduced by this PR, not an inherited one. It diverges from the spec and from existing v3 writer behavior:
Spec (format/spec.md, data_file fields) marks both fields _optional_ | _optional_ | _optional_ across v1/v2/v3 with no per-content-type restriction:
| _optional_ | _optional_ | _optional_ | 132 split_offsets | list<long> | Split offsets for the data file ...
| _optional_ | _optional_ | _optional_ | 140 sort_order_id | int | ID representing sort order for this file
v3 writer behavior — EqualityDeleteWriter actively sets BOTH on equality delete files (EqualityDeleteWriter.java:80-92):
FileMetadata.deleteFileBuilder(spec)
.ofEqualityDeletes(equalityFieldIds)
...
.withSplitOffsets(appender.splitOffsets())
.withSortOrder(sortOrder)
.build();And FileMetadata.Builder.build() (FileMetadata.java:274-286) explicitly defaults sortOrderId to SortOrder.unsorted().orderId() for EQUALITY_DELETES if not set:
switch (content) {
case POSITION_DELETES:
Preconditions.checkArgument(sortOrderId == null, "Position delete file should not have sort order");
break;
case EQUALITY_DELETES:
if (sortOrderId == null) {
sortOrderId = SortOrder.unsorted().orderId();
}
break;
...
}So in v3 / spec terms, POSITION_DELETES is the only content type where sort_order_id is prohibited; equality deletes always carry one. The v4 schema (TrackedFile.SORT_ORDER_ID, TrackedFile.SPLIT_OFFSETS) doesn't gate by content type either, and TrackedEqualityDeleteFile (TrackedFileAdapters.java:156-167) blindly delegates both fields — meaning a v4 EQ-delete entry that does carry them will surface them on the legacy DeleteFile.
This restriction also drove the assertion deletion in TestTrackedFileAdapters.testEqualityDeleteFileAdapterDelegation (your reply: #16769 (comment)). That removal is locally consistent with the builder's contract but rests on the unstated premise that the contract itself is correct — which the spec doesn't support.
Two options:
- Relax the restriction to match the spec — allow
sortOrderIdonDATA | EQUALITY_DELETES(gate against POSITION_DELETES if anything), and drop the content-type check onsplitOffsetsentirely. Then the deleted adapter assertions can come back. - If there's a deliberate v4 design decision to drop these fields from EQ-deletes, please call it out in the PR description (currently empty) and link to wherever it's discussed — it's a content-type-semantics change, not a builder ergonomics one, and shouldn't be silent.
Same applies to the analogous check on splitOffsets at line 263-271.
There was a problem hiding this comment.
Thanks for walking me through the thought process!
I followed the spec PR here that says for sort_order_id:
'sort_order_id' must be null when 'content_type' is not 0.
and for split_offsets it says:
Split offsets for the data file that I figured means only valid for content_type=0
I saw that these are allowed fields in V3, hence I called this out with the comment in the tests where I remove the checks on these for eq-deletes.
Let me involve @amogh-jahagirdar into the discussion. I wonder how much intentional it was to introduce this restriction in V4 compared to V3. If it was, then I can call this semantics change in the PR description.
| file.set(KEY_METADATA_ORDINAL, ByteBuffer.wrap(new byte[] {4, 5})); | ||
| file.set(SPLIT_OFFSETS_ORDINAL, ImmutableList.of(200L)); | ||
| file.set(EQUALITY_IDS_ORDINAL, ImmutableList.of(1, 2, 3)); | ||
| (TrackedFileStruct) |
There was a problem hiding this comment.
why don't we change the variable type to the TrackedFile interface (instead of casting here)?
populateDerivedAndInheritedFields also takes the interface as an arg.
| TrackingStruct tracking = new TrackingStruct(); | ||
| tracking.set(STATUS_ORDINAL, EntryStatus.ADDED.id()); | ||
| tracking.set(SNAPSHOT_ID_ORDINAL, 42L); | ||
| private static void populateDerivedAndInheritedFields(TrackedFile file) { |
There was a problem hiding this comment.
this method name is confusing tom me. we are not inheriting values here. sth like populateTrackfingFields might be more intuitive.
I am also wondering if we should expose a setter of Tracking object in the builder. I know normal production code probably doesn't need it. we can mark the package private method with @VisibleForTesting, which is a common pattern we do in other classes.
There was a problem hiding this comment.
You're right. I initially tried to implement this using the inheritFrom and setFirstRowId functions, but I saw that some tests expect FILE_SEQ_NUM and DATA_SEQ_NUM to be different, that is impossible to construct using these functions, so I reverted back using the field setters.
Renamed as you suggested.
I'm not sure about a setter for Tracking. We can already use the set method coming from SupportsIndexProjection as a cheat for tests, that we already do for other fields. I wouldn't expose more functionality for this even if @VisibleForTesting. WDYT?
082a487 to
072a4e9
Compare
There was a problem hiding this comment.
Thanks again @stevenzwu for looking into this. I address your comments. For the new restrictions of eq-deletes I invited @amogh-jahagirdar to share some more context.
writer_format_version landed in the meantime. Let me rebase and get rid of the relevant TODOs
|
|
||
| TrackedFileBuilder sortOrderId(int newSortOrderId) { | ||
| Preconditions.checkArgument( | ||
| contentType == FileContent.DATA, |
There was a problem hiding this comment.
Thanks for walking me through the thought process!
I followed the spec PR here that says for sort_order_id:
'sort_order_id' must be null when 'content_type' is not 0.
and for split_offsets it says:
Split offsets for the data file that I figured means only valid for content_type=0
I saw that these are allowed fields in V3, hence I called this out with the comment in the tests where I remove the checks on these for eq-deletes.
Let me involve @amogh-jahagirdar into the discussion. I wonder how much intentional it was to introduce this restriction in V4 compared to V3. If it was, then I can call this semantics change in the PR description.
| .build(); | ||
|
|
||
| // Passed for unpartitioned test files, where there is no partition tuple. | ||
| private static final PartitionData NO_PARTITION = null; |
There was a problem hiding this comment.
In fact we didn't use this non-partitioned case for anything. It was needed to construct a dummy TrackedFileStruct object through the constructor with required fields. But in fact the partition part was also a dummy input, it was never exercised in the test.
Minimal file with no tracking, used by the rejection and null-tracking tests.
Now, I removed that constructor for required fields from TrackedFileStruct, and instead of using that I construct the bare minimum dummy object required for failure tests, and it apparently doesn't need a dummy partition parameter.
Maybe this answers the question you asked here.
| file.set(KEY_METADATA_ORDINAL, ByteBuffer.wrap(new byte[] {4, 5})); | ||
| file.set(SPLIT_OFFSETS_ORDINAL, ImmutableList.of(200L)); | ||
| file.set(EQUALITY_IDS_ORDINAL, ImmutableList.of(1, 2, 3)); | ||
| (TrackedFileStruct) |
| /** Minimal file with no tracking, used by the rejection and null-tracking tests. */ | ||
| private static TrackedFileStruct trackedFile(FileContent contentType) { | ||
| return new TrackedFileStruct( | ||
| null, contentType, "s3://bucket/file", FileFormat.PARQUET, NO_PARTITION, 1L, 1L); |
There was a problem hiding this comment.
See my answer above on NO_PARTITION. I removed this constructor because I don't think it is appropriate to use one that accepts the required fields only. I see 2 ways: 1) is to use the builder, 2) is in tests if we want to do something that breaks internal invariant of the class then we can use the field setters.
Here, the purpose is to construct a dummy object for failure tests where only the content type matters. Everything else on the object seems noise.
Renamed the method to dummyTrackedFile to articulate that this creates an internal state that doesn't meet invariants.
| TrackingStruct tracking = new TrackingStruct(); | ||
| tracking.set(STATUS_ORDINAL, EntryStatus.ADDED.id()); | ||
| tracking.set(SNAPSHOT_ID_ORDINAL, 42L); | ||
| private static void populateDerivedAndInheritedFields(TrackedFile file) { |
There was a problem hiding this comment.
You're right. I initially tried to implement this using the inheritFrom and setFirstRowId functions, but I saw that some tests expect FILE_SEQ_NUM and DATA_SEQ_NUM to be different, that is impossible to construct using these functions, so I reverted back using the field setters.
Renamed as you suggested.
I'm not sure about a setter for Tracking. We can already use the set method coming from SupportsIndexProjection as a cheat for tests, that we already do for other fields. I wouldn't expose more functionality for this even if @VisibleForTesting. WDYT?
2089622 to
0360884
Compare
0360884 to
707b01c
Compare
No description provided.