fix: include NULL in delete predicate for evolved partition fields in dynamic_partition_overwrite#3460
Open
GayathriSrividya wants to merge 8 commits into
Conversation
Replace the collect-then-return approach with proper generator functions that yield results page by page. Extract per-page fetch logic into dedicated helper methods (_fetch_tables_page, _fetch_views_page, _fetch_namespaces_page) decorated with @Retry so authentication retries work correctly per page. Co-authored-by: Yuya Ebihara <ebyhry@gmail.com>
dynamic_partition_overwrite builds a delete predicate from the current partition spec. After spec evolution, data files written under older specs may carry NULL for fields that were added later. The predicate category=A AND region=us never matches those files (region=NULL ≠ us), so they are silently left behind after the overwrite. Fix: detect which fields in the current spec are absent from at least one historical spec (i.e., were added via evolution), and extend the predicate for those fields to also accept NULL: category=A AND (region=us OR region IS NULL) This ensures pre-evolution data files are included in the delete pass while files from other partitions (region=eu, etc.) are correctly kept. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…evolution
After partition spec evolution, data files written under older specs carry
NULL for fields that didn't exist as partition fields at write time.
A single delete predicate (e.g. category=A AND region=us) never matches
those files because the strict-metrics evaluator sees region=NULL ≠ us.
The previous attempt (OR IS NULL in the predicate) was too broad: it also
deleted spec-1 null-partition files that should be preserved.
Correct fix: build a per-spec delete predicate map where each historical
spec gets a predicate that:
- Uses exact-match for fields the spec already had (region=us for spec-1)
- Also accepts NULL for fields the spec was missing (region=us OR NULL
for spec-0, since spec-0 had no region partition field)
These are stored on _SnapshotProducer._per_spec_predicates. _compute_deletes
looks up the right predicate for each manifest's spec_id instead of using
the single global predicate for metrics evaluation. The global predicate
(OR of all per-spec predicates) is still used for the manifest evaluator so
no manifests are skipped.
The rewrite logic from Transaction.delete() is inlined into
dynamic_partition_overwrite so that partial-match files are still rewritten
correctly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Or() requires exactly two arguments; when a table has only one spec, Or(*[single_pred]) raises TypeError. Guard with len check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
|
@GayathriSrividya Thanks for opening this PR, but I think you've pulled in some unrelated commits here |
Author
Thanks, @Fokko. Could you please point out which commits look unrelated? I’ll clean up the branch and keep this PR focused only on the #3148 fix. Or I will close this and open fresh PR |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3148
Root cause
dynamic_partition_overwritebuilds its delete predicate using the current partition spec only:After a partition spec evolution (e.g. adding a
regionfield), data files written under the older spec carryNULLfor the new field — becauseregionsimply wasn't part of the schema at write time.The predicate produced for the new partition
{category=A, region=us}is:The
_StrictMetricsEvaluatorcorrectly sees that spec-0 files haveregion = NULLfor every row, soregion = 'us'can never beROWS_MUST_MATCH. Those files are silently kept, leaving stale data behind.Fix
Detect which fields in the current spec were absent from at least one historical spec ("evolved" fields). For those fields, extend the per-field clause to also accept
NULL:This causes the metrics evaluator to flag pre-evolution files (all-null
region) asROWS_MUST_MATCHand delete them, while correctly preserving files in other non-overlapping partitions (e.g.region = 'eu').Verification
Added two unit tests for
_build_partition_predicate(with / withoutevolved_source_ids) and manually confirmed the repro from the issue now produces[999]instead of[1, 2, 999].