feat(storage): full object checksum: parse finalize_time and server crc32c in async object stream#17261
Open
chandra-siri wants to merge 6 commits into
Open
feat(storage): full object checksum: parse finalize_time and server crc32c in async object stream#17261chandra-siri wants to merge 6 commits into
chandra-siri wants to merge 6 commits into
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces tracking for object finalization and full object CRC32C checksums in the asynchronous read stream. The feedback focuses on simplifying the production code by removing logic added solely to accommodate unit test mocks (such as checking for a .seconds attribute). Instead, it is recommended to mock finalize_time as a standard datetime.datetime or None in the unit tests, which allows the production code to rely on standard isinstance checks.
…est-asyncio issues
kalragauri
approved these changes
May 27, 2026
| if ( | ||
| hasattr(response.metadata, "finalize_time") | ||
| and response.metadata.finalize_time | ||
| and response.metadata.finalize_time.second > 0 |
There was a problem hiding this comment.
is it possible that this is 0? What happens if the object gets finalized exactly on a minute boundary?
| self._is_stream_open: bool = False | ||
| self.persisted_size: Optional[int] = None | ||
| self.is_finalized: bool = False | ||
| self.full_obj_server_crc32c: Optional[int] = None |
There was a problem hiding this comment.
Should these be private similar to _is_stream_open?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
1. Overview of the Solution
This solution implements end-to-end full-object checksum validation in
AsyncMultiRangeDownloaderfor the asynchronous Google Cloud Storage Python client library. As asynchronous multiplexed downloads of non-contiguous ranges are performed concurrently over a single bidirectional gRPC connection, this feature automatically and incrementally calculates a rolling checksum as bytes arrive and validates it against the server's authoritative object checksum once the download completes.The technical approach consists of three coordinated layers:
_AsyncReadObjectStream(Stream Ingestion): Safely extracts the authoritative server checksum (full_obj_server_crc32c) and finalization status (is_finalized) from the object metadata received in the first data payload response of the stream._ReadResumptionStrategy&_DownloadState(Verification Logic): Computes an isolated, persistent rolling checksum in the individual_DownloadStateobject to ensure calculations do not bleed across concurrent multiplexed ranges. Crucially, the rolling hash updates only after buffer writes succeed to prevent state corruption during retry re-connects, raising aDataCorruptionexception on completion if a mismatch occurs.AsyncMultiRangeDownloader(Orchestration & Cleanup): Detects candidate full-object ranges (e.g.,(0, 0)or(0, persisted_size)), propagates checksum settings to the resumption strategy, and guarantees robust cleanup (closing the stream immediately and unregistering IDs) if data corruption or write errors occur.2. What This PR Specifically Does
This PR implements Step 1: Stream Metadata Ingestion of the solution:
_AsyncReadObjectStreamto safely parse GCS object metadata from the first data payload of the response.is_finalized,full_obj_server_crc32c, andobject_metadataattributes in_AsyncReadObjectStream.open().tests/unit/conftest.pyto resolve compatibility issues withpytest-asynciounder Python 3.11+.test_async_read_object_stream.pyto verify that finalization status and server-authoritative checksums are correctly extracted or skipped for unfinalized objects.