Skip to content

fix: GCS composite object has no MD5 hash#271

Open
yuechao-qin wants to merge 1 commit into
masterfrom
ycq/fix-gcs-composite-object-md5-hash-crash
Open

fix: GCS composite object has no MD5 hash#271
yuechao-qin wants to merge 1 commit into
masterfrom
ycq/fix-gcs-composite-object-md5-hash-crash

Conversation

@yuechao-qin

@yuechao-qin yuechao-qin commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

The orchestrator crashes with AttributeError: 'NoneType' object has no attribute 'encode' when collecting output artifacts from GCS that are composite objects. Composite objects (created by parallel uploads or compose operations) have no MD5 hash — blob.md5_hash is None.

The crash happens in the upstream SDK (cloud-pipelines==0.26.3.12) at GoogleCloudStorageProvider._get_info_from_uri, during output collection after a container completes. This is before any data reaches our code, so there's no way to fix it at the save/consume layer.

Fix

  • New file: patched_google_cloud_storage.py — subclasses the SDK's GoogleCloudStorageProvider and overrides _get_info_from_uri to handle None MD5 gracefully
  • When MD5 is missing, substitutes "no_md5_<UTC timestamp>" as the hash value — this forces a cache miss for any downstream task consuming that artifact (Solution A: safe, no false cache hits)
  • Updated launchers (google_kubernetes_launchers.py, kubernetes_launchers.py) to use PatchedGoogleCloudStorageProvider instead of the upstream provider

Failing runs

Trade-offs

  • Composite object outputs will always cause a cache miss for downstream tasks — this is intentional and safe
  • Long-term fix: compute our own MD5 by downloading the file (cost scales with file size)

Tests

7 unit tests added covering:

  • Normal blobs return correct MD5 hex digest
  • Composite blobs return unique timestamp fallback
  • Timestamps are unique across calls
  • Warning is logged for composite objects
  • Provider handles single files, composite files, and mixed directories

Copy link
Copy Markdown
Collaborator Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@yuechao-qin yuechao-qin marked this pull request as ready for review June 4, 2026 21:27
@yuechao-qin yuechao-qin requested a review from Ark-kun as a code owner June 4, 2026 21:27
import sys
from unittest import mock

import pytest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant