feat(snapshots): Retry rate-limited image uploads with backoff#3326
feat(snapshots): Retry rate-limited image uploads with backoff#3326NicoHinderling wants to merge 5 commits into
Conversation
|
85fa116 to
afc5712
Compare
8e23bfa to
117fecd
Compare
Snapshot image uploads now retry 429/503 and transient failures instead of aborting the whole command on the first error. Each attempt rebuilds a batch of only the operations that have not yet succeeded, reusing the CLI's existing backoff convention (get_default_backoff + Config::max_retries).
117fecd to
2b9dda1
Compare
An unattributable OperationResult::Error that is not retryable (e.g. a malformed batch response) now breaks the retry loop immediately instead of exhausting the full backoff, matching the behavior for attributed fatal errors.
…oads Report the real objectstore error text for every still-pending image instead of attaching it to only the first, and prefer a non-retryable (fatal) batch error over a retryable one when choosing the message.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 931697b. Configure here.
When the retry loop stopped on a non-retryable unattributed batch error, pending uploads still reported a stale per-key retryable error (such as a 429 from an earlier attempt) instead of the fatal error that actually ended the run. Skip the per-key last_error when the loop stops on a fatal unattributed error so all pending uploads surface the real cause. Co-Authored-By: Claude <noreply@anthropic.com>
| .with_max_times(Config::current().max_retries() as usize) | ||
| .build(); | ||
|
|
||
| let failures = upload_with_retry(missing_uploads, delays, |pending| { |
There was a problem hiding this comment.
Is this something we should be handling in application code? It seems like it might be the responsibility of objectstore_client to handle retries for us.
There was a problem hiding this comment.
it's not currently an available option but you're not wrong. that's probably a better place for this logic
There was a problem hiding this comment.
ok i msged objectstore team - ill close this PR for now and work with them on a change for their repo instead
| stopped_on_fatal_unattributed = true; | ||
| break; | ||
| } | ||
| let Some(delay) = delays.next() else { |
There was a problem hiding this comment.
It doesn't look like this honors the Retry-After header which we should likely try to use first, and some kind of exponential backoff as a fallback.

Snapshot image uploads now retry rate-limited and transient failures instead of failing the whole command on the first error. Large batches were hitting Objectstore's per-operation rate limiter (HTTP 429
batch error: rate limited) and aborting partway through, even though most operations succeeded.Each attempt re-sends a batch of only the operations that have not yet succeeded, so retries stay small and target just the failing keys. The backoff schedule and retry count reuse the CLI's existing HTTP conventions (
get_default_backoff()andConfig::max_retries(), overridable viaSENTRY_HTTP_MAX_RETRIES).Retryable vs fatal classification
Per-operation results are drained from the batch stream and split into succeeded / retryable / fatal. Retryable covers 429 and 503 (whether surfaced as a per-op
OperationFailure, a whole-batchBatch, or an individualReqwest) plus transient connect/timeout errors. Fatal errors are reported immediately and never retried; unattributable batch errors preserve their real message in the final report.Reporting unchanged
On exhaustion the command reports the failing images and exits non-zero in the same format as before.