diego-release: startup resilience fixes for BOSH upgrades (tnz-96144, tnz-96145)#1140
Merged
Conversation
kart2bc
approved these changes
Jun 1, 2026
The diego-logging-client change (tnz-96145) removed grpc.WithBlock() from NewIngressClient, making the loggregator dial non-blocking. The auctioneer now starts successfully even when metron is unavailable and retries the connection lazily in the background. The test "when the metron agent isn't up → exits with non-zero status code" was asserting the old blocking behaviour (exit after 1s dial timeout). With the non-blocking dial the auctioneer never exits, causing the test to time out. Update the test to assert the new correct behaviour: the auctioneer starts and keeps running when metron is temporarily unavailable. Co-authored-by: Cursor <cursoragent@cursor.com>
The diego-logging-client change (tnz-96145) removed grpc.WithBlock() from NewIngressClient, making the loggregator dial non-blocking. The route-emitter now starts successfully even when the loggregator agent is unavailable and retries the connection lazily in the background. The test "when emitter cannot connect to the loggregator agent → exit with non-zero status code" was asserting the old blocking behaviour (exit after 1s dial timeout). With the non-blocking dial the emitter never exits, causing the test to time out after 15s. Update the test to assert the new correct behaviour: the route-emitter starts and keeps running when the loggregator agent is temporarily unavailable. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR bundles two independent startup resilience improvements that reduce unnecessary Monit restart cycles during BOSH rolling upgrades of Diego cells.
Fix 1 — Garden health check retry on startup (tnz-96144)
During BOSH starting_jobs, Monit starts Garden and rep simultaneously. Garden requires ~60–90 seconds to warm up before it can successfully create containers. Previously, the executor's Garden health check would fail fatally on the very first transient error, causing rep to exit and enter a ~53s Monit restart cycle.
This fix (landed in cloudfoundry/executor#130) makes the initial health check resilient to transient errors by retrying in a bounded loop until GardenHealthcheckTimeout (default 10m) expires.
Key properties preserved:
The cell is correctly marked as unhealthy during the retry phase so BBS does not schedule LRPs there.
An UnrecoverableError (e.g. bad TLS certs) still causes an immediate fatal exit — only transient connection errors trigger the retry.
A fatal error is triggered if the full timeout is reached, ensuring permanently broken Garden instances are still detected and the CellUnhealthy metric is emitted.
Picked up via the upstream bot submodule bump in 1aa487f (executor → 9e97ac1b). No additional submodule change needed in this PR.
Fix 2 — Non-blocking Loggregator dial on startup (tnz-96145)
diego-logging-client tnz-96145 removed grpc.WithBlock() from NewIngressClient. Previously, if metron_agent was not yet listening at startup (common during stemcell rolls), any component using NewIngressClient would fail to start and enter a Monit restart cycle (~53s). With the non-blocking dial, the gRPC connection is established lazily in the background — the component starts immediately and retries the connection automatically.
This affects four components whose test suites asserted the old "exits when metron is down" behaviour. This PR updates those tests to assert the new correct behaviour: the component starts and keeps running when metron is temporarily unavailable.
The rep submodule bump also picks up cloudfoundry/rep#84 (tnz-96147 — silk-daemon init retry), which landed upstream before this PR.
Backward Compatibility
Breaking Change? No
Both fixes modify internal startup resilience only. No external APIs, metric definitions, or failure conditions are changed. Both are fully backwards compatible and take effect immediately on all deployments.