Skip to content

diego-release: startup resilience fixes for BOSH upgrades (tnz-96144, tnz-96145)#1140

Merged
kart2bc merged 2 commits into
cloudfoundry:developfrom
navinms711:develop
Jun 1, 2026
Merged

diego-release: startup resilience fixes for BOSH upgrades (tnz-96144, tnz-96145)#1140
kart2bc merged 2 commits into
cloudfoundry:developfrom
navinms711:develop

Conversation

@navinms711

@navinms711 navinms711 commented May 29, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR bundles two independent startup resilience improvements that reduce unnecessary Monit restart cycles during BOSH rolling upgrades of Diego cells.

Fix 1 — Garden health check retry on startup (tnz-96144)

During BOSH starting_jobs, Monit starts Garden and rep simultaneously. Garden requires ~60–90 seconds to warm up before it can successfully create containers. Previously, the executor's Garden health check would fail fatally on the very first transient error, causing rep to exit and enter a ~53s Monit restart cycle.

This fix (landed in cloudfoundry/executor#130) makes the initial health check resilient to transient errors by retrying in a bounded loop until GardenHealthcheckTimeout (default 10m) expires.

Key properties preserved:

The cell is correctly marked as unhealthy during the retry phase so BBS does not schedule LRPs there.
An UnrecoverableError (e.g. bad TLS certs) still causes an immediate fatal exit — only transient connection errors trigger the retry.

A fatal error is triggered if the full timeout is reached, ensuring permanently broken Garden instances are still detected and the CellUnhealthy metric is emitted.

Picked up via the upstream bot submodule bump in 1aa487f (executor → 9e97ac1b). No additional submodule change needed in this PR.

Fix 2 — Non-blocking Loggregator dial on startup (tnz-96145)

diego-logging-client tnz-96145 removed grpc.WithBlock() from NewIngressClient. Previously, if metron_agent was not yet listening at startup (common during stemcell rolls), any component using NewIngressClient would fail to start and enter a Monit restart cycle (~53s). With the non-blocking dial, the gRPC connection is established lazily in the background — the component starts immediately and retries the connection automatically.

This affects four components whose test suites asserted the old "exits when metron is down" behaviour. This PR updates those tests to assert the new correct behaviour: the component starts and keeps running when metron is temporarily unavailable.

image

The rep submodule bump also picks up cloudfoundry/rep#84 (tnz-96147 — silk-daemon init retry), which landed upstream before this PR.

Backward Compatibility

Breaking Change? No

Both fixes modify internal startup resilience only. No external APIs, metric definitions, or failure conditions are changed. Both are fully backwards compatible and take effect immediately on all deployments.

@navinms711 navinms711 requested a review from a team as a code owner May 29, 2026 20:33
@navinms711 navinms711 changed the title diego-release: bump executor to gardenhealth retry fix (tnz-96144) diego-release: startup resilience fixes for BOSH upgrades (tnz-96144, tnz-96145) Jun 1, 2026
@github-project-automation github-project-automation Bot moved this from Inbox to Pending Merge | Prioritized in Application Runtime Platform Working Group Jun 1, 2026
navinms711 and others added 2 commits June 1, 2026 20:23
The diego-logging-client change (tnz-96145) removed grpc.WithBlock() from
NewIngressClient, making the loggregator dial non-blocking. The auctioneer
now starts successfully even when metron is unavailable and retries the
connection lazily in the background.

The test "when the metron agent isn't up → exits with non-zero status code"
was asserting the old blocking behaviour (exit after 1s dial timeout). With
the non-blocking dial the auctioneer never exits, causing the test to time
out.

Update the test to assert the new correct behaviour: the auctioneer starts
and keeps running when metron is temporarily unavailable.

Co-authored-by: Cursor <cursoragent@cursor.com>
The diego-logging-client change (tnz-96145) removed grpc.WithBlock() from
NewIngressClient, making the loggregator dial non-blocking. The
route-emitter now starts successfully even when the loggregator agent is
unavailable and retries the connection lazily in the background.

The test "when emitter cannot connect to the loggregator agent → exit with
non-zero status code" was asserting the old blocking behaviour (exit after
1s dial timeout). With the non-blocking dial the emitter never exits,
causing the test to time out after 15s.

Update the test to assert the new correct behaviour: the route-emitter
starts and keeps running when the loggregator agent is temporarily
unavailable.

Co-authored-by: Cursor <cursoragent@cursor.com>
@kart2bc kart2bc merged commit 0d00a4b into cloudfoundry:develop Jun 1, 2026
9 checks passed
@github-project-automation github-project-automation Bot moved this from Pending Merge | Prioritized to Done in Application Runtime Platform Working Group Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants