diego-release: startup resilience fixes for BOSH upgrades (tnz-96144, tnz-96145) by navinms711 · Pull Request #1140 · cloudfoundry/diego-release

navinms711 · 2026-05-29T20:33:08Z

Summary

This PR bundles two independent startup resilience improvements that reduce unnecessary Monit restart cycles during BOSH rolling upgrades of Diego cells.

Fix 1 — Garden health check retry on startup (tnz-96144)

During BOSH starting_jobs, Monit starts Garden and rep simultaneously. Garden requires ~60–90 seconds to warm up before it can successfully create containers. Previously, the executor's Garden health check would fail fatally on the very first transient error, causing rep to exit and enter a ~53s Monit restart cycle.

This fix (landed in cloudfoundry/executor#130) makes the initial health check resilient to transient errors by retrying in a bounded loop until GardenHealthcheckTimeout (default 10m) expires.

Key properties preserved:

The cell is correctly marked as unhealthy during the retry phase so BBS does not schedule LRPs there.
An UnrecoverableError (e.g. bad TLS certs) still causes an immediate fatal exit — only transient connection errors trigger the retry.

A fatal error is triggered if the full timeout is reached, ensuring permanently broken Garden instances are still detected and the CellUnhealthy metric is emitted.

Picked up via the upstream bot submodule bump in 1aa487f (executor → 9e97ac1b). No additional submodule change needed in this PR.

Fix 2 — Non-blocking Loggregator dial on startup (tnz-96145)

diego-logging-client tnz-96145 removed grpc.WithBlock() from NewIngressClient. Previously, if metron_agent was not yet listening at startup (common during stemcell rolls), any component using NewIngressClient would fail to start and enter a Monit restart cycle (~53s). With the non-blocking dial, the gRPC connection is established lazily in the background — the component starts immediately and retries the connection automatically.

This affects four components whose test suites asserted the old "exits when metron is down" behaviour. This PR updates those tests to assert the new correct behaviour: the component starts and keeps running when metron is temporarily unavailable.

The rep submodule bump also picks up cloudfoundry/rep#84 (tnz-96147 — silk-daemon init retry), which landed upstream before this PR.

Backward Compatibility

Breaking Change? No

Both fixes modify internal startup resilience only. No external APIs, metric definitions, or failure conditions are changed. Both are fully backwards compatible and take effect immediately on all deployments.

The diego-logging-client change (tnz-96145) removed grpc.WithBlock() from NewIngressClient, making the loggregator dial non-blocking. The auctioneer now starts successfully even when metron is unavailable and retries the connection lazily in the background. The test "when the metron agent isn't up → exits with non-zero status code" was asserting the old blocking behaviour (exit after 1s dial timeout). With the non-blocking dial the auctioneer never exits, causing the test to time out. Update the test to assert the new correct behaviour: the auctioneer starts and keeps running when metron is temporarily unavailable. Co-authored-by: Cursor <cursoragent@cursor.com>

The diego-logging-client change (tnz-96145) removed grpc.WithBlock() from NewIngressClient, making the loggregator dial non-blocking. The route-emitter now starts successfully even when the loggregator agent is unavailable and retries the connection lazily in the background. The test "when emitter cannot connect to the loggregator agent → exit with non-zero status code" was asserting the old blocking behaviour (exit after 1s dial timeout). With the non-blocking dial the emitter never exits, causing the test to time out after 15s. Update the test to assert the new correct behaviour: the route-emitter starts and keeps running when the loggregator agent is temporarily unavailable. Co-authored-by: Cursor <cursoragent@cursor.com>

navinms711 requested a review from a team as a code owner May 29, 2026 20:33

cf-foundation-community-automation Bot added this to Application Runtime Platform Working Group May 29, 2026

cf-foundation-community-automation Bot moved this to Inbox in Application Runtime Platform Working Group May 29, 2026

navinms711 force-pushed the develop branch from 8180b2f to a941a34 Compare June 1, 2026 17:23

navinms711 changed the title ~~diego-release: bump executor to gardenhealth retry fix (tnz-96144)~~ diego-release: startup resilience fixes for BOSH upgrades (tnz-96144, tnz-96145) Jun 1, 2026

navinms711 force-pushed the develop branch from a941a34 to e0cc4ab Compare June 1, 2026 20:00

kart2bc approved these changes Jun 1, 2026

View reviewed changes

github-project-automation Bot moved this from Inbox to Pending Merge | Prioritized in Application Runtime Platform Working Group Jun 1, 2026

navinms711 and others added 2 commits June 1, 2026 20:23

navinms711 force-pushed the develop branch from e0cc4ab to e8b654e Compare June 1, 2026 20:23

kart2bc merged commit 0d00a4b into cloudfoundry:develop Jun 1, 2026
9 checks passed

github-project-automation Bot moved this from Pending Merge | Prioritized to Done in Application Runtime Platform Working Group Jun 1, 2026

navinms711 mentioned this pull request Jun 2, 2026

diego-ssh: update loggregator-down test for non-blocking dial (tnz-96… #1142

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diego-release: startup resilience fixes for BOSH upgrades (tnz-96144, tnz-96145)#1140

diego-release: startup resilience fixes for BOSH upgrades (tnz-96144, tnz-96145)#1140
kart2bc merged 2 commits into
cloudfoundry:developfrom
navinms711:develop

navinms711 commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

navinms711 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix 1 — Garden health check retry on startup (tnz-96144)

Fix 2 — Non-blocking Loggregator dial on startup (tnz-96145)

Backward Compatibility

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

navinms711 commented May 29, 2026 •

edited

Loading