feat(telemetry): alert on gateway controller reconcile-error ratio#218
Open
scotwells wants to merge 1 commit into
Open
feat(telemetry): alert on gateway controller reconcile-error ratio#218scotwells wants to merge 1 commit into
scotwells wants to merge 1 commit into
Conversation
Adds two new PrometheusRule alerts and a matching runbook for the case where the gateway controller is stuck retrying rejected API server writes. Prod context: since v0.23.4 shipped PR #217 (cert-listener withholding), controller_runtime_reconcile_total{controller="gateway",result="error"} has been ~1963 vs result="success" ~395 — an 83% failure rate with no alert firing. The regression: witholding all listeners on a gateway produces a downstream Gateway with zero listeners, which the Gateway-API CRD rejects as "Required value". The controller hot-loops silently. Alerts added to config/telemetry/alerts/gateways.yaml: - GatewayControllerReconcileErrorRatioHigh (warning, >20% for 15m) - GatewayControllerReconcileErrorRatioCritical (critical, >50% for 10m) Expression uses sum without(result) to aggregate across result label dimensions so the ratio is computed correctly and the alert carries only the controller label (not result="error"). Also adds promtool unit tests (test/prometheus-rules/gateways/) and a runbook (docs/runbooks/gateway-controller-health.md) covering meaning, impact, diagnosis, and remediation for both tiers. Closes #212 (alerting gap identified in that issue).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When the gateway controller can't write a gateway's downstream config, customer changes silently stop reaching the edge — stale, often broken, configuration keeps serving and operators get no signal that anything is wrong. This happened in production right after v0.23.4: the controller ran at an ~83% reconcile failure rate for hours (≈1963 errors vs ≈395 successes) with nothing firing. Cert-listener withholding (#217) can leave a downstream Gateway with zero listeners, which the Gateway-API CRD rejects (
spec.listeners: Required value), so the controller retries in a hot loop and the bad config stays programmed at the edge.The control plane had no alert on reconcile health at all. The failure only surfaced indirectly, days later, through an edge-level Envoy listener-rejection alert — far from the actual cause.
Issue: #212 · PR: #217
What this adds
Two alerts on the gateway controller's reconcile error ratio, so a controller that can't apply configuration pages operators directly instead of failing silently:
Each alert links to a new runbook (
docs/runbooks/gateway-controller-health.md) covering impact, diagnosis, and remediation. promtool unit tests cover firing at each tier, not firing below threshold, and not firing before thefor:window elapses.These would have fired the moment v0.23.4 shipped, instead of surfacing days later from an edge symptom.
Alert expressions