NO-ISSUE: clusters/app.ci/openshift-user-workload-monitoring/mixins: Consolidated NonKubeContainerWaiting alert#80863
Conversation
|
@wking: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
💤 Files with no reviewable changes (1)
WalkthroughTwo container-waiting alert rules are consolidated: ChangesContainer-Waiting Alert Consolidation
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: wking The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/prow_alerts.libsonnet`:
- Around line 19-27: The alert 'NonKubeContainerWaiting' references the reason
label in its description annotation with {{ $labels.reason }}, but the reason
field is not included in the sum by aggregation clause within the expr field.
Add reason to the sum by clause alongside namespace, pod, container, and cluster
so that the reason label is preserved through the aggregation and the annotation
template can render the actual waiting reason when the alert fires.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: ca414da0-6c75-4c24-b73c-9415c2225f69
📒 Files selected for processing (3)
clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/prow_alerts.libsonnetclusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/release_controller_alerts.libsonnetclusters/app.ci/openshift-user-workload-monitoring/mixins/prometheus_out/ci-alerts_prometheusrule.yaml
💤 Files with no reviewable changes (1)
- clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/release_controller_alerts.libsonnet
7254779 to
20e176d
Compare
…ed NonKubeContainerWaiting alert Replacing the earlier ImagePullBackOff from 09ce276 (Alert on ImagePullBackOff, 2023-08-31, openshift#42896) and releaseControllerContainerWaiting from 338a520 (Adding a releaseControllerContainerWaiting alert, 2025-09-24, openshift#69641) with one alert based on KubeContainerWaiting [1] to cover the namespaces that that core alert excludes. This consolidation makes it easier to notice and fix stuck containers, without waiting for people to notice the lack of functionality and reach out to poke workload maintainers about fixes. Recent examples include a Tide Pod stuck in CreateContainerError [2]: $ oc whoami -c default/api-ci-l2s4-p1-openshiftapps-com:6443/wking $ oc -n ci get pods | grep 'NAME\|tide' NAME READY STATUS RESTARTS AGE tide-74dd668fdf-fxj7m 1/2 CreateContainerError 2 (5h17m ago) 13d due to PID exhaustion: pod=tide-74dd668fdf-fxj7m pids=8095 limit=8096 proc=8068 sh: can't fork: Resource temporarily unavailable command terminated with exit code 2 and some openstack-beta-* Pods stuck with mount errors: $ oc -n ocp get --sort-by '{.metadata.creationTimestamp}' events | grep 'configmap references non-existent config key' | tail 2m54s Warning FailedMount pod/openstack-beta-4-21-559b848bfd-c6rvm MountVolume.SetUp failed for volume "repos" : configmap references non-existent config key: ocp-4.21-openstack-beta.repo 3m41s Warning FailedMount pod/openstack-beta-4-15-84bfb68b77-xh45g MountVolume.SetUp failed for volume "repos" : configmap references non-existent config key: ocp-4.15-openstack-beta.repo ... after missing some cleanup in the wake of 960b9b6 (Remove openstack-beta RPM mirror repos and services, 2026-06-12, openshift#80480). By removing the previous two narrowly-scoped alerts, and adding a broader alert to cover this whole problem class, we make it easier for workload admins to notice the issues that previously missed both of the two narrowly-scoped alerts. It becomes a bit harder to route to the right workload admins, but hopefully generalists watching the alert in the cluster as a whole have a clear enough idea of who is running what on the cluster to be able to find the correct admins for the alerting Namespace and Pod. Generated by manually editing the libsonnet files, and then regenerating the YAML file with: $ go install github.com/brancz/gojsontoyaml@latest $ (cd clusters/app.ci/openshift-user-workload-monitoring/mixins && make ci-alerts_prometheusrule.yaml) [1]: https://github.com/openshift/cluster-monitoring-operator/blob/9f41f998ef553ab4b6bbbca239d174543a126ede/assets/control-plane/prometheus-rule.yaml#L148-L156 [2]: https://redhat.atlassian.net/browse/DPTP-4987
|
[REHEARSALNOTIFIER] Note: If this PR includes changes to step registry files ( Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
@wking: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/pj-rehearse ack |
Replacing the earlier
ImagePullBackOfffrom 09ce276 (#42896) andreleaseControllerContainerWaitingfrom 338a520 (#69641) with one alert based onKubeContainerWaitingto cover the namespaces that that core alert excludes. This consolidation makes it easier to notice and fix stuck containers, without waiting for people to notice the lack of functionality and reach out to poke workload maintainers about fixes. Recent examples include a Tide Pod stuck inCreateContainerError:due to PID exhaustion:
and some
openstack-beta-*Pods stuck with mount errors:after missing some cleanup in the wake of 960b9b6 (#80480).
By removing the previous two narrowly-scoped alerts, and adding a broader alert to cover this whole problem class, we make it easier for workload admins to notice the issues that previously missed both of the two narrowly-scoped alerts. It becomes a bit harder to route to the right workload admins, but hopefully generalists watching the alert in the cluster as a whole have a clear enough idea of who is running what on the cluster to be able to find the correct admins for the alerting Namespace and Pod.
Generated by manually editing the libsonnet files, and then regenerating the YAML file with:
Summary by CodeRabbit
This PR updates the OpenShift CI user workload monitoring (Prometheus alerting) to improve detection of containers stuck in Kubernetes waiting states.
What changed (practically):
ImagePullBackOffalert from the monitoring rules (previously only triggered for theImagePullBackOffwaiting reason).releaseControllerContainerWaitingalert rule group (previously targetedrelease-controller.*pods and excludedCrashLoopBackOff).NonKubeContainerWaiting, that fires for containers reported as waiting (viakube_pod_container_status_waiting_reason) for longer than 1 hour, limited to thekube-state-metricsjob and excludingopenshift-*,kube-*, anddefaultnamespaces.Why it matters:
The earlier alerts were too specific and could miss real stuck-container scenarios. The consolidated alert is broader and designed so workload administrators can find and address issues more easily using the alert’s namespace/pod/container context—without relying on separate, reason- or component-specific notifications.
Trade-off:
Alert routing specificity is reduced compared to the removed targeted alerts; the expectation is that cluster-level responders can determine the responsible teams from the alert labels (notably namespace and pod name).
Files updated:
clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/prow_alerts.libsonnetclusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/release_controller_alerts.libsonnetclusters/app.ci/openshift-user-workload-monitoring/mixins/prometheus_out/ci-alerts_prometheusrule.yaml(regenerated to reflect the new/removed alert rules)