NO-ISSUE: clusters/app.ci/openshift-user-workload-monitoring/mixins: Consolidated NonKubeContainerWaiting alert by wking · Pull Request #80863 · openshift/release

wking · 2026-06-22T18:30:50Z

Replacing the earlier ImagePullBackOff from 09ce276 (#42896) and releaseControllerContainerWaiting from 338a520 (#69641) with one alert based on KubeContainerWaiting to cover the namespaces that that core alert excludes. This consolidation makes it easier to notice and fix stuck containers, without waiting for people to notice the lack of functionality and reach out to poke workload maintainers about fixes. Recent examples include a Tide Pod stuck in CreateContainerError:

$ oc whoami -c
default/api-ci-l2s4-p1-openshiftapps-com:6443/wking
$ oc -n ci get pods | grep 'NAME\|tide'
NAME                                                                        READY   STATUS                        RESTARTS         AGE
tide-74dd668fdf-fxj7m                                                       1/2     CreateContainerError          2 (5h17m ago)    13d

due to PID exhaustion:

pod=tide-74dd668fdf-fxj7m pids=8095 limit=8096 proc=8068
sh: can't fork: Resource temporarily unavailable
command terminated with exit code 2

and some openstack-beta-* Pods stuck with mount errors:

$ oc -n ocp get --sort-by '{.metadata.creationTimestamp}' events | grep 'configmap references non-existent config key' | tail
2m54s       Warning   FailedMount                       pod/openstack-beta-4-21-559b848bfd-c6rvm                     MountVolume.SetUp failed for volume "repos" : configmap references non-existent config key: ocp-4.21-openstack-beta.repo
3m41s       Warning   FailedMount                       pod/openstack-beta-4-15-84bfb68b77-xh45g                     MountVolume.SetUp failed for volume "repos" : configmap references non-existent config key: ocp-4.15-openstack-beta.repo
...

after missing some cleanup in the wake of 960b9b6 (#80480).

By removing the previous two narrowly-scoped alerts, and adding a broader alert to cover this whole problem class, we make it easier for workload admins to notice the issues that previously missed both of the two narrowly-scoped alerts. It becomes a bit harder to route to the right workload admins, but hopefully generalists watching the alert in the cluster as a whole have a clear enough idea of who is running what on the cluster to be able to find the correct admins for the alerting Namespace and Pod.

Generated by manually editing the libsonnet files, and then regenerating the YAML file with:

$ go install github.com/brancz/gojsontoyaml@latest
$ (cd clusters/app.ci/openshift-user-workload-monitoring/mixins && make ci-alerts_prometheusrule.yaml)

Summary by CodeRabbit

This PR updates the OpenShift CI user workload monitoring (Prometheus alerting) to improve detection of containers stuck in Kubernetes waiting states.

What changed (practically):

Removed the narrowly scoped ImagePullBackOff alert from the monitoring rules (previously only triggered for the ImagePullBackOff waiting reason).
Removed the releaseControllerContainerWaiting alert rule group (previously targeted release-controller.* pods and excluded CrashLoopBackOff).
Added a new unified warning alert, NonKubeContainerWaiting, that fires for containers reported as waiting (via kube_pod_container_status_waiting_reason) for longer than 1 hour, limited to the kube-state-metrics job and excluding openshift-*, kube-*, and default namespaces.

Why it matters:
The earlier alerts were too specific and could miss real stuck-container scenarios. The consolidated alert is broader and designed so workload administrators can find and address issues more easily using the alert’s namespace/pod/container context—without relying on separate, reason- or component-specific notifications.

Trade-off:
Alert routing specificity is reduced compared to the removed targeted alerts; the expectation is that cluster-level responders can determine the responsible teams from the alert labels (notably namespace and pod name).

Files updated:

clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/prow_alerts.libsonnet
clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/release_controller_alerts.libsonnet
clusters/app.ci/openshift-user-workload-monitoring/mixins/prometheus_out/ci-alerts_prometheusrule.yaml (regenerated to reflect the new/removed alert rules)

openshift-ci-robot · 2026-06-22T18:30:54Z

@wking: This pull request explicitly references no jira issue.

Details

In response to this:

Replacing the earlier ImagePullBackOff from 09ce276 (#42896) and releaseControllerContainerWaiting from 338a520 (#69641) with one alert based on KubeContainerWaiting to cover the namespaces that that core alert excludes. This consolidation makes it easier to notice and fix stuck containers, without waiting for people to notice the lack of functionality and reach out to poke workload maintainers about fixes. Recent examples include a Tide Pod stuck in CreateContainerError:
$ oc whoami -c
default/api-ci-l2s4-p1-openshiftapps-com:6443/wking
$ oc -n ci get pods | grep 'NAME\|tide'
NAME                                                                        READY   STATUS                        RESTARTS         AGE
tide-74dd668fdf-fxj7m                                                       1/2     CreateContainerError          2 (5h17m ago)    13d
due to PID exhaustion:
pod=tide-74dd668fdf-fxj7m pids=8095 limit=8096 proc=8068
sh: can't fork: Resource temporarily unavailable
command terminated with exit code 2
and some openstack-beta-* Pods stuck with mount errors:
$ oc -n ocp get --sort-by '{.metadata.creationTimestamp}' events | grep 'configmap references non-existent config key' | tail
2m54s       Warning   FailedMount                       pod/openstack-beta-4-21-559b848bfd-c6rvm                     MountVolume.SetUp failed for volume "repos" : configmap references non-existent config key: ocp-4.21-openstack-beta.repo
3m41s       Warning   FailedMount                       pod/openstack-beta-4-15-84bfb68b77-xh45g                     MountVolume.SetUp failed for volume "repos" : configmap references non-existent config key: ocp-4.15-openstack-beta.repo
...
after missing some cleanup in the wake of 960b9b6 (#80480).

By removing the previous two narrowly-scoped alerts, and adding a broader alert to cover this whole problem class, we make it easier for workload admins to notice the issues that previously missed both of the two narrowly-scoped alerts. It becomes a bit harder to route to the right workload admins, but hopefully generalists watching the alert in the cluster as a whole have a clear enough idea of who is running what on the cluster to be able to find the correct admins for the alerting Namespace and Pod.

Generated by manually editing the libsonnet files, and then regenerating the YAML file with:
$ go install github.com/brancz/gojsontoyaml@latest
$ (cd clusters/app.ci/openshift-user-workload-monitoring/mixins && make ci-alerts_prometheusrule.yaml)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-06-22T18:31:25Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 750d06fd-6672-4619-a2bb-049c2388b208

📥 Commits

Reviewing files that changed from the base of the PR and between 7254779 and 20e176d.

📒 Files selected for processing (3)

clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/prow_alerts.libsonnet
clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/release_controller_alerts.libsonnet
clusters/app.ci/openshift-user-workload-monitoring/mixins/prometheus_out/ci-alerts_prometheusrule.yaml

💤 Files with no reviewable changes (1)

clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/release_controller_alerts.libsonnet

Walkthrough

Two container-waiting alert rules are consolidated: ImagePullBackOff (critical, 10m) is removed from prow alerts and replaced with NonKubeContainerWaiting (warning, 1h, excluding system namespaces). The dedicated releaseControllerContainerWaiting alert group is deleted from release-controller alerts. Both changes are reflected in the generated YAML.

Changes

Container-Waiting Alert Consolidation

Layer / File(s)	Summary
NonKubeContainerWaiting alert replacement `clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/prow_alerts.libsonnet`, `clusters/app.ci/openshift-user-workload-monitoring/mixins/prometheus_out/ci-alerts_prometheusrule.yaml`	Removes the critical `ImagePullBackOff` alert (10m, `sum_over_time`) and adds `NonKubeContainerWaiting` (warning, 1h, excludes `openshift-`/`kube-`/`default` namespaces, uses `summary`+`description` annotations with the waiting `reason` label). Generated YAML updated accordingly.
releaseControllerContainerWaiting alert group removal `clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/release_controller_alerts.libsonnet`, `clusters/app.ci/openshift-user-workload-monitoring/mixins/prometheus_out/ci-alerts_prometheusrule.yaml`	Deletes the `release-controller-container-waiting` alert group (warning, `team: crt`, 1h, excluding `CrashLoopBackOff`) from both the Jsonnet source and generated YAML; coverage is absorbed by the new broader alert.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 15

✅ Passed checks (15 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: consolidating container waiting alerts by replacing two specific alerts with a unified NonKubeContainerWaiting alert.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	The PR modifies only Prometheus alert configuration files (Libsonnet and YAML) and contains no Ginkgo test files or test patterns, making this check not applicable.
Test Structure And Quality	✅ Passed	Custom check for Ginkgo test quality is not applicable. PR modifies Prometheus alert rules (Libsonnet/YAML), not test code.
Microshift Test Compatibility	✅ Passed	PR only modifies Prometheus alert configuration files (libsonnet/YAML), not Ginkgo e2e tests. The MicroShift compatibility check applies only to new e2e tests, which are not present in this PR.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR modifies Prometheus alert configuration files (Libsonnet/YAML), not Ginkgo e2e tests. Check is not applicable.
Topology-Aware Scheduling Compatibility	✅ Passed	PR modifies only Prometheus alerting rules (PrometheusRule CRD), not deployment manifests, operators, or controllers. No scheduling constraints are introduced.
Ote Binary Stdout Contract	✅ Passed	PR modifies only Prometheus alert configuration files (Libsonnet and YAML), containing no Go code, test code, or executable binaries. OTE stdout contract check does not apply to configuration files.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	This PR modifies Prometheus alert rules (Libsonnet and YAML configs), not Ginkgo e2e tests. The IPv6/disconnected network check applies only to new e2e tests, which are not present in this PR.
No-Weak-Crypto	✅ Passed	No weak cryptography, custom crypto implementations, or insecure secret comparisons found. PR contains only Prometheus alert configuration changes with no cryptographic operations.
Container-Privileges	✅ Passed	The PR modifies only Prometheus alert rule configurations (libsonnet and generated YAML PrometheusRule resources). These are monitoring/alerting configurations, not container deployment or pod spec...
No-Sensitive-Data-In-Logs	✅ Passed	PR adds NonKubeContainerWaiting alert annotations exposing only non-sensitive Kubernetes identifiers (pod, namespace, container names) and standard state reasons; no passwords, tokens, PII, or sens...

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

openshift-ci · 2026-06-22T18:31:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wking
Once this PR has been reviewed and has the lgtm label, please assign hector-vido for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

clusters/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/prow_alerts.libsonnet`:
- Around line 19-27: The alert 'NonKubeContainerWaiting' references the reason
label in its description annotation with {{ $labels.reason }}, but the reason
field is not included in the sum by aggregation clause within the expr field.
Add reason to the sum by clause alongside namespace, pod, container, and cluster
so that the reason label is preserved through the aggregation and the annotation
template can render the actual waiting reason when the alert fires.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: ca414da0-6c75-4c24-b73c-9415c2225f69

📥 Commits

Reviewing files that changed from the base of the PR and between 8e61531 and 7254779.

📒 Files selected for processing (3)

clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/prow_alerts.libsonnet
clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/release_controller_alerts.libsonnet
clusters/app.ci/openshift-user-workload-monitoring/mixins/prometheus_out/ci-alerts_prometheusrule.yaml

💤 Files with no reviewable changes (1)

clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/release_controller_alerts.libsonnet

…ed NonKubeContainerWaiting alert Replacing the earlier ImagePullBackOff from 09ce276 (Alert on ImagePullBackOff, 2023-08-31, openshift#42896) and releaseControllerContainerWaiting from 338a520 (Adding a releaseControllerContainerWaiting alert, 2025-09-24, openshift#69641) with one alert based on KubeContainerWaiting [1] to cover the namespaces that that core alert excludes. This consolidation makes it easier to notice and fix stuck containers, without waiting for people to notice the lack of functionality and reach out to poke workload maintainers about fixes. Recent examples include a Tide Pod stuck in CreateContainerError [2]: $ oc whoami -c default/api-ci-l2s4-p1-openshiftapps-com:6443/wking $ oc -n ci get pods | grep 'NAME\|tide' NAME READY STATUS RESTARTS AGE tide-74dd668fdf-fxj7m 1/2 CreateContainerError 2 (5h17m ago) 13d due to PID exhaustion: pod=tide-74dd668fdf-fxj7m pids=8095 limit=8096 proc=8068 sh: can't fork: Resource temporarily unavailable command terminated with exit code 2 and some openstack-beta-* Pods stuck with mount errors: $ oc -n ocp get --sort-by '{.metadata.creationTimestamp}' events | grep 'configmap references non-existent config key' | tail 2m54s Warning FailedMount pod/openstack-beta-4-21-559b848bfd-c6rvm MountVolume.SetUp failed for volume "repos" : configmap references non-existent config key: ocp-4.21-openstack-beta.repo 3m41s Warning FailedMount pod/openstack-beta-4-15-84bfb68b77-xh45g MountVolume.SetUp failed for volume "repos" : configmap references non-existent config key: ocp-4.15-openstack-beta.repo ... after missing some cleanup in the wake of 960b9b6 (Remove openstack-beta RPM mirror repos and services, 2026-06-12, openshift#80480). By removing the previous two narrowly-scoped alerts, and adding a broader alert to cover this whole problem class, we make it easier for workload admins to notice the issues that previously missed both of the two narrowly-scoped alerts. It becomes a bit harder to route to the right workload admins, but hopefully generalists watching the alert in the cluster as a whole have a clear enough idea of who is running what on the cluster to be able to find the correct admins for the alerting Namespace and Pod. Generated by manually editing the libsonnet files, and then regenerating the YAML file with: $ go install github.com/brancz/gojsontoyaml@latest $ (cd clusters/app.ci/openshift-user-workload-monitoring/mixins && make ci-alerts_prometheusrule.yaml) [1]: https://github.com/openshift/cluster-monitoring-operator/blob/9f41f998ef553ab4b6bbbca239d174543a126ede/assets/control-plane/prometheus-rule.yaml#L148-L156 [2]: https://redhat.atlassian.net/browse/DPTP-4987

openshift-merge-bot · 2026-06-22T21:03:39Z

[REHEARSALNOTIFIER]
@wking: no rehearsable tests are affected by this change

Note: If this PR includes changes to step registry files (ci-operator/step-registry/) and you expected jobs to be found, try rebasing your PR onto the base branch. This helps pj-rehearse accurately detect changes when the base branch has moved forward.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

openshift-ci · 2026-06-22T21:09:48Z

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

wking · 2026-06-22T23:33:01Z

Not rehearsable:

/pj-rehearse ack

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 22, 2026

openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jun 22, 2026

openshift-ci Bot requested review from bear-redhat and danilo-gemoli June 22, 2026 18:31

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/prow_alerts.libsonnet

wking force-pushed the consolidate-container-waiting branch from 7254779 to 20e176d Compare June 22, 2026 21:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NO-ISSUE: clusters/app.ci/openshift-user-workload-monitoring/mixins: Consolidated NonKubeContainerWaiting alert#80863

NO-ISSUE: clusters/app.ci/openshift-user-workload-monitoring/mixins: Consolidated NonKubeContainerWaiting alert#80863
wking wants to merge 1 commit into
openshift:mainfrom
wking:consolidate-container-waiting

wking commented Jun 22, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-ci-robot commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

openshift-ci Bot commented Jun 22, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

openshift-merge-bot Bot commented Jun 22, 2026

Uh oh!

openshift-ci Bot commented Jun 22, 2026

Uh oh!

wking commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wking commented Jun 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci Bot commented Jun 22, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-merge-bot Bot commented Jun 22, 2026

Uh oh!

openshift-ci Bot commented Jun 22, 2026

Uh oh!

wking commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wking commented Jun 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading