Skip to content

NO-ISSUE: clusters/app.ci/openshift-user-workload-monitoring/mixins: Consolidated NonKubeContainerWaiting alert#80863

Open
wking wants to merge 1 commit into
openshift:mainfrom
wking:consolidate-container-waiting
Open

NO-ISSUE: clusters/app.ci/openshift-user-workload-monitoring/mixins: Consolidated NonKubeContainerWaiting alert#80863
wking wants to merge 1 commit into
openshift:mainfrom
wking:consolidate-container-waiting

Conversation

@wking

@wking wking commented Jun 22, 2026

Copy link
Copy Markdown
Member

Replacing the earlier ImagePullBackOff from 09ce276 (#42896) and releaseControllerContainerWaiting from 338a520 (#69641) with one alert based on KubeContainerWaiting to cover the namespaces that that core alert excludes. This consolidation makes it easier to notice and fix stuck containers, without waiting for people to notice the lack of functionality and reach out to poke workload maintainers about fixes. Recent examples include a Tide Pod stuck in CreateContainerError:

$ oc whoami -c
default/api-ci-l2s4-p1-openshiftapps-com:6443/wking
$ oc -n ci get pods | grep 'NAME\|tide'
NAME                                                                        READY   STATUS                        RESTARTS         AGE
tide-74dd668fdf-fxj7m                                                       1/2     CreateContainerError          2 (5h17m ago)    13d

due to PID exhaustion:

pod=tide-74dd668fdf-fxj7m pids=8095 limit=8096 proc=8068
sh: can't fork: Resource temporarily unavailable
command terminated with exit code 2

and some openstack-beta-* Pods stuck with mount errors:

$ oc -n ocp get --sort-by '{.metadata.creationTimestamp}' events | grep 'configmap references non-existent config key' | tail
2m54s       Warning   FailedMount                       pod/openstack-beta-4-21-559b848bfd-c6rvm                     MountVolume.SetUp failed for volume "repos" : configmap references non-existent config key: ocp-4.21-openstack-beta.repo
3m41s       Warning   FailedMount                       pod/openstack-beta-4-15-84bfb68b77-xh45g                     MountVolume.SetUp failed for volume "repos" : configmap references non-existent config key: ocp-4.15-openstack-beta.repo
...

after missing some cleanup in the wake of 960b9b6 (#80480).

By removing the previous two narrowly-scoped alerts, and adding a broader alert to cover this whole problem class, we make it easier for workload admins to notice the issues that previously missed both of the two narrowly-scoped alerts. It becomes a bit harder to route to the right workload admins, but hopefully generalists watching the alert in the cluster as a whole have a clear enough idea of who is running what on the cluster to be able to find the correct admins for the alerting Namespace and Pod.

Generated by manually editing the libsonnet files, and then regenerating the YAML file with:

$ go install github.com/brancz/gojsontoyaml@latest
$ (cd clusters/app.ci/openshift-user-workload-monitoring/mixins && make ci-alerts_prometheusrule.yaml)

Summary by CodeRabbit

This PR updates the OpenShift CI user workload monitoring (Prometheus alerting) to improve detection of containers stuck in Kubernetes waiting states.

What changed (practically):

  • Removed the narrowly scoped ImagePullBackOff alert from the monitoring rules (previously only triggered for the ImagePullBackOff waiting reason).
  • Removed the releaseControllerContainerWaiting alert rule group (previously targeted release-controller.* pods and excluded CrashLoopBackOff).
  • Added a new unified warning alert, NonKubeContainerWaiting, that fires for containers reported as waiting (via kube_pod_container_status_waiting_reason) for longer than 1 hour, limited to the kube-state-metrics job and excluding openshift-*, kube-*, and default namespaces.

Why it matters:
The earlier alerts were too specific and could miss real stuck-container scenarios. The consolidated alert is broader and designed so workload administrators can find and address issues more easily using the alert’s namespace/pod/container context—without relying on separate, reason- or component-specific notifications.

Trade-off:
Alert routing specificity is reduced compared to the removed targeted alerts; the expectation is that cluster-level responders can determine the responsible teams from the alert labels (notably namespace and pod name).

Files updated:

  • clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/prow_alerts.libsonnet
  • clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/release_controller_alerts.libsonnet
  • clusters/app.ci/openshift-user-workload-monitoring/mixins/prometheus_out/ci-alerts_prometheusrule.yaml (regenerated to reflect the new/removed alert rules)

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 22, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@wking: This pull request explicitly references no jira issue.

Details

In response to this:

Replacing the earlier ImagePullBackOff from 09ce276 (#42896) and releaseControllerContainerWaiting from 338a520 (#69641) with one alert based on KubeContainerWaiting to cover the namespaces that that core alert excludes. This consolidation makes it easier to notice and fix stuck containers, without waiting for people to notice the lack of functionality and reach out to poke workload maintainers about fixes. Recent examples include a Tide Pod stuck in CreateContainerError:

$ oc whoami -c
default/api-ci-l2s4-p1-openshiftapps-com:6443/wking
$ oc -n ci get pods | grep 'NAME\|tide'
NAME                                                                        READY   STATUS                        RESTARTS         AGE
tide-74dd668fdf-fxj7m                                                       1/2     CreateContainerError          2 (5h17m ago)    13d

due to PID exhaustion:

pod=tide-74dd668fdf-fxj7m pids=8095 limit=8096 proc=8068
sh: can't fork: Resource temporarily unavailable
command terminated with exit code 2

and some openstack-beta-* Pods stuck with mount errors:

$ oc -n ocp get --sort-by '{.metadata.creationTimestamp}' events | grep 'configmap references non-existent config key' | tail
2m54s       Warning   FailedMount                       pod/openstack-beta-4-21-559b848bfd-c6rvm                     MountVolume.SetUp failed for volume "repos" : configmap references non-existent config key: ocp-4.21-openstack-beta.repo
3m41s       Warning   FailedMount                       pod/openstack-beta-4-15-84bfb68b77-xh45g                     MountVolume.SetUp failed for volume "repos" : configmap references non-existent config key: ocp-4.15-openstack-beta.repo
...

after missing some cleanup in the wake of 960b9b6 (#80480).

By removing the previous two narrowly-scoped alerts, and adding a broader alert to cover this whole problem class, we make it easier for workload admins to notice the issues that previously missed both of the two narrowly-scoped alerts. It becomes a bit harder to route to the right workload admins, but hopefully generalists watching the alert in the cluster as a whole have a clear enough idea of who is running what on the cluster to be able to find the correct admins for the alerting Namespace and Pod.

Generated by manually editing the libsonnet files, and then regenerating the YAML file with:

$ go install github.com/brancz/gojsontoyaml@latest
$ (cd clusters/app.ci/openshift-user-workload-monitoring/mixins && make ci-alerts_prometheusrule.yaml)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jun 22, 2026
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 750d06fd-6672-4619-a2bb-049c2388b208

📥 Commits

Reviewing files that changed from the base of the PR and between 7254779 and 20e176d.

📒 Files selected for processing (3)
  • clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/prow_alerts.libsonnet
  • clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/release_controller_alerts.libsonnet
  • clusters/app.ci/openshift-user-workload-monitoring/mixins/prometheus_out/ci-alerts_prometheusrule.yaml
💤 Files with no reviewable changes (1)
  • clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/release_controller_alerts.libsonnet

Walkthrough

Two container-waiting alert rules are consolidated: ImagePullBackOff (critical, 10m) is removed from prow alerts and replaced with NonKubeContainerWaiting (warning, 1h, excluding system namespaces). The dedicated releaseControllerContainerWaiting alert group is deleted from release-controller alerts. Both changes are reflected in the generated YAML.

Changes

Container-Waiting Alert Consolidation

Layer / File(s) Summary
NonKubeContainerWaiting alert replacement
clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/prow_alerts.libsonnet, clusters/app.ci/openshift-user-workload-monitoring/mixins/prometheus_out/ci-alerts_prometheusrule.yaml
Removes the critical ImagePullBackOff alert (10m, sum_over_time) and adds NonKubeContainerWaiting (warning, 1h, excludes openshift-*/kube-*/default namespaces, uses summary+description annotations with the waiting reason label). Generated YAML updated accordingly.
releaseControllerContainerWaiting alert group removal
clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/release_controller_alerts.libsonnet, clusters/app.ci/openshift-user-workload-monitoring/mixins/prometheus_out/ci-alerts_prometheusrule.yaml
Deletes the release-controller-container-waiting alert group (warning, team: crt, 1h, excluding CrashLoopBackOff) from both the Jsonnet source and generated YAML; coverage is absorbed by the new broader alert.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: consolidating container waiting alerts by replacing two specific alerts with a unified NonKubeContainerWaiting alert.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The PR modifies only Prometheus alert configuration files (Libsonnet and YAML) and contains no Ginkgo test files or test patterns, making this check not applicable.
Test Structure And Quality ✅ Passed Custom check for Ginkgo test quality is not applicable. PR modifies Prometheus alert rules (Libsonnet/YAML), not test code.
Microshift Test Compatibility ✅ Passed PR only modifies Prometheus alert configuration files (libsonnet/YAML), not Ginkgo e2e tests. The MicroShift compatibility check applies only to new e2e tests, which are not present in this PR.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR modifies Prometheus alert configuration files (Libsonnet/YAML), not Ginkgo e2e tests. Check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only Prometheus alerting rules (PrometheusRule CRD), not deployment manifests, operators, or controllers. No scheduling constraints are introduced.
Ote Binary Stdout Contract ✅ Passed PR modifies only Prometheus alert configuration files (Libsonnet and YAML), containing no Go code, test code, or executable binaries. OTE stdout contract check does not apply to configuration files.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR modifies Prometheus alert rules (Libsonnet and YAML configs), not Ginkgo e2e tests. The IPv6/disconnected network check applies only to new e2e tests, which are not present in this PR.
No-Weak-Crypto ✅ Passed No weak cryptography, custom crypto implementations, or insecure secret comparisons found. PR contains only Prometheus alert configuration changes with no cryptographic operations.
Container-Privileges ✅ Passed The PR modifies only Prometheus alert rule configurations (libsonnet and generated YAML PrometheusRule resources). These are monitoring/alerting configurations, not container deployment or pod spec...
No-Sensitive-Data-In-Logs ✅ Passed PR adds NonKubeContainerWaiting alert annotations exposing only non-sensitive Kubernetes identifiers (pod, namespace, container names) and standard state reasons; no passwords, tokens, PII, or sens...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@openshift-ci

openshift-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wking
Once this PR has been reviewed and has the lgtm label, please assign hector-vido for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/prow_alerts.libsonnet`:
- Around line 19-27: The alert 'NonKubeContainerWaiting' references the reason
label in its description annotation with {{ $labels.reason }}, but the reason
field is not included in the sum by aggregation clause within the expr field.
Add reason to the sum by clause alongside namespace, pod, container, and cluster
so that the reason label is preserved through the aggregation and the annotation
template can render the actual waiting reason when the alert fires.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: ca414da0-6c75-4c24-b73c-9415c2225f69

📥 Commits

Reviewing files that changed from the base of the PR and between 8e61531 and 7254779.

📒 Files selected for processing (3)
  • clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/prow_alerts.libsonnet
  • clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/release_controller_alerts.libsonnet
  • clusters/app.ci/openshift-user-workload-monitoring/mixins/prometheus_out/ci-alerts_prometheusrule.yaml
💤 Files with no reviewable changes (1)
  • clusters/app.ci/openshift-user-workload-monitoring/mixins/_prometheus/release_controller_alerts.libsonnet

@wking wking force-pushed the consolidate-container-waiting branch from 7254779 to 20e176d Compare June 22, 2026 21:03
…ed NonKubeContainerWaiting alert

Replacing the earlier ImagePullBackOff from 09ce276 (Alert on
ImagePullBackOff, 2023-08-31, openshift#42896) and
releaseControllerContainerWaiting from 338a520 (Adding a
releaseControllerContainerWaiting alert, 2025-09-24, openshift#69641) with one
alert based on KubeContainerWaiting [1] to cover the namespaces that
that core alert excludes.  This consolidation makes it easier to
notice and fix stuck containers, without waiting for people to notice
the lack of functionality and reach out to poke workload maintainers
about fixes.  Recent examples include a Tide Pod stuck in
CreateContainerError [2]:

  $ oc whoami -c
  default/api-ci-l2s4-p1-openshiftapps-com:6443/wking
  $ oc -n ci get pods | grep 'NAME\|tide'
  NAME                                                                        READY   STATUS                        RESTARTS         AGE
  tide-74dd668fdf-fxj7m                                                       1/2     CreateContainerError          2 (5h17m ago)    13d

due to PID exhaustion:

  pod=tide-74dd668fdf-fxj7m pids=8095 limit=8096 proc=8068
  sh: can't fork: Resource temporarily unavailable
  command terminated with exit code 2

and some openstack-beta-* Pods stuck with mount errors:

  $ oc -n ocp get --sort-by '{.metadata.creationTimestamp}' events | grep 'configmap references non-existent config key' | tail
  2m54s       Warning   FailedMount                       pod/openstack-beta-4-21-559b848bfd-c6rvm                     MountVolume.SetUp failed for volume "repos" : configmap references non-existent config key: ocp-4.21-openstack-beta.repo
  3m41s       Warning   FailedMount                       pod/openstack-beta-4-15-84bfb68b77-xh45g                     MountVolume.SetUp failed for volume "repos" : configmap references non-existent config key: ocp-4.15-openstack-beta.repo
  ...

after missing some cleanup in the wake of 960b9b6 (Remove
openstack-beta RPM mirror repos and services, 2026-06-12, openshift#80480).

By removing the previous two narrowly-scoped alerts, and adding a
broader alert to cover this whole problem class, we make it easier for
workload admins to notice the issues that previously missed both of
the two narrowly-scoped alerts.  It becomes a bit harder to route to
the right workload admins, but hopefully generalists watching the
alert in the cluster as a whole have a clear enough idea of who is
running what on the cluster to be able to find the correct admins for
the alerting Namespace and Pod.

Generated by manually editing the libsonnet files, and then
regenerating the YAML file with:

  $ go install github.com/brancz/gojsontoyaml@latest
  $ (cd clusters/app.ci/openshift-user-workload-monitoring/mixins && make ci-alerts_prometheusrule.yaml)

[1]: https://github.com/openshift/cluster-monitoring-operator/blob/9f41f998ef553ab4b6bbbca239d174543a126ede/assets/control-plane/prometheus-rule.yaml#L148-L156
[2]: https://redhat.atlassian.net/browse/DPTP-4987
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@wking: no rehearsable tests are affected by this change

Note: If this PR includes changes to step registry files (ci-operator/step-registry/) and you expected jobs to be found, try rebasing your PR onto the base branch. This helps pj-rehearse accurately detect changes when the base branch has moved forward.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@openshift-ci

openshift-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@wking

wking commented Jun 22, 2026

Copy link
Copy Markdown
Member Author

Not rehearsable:

/pj-rehearse ack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. rehearsals-ack Signifies that rehearsal jobs have been acknowledged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants