NO-JIRA: Resolve PDB readiness test flake by YamunadeviShanmugam · Pull Request #1418 · openshift/cluster-version-operator

YamunadeviShanmugam · 2026-07-01T04:56:02Z

Issue
The PDB readiness test was intermittently failing with count mismatch errors (e.g., expecting 28 PDBs but finding 29). This flake was caused by a timing race condition: the test was fetching the "ground truth" list of PDBs from the Kubernetes API before executing the custom readiness.RunAll() check.

Because background controllers in a live cluster constantly create or delete pods and PDBs, this execution order left a wide time window where the cluster state could change between the two API queries, leading to mismatched counts.

❯ : [Jira:"Cluster Version Operator"] cluster-version-operator readiness checks should report PDB count matching actual PodDisruptionBudgets expand_less    1s                                                      
  {  fail [github.com/openshift/cluster-version-operator/test/cvo/readiness.go:207]: PDB count should match actual PDBs in cluster                                                                                  
  Expected                                                                                                                                                                                                          
      <int>: 29                                                                                                                                                                                                     
  to equal                                                                                                                                                                                                          
      <int>: 28} why this test is failing

Fix
The execution order was reversed to run the heavier readiness.RunAll() check first, followed immediately by querying the typed Kubernetes client for the ground truth snapshot. This change drastically shrinks the time window between the two data collection points to milliseconds, minimizing the chances of background cluster modifications causing a test flake.

Summary by CodeRabbit

Bug Fixes
- Improved the reliability of readiness checks by reducing timing-related drift during cluster validation.
- Made the validation flow more consistent so results better match the cluster’s current state.

resolve PDB readiness test flake

openshift-ci · 2026-07-01T04:56:28Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: YamunadeviShanmugam
Once this PR has been reviewed and has the lgtm label, please assign davidhurta for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

YamunadeviShanmugam · 2026-07-01T04:58:00Z

/retitle NO-JIRA: Resolve PDB readiness test flake

openshift-ci-robot · 2026-07-01T04:58:06Z

@YamunadeviShanmugam: This pull request explicitly references no jira issue.

Details

In response to this:

Issue
The PDB readiness test was intermittently failing with count mismatch errors (e.g., expecting 28 PDBs but finding 29). This flake was caused by a timing race condition: the test was fetching the "ground truth" list of PDBs from the Kubernetes API before executing the custom readiness.RunAll() check.

Because background controllers in a live cluster constantly create or delete pods and PDBs, this execution order left a wide time window where the cluster state could change between the two API queries, leading to mismatched counts.
❯ : [Jira:"Cluster Version Operator"] cluster-version-operator readiness checks should report PDB count matching actual PodDisruptionBudgets expand_less    1s                                                      
 {  fail [github.com/openshift/cluster-version-operator/test/cvo/readiness.go:207]: PDB count should match actual PDBs in cluster                                                                                  
 Expected                                                                                                                                                                                                          
     <int>: 29                                                                                                                                                                                                     
 to equal                                                                                                                                                                                                          
     <int>: 28} why this test is failing    
Fix
The execution order was reversed to run the heavier readiness.RunAll() check first, followed immediately by querying the typed Kubernetes client for the ground truth snapshot. This change drastically shrinks the time window between the two data collection points to milliseconds, minimizing the chances of background cluster modifications causing a test flake.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-07-01T04:58:42Z

Walkthrough

Modifies the pdb_drain readiness test in test/cvo/readiness.go to run readiness.RunAll before listing PDBs, reducing timing drift between check execution and ground-truth verification. Removes a duplicate pre-listing assertion block; assertions now compare against expectations computed after the check runs.

Changes

PDB Drain Test Reordering

Layer / File(s)	Summary
Reorder check execution and ground-truth verification `test/cvo/readiness.go`	Runs `readiness.RunAll` and retrieves the `pdb_drain` result first, then lists PDBs afterward to compute `expectedTotal`/`expectedBlocking`; removes the earlier duplicate assertion block that ran before the PDB listing.

Estimated code review effort: 1 (Trivial) | ~5 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Microshift Test Compatibility	⚠️ Warning	New Ginkgo e2e tests call ClusterVersions, ClusterOperators, Networks, openshift-etcd, and PDB APIs without any MicroShift skip/tag.	Add [Skipped:MicroShift] or an apigroup tag, or guard with exutil.IsMicroShiftCluster(); otherwise verify on MicroShift CI.

✅ Passed checks (14 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: fixing a flaky PDB readiness test.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All Ginkgo titles in test/cvo/readiness.go are static strings; the PR only changes PDB check ordering and assertions, not test names.
Test Structure And Quality	✅ Passed	The pdb_drain It block remains single-purpose, uses the shared BeforeEach timeout/cleanup, has no cluster resource creation, and the PR only reorders existing reads.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	Only the PDB test’s query order changed; no new multi-node or HA assumptions were added, and the existing node-count check is SNO-safe.
Topology-Aware Scheduling Compatibility	✅ Passed	Only the PDB readiness test order changed; no manifests, controllers, replicas, affinity, or node selectors were introduced.
Ote Binary Stdout Contract	✅ Passed	The only change reorders reads inside an It block; no main/init/TestMain/BeforeSuite/RunSpecs stdout writes were added.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No IPv4 literals, IP parsing, host URL construction, or external connectivity were added; pdb_drain only reorders internal API reads.
No-Weak-Crypto	✅ Passed	Touched test code only reorders PDB checks; no MD5/SHA1/DES/RC4/3DES/ECB, custom crypto, or secret comparisons found.
Container-Privileges	✅ Passed	Only test/cvo/readiness.go changed; it reorders a PDB readiness check and contains no privileged container/K8s manifest settings.
No-Sensitive-Data-In-Logs	✅ Passed	The PR only reorders test API calls in test/cvo/readiness.go and adds comments; it introduces no logging or sensitive fields.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

test/cvo/readiness.go (1)
187-214: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

This reorder may widen the pdb_drain window. readiness.RunAll() waits for every check to finish before returning, so the PDB List() here happens after the slowest check, not right after pdb_drain captures its snapshot. If the goal is to compare against the same cluster state, list PDBs before RunAll() or reuse the snapshot from pdb_drain.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/cvo/readiness.go` around lines 187 - 214, The PDB verification in
readiness should use the same cluster snapshot as the pdb_drain check, because
calling readiness.RunAll and then PolicyV1().PodDisruptionBudgets().List() can
let cluster state drift before the comparison. Move the PDB list capture to
before RunAll, or refactor pdb_drain to expose/reuse its snapshot so the
assertions on total_pdbs and blocking_pdbs in test/cvo/readiness.go compare
against the exact state that pdb_drain observed.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@test/cvo/readiness.go`:
- Around line 187-214: The PDB verification in readiness should use the same
cluster snapshot as the pdb_drain check, because calling readiness.RunAll and
then PolicyV1().PodDisruptionBudgets().List() can let cluster state drift before
the comparison. Move the PDB list capture to before RunAll, or refactor
pdb_drain to expose/reuse its snapshot so the assertions on total_pdbs and
blocking_pdbs in test/cvo/readiness.go compare against the exact state that
pdb_drain observed.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 876b56c4-af2f-4d9f-ae28-f9b05a339aa2

📥 Commits

Reviewing files that changed from the base of the PR and between 9bd6fbf and e8bcd03.

📒 Files selected for processing (1)

test/cvo/readiness.go

openshift-ci · 2026-07-01T08:25:18Z

@YamunadeviShanmugam: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-hypershift-conformance	`e8bcd03`	link	true	`/test e2e-hypershift-conformance`
ci/prow/e2e-aws-ovn-techpreview	`e8bcd03`	link	true	`/test e2e-aws-ovn-techpreview`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

fix(e2e): pdb readiness

e8bcd03

resolve PDB readiness test flake

openshift-ci Bot changed the title ~~fix(e2e): pdb readiness~~ NO-JIRA: Resolve PDB readiness test flake Jul 1, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 1, 2026

coderabbitai Bot reviewed Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NO-JIRA: Resolve PDB readiness test flake#1418

NO-JIRA: Resolve PDB readiness test flake#1418
YamunadeviShanmugam wants to merge 1 commit into
openshift:mainfrom
YamunadeviShanmugam:fix-e2e-pdb-count

YamunadeviShanmugam commented Jul 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-ci Bot commented Jul 1, 2026

Uh oh!

YamunadeviShanmugam commented Jul 1, 2026

Uh oh!

openshift-ci-robot commented Jul 1, 2026

Uh oh!

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

openshift-ci Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

YamunadeviShanmugam commented Jul 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci Bot commented Jul 1, 2026

Uh oh!

YamunadeviShanmugam commented Jul 1, 2026

Uh oh!

openshift-ci-robot commented Jul 1, 2026

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

YamunadeviShanmugam commented Jul 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading