Skip to content

NO-JIRA: Resolve PDB readiness test flake#1418

Open
YamunadeviShanmugam wants to merge 1 commit into
openshift:mainfrom
YamunadeviShanmugam:fix-e2e-pdb-count
Open

NO-JIRA: Resolve PDB readiness test flake#1418
YamunadeviShanmugam wants to merge 1 commit into
openshift:mainfrom
YamunadeviShanmugam:fix-e2e-pdb-count

Conversation

@YamunadeviShanmugam

@YamunadeviShanmugam YamunadeviShanmugam commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Issue
The PDB readiness test was intermittently failing with count mismatch errors (e.g., expecting 28 PDBs but finding 29). This flake was caused by a timing race condition: the test was fetching the "ground truth" list of PDBs from the Kubernetes API before executing the custom readiness.RunAll() check.

Because background controllers in a live cluster constantly create or delete pods and PDBs, this execution order left a wide time window where the cluster state could change between the two API queries, leading to mismatched counts.

❯ : [Jira:"Cluster Version Operator"] cluster-version-operator readiness checks should report PDB count matching actual PodDisruptionBudgets expand_less    1s                                                      
  {  fail [github.com/openshift/cluster-version-operator/test/cvo/readiness.go:207]: PDB count should match actual PDBs in cluster                                                                                  
  Expected                                                                                                                                                                                                          
      <int>: 29                                                                                                                                                                                                     
  to equal                                                                                                                                                                                                          
      <int>: 28} why this test is failing    

Fix
The execution order was reversed to run the heavier readiness.RunAll() check first, followed immediately by querying the typed Kubernetes client for the ground truth snapshot. This change drastically shrinks the time window between the two data collection points to milliseconds, minimizing the chances of background cluster modifications causing a test flake.

Summary by CodeRabbit

  • Bug Fixes
    • Improved the reliability of readiness checks by reducing timing-related drift during cluster validation.
    • Made the validation flow more consistent so results better match the cluster’s current state.

    resolve PDB readiness test flake
@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: YamunadeviShanmugam
Once this PR has been reviewed and has the lgtm label, please assign davidhurta for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/retitle NO-JIRA: Resolve PDB readiness test flake

@openshift-ci openshift-ci Bot changed the title fix(e2e): pdb readiness NO-JIRA: Resolve PDB readiness test flake Jul 1, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 1, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@YamunadeviShanmugam: This pull request explicitly references no jira issue.

Details

In response to this:

Issue
The PDB readiness test was intermittently failing with count mismatch errors (e.g., expecting 28 PDBs but finding 29). This flake was caused by a timing race condition: the test was fetching the "ground truth" list of PDBs from the Kubernetes API before executing the custom readiness.RunAll() check.

Because background controllers in a live cluster constantly create or delete pods and PDBs, this execution order left a wide time window where the cluster state could change between the two API queries, leading to mismatched counts.

❯ : [Jira:"Cluster Version Operator"] cluster-version-operator readiness checks should report PDB count matching actual PodDisruptionBudgets expand_less    1s                                                      
 {  fail [github.com/openshift/cluster-version-operator/test/cvo/readiness.go:207]: PDB count should match actual PDBs in cluster                                                                                  
 Expected                                                                                                                                                                                                          
     <int>: 29                                                                                                                                                                                                     
 to equal                                                                                                                                                                                                          
     <int>: 28} why this test is failing    

Fix
The execution order was reversed to run the heavier readiness.RunAll() check first, followed immediately by querying the typed Kubernetes client for the ground truth snapshot. This change drastically shrinks the time window between the two data collection points to milliseconds, minimizing the chances of background cluster modifications causing a test flake.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Walkthrough

Modifies the pdb_drain readiness test in test/cvo/readiness.go to run readiness.RunAll before listing PDBs, reducing timing drift between check execution and ground-truth verification. Removes a duplicate pre-listing assertion block; assertions now compare against expectations computed after the check runs.

Changes

PDB Drain Test Reordering

Layer / File(s) Summary
Reorder check execution and ground-truth verification
test/cvo/readiness.go
Runs readiness.RunAll and retrieves the pdb_drain result first, then lists PDBs afterward to compute expectedTotal/expectedBlocking; removes the earlier duplicate assertion block that ran before the PDB listing.

Estimated code review effort: 1 (Trivial) | ~5 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Microshift Test Compatibility ⚠️ Warning New Ginkgo e2e tests call ClusterVersions, ClusterOperators, Networks, openshift-etcd, and PDB APIs without any MicroShift skip/tag. Add [Skipped:MicroShift] or an apigroup tag, or guard with exutil.IsMicroShiftCluster(); otherwise verify on MicroShift CI.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: fixing a flaky PDB readiness test.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All Ginkgo titles in test/cvo/readiness.go are static strings; the PR only changes PDB check ordering and assertions, not test names.
Test Structure And Quality ✅ Passed The pdb_drain It block remains single-purpose, uses the shared BeforeEach timeout/cleanup, has no cluster resource creation, and the PR only reorders existing reads.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Only the PDB test’s query order changed; no new multi-node or HA assumptions were added, and the existing node-count check is SNO-safe.
Topology-Aware Scheduling Compatibility ✅ Passed Only the PDB readiness test order changed; no manifests, controllers, replicas, affinity, or node selectors were introduced.
Ote Binary Stdout Contract ✅ Passed The only change reorders reads inside an It block; no main/init/TestMain/BeforeSuite/RunSpecs stdout writes were added.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No IPv4 literals, IP parsing, host URL construction, or external connectivity were added; pdb_drain only reorders internal API reads.
No-Weak-Crypto ✅ Passed Touched test code only reorders PDB checks; no MD5/SHA1/DES/RC4/3DES/ECB, custom crypto, or secret comparisons found.
Container-Privileges ✅ Passed Only test/cvo/readiness.go changed; it reorders a PDB readiness check and contains no privileged container/K8s manifest settings.
No-Sensitive-Data-In-Logs ✅ Passed The PR only reorders test API calls in test/cvo/readiness.go and adds comments; it introduces no logging or sensitive fields.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/cvo/readiness.go (1)

187-214: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

This reorder may widen the pdb_drain window. readiness.RunAll() waits for every check to finish before returning, so the PDB List() here happens after the slowest check, not right after pdb_drain captures its snapshot. If the goal is to compare against the same cluster state, list PDBs before RunAll() or reuse the snapshot from pdb_drain.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/cvo/readiness.go` around lines 187 - 214, The PDB verification in
readiness should use the same cluster snapshot as the pdb_drain check, because
calling readiness.RunAll and then PolicyV1().PodDisruptionBudgets().List() can
let cluster state drift before the comparison. Move the PDB list capture to
before RunAll, or refactor pdb_drain to expose/reuse its snapshot so the
assertions on total_pdbs and blocking_pdbs in test/cvo/readiness.go compare
against the exact state that pdb_drain observed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@test/cvo/readiness.go`:
- Around line 187-214: The PDB verification in readiness should use the same
cluster snapshot as the pdb_drain check, because calling readiness.RunAll and
then PolicyV1().PodDisruptionBudgets().List() can let cluster state drift before
the comparison. Move the PDB list capture to before RunAll, or refactor
pdb_drain to expose/reuse its snapshot so the assertions on total_pdbs and
blocking_pdbs in test/cvo/readiness.go compare against the exact state that
pdb_drain observed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 876b56c4-af2f-4d9f-ae28-f9b05a339aa2

📥 Commits

Reviewing files that changed from the base of the PR and between 9bd6fbf and e8bcd03.

📒 Files selected for processing (1)
  • test/cvo/readiness.go

@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@YamunadeviShanmugam: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-hypershift-conformance e8bcd03 link true /test e2e-hypershift-conformance
ci/prow/e2e-aws-ovn-techpreview e8bcd03 link true /test e2e-aws-ovn-techpreview

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants