NO-JIRA: Resolve PDB readiness test flake#1418
Conversation
resolve PDB readiness test flake
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: YamunadeviShanmugam The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/retitle NO-JIRA: Resolve PDB readiness test flake |
|
@YamunadeviShanmugam: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughModifies the pdb_drain readiness test in test/cvo/readiness.go to run readiness.RunAll before listing PDBs, reducing timing drift between check execution and ground-truth verification. Removes a duplicate pre-listing assertion block; assertions now compare against expectations computed after the check runs. ChangesPDB Drain Test Reordering
Estimated code review effort: 1 (Trivial) | ~5 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/cvo/readiness.go (1)
187-214: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winThis reorder may widen the
pdb_drainwindow.readiness.RunAll()waits for every check to finish before returning, so the PDBList()here happens after the slowest check, not right afterpdb_draincaptures its snapshot. If the goal is to compare against the same cluster state, list PDBs beforeRunAll()or reuse the snapshot frompdb_drain.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/cvo/readiness.go` around lines 187 - 214, The PDB verification in readiness should use the same cluster snapshot as the pdb_drain check, because calling readiness.RunAll and then PolicyV1().PodDisruptionBudgets().List() can let cluster state drift before the comparison. Move the PDB list capture to before RunAll, or refactor pdb_drain to expose/reuse its snapshot so the assertions on total_pdbs and blocking_pdbs in test/cvo/readiness.go compare against the exact state that pdb_drain observed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@test/cvo/readiness.go`:
- Around line 187-214: The PDB verification in readiness should use the same
cluster snapshot as the pdb_drain check, because calling readiness.RunAll and
then PolicyV1().PodDisruptionBudgets().List() can let cluster state drift before
the comparison. Move the PDB list capture to before RunAll, or refactor
pdb_drain to expose/reuse its snapshot so the assertions on total_pdbs and
blocking_pdbs in test/cvo/readiness.go compare against the exact state that
pdb_drain observed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 876b56c4-af2f-4d9f-ae28-f9b05a339aa2
📒 Files selected for processing (1)
test/cvo/readiness.go
|
@YamunadeviShanmugam: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Issue
The PDB readiness test was intermittently failing with count mismatch errors (e.g., expecting 28 PDBs but finding 29). This flake was caused by a timing race condition: the test was fetching the "ground truth" list of PDBs from the Kubernetes API before executing the custom readiness.RunAll() check.
Because background controllers in a live cluster constantly create or delete pods and PDBs, this execution order left a wide time window where the cluster state could change between the two API queries, leading to mismatched counts.
Fix
The execution order was reversed to run the heavier readiness.RunAll() check first, followed immediately by querying the typed Kubernetes client for the ground truth snapshot. This change drastically shrinks the time window between the two data collection points to milliseconds, minimizing the chances of background cluster modifications causing a test flake.
Summary by CodeRabbit