NO-JIRA: Fix Flaky e2e tests - Update the Cleanup order#31347
NO-JIRA: Fix Flaky e2e tests - Update the Cleanup order#31347YamunadeviShanmugam wants to merge 1 commit into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
Walkthrough
ChangesOAuth token provisioning
Estimated code review effort: 2 (Simple) | ~10 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/lgtm |
|
Scheduling required tests: |
|
/retest |
1 similar comment
|
/retest |
|
The failure is unrelated to the PR - regression caused by openshift/cluster-version-operator#1395 |
|
/retest |
| } | ||
| c.AddResourceToDelete(oauthv1.GroupVersion.WithResource("oauthaccesstokens"), token) | ||
| // The token is intentionally not added to resourcesToDelete. | ||
| // TeardownProject runs as an AfterEach, which fires before the framework's |
There was a problem hiding this comment.
By emptying the slice in c.resourcesToDelete = nil, we fix the multiple subsequent deletions. Why do we need this?. If we need this, why do we emptying the slices in c.resourcesToDelete = nil. I think this PR should focus on fixing the one issue at a time.
There was a problem hiding this comment.
In my opinion, we should keep c.AddResourceToDelete(oauthv1.GroupVersion.WithResource("oauthaccesstokens"), token) as is.
There was a problem hiding this comment.
@ardaguclu : Yes you are correct, emptying the slice is not going to fix the actual issue, it is a performance improvement of not repeating the deletion in subsequent tests(no failures as it will return 404 and skips), adding this in the PR is deviating the context, will remove it
There was a problem hiding this comment.
Regarding the tokens, as we are adding it to the deletion - it is getting deleted in TearDownProject which causes the failure in defercleanup where the token is required. This order is wrong.
Another approach is to delete the token in beforeeach, but the current design is simpler.
There was a problem hiding this comment.
According to the documented justification, tests should always fail. Because when DeferCleanup is invoked, token should have already been deleted, no?
There was a problem hiding this comment.
Yes Arda, it is completely misleading - fixed it with actual issue
74d2be9 to
d06bacf
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/extended/util/client.go`:
- Around line 1210-1216: The token cleanup currently skips deleting bearer
tokens for users that already existed, leaving valid tokens behind until TTL
expiry. Update GetClientConfigForUser in client.go so there is an additional
cleanup path for the token itself, using a detached timeout context built with
context.WithoutCancel and context.WithTimeout, while keeping the existing
user-deletion behavior intact. Make sure the cleanup runs late enough for
DeferCleanup to still authenticate, and reference the existing token/user
teardown logic around GetClientConfigForUser and resourcesToDelete when wiring
it in.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 3361ae0e-fb69-454a-bd88-5263770edb99
📒 Files selected for processing (1)
test/extended/util/client.go
d06bacf to
14985db
Compare
| UserUID: string(user.UID), | ||
| Scopes: []string{"user:full"}, | ||
| RedirectURI: "https://localhost:8443/oauth/token/implicit", | ||
| ExpiresIn: 86400, // 24 h TTL; auto-expires without explicit deletion |
There was a problem hiding this comment.
A job takes max 4 hours in average. Would it make sense setting this to 6h?
There was a problem hiding this comment.
Addressed.
Improve the cleanup order for fixing flakiness
14985db to
95670ed
Compare
|
/lgtm |
|
Scheduling required tests: |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ardaguclu, gangwgr, sosiouxme, YamunadeviShanmugam The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest-required |
|
Waiting on csi pipeline - as it is consistently failing for consecutive 5th time across multiple PRs |
|
/test e2e-gcp-ovn e2e-aws-csi |
|
/retitle NO-JIRA: Fix Flaky e2e tests - Update the Cleanup order |
|
@YamunadeviShanmugam: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@YamunadeviShanmugam: YamunadeviShanmugam unauthorized: /override is restricted to Repo administrators, approvers in top level OWNERS file, and the following github teams:openshift: openshift-release-oversight openshift-staff-engineers openshift-sustaining-engineers. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/verified by CI runs for required tests |
|
@YamunadeviShanmugam: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@YamunadeviShanmugam: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Fix premature OAuth token deletion in CLI
OAuthAccessToken deleted before framework DeferCleanup runs
GetClientConfigForUser created an OAuthAccessToken and registered it in resourcesToDelete. This meant it was deleted during TeardownProject, which is registered as a g.AfterEach. The Kubernetes framework's namespace cleanup, however, runs via ginkgo.DeferCleanup (registered inside f.BeforeEach), and the documented execution order is:
It block
→ all AfterEach nodes (including TeardownProject) ← token deleted here
→ all DeferCleanups (LIFO) ← framework cleanup here
→ f.AfterEach (namespace deletion)
This meant the bearer token embedded in the returned *rest.Config became invalid before any DeferCleanup callbacks had a chance to run. Any deferred cleanup that tried to make authenticated API calls as that user would receive 401 Unauthorized.
Fix:
Remove the token from resourcesToDelete and instead set ExpiresIn: 6 hours. The associated User object is still explicitly deleted in TeardownProject, which is sufficient to revoke effective access. The ExpiresIn TTL ensures the token is eventually cleaned up by the OpenShift OAuth token controller without needing explicit deletion.
Latent OIDC test bug fixed as a consequence
test/extended/authentication/oidc.go calls GetClientConfigForUser in a g.BeforeAll and stores the result in oauthUserConfig. This config is then used across two separate ordered It blocks:
"should not accept tokens provided by the OAuth server" (expects Unauthorized)
"should accept tokens provided by the OpenShift OAuth server" (expects 200 OK)
With the old code, TeardownProject deleted the OAuthAccessToken after the first It block. The first test passed for the wrong reason (deleted token → Unauthorized, which matched the expectation, but for the wrong cause). The second test silently used an invalid bearer token. As the current fix keeps the token alive across all ordered It blocks as intended, so both tests validate the correct behavior.
Summary by CodeRabbit