Skip to content

NO-JIRA: Fix Flaky e2e tests - Update the Cleanup order#31347

Open
YamunadeviShanmugam wants to merge 1 commit into
openshift:mainfrom
YamunadeviShanmugam:improve_test_steps_client_go
Open

NO-JIRA: Fix Flaky e2e tests - Update the Cleanup order#31347
YamunadeviShanmugam wants to merge 1 commit into
openshift:mainfrom
YamunadeviShanmugam:improve_test_steps_client_go

Conversation

@YamunadeviShanmugam

@YamunadeviShanmugam YamunadeviShanmugam commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Fix premature OAuth token deletion in CLI

OAuthAccessToken deleted before framework DeferCleanup runs

GetClientConfigForUser created an OAuthAccessToken and registered it in resourcesToDelete. This meant it was deleted during TeardownProject, which is registered as a g.AfterEach. The Kubernetes framework's namespace cleanup, however, runs via ginkgo.DeferCleanup (registered inside f.BeforeEach), and the documented execution order is:

It block
→ all AfterEach nodes (including TeardownProject) ← token deleted here
→ all DeferCleanups (LIFO) ← framework cleanup here
→ f.AfterEach (namespace deletion)
This meant the bearer token embedded in the returned *rest.Config became invalid before any DeferCleanup callbacks had a chance to run. Any deferred cleanup that tried to make authenticated API calls as that user would receive 401 Unauthorized.

Fix:

Remove the token from resourcesToDelete and instead set ExpiresIn: 6 hours. The associated User object is still explicitly deleted in TeardownProject, which is sufficient to revoke effective access. The ExpiresIn TTL ensures the token is eventually cleaned up by the OpenShift OAuth token controller without needing explicit deletion.

Latent OIDC test bug fixed as a consequence

test/extended/authentication/oidc.go calls GetClientConfigForUser in a g.BeforeAll and stores the result in oauthUserConfig. This config is then used across two separate ordered It blocks:

"should not accept tokens provided by the OAuth server" (expects Unauthorized)
"should accept tokens provided by the OpenShift OAuth server" (expects 200 OK)
With the old code, TeardownProject deleted the OAuthAccessToken after the first It block. The first test passed for the wrong reason (deleted token → Unauthorized, which matched the expectation, but for the wrong cause). The second test silently used an invalid bearer token. As the current fix keeps the token alive across all ordered It blocks as intended, so both tests validate the correct behavior.

Summary by CodeRabbit

  • Bug Fixes
    • Adjusted OAuth access token provisioning to use a 6-hour automatic expiration, reducing the risk of stale authentication artifacts.
    • Improved test cleanup by avoiding explicit token deletion during teardown, preventing cached clients from losing credentials (which could otherwise lead to authentication fallback and follow-on failures).

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Walkthrough

GetClientConfigForUser now creates an OAuth access token with a 6-hour expiry and no longer registers that token for explicit teardown cleanup.

Changes

OAuth token provisioning

Layer / File(s) Summary
Token TTL and cleanup
test/extended/util/client.go
GetClientConfigForUser sets ExpiresIn on the created OAuthAccessToken, removes the explicit resourcesToDelete registration for that token, and documents the cleanup failure mode.

Estimated code review effort: 2 (Simple) | ~10 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning FAIL: TeardownProject still never resets resourcesToDelete, so the shared CLI accumulates cleanup targets across tests; some assertions in OIDC also lack failure messages. Reset c.resourcesToDelete to nil at the end of TeardownProject, and ensure remaining Ginkgo assertions carry helpful failure messages.
✅ Passed checks (14 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR only changes cleanup/token TTL in util/client.go; no Ginkgo titles were added or modified, and related OIDC titles remain static literals.
Microshift Test Compatibility ✅ Passed No new or edited Ginkgo e2e tests were added; the PR only changes a test helper and an extension-binary registry.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Only shared cleanup/token logic changed in util/client.go; no new Ginkgo e2e tests or multi-node/SNO assumptions were added.
Topology-Aware Scheduling Compatibility ✅ Passed Only test/extended/util/client.go changed; it adds OAuth token cleanup tweaks and no scheduling, manifest, or controller changes.
Ote Binary Stdout Contract ✅ Passed Changes only adjust OAuth token cleanup/TTL in GetClientConfigForUser; no process-level stdout writes were added.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Only test helper cleanup/token TTL logic changed; no new Ginkgo tests, IPv4-only assumptions, or external connectivity were introduced.
No-Weak-Crypto ✅ Passed Touched code only uses crypto/sha256 for token hashing; no MD5/SHA1/DES/RC4/ECB or insecure token comparisons found.
Container-Privileges ✅ Passed PR only changes test/extended/util/client.go; no manifests or privilege fields (privileged, hostPID, hostNetwork, hostIPC, SYS_ADMIN, allowPrivilegeEscalation) were added.
No-Sensitive-Data-In-Logs ✅ Passed No new sensitive-data logging was introduced; existing client command logs redact bearer tokens, and the PR only changes token cleanup/comments.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: adjusting cleanup behavior to fix flaky e2e tests.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot requested review from deads2k and sjenning June 29, 2026 08:24
@gangwgr

gangwgr commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 29, 2026
@openshift-ci openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 29, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/retest

@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

The failure is unrelated to the PR - regression caused by openshift/cluster-version-operator#1395

@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/retest

Comment thread test/extended/util/client.go Outdated
}
c.AddResourceToDelete(oauthv1.GroupVersion.WithResource("oauthaccesstokens"), token)
// The token is intentionally not added to resourcesToDelete.
// TeardownProject runs as an AfterEach, which fires before the framework's

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By emptying the slice in c.resourcesToDelete = nil, we fix the multiple subsequent deletions. Why do we need this?. If we need this, why do we emptying the slices in c.resourcesToDelete = nil. I think this PR should focus on fixing the one issue at a time.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, we should keep c.AddResourceToDelete(oauthv1.GroupVersion.WithResource("oauthaccesstokens"), token) as is.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ardaguclu : Yes you are correct, emptying the slice is not going to fix the actual issue, it is a performance improvement of not repeating the deletion in subsequent tests(no failures as it will return 404 and skips), adding this in the PR is deviating the context, will remove it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the tokens, as we are adding it to the deletion - it is getting deleted in TearDownProject which causes the failure in defercleanup where the token is required. This order is wrong.

Another approach is to delete the token in beforeeach, but the current design is simpler.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the documented justification, tests should always fail. Because when DeferCleanup is invoked, token should have already been deleted, no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes Arda, it is completely misleading - fixed it with actual issue

@YamunadeviShanmugam YamunadeviShanmugam force-pushed the improve_test_steps_client_go branch from 74d2be9 to d06bacf Compare July 1, 2026 11:49
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 1, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/util/client.go`:
- Around line 1210-1216: The token cleanup currently skips deleting bearer
tokens for users that already existed, leaving valid tokens behind until TTL
expiry. Update GetClientConfigForUser in client.go so there is an additional
cleanup path for the token itself, using a detached timeout context built with
context.WithoutCancel and context.WithTimeout, while keeping the existing
user-deletion behavior intact. Make sure the cleanup runs late enough for
DeferCleanup to still authenticate, and reference the existing token/user
teardown logic around GetClientConfigForUser and resourcesToDelete when wiring
it in.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3361ae0e-fb69-454a-bd88-5263770edb99

📥 Commits

Reviewing files that changed from the base of the PR and between 74d2be9 and d06bacf.

📒 Files selected for processing (1)
  • test/extended/util/client.go

Comment thread test/extended/util/client.go Outdated
@YamunadeviShanmugam YamunadeviShanmugam force-pushed the improve_test_steps_client_go branch from d06bacf to 14985db Compare July 1, 2026 12:12
Comment thread test/extended/util/client.go Outdated
UserUID: string(user.UID),
Scopes: []string{"user:full"},
RedirectURI: "https://localhost:8443/oauth/token/implicit",
ExpiresIn: 86400, // 24 h TTL; auto-expires without explicit deletion

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A job takes max 4 hours in average. Would it make sense setting this to 6h?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

Comment thread test/extended/util/client.go
  Improve the cleanup order for fixing flakiness
@YamunadeviShanmugam YamunadeviShanmugam force-pushed the improve_test_steps_client_go branch from 14985db to 95670ed Compare July 1, 2026 12:20
@ardaguclu

Copy link
Copy Markdown
Member

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jul 1, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@sosiouxme

Copy link
Copy Markdown
Member

/approve

@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ardaguclu, gangwgr, sosiouxme, YamunadeviShanmugam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 1, 2026
@sosiouxme

Copy link
Copy Markdown
Member

/retest-required

@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

Waiting on csi pipeline - as it is consistently failing for consecutive 5th time across multiple PRs

@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/test e2e-gcp-ovn e2e-aws-csi

@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/retitle NO-JIRA: Fix Flaky e2e tests - Update the Cleanup order

@openshift-ci openshift-ci Bot changed the title Fix Flaky e2e tests - Update the Cleanup order NO-JIRA: Fix Flaky e2e tests - Update the Cleanup order Jul 3, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@YamunadeviShanmugam: This pull request explicitly references no jira issue.

Details

In response to this:

Fix premature OAuth token deletion in CLI

OAuthAccessToken deleted before framework DeferCleanup runs

GetClientConfigForUser created an OAuthAccessToken and registered it in resourcesToDelete. This meant it was deleted during TeardownProject, which is registered as a g.AfterEach. The Kubernetes framework's namespace cleanup, however, runs via ginkgo.DeferCleanup (registered inside f.BeforeEach), and the documented execution order is:

It block
→ all AfterEach nodes (including TeardownProject) ← token deleted here
→ all DeferCleanups (LIFO) ← framework cleanup here
→ f.AfterEach (namespace deletion)
This meant the bearer token embedded in the returned *rest.Config became invalid before any DeferCleanup callbacks had a chance to run. Any deferred cleanup that tried to make authenticated API calls as that user would receive 401 Unauthorized.

Fix:

Remove the token from resourcesToDelete and instead set ExpiresIn: 6 hours. The associated User object is still explicitly deleted in TeardownProject, which is sufficient to revoke effective access. The ExpiresIn TTL ensures the token is eventually cleaned up by the OpenShift OAuth token controller without needing explicit deletion.

Latent OIDC test bug fixed as a consequence

test/extended/authentication/oidc.go calls GetClientConfigForUser in a g.BeforeAll and stores the result in oauthUserConfig. This config is then used across two separate ordered It blocks:

"should not accept tokens provided by the OAuth server" (expects Unauthorized)
"should accept tokens provided by the OpenShift OAuth server" (expects 200 OK)
With the old code, TeardownProject deleted the OAuthAccessToken after the first It block. The first test passed for the wrong reason (deleted token → Unauthorized, which matched the expectation, but for the wrong cause). The second test silently used an invalid bearer token. As the current fix keeps the token alive across all ordered It blocks as intended, so both tests validate the correct behavior.

Summary by CodeRabbit

  • Bug Fixes
  • Adjusted OAuth access token provisioning to use a 6-hour automatic expiration, reducing the risk of stale authentication artifacts.
  • Improved test cleanup by avoiding explicit token deletion during teardown, preventing cached clients from losing credentials (which could otherwise lead to authentication fallback and follow-on failures).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 3, 2026
@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@YamunadeviShanmugam: YamunadeviShanmugam unauthorized: /override is restricted to Repo administrators, approvers in top level OWNERS file, and the following github teams:openshift: openshift-release-oversight openshift-staff-engineers openshift-sustaining-engineers.

Details

In response to this:

/override ci/prow/e2e-aws-csi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/verified by CI runs for required tests

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jul 3, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@YamunadeviShanmugam: This PR has been marked as verified by CI runs for required tests.

Details

In response to this:

/verified by CI runs for required tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD f2c72ee and 2 for PR HEAD 95670ed in total

@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@YamunadeviShanmugam: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-csi 95670ed link true /test e2e-aws-csi

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants