Skip to content

Add retry logic for ROSA provision shard lookup and pin rosa-e2e region#80875

Merged
openshift-merge-bot[bot] merged 2 commits into
openshift:mainfrom
dustman9000:fix/provision-shard-retry
Jun 22, 2026
Merged

Add retry logic for ROSA provision shard lookup and pin rosa-e2e region#80875
openshift-merge-bot[bot] merged 2 commits into
openshift:mainfrom
dustman9000:fix/provision-shard-retry

Conversation

@dustman9000

@dustman9000 dustman9000 commented Jun 22, 2026

Copy link
Copy Markdown
Member

Summary

  • Adds retry with backoff (5 attempts, 30s delay) to the OSDFM provision shard API query in the rosa-cluster-provision step
  • On final failure, dumps the raw API response for debugging instead of a generic 'No available provision shard!' message
  • Pins REGION=us-west-2 for the rosa-e2e HCP smoke job to match the rosa-e2e sector location (root cause was the cluster profile lease landing in us-east-2)

Test plan

  • pj-rehearse validates the step registry change
  • Retest rosa-e2e HCP smoke job after merge

Summary by CodeRabbit

This PR improves the reliability and debuggability of OpenShift CI’s ROSA Hosted Control Plane (HCP) cluster provisioning by updating the rosa-cluster-provision step logic in the CI operator step registry.

What changed:

  • Updated ci-operator/step-registry/rosa/cluster/provision/rosa-cluster-provision-commands.sh so provision shard discovery against the OSDFM API now retries up to 5 times with a 30-second delay between attempts.
  • Each attempt searches for a matching service cluster shard for the configured sector and region, preferring ready status and falling back to maintenance if needed.
  • The existing downstream logic (e.g., dedicated topology shard selection and setting provision_shard_id) remains unchanged once a suitable shard is found.

Failure handling improvements:

  • If no matching provision shard is found after all retry attempts are exhausted, the step now prints detailed debug output, including the raw OSDFM API response, instead of failing with a generic “No available provision shard!” style message.

Test/config impact:

  • Updated ci-operator/config/openshift-online/rosa-e2e/openshift-online-rosa-e2e-main.yaml by pinning REGION: us-west-2 for the e2e-rosa-hcp-smoke job to avoid region/sector mismatches (e.g., the rosa-e2e sector only existing in us-west-2).

Practical effect on CI:

  • Addresses transient OSDFM API behavior where empty/insufficient results can cause immediate provisioning failure—especially for CI runs using dedicated sectors such as CLUSTER_SECTOR=rosa-e2e.

Validation notes:

  • The PR includes guidance to validate the step registry change via pj-rehearse and recommends retesting the ROSA-e2e HCP smoke job after merge to confirm the transient failure is resolved.

The OSDFM API query for provision shards can transiently return empty
results, causing cluster provisioning to fail immediately. Add retry
with backoff (5 attempts, 30s delay) and dump the raw API response on
final failure for debugging.
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: ebecaf1a-ebd8-41e0-ab7b-7d7f6277fc74

📥 Commits

Reviewing files that changed from the base of the PR and between e8b3038 and 93a09c1.

📒 Files selected for processing (1)
  • ci-operator/config/openshift-online/rosa-e2e/openshift-online-rosa-e2e-main.yaml

Walkthrough

In the ROSA cluster provisioning script's Hypershift path, the single-pass provision shard lookup is replaced with a retry loop of up to 5 attempts with a 30-second delay between each. On exhaustion, additional debug output including a direct OSDFM API query is printed before exiting with failure. The ROSA E2E smoke test is configured with a specific region (us-west-2) to exercise the updated provisioning logic.

Changes

ROSA Provision Shard and E2E Test Setup

Layer / File(s) Summary
Provision shard retry loop and debug fallback
ci-operator/step-registry/rosa/cluster/provision/rosa-cluster-provision-commands.sh
Replaces single-pass readymaintenance shard query with a loop of up to 5 attempts (30s between retries); on exhaustion, prints a direct OSDFM API query for the sector/region and exits with failure.
ROSA E2E test region configuration
ci-operator/config/openshift-online/rosa-e2e/openshift-online-rosa-e2e-main.yaml
Adds REGION: us-west-2 environment variable to the e2e-rosa-hcp-smoke test workflow.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the two main changes: adding retry logic for ROSA provision shard lookup and pinning the rosa-e2e region.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies bash script and YAML config files, not Ginkgo tests. Check is not applicable as no Ginkgo test definitions (It/Describe/Context/When) are present.
Test Structure And Quality ✅ Passed Check is not applicable: PR modifies bash provisioning script and YAML configuration, not Ginkgo test code. Check requires Go test review.
Microshift Test Compatibility ✅ Passed PR modifies CI infrastructure (shell script and YAML config), not Ginkgo e2e tests. The custom check applies only to new Ginkgo test code, which is absent from this PR.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR modifies provisioning scripts and CI config files only; it adds no new Ginkgo test definitions (It, Describe, Context, When, etc.), so SNO compatibility check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only CI infrastructure (provision script and test config), not deployment manifests, operator code, or controllers. No Kubernetes scheduling constraints are introduced.
Ote Binary Stdout Contract ✅ Passed PR modifies only bash shell script and YAML config (CI infrastructure), not OTE binaries or process-level Go code; check does not apply.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. Changes are limited to a shell provisioning script and a CI configuration YAML file, neither of which contain Ginkgo test definitions (It(), Describe()...
No-Weak-Crypto ✅ Passed No weak cryptography (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons detected in the PR changes.
Container-Privileges ✅ Passed PR contains no container/Kubernetes manifests with privileged settings (privileged: true, hostPID, hostNetwork, hostIPC, SYS_ADMIN, allowPrivilegeEscalation, or runAsRoot). Changes are limited to a...
No-Sensitive-Data-In-Logs ✅ Passed The PR logs infrastructure identifiers and system metadata (provision_shard_ids, cluster statuses, regions, sector names) for debugging provision shard lookup failures. None expose passwords, token...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 22, 2026
@openshift-ci openshift-ci Bot requested review from joshbranham and jtaleric June 22, 2026 22:10

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/step-registry/rosa/cluster/provision/rosa-cluster-provision-commands.sh`:
- Line 512: The ocm get command on the line containing service_clusters lookup
only filters by CLUSTER_SECTOR in the search parameter but omits the region
filter, causing unrelated shards to appear in debug output. Modify the search
parameter to include both the sector and region filters in the query so that the
lookup is properly scoped to the relevant region and sector combination.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: ffc72fba-7961-4610-9c2a-aec169fce148

📥 Commits

Reviewing files that changed from the base of the PR and between f032d5f and e8b3038.

📒 Files selected for processing (1)
  • ci-operator/step-registry/rosa/cluster/provision/rosa-cluster-provision-commands.sh

echo "No available provision shard after ${MAX_SHARD_RETRIES} attempts!"
echo "Sector: ${CLUSTER_SECTOR}, Region: ${CLOUD_PROVIDER_REGION}"
echo "Debug: querying OSDFM API directly..."
ocm get /api/osd_fleet_mgmt/v1/service_clusters --parameter search="sector is '${CLUSTER_SECTOR}'" || true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Include region in the debug fallback query.

Line 512 drops the region filter, so the debug output can include unrelated shards and obscure the root cause for the failing lookup.

Suggested patch
-      ocm get /api/osd_fleet_mgmt/v1/service_clusters --parameter search="sector is '${CLUSTER_SECTOR}'" || true
+      ocm get /api/osd_fleet_mgmt/v1/service_clusters --parameter search="sector is '${CLUSTER_SECTOR}' and region is '${CLOUD_PROVIDER_REGION}'" || true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ocm get /api/osd_fleet_mgmt/v1/service_clusters --parameter search="sector is '${CLUSTER_SECTOR}'" || true
ocm get /api/osd_fleet_mgmt/v1/service_clusters --parameter search="sector is '${CLUSTER_SECTOR}' and region is '${CLOUD_PROVIDER_REGION}'" || true
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/rosa/cluster/provision/rosa-cluster-provision-commands.sh`
at line 512, The ocm get command on the line containing service_clusters lookup
only filters by CLUSTER_SECTOR in the search parameter but omits the region
filter, causing unrelated shards to appear in debug output. Modify the search
parameter to include both the sector and region filters in the query so that the
lookup is properly scoped to the relevant region and sector combination.

@dustman9000

Copy link
Copy Markdown
Member Author

/pj-rehearse pull-ci-openshift-online-rosa-e2e-main-e2e-rosa-hcp-smoke

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

@dustman9000: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

The rosa-e2e sector only exists in us-west-2 but the cluster profile
lease can land in other regions, causing provision shard lookup failures
when the region does not match the sector.
@dustman9000 dustman9000 changed the title Add retry logic for ROSA provision shard lookup Add retry logic for ROSA provision shard lookup and pin rosa-e2e region Jun 22, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@dustman9000: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.22-nightly-x86-cluster-density-v2-249nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.22-nightly-x86-control-plane-120nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.22-nightly-x86-control-plane-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.22-nightly-x86-data-path-9nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.22-nightly-x86-node-density-heavy-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.22-nightly-x86-node-density-cni-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.22-nightly-x86-conc-builds-3nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.22-nightly-x86-control-plane-3nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.23-nightly-x86-build-farm-114nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.23-nightly-x86-cluster-density-v2-249nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.23-nightly-x86-control-plane-120nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.23-nightly-x86-control-plane-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.23-nightly-x86-data-path-9nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.23-nightly-x86-node-density-heavy-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.23-nightly-x86-node-density-cni-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.23-nightly-x86-conc-builds-3nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.23-nightly-x86-control-plane-3nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.22-candidate-x86-loaded-upgrade-from-4.21-loaded-upgrade-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.22-candidate-x86-loaded-upgrade-from-4.21-loaded-upgrade-120nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.22-candidate-x86-loaded-upgrade-from-4.21-loaded-upgrade-249nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.21-nightly-x86-cluster-density-v2-249nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.21-nightly-x86-control-plane-120nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.21-nightly-x86-control-plane-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.21-nightly-x86-data-path-9nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.21-nightly-x86-node-density-heavy-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed

A total of 301 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs.

A full list of affected jobs can be found here

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@openshift-ci

openshift-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

@dustman9000: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/rehearse/openshift-online/rosa-e2e/main/e2e-rosa-hcp-smoke e8b3038 link unknown /pj-rehearse pull-ci-openshift-online-rosa-e2e-main-e2e-rosa-hcp-smoke

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@dustman9000

Copy link
Copy Markdown
Member Author

/test rehearse-80875-pull-ci-openshift-online-rosa-e2e-main-e2e-rosa-hcp-smoke

@openshift-ci

openshift-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

@dustman9000: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test boskos-config
/test boskos-config-generation
/test check-gh-automation
/test check-gh-automation-tide
/test check-trigger-trusted-apps
/test ci-operator-config
/test ci-operator-config-metadata
/test ci-operator-registry
/test ci-secret-bootstrap-config-validation
/test ci-testgrid-allow-list
/test clusterimageset-validate
/test config
/test core-valid
/test generated-config
/test generated-dashboards
/test hyperfleet-risk-scorer-test
/test image-mirroring-config-validation
/test jira-lifecycle-config
/test labels
/test openshift-image-mirror-mappings
/test ordered-prow-config
/test owners
/test pr-reminder-config
/test prow-config
/test prow-config-filenames
/test prow-config-semantics
/test pylint
/test release-config
/test release-controller-config
/test rover-groups-config-validation
/test secret-generator-config-valid
/test services-valid
/test stackrox-stackrox-stackrox-stackrox-check
/test step-registry-metadata
/test step-registry-shellcheck
/test sync-rover-groups
/test verified-config
/test yamllint

The following commands are available to trigger optional jobs:

/test check-cluster-profiles-config

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-release-check-gh-automation
pull-ci-openshift-release-main-ci-operator-config
pull-ci-openshift-release-main-ci-operator-config-metadata
pull-ci-openshift-release-main-ci-operator-registry
pull-ci-openshift-release-main-core-valid
pull-ci-openshift-release-main-generated-config
pull-ci-openshift-release-main-ordered-prow-config
pull-ci-openshift-release-main-owners
pull-ci-openshift-release-main-prow-config-filenames
pull-ci-openshift-release-main-release-controller-config
pull-ci-openshift-release-main-step-registry-metadata
pull-ci-openshift-release-main-step-registry-shellcheck
pull-ci-openshift-release-openshift-image-mirror-mappings
pull-ci-openshift-release-yamllint
Details

In response to this:

/test rehearse-80875-pull-ci-openshift-online-rosa-e2e-main-e2e-rosa-hcp-smoke

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@bmeng

bmeng commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

/lgtm
/pj-rehearse ack

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

@bmeng: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-merge-bot openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jun 22, 2026
@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 22, 2026
@openshift-ci

openshift-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bmeng, dustman9000

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot Bot merged commit b88ca26 into openshift:main Jun 22, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. rehearsals-ack Signifies that rehearsal jobs have been acknowledged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants