Add retry logic for ROSA provision shard lookup and pin rosa-e2e region#80875
Conversation
The OSDFM API query for provision shards can transiently return empty results, causing cluster provisioning to fail immediately. Add retry with backoff (5 attempts, 30s delay) and dump the raw API response on final failure for debugging.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
WalkthroughIn the ROSA cluster provisioning script's Hypershift path, the single-pass provision shard lookup is replaced with a retry loop of up to 5 attempts with a 30-second delay between each. On exhaustion, additional debug output including a direct OSDFM API query is printed before exiting with failure. The ROSA E2E smoke test is configured with a specific region (us-west-2) to exercise the updated provisioning logic. ChangesROSA Provision Shard and E2E Test Setup
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@ci-operator/step-registry/rosa/cluster/provision/rosa-cluster-provision-commands.sh`:
- Line 512: The ocm get command on the line containing service_clusters lookup
only filters by CLUSTER_SECTOR in the search parameter but omits the region
filter, causing unrelated shards to appear in debug output. Modify the search
parameter to include both the sector and region filters in the query so that the
lookup is properly scoped to the relevant region and sector combination.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: ffc72fba-7961-4610-9c2a-aec169fce148
📒 Files selected for processing (1)
ci-operator/step-registry/rosa/cluster/provision/rosa-cluster-provision-commands.sh
| echo "No available provision shard after ${MAX_SHARD_RETRIES} attempts!" | ||
| echo "Sector: ${CLUSTER_SECTOR}, Region: ${CLOUD_PROVIDER_REGION}" | ||
| echo "Debug: querying OSDFM API directly..." | ||
| ocm get /api/osd_fleet_mgmt/v1/service_clusters --parameter search="sector is '${CLUSTER_SECTOR}'" || true |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Include region in the debug fallback query.
Line 512 drops the region filter, so the debug output can include unrelated shards and obscure the root cause for the failing lookup.
Suggested patch
- ocm get /api/osd_fleet_mgmt/v1/service_clusters --parameter search="sector is '${CLUSTER_SECTOR}'" || true
+ ocm get /api/osd_fleet_mgmt/v1/service_clusters --parameter search="sector is '${CLUSTER_SECTOR}' and region is '${CLOUD_PROVIDER_REGION}'" || true📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ocm get /api/osd_fleet_mgmt/v1/service_clusters --parameter search="sector is '${CLUSTER_SECTOR}'" || true | |
| ocm get /api/osd_fleet_mgmt/v1/service_clusters --parameter search="sector is '${CLUSTER_SECTOR}' and region is '${CLOUD_PROVIDER_REGION}'" || true |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@ci-operator/step-registry/rosa/cluster/provision/rosa-cluster-provision-commands.sh`
at line 512, The ocm get command on the line containing service_clusters lookup
only filters by CLUSTER_SECTOR in the search parameter but omits the region
filter, causing unrelated shards to appear in debug output. Modify the search
parameter to include both the sector and region filters in the query so that the
lookup is properly scoped to the relevant region and sector combination.
|
/pj-rehearse pull-ci-openshift-online-rosa-e2e-main-e2e-rosa-hcp-smoke |
|
@dustman9000: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
The rosa-e2e sector only exists in us-west-2 but the cluster profile lease can land in other regions, causing provision shard lookup failures when the region does not match the sector.
|
[REHEARSALNOTIFIER]
A total of 301 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs. A full list of affected jobs can be found here Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
@dustman9000: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/test rehearse-80875-pull-ci-openshift-online-rosa-e2e-main-e2e-rosa-hcp-smoke |
|
@dustman9000: The specified target(s) for The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/lgtm |
|
@bmeng: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bmeng, dustman9000 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Summary
Test plan
Summary by CodeRabbit
This PR improves the reliability and debuggability of OpenShift CI’s ROSA Hosted Control Plane (HCP) cluster provisioning by updating the
rosa-cluster-provisionstep logic in the CI operator step registry.What changed:
ci-operator/step-registry/rosa/cluster/provision/rosa-cluster-provision-commands.shso provision shard discovery against the OSDFM API now retries up to 5 times with a 30-second delay between attempts.readystatus and falling back tomaintenanceif needed.provision_shard_id) remains unchanged once a suitable shard is found.Failure handling improvements:
Test/config impact:
ci-operator/config/openshift-online/rosa-e2e/openshift-online-rosa-e2e-main.yamlby pinningREGION: us-west-2for thee2e-rosa-hcp-smokejob to avoid region/sector mismatches (e.g., therosa-e2esector only existing inus-west-2).Practical effect on CI:
CLUSTER_SECTOR=rosa-e2e.Validation notes:
pj-rehearseand recommends retesting the ROSA-e2e HCP smoke job after merge to confirm the transient failure is resolved.