Added new controller to monitor infra to validate fd usage#1508
Added new controller to monitor infra to validate fd usage#1508vr4manta wants to merge 1 commit into
Conversation
WalkthroughThis PR introduces a new Kubernetes controller that validates VSphere Infrastructure failure domain configurations. It verifies that machines reference valid failure domains, reports validation status through ClusterOperator conditions, and prevents operator degradation when validation passes. The validator is integrated into the manager startup and the operator's reconciliation path. ChangesInfrastructure Failure Domain Validation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 13 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (13 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/vsphere/main.go (1)
167-170:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftKeep the cluster-wide Machine watch out of the shared manager cache.
DefaultNamespacessays this binary should reconcile machine-api objects only in--namespace, but Line 177 overridesmachinev1.Machineto cache all namespaces. Because the Machine actuator/controller on Line 223 shares this manager, that widens Machine reconciliation cluster-wide, not just the new infrastructure validation controller. Please keep validation’s cluster-wide Machine watch on a dedicated cache/client instead of changing the manager-wide Machine cache.Also applies to: 177-181
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/vsphere/main.go` around lines 167 - 170, The manager-wide cache is being widened by overriding machinev1.Machine in opts.Cache.DefaultNamespaces; instead revert that change and keep opts.Cache.DefaultNamespaces as-is, and do not register machinev1.Machine in the shared manager cache. Create a dedicated cluster-scoped cache/client for Machine objects (using controller-runtime's cache.New or similar) and pass that dedicated client/cache into the infrastructure validation controller setup (the controller registered where the Machine actuator/controller is created around the code referenced at Line 223), while leaving the main mgr and mgr.GetClient scoped to the configured DefaultNamespaces. Ensure the validation controller uses the dedicated Machine client for cluster-wide watches and the rest of controllers continue to use the shared manager client/cache.
🧹 Nitpick comments (1)
pkg/controller/infrastructurevalidation/vsphere_validator.go (1)
80-123: ⚡ Quick winConsider checking context cancellation in the machine validation loop.
The loop iterates over all machines without checking
ctx.Err(). For clusters with many machines, this could delay cancellation.Suggested improvement
// Check each machine's failure domain reference via region/zone labels for i := range machineList.Items { + if ctx.Err() != nil { + return nil, ctx.Err() + } machine := &machineList.Items[i]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/controller/infrastructurevalidation/vsphere_validator.go` around lines 80 - 123, Inside the machine validation loop over machineList.Items, check the provided context for cancellation (ctx.Err() or ctx.Done()) at the top of each iteration and stop processing if cancelled: return early from the enclosing validation function (propagating the context error) or break the loop so work halts promptly; locate the loop that references machineList.Items, machineRegionLabelKey/machineZoneLabelKey, configuredFDs and updates result and OrphanedReference and add the context check before any heavy work or appending to result.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@install/0000_30_machine-api-operator_09_rbac.yaml`:
- Around line 291-300: Split the single RBAC rule into two rules under
apiGroups: config.openshift.io: one rule that lists resources: clusteroperators
with verbs get, list, watch (no update) to grant read access, and a second rule
that lists resource: clusteroperators/status with verbs update (and optionally
get/list/watch if the controller reads status) so that the update verb only
applies to the status subresource; locate the existing block referencing
"clusteroperators" and "clusteroperators/status" and replace it with these two
narrower rules.
In `@pkg/operator/sync.go`:
- Around line 93-103: The ClusterOperator lookup error from
optr.getClusterOperator() is currently ignored which lets validation be skipped;
modify the block in pkg/operator/sync.go so that if optr.getClusterOperator()
returns an error you append that error to the errors slice (or return early)
instead of silently continuing; keep the existing infraValidCondition check
(v1helpers.FindStatusCondition(...) and its use of
"InfrastructureFailureDomainsValid") but ensure any err from
optr.getClusterOperator() is propagated into errors so the failure-domain
validation gate is not bypassed.
---
Outside diff comments:
In `@cmd/vsphere/main.go`:
- Around line 167-170: The manager-wide cache is being widened by overriding
machinev1.Machine in opts.Cache.DefaultNamespaces; instead revert that change
and keep opts.Cache.DefaultNamespaces as-is, and do not register
machinev1.Machine in the shared manager cache. Create a dedicated cluster-scoped
cache/client for Machine objects (using controller-runtime's cache.New or
similar) and pass that dedicated client/cache into the infrastructure validation
controller setup (the controller registered where the Machine
actuator/controller is created around the code referenced at Line 223), while
leaving the main mgr and mgr.GetClient scoped to the configured
DefaultNamespaces. Ensure the validation controller uses the dedicated Machine
client for cluster-wide watches and the rest of controllers continue to use the
shared manager client/cache.
---
Nitpick comments:
In `@pkg/controller/infrastructurevalidation/vsphere_validator.go`:
- Around line 80-123: Inside the machine validation loop over machineList.Items,
check the provided context for cancellation (ctx.Err() or ctx.Done()) at the top
of each iteration and stop processing if cancelled: return early from the
enclosing validation function (propagating the context error) or break the loop
so work halts promptly; locate the loop that references machineList.Items,
machineRegionLabelKey/machineZoneLabelKey, configuredFDs and updates result and
OrphanedReference and add the context check before any heavy work or appending
to result.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d154da29-1729-4bc0-8c59-f08eac10bc7a
📒 Files selected for processing (7)
cmd/vsphere/main.goinstall/0000_30_machine-api-operator_09_rbac.yamlpkg/controller/infrastructurevalidation/infrastructurevalidation_controller.gopkg/controller/infrastructurevalidation/infrastructurevalidation_controller_test.gopkg/controller/infrastructurevalidation/vsphere_validator.gopkg/controller/infrastructurevalidation/vsphere_validator_test.gopkg/operator/sync.go
| - apiGroups: | ||
| - config.openshift.io | ||
| resources: | ||
| - clusteroperators | ||
| - clusteroperators/status | ||
| verbs: | ||
| - get | ||
| - list | ||
| - watch | ||
| - update |
There was a problem hiding this comment.
Split this rule so update only applies to clusteroperators/status.
The controller reads clusteroperators and writes through Status().Update, but this single rule also grants update on the parent clusteroperators resource because verbs apply to every resource in the list. That is broader cluster-level write access than the new controller needs.
🔐 Suggested RBAC split
- - apiGroups:
- - config.openshift.io
- resources:
- - clusteroperators
- - clusteroperators/status
- verbs:
- - get
- - list
- - watch
- - update
+ - apiGroups:
+ - config.openshift.io
+ resources:
+ - clusteroperators
+ verbs:
+ - get
+ - list
+ - watch
+ - apiGroups:
+ - config.openshift.io
+ resources:
+ - clusteroperators/status
+ verbs:
+ - updateAs per coding guidelines, "RBAC: least privilege; no cluster-admin for workloads."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - apiGroups: | |
| - config.openshift.io | |
| resources: | |
| - clusteroperators | |
| - clusteroperators/status | |
| verbs: | |
| - get | |
| - list | |
| - watch | |
| - update | |
| - apiGroups: | |
| - config.openshift.io | |
| resources: | |
| - clusteroperators | |
| verbs: | |
| - get | |
| - list | |
| - watch | |
| - apiGroups: | |
| - config.openshift.io | |
| resources: | |
| - clusteroperators/status | |
| verbs: | |
| - update |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@install/0000_30_machine-api-operator_09_rbac.yaml` around lines 291 - 300,
Split the single RBAC rule into two rules under apiGroups: config.openshift.io:
one rule that lists resources: clusteroperators with verbs get, list, watch (no
update) to grant read access, and a second rule that lists resource:
clusteroperators/status with verbs update (and optionally get/list/watch if the
controller reads status) so that the update verb only applies to the status
subresource; locate the existing block referencing "clusteroperators" and
"clusteroperators/status" and replace it with these two narrower rules.
Source: Coding guidelines
| // Check infrastructure validation for VSphere | ||
| if config.PlatformType == configv1.VSpherePlatformType { | ||
| co, err := optr.getClusterOperator() | ||
| if err == nil { | ||
| // Check if InfrastructureFailureDomainsValid condition is False | ||
| infraValidCondition := v1helpers.FindStatusCondition(co.Status.Conditions, "InfrastructureFailureDomainsValid") | ||
| if infraValidCondition != nil && infraValidCondition.Status == configv1.ConditionFalse { | ||
| errors = append(errors, fmt.Errorf("infrastructure validation failed: %s", infraValidCondition.Message)) | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Don’t fail open when the ClusterOperator lookup errors.
If optr.getClusterOperator() fails here, the new validation gate is silently skipped and syncAll can continue into the later Available path. Please append that lookup failure to errors (or return early) so an unreadable ClusterOperator cannot disable the failure-domain validation gate.
🩹 Minimal fix
if config.PlatformType == configv1.VSpherePlatformType {
co, err := optr.getClusterOperator()
- if err == nil {
+ if err != nil {
+ errors = append(errors, fmt.Errorf("failed to read ClusterOperator for infrastructure validation: %w", err))
+ } else {
// Check if InfrastructureFailureDomainsValid condition is False
infraValidCondition := v1helpers.FindStatusCondition(co.Status.Conditions, "InfrastructureFailureDomainsValid")
if infraValidCondition != nil && infraValidCondition.Status == configv1.ConditionFalse {
errors = append(errors, fmt.Errorf("infrastructure validation failed: %s", infraValidCondition.Message))
}
}
}As per coding guidelines, "Never ignore error returns."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/operator/sync.go` around lines 93 - 103, The ClusterOperator lookup error
from optr.getClusterOperator() is currently ignored which lets validation be
skipped; modify the block in pkg/operator/sync.go so that if
optr.getClusterOperator() returns an error you append that error to the errors
slice (or return early) instead of silently continuing; keep the existing
infraValidCondition check (v1helpers.FindStatusCondition(...) and its use of
"InfrastructureFailureDomainsValid") but ensure any err from
optr.getClusterOperator() is propagated into errors so the failure-domain
validation gate is not bypassed.
Source: Coding guidelines
Changes
Summary by CodeRabbit
New Features
Tests
Chores