Skip to content

Added new controller to monitor infra to validate fd usage#1508

Draft
vr4manta wants to merge 1 commit into
openshift:mainfrom
vr4manta:infra-validation
Draft

Added new controller to monitor infra to validate fd usage#1508
vr4manta wants to merge 1 commit into
openshift:mainfrom
vr4manta:infra-validation

Conversation

@vr4manta

@vr4manta vr4manta commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Changes

  • Created new controller to watch for infrastructure changes and degrade operator if any vSphere machines are referencing Failure Domains that no longer exist.

Summary by CodeRabbit

  • New Features

    • Added infrastructure validation for VSphere platforms that detects when cluster machines reference invalid or missing failure domains.
    • Validation results are now reported via ClusterOperator status conditions, with detailed error messages identifying orphaned failure domain references.
    • Operator automatically degrades when VSphere failure domain validation fails.
  • Tests

    • Comprehensive test coverage for validation logic under various scenarios.
  • Chores

    • Extended RBAC permissions to allow access to ClusterOperator resources.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2026
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Walkthrough

This PR introduces a new Kubernetes controller that validates VSphere Infrastructure failure domain configurations. It verifies that machines reference valid failure domains, reports validation status through ClusterOperator conditions, and prevents operator degradation when validation passes. The validator is integrated into the manager startup and the operator's reconciliation path.

Changes

Infrastructure Failure Domain Validation

Layer / File(s) Summary
VSphere Failure Domain Validator
pkg/controller/infrastructurevalidation/vsphere_validator.go, pkg/controller/infrastructurevalidation/vsphere_validator_test.go
VSphereFailureDomainValidator constructs a set of configured failure domains from Infrastructure spec (keyed by region+zone), lists all Machines via client, reads region/zone labels from each Machine, and collects orphaned references when the labeled combination does not match any configured domain. Returns FailureDomainValidationResult with validity flag, orphaned reference list, and domain names. Tests cover valid references, orphaned references, missing domains, unlabeled machines, partial labels, no machines, and multiple machines per domain.
Infrastructure Validation Reconciler and Condition Management
pkg/controller/infrastructurevalidation/infrastructurevalidation_controller.go, pkg/controller/infrastructurevalidation/infrastructurevalidation_controller_test.go
ReconcileInfrastructureValidation controller watches Infrastructure and Machine resources. On reconcile, it fetches Infrastructure via uncached APIReader, skips non-VSphere platforms (clearing prior conditions), runs validation for VSphere, and updates ClusterOperator status with a custom InfrastructureFailureDomainsValid condition. Helper methods update conditions (setting validation result and toggling OperatorDegraded based on validity), clear conditions (removing only controller-specific state), and format orphaned-reference messages grouped by failure domain. Tests validate successful reconciliation, orphaned reference detection, degraded condition clearing, non-VSphere skip behavior, missing resources, and helper logic.
Manager Setup with Scheme and Cache Configuration
cmd/vsphere/main.go (imports, scheme init, cache config, controller registration)
Pre-creates a standalone runtime.Scheme with required API types (core, configv1, machinev1, ipamv1beta1), passes it to manager options, and extends cache watching via opts.Cache.ByObject to ensure Infrastructure and Machine resources are watched cluster-wide regardless of namespace scoping. Registers the infrastructure validation controller during setup.
RBAC Permissions and Sync Integration
install/0000_30_machine-api-operator_09_rbac.yaml, pkg/operator/sync.go
Grants ClusterRole permissions for get, list, watch, and update on config.openshift.io clusteroperators and clusteroperators/status resources. Integrates validation into operator reconciliation: syncAll now checks the InfrastructureFailureDomainsValid ClusterOperator condition for VSphere platforms and appends validation failures to the error list for downstream degraded handling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 13 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.03% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Tests violate single responsibility; TestReconcileVSphereOrphanedReferences tests both validation and degraded condition behavior across 11+ assertions. Split multi-behavior tests into separate tests for validation condition and degraded condition behavior to ensure each test focuses on one specific behavior.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Added new controller to monitor infra to validate fd usage' accurately describes the main change: addition of a new infrastructure validation controller for VSphere failure domain usage.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The PR adds standard Go tests with stable, descriptive function names. No Ginkgo patterns are used, and all test names contain only static, deterministic strings with no dynamic values.
Microshift Test Compatibility ✅ Passed The PR adds only standard Go unit tests (using testing.T), not Ginkgo e2e tests. The check applies only to Ginkgo-style tests (It, Describe, Context, etc.), which are not present in this PR.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR adds only standard Go unit tests (pkg/controller/*_test.go) with no Ginkgo e2e tests. No Ginkgo imports, Describe/It/Context patterns, or test/e2e/ changes detected.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds a new controller for infrastructure validation with no new deployment manifests or scheduling constraints that would impact non-HA topologies.
Ote Binary Stdout Contract ✅ Passed klog correctly configured to stderr via logtostderr=true before logging; no unintended stdout writes in main.go, controller files, or initialization code detected.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Tests are unit tests (not Ginkgo e2e tests) with no IPv4 hardcoded addresses, IPv4 assumptions, or external connectivity requirements.
No-Weak-Crypto ✅ Passed No weak cryptography patterns found. The PR contains no usage of MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB, custom crypto implementations, or non-constant-time secret comparisons.
Container-Privileges ✅ Passed No privileged container settings (privileged: true, hostPID, hostNetwork, hostIPC, SYS_ADMIN, allowPrivilegeEscalation) found in the PR code or manifests.
No-Sensitive-Data-In-Logs ✅ Passed All logging statements use only non-sensitive infrastructure metadata (machine names, regions, zones, condition statuses). No passwords, API keys, tokens, PII, or credentials are logged.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign damdo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/vsphere/main.go (1)

167-170: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Keep the cluster-wide Machine watch out of the shared manager cache.

DefaultNamespaces says this binary should reconcile machine-api objects only in --namespace, but Line 177 overrides machinev1.Machine to cache all namespaces. Because the Machine actuator/controller on Line 223 shares this manager, that widens Machine reconciliation cluster-wide, not just the new infrastructure validation controller. Please keep validation’s cluster-wide Machine watch on a dedicated cache/client instead of changing the manager-wide Machine cache.

Also applies to: 177-181

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/vsphere/main.go` around lines 167 - 170, The manager-wide cache is being
widened by overriding machinev1.Machine in opts.Cache.DefaultNamespaces; instead
revert that change and keep opts.Cache.DefaultNamespaces as-is, and do not
register machinev1.Machine in the shared manager cache. Create a dedicated
cluster-scoped cache/client for Machine objects (using controller-runtime's
cache.New or similar) and pass that dedicated client/cache into the
infrastructure validation controller setup (the controller registered where the
Machine actuator/controller is created around the code referenced at Line 223),
while leaving the main mgr and mgr.GetClient scoped to the configured
DefaultNamespaces. Ensure the validation controller uses the dedicated Machine
client for cluster-wide watches and the rest of controllers continue to use the
shared manager client/cache.
🧹 Nitpick comments (1)
pkg/controller/infrastructurevalidation/vsphere_validator.go (1)

80-123: ⚡ Quick win

Consider checking context cancellation in the machine validation loop.

The loop iterates over all machines without checking ctx.Err(). For clusters with many machines, this could delay cancellation.

Suggested improvement
 // Check each machine's failure domain reference via region/zone labels
 for i := range machineList.Items {
+	if ctx.Err() != nil {
+		return nil, ctx.Err()
+	}
 	machine := &machineList.Items[i]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/infrastructurevalidation/vsphere_validator.go` around lines 80
- 123, Inside the machine validation loop over machineList.Items, check the
provided context for cancellation (ctx.Err() or ctx.Done()) at the top of each
iteration and stop processing if cancelled: return early from the enclosing
validation function (propagating the context error) or break the loop so work
halts promptly; locate the loop that references machineList.Items,
machineRegionLabelKey/machineZoneLabelKey, configuredFDs and updates result and
OrphanedReference and add the context check before any heavy work or appending
to result.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@install/0000_30_machine-api-operator_09_rbac.yaml`:
- Around line 291-300: Split the single RBAC rule into two rules under
apiGroups: config.openshift.io: one rule that lists resources: clusteroperators
with verbs get, list, watch (no update) to grant read access, and a second rule
that lists resource: clusteroperators/status with verbs update (and optionally
get/list/watch if the controller reads status) so that the update verb only
applies to the status subresource; locate the existing block referencing
"clusteroperators" and "clusteroperators/status" and replace it with these two
narrower rules.

In `@pkg/operator/sync.go`:
- Around line 93-103: The ClusterOperator lookup error from
optr.getClusterOperator() is currently ignored which lets validation be skipped;
modify the block in pkg/operator/sync.go so that if optr.getClusterOperator()
returns an error you append that error to the errors slice (or return early)
instead of silently continuing; keep the existing infraValidCondition check
(v1helpers.FindStatusCondition(...) and its use of
"InfrastructureFailureDomainsValid") but ensure any err from
optr.getClusterOperator() is propagated into errors so the failure-domain
validation gate is not bypassed.

---

Outside diff comments:
In `@cmd/vsphere/main.go`:
- Around line 167-170: The manager-wide cache is being widened by overriding
machinev1.Machine in opts.Cache.DefaultNamespaces; instead revert that change
and keep opts.Cache.DefaultNamespaces as-is, and do not register
machinev1.Machine in the shared manager cache. Create a dedicated cluster-scoped
cache/client for Machine objects (using controller-runtime's cache.New or
similar) and pass that dedicated client/cache into the infrastructure validation
controller setup (the controller registered where the Machine
actuator/controller is created around the code referenced at Line 223), while
leaving the main mgr and mgr.GetClient scoped to the configured
DefaultNamespaces. Ensure the validation controller uses the dedicated Machine
client for cluster-wide watches and the rest of controllers continue to use the
shared manager client/cache.

---

Nitpick comments:
In `@pkg/controller/infrastructurevalidation/vsphere_validator.go`:
- Around line 80-123: Inside the machine validation loop over machineList.Items,
check the provided context for cancellation (ctx.Err() or ctx.Done()) at the top
of each iteration and stop processing if cancelled: return early from the
enclosing validation function (propagating the context error) or break the loop
so work halts promptly; locate the loop that references machineList.Items,
machineRegionLabelKey/machineZoneLabelKey, configuredFDs and updates result and
OrphanedReference and add the context check before any heavy work or appending
to result.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d154da29-1729-4bc0-8c59-f08eac10bc7a

📥 Commits

Reviewing files that changed from the base of the PR and between d7772c6 and 50627e1.

📒 Files selected for processing (7)
  • cmd/vsphere/main.go
  • install/0000_30_machine-api-operator_09_rbac.yaml
  • pkg/controller/infrastructurevalidation/infrastructurevalidation_controller.go
  • pkg/controller/infrastructurevalidation/infrastructurevalidation_controller_test.go
  • pkg/controller/infrastructurevalidation/vsphere_validator.go
  • pkg/controller/infrastructurevalidation/vsphere_validator_test.go
  • pkg/operator/sync.go

Comment on lines +291 to +300
- apiGroups:
- config.openshift.io
resources:
- clusteroperators
- clusteroperators/status
verbs:
- get
- list
- watch
- update

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Split this rule so update only applies to clusteroperators/status.

The controller reads clusteroperators and writes through Status().Update, but this single rule also grants update on the parent clusteroperators resource because verbs apply to every resource in the list. That is broader cluster-level write access than the new controller needs.

🔐 Suggested RBAC split
-  - apiGroups:
-      - config.openshift.io
-    resources:
-      - clusteroperators
-      - clusteroperators/status
-    verbs:
-      - get
-      - list
-      - watch
-      - update
+  - apiGroups:
+      - config.openshift.io
+    resources:
+      - clusteroperators
+    verbs:
+      - get
+      - list
+      - watch
+  - apiGroups:
+      - config.openshift.io
+    resources:
+      - clusteroperators/status
+    verbs:
+      - update

As per coding guidelines, "RBAC: least privilege; no cluster-admin for workloads."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- apiGroups:
- config.openshift.io
resources:
- clusteroperators
- clusteroperators/status
verbs:
- get
- list
- watch
- update
- apiGroups:
- config.openshift.io
resources:
- clusteroperators
verbs:
- get
- list
- watch
- apiGroups:
- config.openshift.io
resources:
- clusteroperators/status
verbs:
- update
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@install/0000_30_machine-api-operator_09_rbac.yaml` around lines 291 - 300,
Split the single RBAC rule into two rules under apiGroups: config.openshift.io:
one rule that lists resources: clusteroperators with verbs get, list, watch (no
update) to grant read access, and a second rule that lists resource:
clusteroperators/status with verbs update (and optionally get/list/watch if the
controller reads status) so that the update verb only applies to the status
subresource; locate the existing block referencing "clusteroperators" and
"clusteroperators/status" and replace it with these two narrower rules.

Source: Coding guidelines

Comment thread pkg/operator/sync.go
Comment on lines +93 to +103
// Check infrastructure validation for VSphere
if config.PlatformType == configv1.VSpherePlatformType {
co, err := optr.getClusterOperator()
if err == nil {
// Check if InfrastructureFailureDomainsValid condition is False
infraValidCondition := v1helpers.FindStatusCondition(co.Status.Conditions, "InfrastructureFailureDomainsValid")
if infraValidCondition != nil && infraValidCondition.Status == configv1.ConditionFalse {
errors = append(errors, fmt.Errorf("infrastructure validation failed: %s", infraValidCondition.Message))
}
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t fail open when the ClusterOperator lookup errors.

If optr.getClusterOperator() fails here, the new validation gate is silently skipped and syncAll can continue into the later Available path. Please append that lookup failure to errors (or return early) so an unreadable ClusterOperator cannot disable the failure-domain validation gate.

🩹 Minimal fix
 	if config.PlatformType == configv1.VSpherePlatformType {
 		co, err := optr.getClusterOperator()
-		if err == nil {
+		if err != nil {
+			errors = append(errors, fmt.Errorf("failed to read ClusterOperator for infrastructure validation: %w", err))
+		} else {
 			// Check if InfrastructureFailureDomainsValid condition is False
 			infraValidCondition := v1helpers.FindStatusCondition(co.Status.Conditions, "InfrastructureFailureDomainsValid")
 			if infraValidCondition != nil && infraValidCondition.Status == configv1.ConditionFalse {
 				errors = append(errors, fmt.Errorf("infrastructure validation failed: %s", infraValidCondition.Message))
 			}
 		}
 	}

As per coding guidelines, "Never ignore error returns."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/operator/sync.go` around lines 93 - 103, The ClusterOperator lookup error
from optr.getClusterOperator() is currently ignored which lets validation be
skipped; modify the block in pkg/operator/sync.go so that if
optr.getClusterOperator() returns an error you append that error to the errors
slice (or return early) instead of silently continuing; keep the existing
infraValidCondition check (v1helpers.FindStatusCondition(...) and its use of
"InfrastructureFailureDomainsValid") but ensure any err from
optr.getClusterOperator() is propagated into errors so the failure-domain
validation gate is not bypassed.

Source: Coding guidelines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant