Added new controller to monitor infra to validate fd usage by vr4manta · Pull Request #1508 · openshift/machine-api-operator

vr4manta · 2026-06-09T15:43:54Z

Changes

Created new controller to watch for infrastructure changes and degrade operator if any vSphere machines are referencing Failure Domains that no longer exist.

Summary by CodeRabbit

New Features
- Added infrastructure validation for VSphere platforms that detects when cluster machines reference invalid or missing failure domains.
- Validation results are now reported via ClusterOperator status conditions, with detailed error messages identifying orphaned failure domain references.
- Operator automatically degrades when VSphere failure domain validation fails.
Tests
- Comprehensive test coverage for validation logic under various scenarios.
Chores
- Extended RBAC permissions to allow access to ClusterOperator resources.

coderabbitai · 2026-06-09T15:44:08Z

Walkthrough

This PR introduces a new Kubernetes controller that validates VSphere Infrastructure failure domain configurations. It verifies that machines reference valid failure domains, reports validation status through ClusterOperator conditions, and prevents operator degradation when validation passes. The validator is integrated into the manager startup and the operator's reconciliation path.

Changes

Infrastructure Failure Domain Validation

Layer / File(s)	Summary
VSphere Failure Domain Validator `pkg/controller/infrastructurevalidation/vsphere_validator.go`, `pkg/controller/infrastructurevalidation/vsphere_validator_test.go`	`VSphereFailureDomainValidator` constructs a set of configured failure domains from Infrastructure spec (keyed by region+zone), lists all Machines via client, reads region/zone labels from each Machine, and collects orphaned references when the labeled combination does not match any configured domain. Returns `FailureDomainValidationResult` with validity flag, orphaned reference list, and domain names. Tests cover valid references, orphaned references, missing domains, unlabeled machines, partial labels, no machines, and multiple machines per domain.
Infrastructure Validation Reconciler and Condition Management `pkg/controller/infrastructurevalidation/infrastructurevalidation_controller.go`, `pkg/controller/infrastructurevalidation/infrastructurevalidation_controller_test.go`	`ReconcileInfrastructureValidation` controller watches `Infrastructure` and `Machine` resources. On reconcile, it fetches Infrastructure via uncached APIReader, skips non-VSphere platforms (clearing prior conditions), runs validation for VSphere, and updates ClusterOperator status with a custom `InfrastructureFailureDomainsValid` condition. Helper methods update conditions (setting validation result and toggling OperatorDegraded based on validity), clear conditions (removing only controller-specific state), and format orphaned-reference messages grouped by failure domain. Tests validate successful reconciliation, orphaned reference detection, degraded condition clearing, non-VSphere skip behavior, missing resources, and helper logic.
Manager Setup with Scheme and Cache Configuration `cmd/vsphere/main.go` (imports, scheme init, cache config, controller registration)	Pre-creates a standalone `runtime.Scheme` with required API types (core, configv1, machinev1, ipamv1beta1), passes it to manager options, and extends cache watching via `opts.Cache.ByObject` to ensure Infrastructure and Machine resources are watched cluster-wide regardless of namespace scoping. Registers the infrastructure validation controller during setup.
RBAC Permissions and Sync Integration `install/0000_30_machine-api-operator_09_rbac.yaml`, `pkg/operator/sync.go`	Grants ClusterRole permissions for `get`, `list`, `watch`, and `update` on `config.openshift.io` clusteroperators and clusteroperators/status resources. Integrates validation into operator reconciliation: `syncAll` now checks the `InfrastructureFailureDomainsValid` ClusterOperator condition for VSphere platforms and appends validation failures to the error list for downstream degraded handling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 13 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 31.03% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	Tests violate single responsibility; TestReconcileVSphereOrphanedReferences tests both validation and degraded condition behavior across 11+ assertions.	Split multi-behavior tests into separate tests for validation condition and degraded condition behavior to ensure each test focuses on one specific behavior.

✅ Passed checks (13 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Added new controller to monitor infra to validate fd usage' accurately describes the main change: addition of a new infrastructure validation controller for VSphere failure domain usage.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	The PR adds standard Go tests with stable, descriptive function names. No Ginkgo patterns are used, and all test names contain only static, deterministic strings with no dynamic values.
Microshift Test Compatibility	✅ Passed	The PR adds only standard Go unit tests (using testing.T), not Ginkgo e2e tests. The check applies only to Ginkgo-style tests (It, Describe, Context, etc.), which are not present in this PR.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR adds only standard Go unit tests (pkg/controller/*_test.go) with no Ginkgo e2e tests. No Ginkgo imports, Describe/It/Context patterns, or test/e2e/ changes detected.
Topology-Aware Scheduling Compatibility	✅ Passed	PR adds a new controller for infrastructure validation with no new deployment manifests or scheduling constraints that would impact non-HA topologies.
Ote Binary Stdout Contract	✅ Passed	klog correctly configured to stderr via logtostderr=true before logging; no unintended stdout writes in main.go, controller files, or initialization code detected.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	Tests are unit tests (not Ginkgo e2e tests) with no IPv4 hardcoded addresses, IPv4 assumptions, or external connectivity requirements.
No-Weak-Crypto	✅ Passed	No weak cryptography patterns found. The PR contains no usage of MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB, custom crypto implementations, or non-constant-time secret comparisons.
Container-Privileges	✅ Passed	No privileged container settings (privileged: true, hostPID, hostNetwork, hostIPC, SYS_ADMIN, allowPrivilegeEscalation) found in the PR code or manifests.
No-Sensitive-Data-In-Logs	✅ Passed	All logging statements use only non-sensitive infrastructure metadata (machine names, regions, zones, condition statuses). No passwords, API keys, tokens, PII, or credentials are logged.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-06-09T15:44:09Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2026-06-09T15:44:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign damdo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cmd/vsphere/main.go (1)
167-170: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Keep the cluster-wide Machine watch out of the shared manager cache.

DefaultNamespaces says this binary should reconcile machine-api objects only in --namespace, but Line 177 overrides machinev1.Machine to cache all namespaces. Because the Machine actuator/controller on Line 223 shares this manager, that widens Machine reconciliation cluster-wide, not just the new infrastructure validation controller. Please keep validation’s cluster-wide Machine watch on a dedicated cache/client instead of changing the manager-wide Machine cache.

Also applies to: 177-181
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/vsphere/main.go` around lines 167 - 170, The manager-wide cache is being
widened by overriding machinev1.Machine in opts.Cache.DefaultNamespaces; instead
revert that change and keep opts.Cache.DefaultNamespaces as-is, and do not
register machinev1.Machine in the shared manager cache. Create a dedicated
cluster-scoped cache/client for Machine objects (using controller-runtime's
cache.New or similar) and pass that dedicated client/cache into the
infrastructure validation controller setup (the controller registered where the
Machine actuator/controller is created around the code referenced at Line 223),
while leaving the main mgr and mgr.GetClient scoped to the configured
DefaultNamespaces. Ensure the validation controller uses the dedicated Machine
client for cluster-wide watches and the rest of controllers continue to use the
shared manager client/cache.

🧹 Nitpick comments (1)

pkg/controller/infrastructurevalidation/vsphere_validator.go (1)

80-123: ⚡ Quick win

Consider checking context cancellation in the machine validation loop.

The loop iterates over all machines without checking ctx.Err(). For clusters with many machines, this could delay cancellation.

Suggested improvement

 // Check each machine's failure domain reference via region/zone labels
 for i := range machineList.Items {
+	if ctx.Err() != nil {
+		return nil, ctx.Err()
+	}
 	machine := &machineList.Items[i]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/infrastructurevalidation/vsphere_validator.go` around lines 80
- 123, Inside the machine validation loop over machineList.Items, check the
provided context for cancellation (ctx.Err() or ctx.Done()) at the top of each
iteration and stop processing if cancelled: return early from the enclosing
validation function (propagating the context error) or break the loop so work
halts promptly; locate the loop that references machineList.Items,
machineRegionLabelKey/machineZoneLabelKey, configuredFDs and updates result and
OrphanedReference and add the context check before any heavy work or appending
to result.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@install/0000_30_machine-api-operator_09_rbac.yaml`:
- Around line 291-300: Split the single RBAC rule into two rules under
apiGroups: config.openshift.io: one rule that lists resources: clusteroperators
with verbs get, list, watch (no update) to grant read access, and a second rule
that lists resource: clusteroperators/status with verbs update (and optionally
get/list/watch if the controller reads status) so that the update verb only
applies to the status subresource; locate the existing block referencing
"clusteroperators" and "clusteroperators/status" and replace it with these two
narrower rules.

In `@pkg/operator/sync.go`:
- Around line 93-103: The ClusterOperator lookup error from
optr.getClusterOperator() is currently ignored which lets validation be skipped;
modify the block in pkg/operator/sync.go so that if optr.getClusterOperator()
returns an error you append that error to the errors slice (or return early)
instead of silently continuing; keep the existing infraValidCondition check
(v1helpers.FindStatusCondition(...) and its use of
"InfrastructureFailureDomainsValid") but ensure any err from
optr.getClusterOperator() is propagated into errors so the failure-domain
validation gate is not bypassed.

---

Outside diff comments:
In `@cmd/vsphere/main.go`:
- Around line 167-170: The manager-wide cache is being widened by overriding
machinev1.Machine in opts.Cache.DefaultNamespaces; instead revert that change
and keep opts.Cache.DefaultNamespaces as-is, and do not register
machinev1.Machine in the shared manager cache. Create a dedicated cluster-scoped
cache/client for Machine objects (using controller-runtime's cache.New or
similar) and pass that dedicated client/cache into the infrastructure validation
controller setup (the controller registered where the Machine
actuator/controller is created around the code referenced at Line 223), while
leaving the main mgr and mgr.GetClient scoped to the configured
DefaultNamespaces. Ensure the validation controller uses the dedicated Machine
client for cluster-wide watches and the rest of controllers continue to use the
shared manager client/cache.

---

Nitpick comments:
In `@pkg/controller/infrastructurevalidation/vsphere_validator.go`:
- Around line 80-123: Inside the machine validation loop over machineList.Items,
check the provided context for cancellation (ctx.Err() or ctx.Done()) at the top
of each iteration and stop processing if cancelled: return early from the
enclosing validation function (propagating the context error) or break the loop
so work halts promptly; locate the loop that references machineList.Items,
machineRegionLabelKey/machineZoneLabelKey, configuredFDs and updates result and
OrphanedReference and add the context check before any heavy work or appending
to result.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d154da29-1729-4bc0-8c59-f08eac10bc7a

📥 Commits

Reviewing files that changed from the base of the PR and between d7772c6 and 50627e1.

📒 Files selected for processing (7)

cmd/vsphere/main.go
install/0000_30_machine-api-operator_09_rbac.yaml
pkg/controller/infrastructurevalidation/infrastructurevalidation_controller.go
pkg/controller/infrastructurevalidation/infrastructurevalidation_controller_test.go
pkg/controller/infrastructurevalidation/vsphere_validator.go
pkg/controller/infrastructurevalidation/vsphere_validator_test.go
pkg/operator/sync.go

coderabbitai · 2026-06-09T15:51:29Z

+  - apiGroups:
+      - config.openshift.io
+    resources:
+      - clusteroperators
+      - clusteroperators/status
+    verbs:
+      - get
+      - list
+      - watch
+      - update


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Split this rule so update only applies to clusteroperators/status.

The controller reads clusteroperators and writes through Status().Update, but this single rule also grants update on the parent clusteroperators resource because verbs apply to every resource in the list. That is broader cluster-level write access than the new controller needs.

🔐 Suggested RBAC split

- - apiGroups: - - config.openshift.io - resources: - - clusteroperators - - clusteroperators/status - verbs: - - get - - list - - watch - - update + - apiGroups: + - config.openshift.io + resources: + - clusteroperators + verbs: + - get + - list + - watch + - apiGroups: + - config.openshift.io + resources: + - clusteroperators/status + verbs: + - update

As per coding guidelines, "RBAC: least privilege; no cluster-admin for workloads."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- apiGroups:

- config.openshift.io

resources:

- clusteroperators

- clusteroperators/status

verbs:

- get

- list

- watch

- update

- apiGroups:

- config.openshift.io

resources:

- clusteroperators

verbs:

- get

- list

- watch

- apiGroups:

- config.openshift.io

resources:

- clusteroperators/status

verbs:

- update

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@install/0000_30_machine-api-operator_09_rbac.yaml` around lines 291 - 300, Split the single RBAC rule into two rules under apiGroups: config.openshift.io: one rule that lists resources: clusteroperators with verbs get, list, watch (no update) to grant read access, and a second rule that lists resource: clusteroperators/status with verbs update (and optionally get/list/watch if the controller reads status) so that the update verb only applies to the status subresource; locate the existing block referencing "clusteroperators" and "clusteroperators/status" and replace it with these two narrower rules.

Source: Coding guidelines

coderabbitai · 2026-06-09T15:51:29Z

+	// Check infrastructure validation for VSphere
+	if config.PlatformType == configv1.VSpherePlatformType {
+		co, err := optr.getClusterOperator()
+		if err == nil {
+			// Check if InfrastructureFailureDomainsValid condition is False
+			infraValidCondition := v1helpers.FindStatusCondition(co.Status.Conditions, "InfrastructureFailureDomainsValid")
+			if infraValidCondition != nil && infraValidCondition.Status == configv1.ConditionFalse {
+				errors = append(errors, fmt.Errorf("infrastructure validation failed: %s", infraValidCondition.Message))
+			}
+		}
+	}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t fail open when the ClusterOperator lookup errors.

If optr.getClusterOperator() fails here, the new validation gate is silently skipped and syncAll can continue into the later Available path. Please append that lookup failure to errors (or return early) so an unreadable ClusterOperator cannot disable the failure-domain validation gate.

🩹 Minimal fix

if config.PlatformType == configv1.VSpherePlatformType { co, err := optr.getClusterOperator() - if err == nil { + if err != nil { + errors = append(errors, fmt.Errorf("failed to read ClusterOperator for infrastructure validation: %w", err)) + } else { // Check if InfrastructureFailureDomainsValid condition is False infraValidCondition := v1helpers.FindStatusCondition(co.Status.Conditions, "InfrastructureFailureDomainsValid") if infraValidCondition != nil && infraValidCondition.Status == configv1.ConditionFalse { errors = append(errors, fmt.Errorf("infrastructure validation failed: %s", infraValidCondition.Message)) } } }

As per coding guidelines, "Never ignore error returns."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/operator/sync.go` around lines 93 - 103, The ClusterOperator lookup error from optr.getClusterOperator() is currently ignored which lets validation be skipped; modify the block in pkg/operator/sync.go so that if optr.getClusterOperator() returns an error you append that error to the errors slice (or return early) instead of silently continuing; keep the existing infraValidCondition check (v1helpers.FindStatusCondition(...) and its use of "InfrastructureFailureDomainsValid") but ensure any err from optr.getClusterOperator() is propagated into errors so the failure-domain validation gate is not bypassed.

Source: Coding guidelines

Added new controller to monitor infra to validate fd usage

50627e1

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2026

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added new controller to monitor infra to validate fd usage#1508

Added new controller to monitor infra to validate fd usage#1508
vr4manta wants to merge 1 commit into
openshift:mainfrom
vr4manta:infra-validation

vr4manta commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

❌ Failed checks (2 warnings)

Uh oh!

openshift-ci Bot commented Jun 9, 2026

Uh oh!

openshift-ci Bot commented Jun 9, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Uh oh!

coderabbitai Bot Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vr4manta commented Jun 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

openshift-ci Bot commented Jun 9, 2026

Uh oh!

openshift-ci Bot commented Jun 9, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vr4manta commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading