Skip to content

fix(operator): guard watch error before wrapping with watch.Filter#3

Open
Arsolitt (Arsolitt) wants to merge 2 commits into
masterfrom
fix/watchfunc-nil-watcher-panic
Open

fix(operator): guard watch error before wrapping with watch.Filter#3
Arsolitt (Arsolitt) wants to merge 2 commits into
masterfrom
fix/watchfunc-nil-watcher-panic

Conversation

@Arsolitt
Copy link
Copy Markdown
Collaborator

@Arsolitt Arsolitt (Arsolitt) commented May 30, 2026

Changes proposed on the PR:

  • Return early on a watch error in NewRedisFailoverRetriever's WatchFunc, before wrapping the watcher with watch.Filter, so the reflector retries the watch instead of crashing the operator.
  • Add a regression test covering the error path (error propagated, no panic, no wrapped watcher) and the happy path (watcher is wrapped).

Why

When the RedisFailover watch cannot be established, the typed client returns a nil watch.Interface together with the error. WatchFunc wrapped that result with watch.Filter unconditionally. watch.Filter immediately starts a goroutine whose loop dereferences the source watcher's ResultChan(), so a nil watcher panics the entire operator process:

panic: runtime error: invalid memory address or nil pointer dereference
k8s.io/apimachinery/pkg/watch.(*filteredWatch).loop()  filter.go:64
created by k8s.io/apimachinery/pkg/watch.Filter         filter.go:41

On clusters where the initial watch is flaky — for example a freshly bootstrapped control plane under load — this drives the operator into CrashLoopBackOff and no RedisFailover is ever reconciled.

The fix mirrors the existing ListFunc, which already returns early when the list call errors. The unguarded wrapper was introduced when namespace filtering added watch.Filter; earlier releases returned the watcher directly from WatchFunc.

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved error handling in Redis Failover watch operations to prevent crashes and ensure errors are properly reported.

Review Change Stack

NewRedisFailoverRetriever's WatchFunc wrapped the result of
WatchRedisFailovers with watch.Filter unconditionally. On a watch error
the typed client returns a nil watch.Interface, and watch.Filter spawns a
goroutine whose loop dereferences the source watcher's ResultChan,
crashing the operator with SIGSEGV instead of letting the reflector retry
the watch. This reliably takes the operator into CrashLoopBackOff on
clusters where the initial watch is flaky (e.g. a freshly bootstrapped
control plane), so no RedisFailover is ever reconciled.

Return (nil, err) before wrapping, mirroring the existing ListFunc guard,
and add a regression test that reproduces the panic and covers the happy
path.

Signed-off-by: Arsolitt <arsolitt@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 30, 2026

Warning

Review limit reached

@Arsolitt, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 41 minutes and 57 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0ca4d00d-bbba-4768-8c51-1276a7316adb

📥 Commits

Reviewing files that changed from the base of the PR and between c4f64aa and 93fdbf3.

📒 Files selected for processing (2)
  • operator/redisfailover/factory.go
  • operator/redisfailover/factory_test.go
📝 Walkthrough

Walkthrough

The PR fixes a potential nil-pointer panic in the watch setup logic by explicitly checking for errors from cli.WatchRedisFailovers before passing the watcher to the filtering wrapper. The change adds two unit tests that verify both the error propagation and successful filtering paths.

Changes

Watch error handling in RedisFailoverRetriever

Layer / File(s) Summary
Error handling in WatchFunc
operator/redisfailover/factory.go
WatchFunc now explicitly checks for errors from cli.WatchRedisFailovers and returns early without attempting to wrap a nil watcher, preventing panics. The filter is applied only when a valid watcher is returned.
Unit tests for Watch error paths
operator/redisfailover/factory_test.go
Two tests verify the error propagation path (error returned without wrapping, no panic, nil watcher) and the success path (watcher wrapped and non-nil returned) with mocked Kubernetes service calls.

🎯 2 (Simple) | ⏱️ ~10 minutes

🐰 A watcher that watched for nil with care,
Now checks for errors ere wrapping with flair,
Two paths of test ensure no panic calls,
This fix hops through safety's sturdy halls!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding a guard for watch errors before applying watch.Filter wrapping, which directly addresses the panic issue detailed in the PR objectives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/watchfunc-nil-watcher-panic

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a potential panic in the NewRedisFailoverRetriever's WatchFunc by checking for errors before wrapping the watcher with watch.Filter, and adds unit tests to verify this behavior. The review feedback identifies a critical issue where watch.Error events are silently filtered out because they do not match the expected RedisFailover type, which could prevent the reflector from restarting failed watches. A code suggestion is provided to explicitly allow watch.Error events to pass through the filter.

Comment on lines +100 to +106
return watch.Filter(watcher, func(event watch.Event) (watch.Event, bool) {
rf, ok := event.Object.(*redisfailoverv1.RedisFailover)
if !ok {
return event, false
}
return event, isNamespaceSupported(*rf)
})
return watcher, err
}), nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When filtering watch events, it is crucial to propagate watch.Error events. If a watch.Error event occurs (for example, when a watch resource version is too old and the watch is closed by the API server), the event object will not be of type *redisfailoverv1.RedisFailover. Under the current implementation, this event will be silently filtered out because ok is false, returning event, false.

If the reflector/informer does not receive the watch.Error event, it won't know that the watch has failed and needs to be restarted immediately, potentially causing the operator to hang or delay recovery until a connection timeout occurs.

We should explicitly allow watch.Error events to pass through the filter.

			return watch.Filter(watcher, func(event watch.Event) (watch.Event, bool) {
				if event.Type == watch.Error {
					return event, true
				}
				rf, ok := event.Object.(*redisfailoverv1.RedisFailover)
				if !ok {
					return event, false
				}
				return event, isNamespaceSupported(*rf)
			}), nil

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 93fdbf30. The filter now passes watch.Error events through (returning event, true) before the *RedisFailover type assertion, so they're no longer dropped. This lets the reflector observe the error via apierrors.FromObject and restart the watch promptly (e.g. on an expired resource version) instead of waiting for the connection to close. I also added a regression test that pushes a watch.Error through a fake watcher and asserts it reaches the filtered channel.

@Arsolitt Arsolitt (Arsolitt) marked this pull request as ready for review May 30, 2026 07:18
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@operator/redisfailover/factory_test.go`:
- Around line 71-74: The test calling retriever.Watch(context.Background(),
metav1.ListOptions{}) stores the returned watcher in w but never stops it,
causing goroutine leakage; update the test to stop the watcher in the success
path by calling w.Stop() (or defer w.Stop() right after ensuring err==nil and
w!=nil) so the watcher is closed before asserting expectations (references:
retriever.Watch and variable w).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9f9f2faa-976f-4a35-be40-416dda0c9f6c

📥 Commits

Reviewing files that changed from the base of the PR and between 162776b and c4f64aa.

📒 Files selected for processing (2)
  • operator/redisfailover/factory.go
  • operator/redisfailover/factory_test.go

Comment on lines +71 to +74
w, err := retriever.Watch(context.Background(), metav1.ListOptions{})
assert.NoError(err)
assert.NotNil(w)
ms.AssertExpectations(t)
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot May 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stop the returned watcher in the success-path test to avoid goroutine leakage.

retriever.Watch(...) returns the filtered watcher; stopping it explicitly makes test lifecycle deterministic and avoids lingering goroutines in larger suites.

Suggested patch
 w, err := retriever.Watch(context.Background(), metav1.ListOptions{})
 assert.NoError(err)
 assert.NotNil(w)
+if w != nil {
+	defer w.Stop()
+}
 ms.AssertExpectations(t)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
w, err := retriever.Watch(context.Background(), metav1.ListOptions{})
assert.NoError(err)
assert.NotNil(w)
ms.AssertExpectations(t)
w, err := retriever.Watch(context.Background(), metav1.ListOptions{})
assert.NoError(err)
assert.NotNil(w)
if w != nil {
defer w.Stop()
}
ms.AssertExpectations(t)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@operator/redisfailover/factory_test.go` around lines 71 - 74, The test
calling retriever.Watch(context.Background(), metav1.ListOptions{}) stores the
returned watcher in w but never stops it, causing goroutine leakage; update the
test to stop the watcher in the success path by calling w.Stop() (or defer
w.Stop() right after ensuring err==nil and w!=nil) so the watcher is closed
before asserting expectations (references: retriever.Watch and variable w).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but I think this is redundant here. filteredWatch.Stop() just delegates to the source watcher's Stop(), and the filter goroutine ranges over the source's ResultChan(), so it exits as soon as the source closes. The test already has defer fake.Stop() (line 61), which closes that channel and terminates the goroutine — adding defer w.Stop() would only call the same fake.Stop() again. There's no goroutine left to leak, so I'll keep the test as is.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Arsolitt, that's a fair and accurate analysis — watch.Filter starts a goroutine that ranges over w.ResultChan() and exits naturally once the source channel is closed, so defer fake.Stop() on line 61 is already the correct cleanup point. Calling defer w.Stop() would just re-invoke the same underlying fake.Stop(). Acknowledged, happy to keep the test as-is.

watch.Error events carry a *metav1.Status object rather than a *RedisFailover, so the namespace filter's type assertion silently dropped them. The reflector depends on these events to detect a failed watch (such as an expired resource version) and restart it promptly instead of waiting for a connection timeout. Pass watch.Error events through unchanged and add a regression test covering the propagation.

Signed-off-by: Arsolitt <arsolitt@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant