fix(operator): guard watch error before wrapping with watch.Filter#3
fix(operator): guard watch error before wrapping with watch.Filter#3Arsolitt (Arsolitt) wants to merge 2 commits into
Conversation
NewRedisFailoverRetriever's WatchFunc wrapped the result of WatchRedisFailovers with watch.Filter unconditionally. On a watch error the typed client returns a nil watch.Interface, and watch.Filter spawns a goroutine whose loop dereferences the source watcher's ResultChan, crashing the operator with SIGSEGV instead of letting the reflector retry the watch. This reliably takes the operator into CrashLoopBackOff on clusters where the initial watch is flaky (e.g. a freshly bootstrapped control plane), so no RedisFailover is ever reconciled. Return (nil, err) before wrapping, mirroring the existing ListFunc guard, and add a regression test that reproduces the panic and covers the happy path. Signed-off-by: Arsolitt <arsolitt@gmail.com>
|
Warning Review limit reached
More reviews will be available in 41 minutes and 57 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe PR fixes a potential nil-pointer panic in the watch setup logic by explicitly checking for errors from ChangesWatch error handling in RedisFailoverRetriever
🎯 2 (Simple) | ⏱️ ~10 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request fixes a potential panic in the NewRedisFailoverRetriever's WatchFunc by checking for errors before wrapping the watcher with watch.Filter, and adds unit tests to verify this behavior. The review feedback identifies a critical issue where watch.Error events are silently filtered out because they do not match the expected RedisFailover type, which could prevent the reflector from restarting failed watches. A code suggestion is provided to explicitly allow watch.Error events to pass through the filter.
| return watch.Filter(watcher, func(event watch.Event) (watch.Event, bool) { | ||
| rf, ok := event.Object.(*redisfailoverv1.RedisFailover) | ||
| if !ok { | ||
| return event, false | ||
| } | ||
| return event, isNamespaceSupported(*rf) | ||
| }) | ||
| return watcher, err | ||
| }), nil |
There was a problem hiding this comment.
When filtering watch events, it is crucial to propagate watch.Error events. If a watch.Error event occurs (for example, when a watch resource version is too old and the watch is closed by the API server), the event object will not be of type *redisfailoverv1.RedisFailover. Under the current implementation, this event will be silently filtered out because ok is false, returning event, false.
If the reflector/informer does not receive the watch.Error event, it won't know that the watch has failed and needs to be restarted immediately, potentially causing the operator to hang or delay recovery until a connection timeout occurs.
We should explicitly allow watch.Error events to pass through the filter.
return watch.Filter(watcher, func(event watch.Event) (watch.Event, bool) {
if event.Type == watch.Error {
return event, true
}
rf, ok := event.Object.(*redisfailoverv1.RedisFailover)
if !ok {
return event, false
}
return event, isNamespaceSupported(*rf)
}), nilThere was a problem hiding this comment.
Good catch — fixed in 93fdbf30. The filter now passes watch.Error events through (returning event, true) before the *RedisFailover type assertion, so they're no longer dropped. This lets the reflector observe the error via apierrors.FromObject and restart the watch promptly (e.g. on an expired resource version) instead of waiting for the connection to close. I also added a regression test that pushes a watch.Error through a fake watcher and asserts it reaches the filtered channel.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@operator/redisfailover/factory_test.go`:
- Around line 71-74: The test calling retriever.Watch(context.Background(),
metav1.ListOptions{}) stores the returned watcher in w but never stops it,
causing goroutine leakage; update the test to stop the watcher in the success
path by calling w.Stop() (or defer w.Stop() right after ensuring err==nil and
w!=nil) so the watcher is closed before asserting expectations (references:
retriever.Watch and variable w).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9f9f2faa-976f-4a35-be40-416dda0c9f6c
📒 Files selected for processing (2)
operator/redisfailover/factory.gooperator/redisfailover/factory_test.go
| w, err := retriever.Watch(context.Background(), metav1.ListOptions{}) | ||
| assert.NoError(err) | ||
| assert.NotNil(w) | ||
| ms.AssertExpectations(t) |
There was a problem hiding this comment.
Stop the returned watcher in the success-path test to avoid goroutine leakage.
retriever.Watch(...) returns the filtered watcher; stopping it explicitly makes test lifecycle deterministic and avoids lingering goroutines in larger suites.
Suggested patch
w, err := retriever.Watch(context.Background(), metav1.ListOptions{})
assert.NoError(err)
assert.NotNil(w)
+if w != nil {
+ defer w.Stop()
+}
ms.AssertExpectations(t)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| w, err := retriever.Watch(context.Background(), metav1.ListOptions{}) | |
| assert.NoError(err) | |
| assert.NotNil(w) | |
| ms.AssertExpectations(t) | |
| w, err := retriever.Watch(context.Background(), metav1.ListOptions{}) | |
| assert.NoError(err) | |
| assert.NotNil(w) | |
| if w != nil { | |
| defer w.Stop() | |
| } | |
| ms.AssertExpectations(t) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@operator/redisfailover/factory_test.go` around lines 71 - 74, The test
calling retriever.Watch(context.Background(), metav1.ListOptions{}) stores the
returned watcher in w but never stops it, causing goroutine leakage; update the
test to stop the watcher in the success path by calling w.Stop() (or defer
w.Stop() right after ensuring err==nil and w!=nil) so the watcher is closed
before asserting expectations (references: retriever.Watch and variable w).
There was a problem hiding this comment.
Thanks, but I think this is redundant here. filteredWatch.Stop() just delegates to the source watcher's Stop(), and the filter goroutine ranges over the source's ResultChan(), so it exits as soon as the source closes. The test already has defer fake.Stop() (line 61), which closes that channel and terminates the goroutine — adding defer w.Stop() would only call the same fake.Stop() again. There's no goroutine left to leak, so I'll keep the test as is.
There was a problem hiding this comment.
@Arsolitt, that's a fair and accurate analysis — watch.Filter starts a goroutine that ranges over w.ResultChan() and exits naturally once the source channel is closed, so defer fake.Stop() on line 61 is already the correct cleanup point. Calling defer w.Stop() would just re-invoke the same underlying fake.Stop(). Acknowledged, happy to keep the test as-is.
watch.Error events carry a *metav1.Status object rather than a *RedisFailover, so the namespace filter's type assertion silently dropped them. The reflector depends on these events to detect a failed watch (such as an expired resource version) and restart it promptly instead of waiting for a connection timeout. Pass watch.Error events through unchanged and add a regression test covering the propagation. Signed-off-by: Arsolitt <arsolitt@gmail.com>
Changes proposed on the PR:
NewRedisFailoverRetriever'sWatchFunc, before wrapping the watcher withwatch.Filter, so the reflector retries the watch instead of crashing the operator.Why
When the RedisFailover watch cannot be established, the typed client returns a nil
watch.Interfacetogether with the error.WatchFuncwrapped that result withwatch.Filterunconditionally.watch.Filterimmediately starts a goroutine whose loop dereferences the source watcher'sResultChan(), so a nil watcher panics the entire operator process:On clusters where the initial watch is flaky — for example a freshly bootstrapped control plane under load — this drives the operator into CrashLoopBackOff and no RedisFailover is ever reconciled.
The fix mirrors the existing
ListFunc, which already returns early when the list call errors. The unguarded wrapper was introduced when namespace filtering addedwatch.Filter; earlier releases returned the watcher directly fromWatchFunc.Summary by CodeRabbit
Release Notes