Fix peer store management on channel closure#895
Conversation
|
👋 Thanks for assigning @tnull as a reviewer! |
Camillarhi
left a comment
There was a problem hiding this comment.
Thanks for working on this. I left some comments. Also, can you squash the commits?
| user_channel_id: UserChannelId(user_channel_id), | ||
| counterparty_node_id, | ||
| reason: Some(reason), | ||
| reason: Some(reason.clone()), |
There was a problem hiding this comment.
reason.clone() is unnecessary as reason was not moved by the match
There was a problem hiding this comment.
Good catch, removed.
Thanks!
| if let Some(counterparty_node_id) = counterparty_node_id { | ||
| let all_channels = | ||
| self.channel_manager.list_channels_with_counterparty(&counterparty_node_id); | ||
|
|
||
| let has_other_channels = | ||
| all_channels.iter().any(|c| c.channel_id != channel_id); | ||
|
|
||
| let counterparty_initiated = matches!( | ||
| reason, | ||
| ClosureReason::CounterpartyForceClosed { .. } | ||
| | ClosureReason::CounterpartyInitiatedCooperativeClosure | ||
| ); |
There was a problem hiding this comment.
We should flip the order here and check the closure reason first, so the read on the channel manager only happens when it's actually relevant
There was a problem hiding this comment.
That's true, no point querying the channel manager if the closure reason rules out action.
It has been reordered.
| // Check if this was the last open channel, if so, forget the peer. | ||
| if open_channels.len() == 1 { | ||
| //For force-closes we keep the peer | ||
| // in the store so the background reconnection task can fire and | ||
| // complete the channel_reestablish recovery flow. This is especially | ||
| // important for LND peers | ||
| // The peer will be removed by the ChannelClosed event handler once | ||
| // the counterparty reconnects and confirms closure. |
There was a problem hiding this comment.
This comment should flow as one paragraph rather than fragmented lines, and the last sentence is misleading, the ChannelClosed event handler only removes the peer on counterparty-initiated closures, so after a local force-close the peer isn't actually cleaned up there. It's retained indefinitely until the user calls disconnect.
Something like:
// If this was the last open channel and we're closing cooperatively, forget // the peer, since we have no further reason to reconnect. For force-closes we // intentionally keep the peer in the store so the background reconnection // task keeps firing and can drive the channel_reestablish recovery flow. // This is especially important against LND peers, which don't always handle // force-closure error messages correctly. Note that this means a force-closed // peer is retained until the user explicitly calls Node::disconnect.
There was a problem hiding this comment.
You're right, the last sentence was inaccurate.
The handler only removes on counterparty-initiated closures, not local force-closes.
Replaced with your suggested comment.
| } | ||
|
|
||
| #[tokio::test(flavor = "multi_thread", worker_threads = 1)] | ||
| async fn test_peer_retained_on_local_force_close() { |
There was a problem hiding this comment.
This test can be folded into do_channel_full_cycle as it already runs both force_close == true and false paths, so the assertion can be done there
There was a problem hiding this comment.
Thanks for pointing this out. I have folded the assertions into do_channel_full_cycle alongside the existing force-close paths rather than keeping them as separate tests.
|
🔔 1st Reminder Hey @tnull! This PR has been waiting for your review. |
c1bd96c to
7609c26
Compare
Thanks for the thorough review. All four points have been addressed . Will squash the commits once the changes are in. |
| // If the counterparty initiated closure of their last remaining channel | ||
| // with us, remove them from the peer store so we stop trying to reconnect. | ||
| // | ||
| // If we initiated the closure, keep them in the peer store so the | ||
| // background reconnection task fires and we can complete the | ||
| // channel_reestablish recovery flow. This matters especially for LND | ||
| // peers, which need us to reconnect to recover from force-closures. | ||
| // | ||
| // We exclude `channel_id` from the remaining-channel check because LDK | ||
| // fires ChannelClosed before removing the channel from its internal list, | ||
| // so list_channels_with_counterparty still includes the closing channel. |
There was a problem hiding this comment.
This comment partially duplicates the rationale in the lib.rs change. I suggest trimming it to just what this block does, and let the lib.rs comment carry the LND recovery rationale
There was a problem hiding this comment.
Thank you!
I have trimmed the comment here to focus on the behavior of this block.
tnull
left a comment
There was a problem hiding this comment.
Thanks for looking into this. I think technically we could consider cleaning up the stored peer once all funds of the last channel were swept successfully, though I'm not sure if it's worth the complexity adding explicit tracking for it (but let me know if you see an easy/minimally invasive approach to do it).
| .any(|c| c.channel_id != channel_id); | ||
|
|
||
| if !has_other_channels { | ||
| if let Err(e) = self.peer_store.remove_peer(&counterparty_node_id) { |
There was a problem hiding this comment.
We usually replay the event in case of persistence failures. While this path is not critical, we should probably do the same here to keep it consistent.
There was a problem hiding this comment.
I agree, I've updated this to return ReplayEvent() on failure to keep it consistent with other persistence paths.
| // task keeps firing and can drive the channel_reestablish recovery flow. | ||
| // This is especially important against LND peers, which don't always handle force-closure error messages correctly. | ||
|
|
||
| //Note that this means a force-closed peer is retained until the user explicitly calls Node::disconnect. |
There was a problem hiding this comment.
nit:
| //Note that this means a force-closed peer is retained until the user explicitly calls Node::disconnect. | |
| // Note that this means a force-closed peer is retained until the user explicitly calls Node::disconnect. |
That makes sense, and would be the ideal cleanup point. I don't see a minimally invasive way to track it right now without additional tracking of channel resolution state across events, so I'd lean toward leaving it as a follow-up issue rather than adding complexity here. Happy to open one if that's useful. |
|
🔔 1st Reminder Hey @tnull! This PR has been waiting for your review. |
|
🔔 2nd Reminder Hey @tnull! This PR has been waiting for your review. |
|
🔔 3rd Reminder Hey @tnull! This PR has been waiting for your review. |
|
🔔 4th Reminder Hey @tnull! This PR has been waiting for your review. |
|
🔔 5th Reminder Hey @tnull! This PR has been waiting for your review. |
|
CI seems to be failing. |
Yes, the CI failure was a flaky async_payment test on 1.85.0 — connection refused on the mock Esplora server before it was ready, unrelated to this change. Triggered a fresh run and all checks are now passing. |
|
🔔 6th Reminder Hey @tnull! This PR has been waiting for your review. |
|
🔔 7th Reminder Hey @tnull! This PR has been waiting for your review. |
|
🔔 8th Reminder Hey @tnull! This PR has been waiting for your review. |
|
🔔 9th Reminder Hey @tnull! This PR has been waiting for your review. |
tnull
left a comment
There was a problem hiding this comment.
Excuse the delay here - LGTM!
Please squash and clean up the commit history while you're at it. All commits should feature a description that (briefly!) outlines the motivation and approach for the change. Feel free to check out https://cbea.ms/git-commit/ for guidance.
| // We exclude `channel_id` from the check because LDK emits | ||
| // ChannelClosed before removing the channel from its internal list. | ||
|
|
||
| if let Some(counterparty_node_id) = counterparty_node_id { |
There was a problem hiding this comment.
nit: The counterparty_node_id field is always set post LDK v0.0.117 and LDK Node v0.2 was already based on LDK v0.0.118. We should be good to just expecting it, reducing indetation.
| // ChannelClosed before removing the channel from its internal list. | ||
|
|
||
| if let Some(counterparty_node_id) = counterparty_node_id { | ||
| let counterparty_initiated = matches!( |
There was a problem hiding this comment.
Codex:
- [P2] Include on-chain force-closes in peer cleanup — /home/tnull/workspace/ldk-node/src/event.rs:1595-1599
When a peer unilaterally broadcasts its commitment while we are disconnected (or their error message is never processed), LDK closes our channel with ClosureReason::CommitmentTxConfirmed, not CounterpartyForceClosed. Because this match only treats error-message force closes as counterparty
initiated, the last-channel peer remains in peer_store and the reconnection loop continues indefinitely after the channel is gone; handle the confirmed commitment/closing-tx case or otherwise clean it up as well.
• Yes, the retention is intentional, but only for locally initiated force-closes.
The review is pointing at a different case: the peer force-closes while we are disconnected, and LDK later reports the terminal close as ClosureReason::CommitmentTxConfirmed rather than CounterpartyForceClosed. At that point the channel is already closed on-chain, so keeping the peer
persisted no longer helps channel_reestablish; it just keeps the reconnect loop alive for a peer with no remaining channel.So I’d treat the review as valid. The fix should preserve:
- HolderForceClosed { .. }: keep persisted peer, per the comment in src/lib.rs:1847
- CounterpartyForceClosed { .. }: remove if it was their last channel
- CommitmentTxConfirmed: also remove if it was their last channel, because this is the disconnected/on-chain remote force-close path
Caveat: CommitmentTxConfirmed is not a perfect “counterparty initiated” signal; LDK documents that it can also come from our own monitor copy. But for peer-store cleanup after a terminal ChannelClosed, it still seems closer to “no reconnect needed” than to the local HolderForceClosed
recovery case.
Summary
This PR resolves both issues described in #745 related to peer store management on channel closure.
Fixed behaviors
channel_reestablishrecoveryProblems
1. Local force-close removes peer too early
When we force-closed a channel in
close_channel_internal, we immediately removed the peer from the store regardless of closure type:This broke recovery. The background reconnection task only reconnects to peers in the store. Removing the peer immediately guaranteed we would never reconnect, so
channel_reestablishcould never complete. This is especially problematic for LND peers, which may not handle force-closure error messages correctly and rely on us reconnecting to recover.2. Counterparty force-close never cleans up peer
When a counterparty force-closed their last channel with us, the
ChannelClosedevent handler did nothing with the peer store. The peer remained persisted indefinitely and the node kept attempting to reconnect on every background tick. wasted resources and misleading peer state.Solution
1. Retain peer on local force-close (
src/lib.rs)In
close_channel_internal, peers are now only removed for cooperativecloses:
For force-closes, the peer is retained in the store so the background reconnection task can fire,
channel_reestablishcan complete, and peer cleanup is deferred to theChannelClosedevent handler.2. Remove peer on counterparty-initiated closure (
src/event.rs)In the
LdkEvent::ChannelClosedhandler, we now check:If both conditions are true, the peer is removed from the store.
Implementation Details
LDK timing behavior
LDK emits
ChannelClosedbefore removing the channel from its internal state. A naive call tolist_channels_with_counterparty()inside the handler would still return the closing channel, making the "last channel" check always wrong. We fix this by excluding the closing channel explicitly:Ordering to avoid race conditions
Peer removal is performed before calling
add_event. This ensures consumers ofnext_event_async()always observe a consistent peer store when they react toChannelClosed— avoiding a race where the event is visible before the removal completes.Tests
Added two integration tests in
tests/integration_tests_rust.rs:test_peer_removed_on_counterparty_force_closeVerifies that after the counterparty force-closes the last channel, the peer is no longer persisted in the local peer store.
test_peer_retained_on_local_force_closeVerifies that after a locally initiated force-close, the peer remains persisted in the local peer store to allow background reconnection and
channel_reestablishrecovery.Both tests assert on
is_persistedrather than peer presence to correctly distinguish stored peers from transient TCP connections that may still appear inlist_peers()during teardown.Results
test test_peer_removed_on_counterparty_force_close ... ok
test test_peer_retained_on_local_force_close ... ok
test result: ok. 41 passed; 0 failed; 0 ignored; 0 measured
cargo build → no errors
cargo clippy → no new warnings
cargo fmt → applied