[server] Add per-task timeout for RebalanceManager to prevent indefinite blocking#3429
[server] Add per-task timeout for RebalanceManager to prevent indefinite blocking#3429swuferhong wants to merge 3 commits into
Conversation
48782f7 to
4db4e7b
Compare
| TableBucket tableBucket = inProgressRebalanceTasksQueue.peek(); | ||
| if (tableBucket != null && inProgressRebalanceTasks.containsKey(tableBucket)) { | ||
| inflightTaskBucket = tableBucket; | ||
| inflightTaskStartMs = clock.milliseconds(); |
There was a problem hiding this comment.
Nit: inflightTaskBucket and inflightTaskStartMs rely on a fragile implicit write-ordering invariant
Currently this is safe because every clear path writes inflightTaskStartMs = -1 before clearing inflightTaskBucket, and processNewRebalanceTask() writes inflightTaskBucket before inflightTaskStartMs = now(). Since checkTimeout() reads inflightTaskStartMs first and returns immediately if it is -1, no dangerous interleaving can occur today.
However, this correctness depends on the write order being maintained across four separate call sites, with no compile-time enforcement. If a future change swaps the write order in any of these paths (e.g., writes inflightTaskStartMs before inflightTaskBucket in processNewRebalanceTask), the timeout thread could read a stale bucket or a stale startMs and fire a spurious timeout for the wrong task.
Consider combining both fields into a single immutable holder behind an AtomicReference to make the invariant structurally unbreakable rather than relying on write ordering:
private final AtomicReference<InflightTask> inflightTask = new AtomicReference<>();
There was a problem hiding this comment.
Good catch on the ordering fragility. Fixed by applying the volatile publication idiom — inflightTaskBucket now acts as the "gate" variable: written last when setting, cleared first when resetting. The timeout checker reads it first, so JMM happens-before guarantees it always sees a consistent (bucket, startMs) pair. Added comments documenting the contract.
LiebingYu
left a comment
There was a problem hiding this comment.
@swuferhong Thanks for the PR. I have some comments.
Purpose
Linked issue: close #3096
Brief change log
Tests
API and Format
Documentation