Skip to content

[server] Add per-task timeout for RebalanceManager to prevent indefinite blocking#3429

Open
swuferhong wants to merge 3 commits into
apache:mainfrom
swuferhong:reblance-block
Open

[server] Add per-task timeout for RebalanceManager to prevent indefinite blocking#3429
swuferhong wants to merge 3 commits into
apache:mainfrom
swuferhong:reblance-block

Conversation

@swuferhong
Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #3096

  • Fix a liveness issue where RebalanceManager could block indefinitely if a rebalance bucket's ISR never converges (e.g., target replica is on a dead TabletServer).
  • Introduce a per-task timeout (2 minutes) with periodic checks (every 30 seconds). When an in-flight bucket exceeds the timeout, it is treated as COMPLETED, unblocking subsequent buckets and the overall rebalance.
  • Add RebalanceTaskTimeoutEvent to the coordinator event system for safe single-threaded state mutation.

Brief change log

Tests

API and Format

Documentation

Copy link
Copy Markdown
Contributor

@loserwang1024 loserwang1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment

TableBucket tableBucket = inProgressRebalanceTasksQueue.peek();
if (tableBucket != null && inProgressRebalanceTasks.containsKey(tableBucket)) {
inflightTaskBucket = tableBucket;
inflightTaskStartMs = clock.milliseconds();
Copy link
Copy Markdown
Contributor

@LiebingYu LiebingYu Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: inflightTaskBucket and inflightTaskStartMs rely on a fragile implicit write-ordering invariant

Currently this is safe because every clear path writes inflightTaskStartMs = -1 before clearing inflightTaskBucket, and processNewRebalanceTask() writes inflightTaskBucket before inflightTaskStartMs = now(). Since checkTimeout() reads inflightTaskStartMs first and returns immediately if it is -1, no dangerous interleaving can occur today.

However, this correctness depends on the write order being maintained across four separate call sites, with no compile-time enforcement. If a future change swaps the write order in any of these paths (e.g., writes inflightTaskStartMs before inflightTaskBucket in processNewRebalanceTask), the timeout thread could read a stale bucket or a stale startMs and fire a spurious timeout for the wrong task.

Consider combining both fields into a single immutable holder behind an AtomicReference to make the invariant structurally unbreakable rather than relying on write ordering:

private final AtomicReference<InflightTask> inflightTask = new AtomicReference<>();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the ordering fragility. Fixed by applying the volatile publication idiom — inflightTaskBucket now acts as the "gate" variable: written last when setting, cleared first when resetting. The timeout checker reads it first, so JMM happens-before guarantees it always sees a consistent (bucket, startMs) pair. Added comments documenting the contract.

Copy link
Copy Markdown
Contributor

@LiebingYu LiebingYu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swuferhong Thanks for the PR. I have some comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[server] Add timeout mechanism for rebalance tasks to prevent indefinite blocking on missing completion events

3 participants