Skip to content

fix(compactor): await HeartBeat goroutine to fix flaky TestBlocksCleaner on arm64#7566

Closed
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix-blocks-cleaner-heartbeat-race
Closed

fix(compactor): await HeartBeat goroutine to fix flaky TestBlocksCleaner on arm64#7566
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix-blocks-cleaner-heartbeat-race

Conversation

@sandy2008
Copy link
Copy Markdown
Contributor

What this PR does

Wraps the VisitMarkerManager.HeartBeat goroutine in BlocksCleaner.cleanUpActiveUsers and BlocksCleaner.cleanDeletedUsers with a sync.WaitGroup so the per-user cleanup callback waits for the heartbeat goroutine to fully exit before returning, instead of just signalling it via errChan <- nil and racing on.

This is a test-flake fix only; the production behaviour change is that per-user cleanup now blocks until the heartbeat's terminal MarkWithStatus(Completed) + DeleteVisitMarker writes finish, which is what the surrounding code already assumed.

Why

TestBlocksCleaner flakes intermittently on test-no-race (arm64) CI with:

--- FAIL: TestBlocksCleaner/concurrency=2,_markers_migration_enabled=false,_tenant_deletion_delay=0s (3.92s)
    objstore.go:26: 
        Error Trace: pkg/util/testutil/objstore.go:26
        Error:      Received unexpected error:
                    unlinkat /tmp/bucket3527476943: directory not empty

Observed in run 26555111751.

Root cause: cleanUpActiveUsers / cleanDeletedUsers launched visitMarkerManager.HeartBeat in a bare go statement and only signalled it via errChan <- nil in a defer, never awaiting it. On receipt of the nil, HeartBeat still ran a terminal MarkWithStatus(Completed) using the parent ctx followed by DeleteVisitMarker(context.Background()) against the bucket. In the test, the bucket is a filesystem bucket backed by a t.TempDir() whose t.Cleanup runs os.RemoveAll; those late writes raced with the directory teardown, producing the arm64-flaky unlinkat: directory not empty failure.

The arm64 scheduler exposes this race more often than amd64 does, but the bug is platform-independent — the goroutine leak / late-write window exists on every architecture.

How

pkg/compactor/blocks_cleaner.go (two call sites: cleanUpActiveUsers and cleanDeletedUsers). Replace:

go visitMarkerManager.HeartBeat(ctx, errChan, c.cleanerVisitMarkerFileUpdateInterval, true)
defer func() {
    errChan <- nil
}()

with:

var hbWG sync.WaitGroup
hbWG.Add(1)
go func() {
    defer hbWG.Done()
    visitMarkerManager.HeartBeat(ctx, errChan, c.cleanerVisitMarkerFileUpdateInterval, true)
}()
defer func() {
    errChan <- nil
    hbWG.Wait()
}()

The same change is applied symmetrically in both call sites. sync is already imported in this file so no import diff is required. A CHANGELOG entry is added under master / unreleased referencing this PR.

Why not other approaches

  • Skip the test on arm64. Hides a real goroutine-leak / late-write window in production code; the bucket writes still happen post-callback regardless of platform.
  • Refactor HeartBeat to await internally. HeartBeat is shared with other call sites (partition_compaction_planner.go, shuffle_sharding_planner.go) that intentionally do not wait; changing its lifetime contract risks breakage well beyond Flaky test: TestBlocksCleaner (arm64) — unlinkat: directory not empty #7564. The leak is owned by the caller; fix it at the caller.
  • Use a done channel instead of sync.WaitGroup. Functionally equivalent (this is what PR Fix flaky TestBlocksCleaner by awaiting HeartBeat goroutine completion #7386 proposed). WaitGroup reads naturally as "wait for this goroutine to exit"; either pattern works.
  • Use errgroup. Overkill for a single goroutine with no error to propagate (the existing errChan already carries the only signal that matters). Would expand the diff without changing the correctness story.

Which issue(s) this PR fixes

Fixes #7564

Note: PR #7386 by another contributor proposed a similar doneChan fix that was marked CHANGES_REQUESTED by @friedrichg for missing the CHANGELOG entry. This PR is functionally equivalent (sync.WaitGroup vs doneChan) and includes the missing CHANGELOG.

Checklist

  • CHANGELOG.md updated under master / unreleased
  • Documentation updated — N/A
  • Tests: existing TestBlocksCleaner is the regression test; runs 30/30 PASS in this worktree on amd64 (the arm64 race window is narrower so amd64 reproduction is unreliable, but the goroutine-leak window is closed regardless of platform)

Test plan

  • gofmt -l pkg/compactor/blocks_cleaner.go — clean
  • goimports -local github.com/cortexproject/cortex -l pkg/compactor/blocks_cleaner.go — clean
  • go vet -tags "netgo slicelabels" ./pkg/compactor/... — clean
  • go build -tags "netgo slicelabels" ./pkg/compactor/... — clean
  • go test -tags "netgo slicelabels" -timeout 600s -count=30 -run "^TestBlocksCleaner$/concurrency=2,_markers_migration_enabled=false,_tenant_deletion_delay=0s" ./pkg/compactor/... — 30/30 PASS
  • go test -race -tags "netgo slicelabels" -timeout 600s -run "^TestBlocksCleaner$" ./pkg/compactor/... — PASS

🤖 Generated with Claude Code

BlocksCleaner.cleanUpActiveUsers and cleanDeletedUsers each launched
a VisitMarkerManager.HeartBeat goroutine and only signalled it via
errChan <- nil before returning, never awaiting the goroutine. When
HeartBeat received the nil it still ran a terminal MarkWithStatus(
Completed) using the parent ctx followed by DeleteVisitMarker(
context.Background()) against the bucket. In tests using a filesystem
bucket, those writes raced with t.Cleanup's RemoveAll on the temp dir,
producing the arm64-flaky failure:

  unlinkat .../visit-mark.json: directory not empty

Wrap the heartbeat goroutine in a sync.WaitGroup so the per-user
callback waits for HeartBeat to fully exit (including its terminal
writes) before returning. This drains all bucket I/O before test
teardown removes the directory, eliminating the race.

Fixes cortexproject#7564

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@SungJin1212
Copy link
Copy Markdown
Member

I think this PR duplicates #7386.

@sandy2008
Copy link
Copy Markdown
Contributor Author

#7386

Yep, I mentioned it in PR, then let me close this PR for now!
Duplication of #7386

@sandy2008 sandy2008 closed this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: TestBlocksCleaner (arm64) — unlinkat: directory not empty

2 participants