Listener silently exits broker-reconnect loop after AAD credential-refresh (ghost-busy)

## Describe the bug

On self-hosted macOS runners (v2.334.0), after a long-running session encounters a broker disconnect followed by a credential refresh, the `Runner.Listener` process can silently exit its broker-reconnect loop. The OS process stays alive (parked on a `pthread_cond_wait`) but holds **no ESTABLISHED TCP socket to the broker**, writes no further diag log entries, and accepts no new jobs.

Critically: the broker side of the agent state still shows the runner as *busy* from its last in-flight job, so the queue stalls behind the phantom. Only `launchctl bootout` + `bootstrap` of the LaunchAgent (or equivalent hard process restart) clears the state. The runner does not self-recover.

We observed this simultaneously on 4 of 4 macOS-arm64 runners on the same host within a 32-minute window, freezing our queue for ~4 hours.

## Runner

- Version: v2.334.0
- OS: macOS 26.3.1 (Apple Silicon, M3 Ultra)
- Service supervision: launchd LaunchAgent (`actions.runner.<owner-repo>.<name>`)
- Repo-scoped (not org-scoped) registration
- 4 ephemeral=false runners on one host, ~10 other runners for a different repo on the same host

## Expected behavior

On any unrecoverable broker session error, the listener should either:
1. Surface a fatal error and exit (so the supervisor restarts it), or
2. Keep retrying the reconnect indefinitely with visible diag log entries.

## Actual behavior

The listener handled dozens of routine `SocketException (89): Operation canceled` events successfully across an ~11-hour run (these are the normal long-poll cancel/retry pattern). Then at a credential-refresh boundary it logged this sequence and went completely silent:

```
[2026-05-22 08:36:41Z INFO RSAFileKeyManager] Loading RSA key parameters from file <path>/.credentials_rsaparams
[2026-05-22 08:36:42Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
```

No further log entries for >30 min. No further TCP socket to the broker (TCP socket scan against the listener PID returns empty). Process alive but parked — main thread in `ObjectNative::WaitTimeout` -> `SyncBlock::Wait` -> `_pthread_cond_wait` per `sample <pid> 3`.

All four listeners on the host hit the same final-log signature within a 32-minute window (08:11-08:43 UTC), suggesting an upstream broker-session event (TLS keepalive teardown, broker rolling restart, or scheduled session expiry policy) triggered the wedge — but the runner client failed to re-establish.

## Reproduction

Not deterministically reproducible from a fresh runner, but reliably observed in the steady-state of long-lived self-hosted runners. The trigger appears to require:

- Listener lifetime > a few hours (so credential refresh happens)
- A broker disconnect concurrent with or immediately following an AAD credential refresh
- macOS Apple Silicon (we have not observed it on Linux x64)

## Workaround

Out-of-band watchdog that probes for `ESTABLISHED` TCP on the listener PID and the freshness of `_diag/Runner_*.log`, and `launchctl bootout`/`bootstrap` if both fail. Reference implementation: https://github.com/EVCA-Org/evca-web-app/pull/16130

## Asks

1. Detect the silent-exit path and either exit non-zero (so a supervisor restarts) or keep retrying loudly.
2. After regaining the broker session, re-sync agent state so any in-flight job that was acknowledging at the moment of disconnect is correctly transitioned out of "busy" on the broker side.
3. If there is already an internal channel for the "Correlation ID: Unknown" AAD response, surface it in the diag log instead of swallowing it.

Full diag logs for all four affected listeners available if helpful for diagnosis — happy to attach.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Listener silently exits broker-reconnect loop after AAD credential-refresh (ghost-busy) #4446

Describe the bug

Runner

Expected behavior

Actual behavior

Reproduction

Workaround

Asks

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Listener silently exits broker-reconnect loop after AAD credential-refresh (ghost-busy) #4446

Description

Describe the bug

Runner

Expected behavior

Actual behavior

Reproduction

Workaround

Asks

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions