Describe the bug
On self-hosted macOS runners (v2.334.0), after a long-running session encounters a broker disconnect followed by a credential refresh, the Runner.Listener process can silently exit its broker-reconnect loop. The OS process stays alive (parked on a pthread_cond_wait) but holds no ESTABLISHED TCP socket to the broker, writes no further diag log entries, and accepts no new jobs.
Critically: the broker side of the agent state still shows the runner as busy from its last in-flight job, so the queue stalls behind the phantom. Only launchctl bootout + bootstrap of the LaunchAgent (or equivalent hard process restart) clears the state. The runner does not self-recover.
We observed this simultaneously on 4 of 4 macOS-arm64 runners on the same host within a 32-minute window, freezing our queue for ~4 hours.
Runner
- Version: v2.334.0
- OS: macOS 26.3.1 (Apple Silicon, M3 Ultra)
- Service supervision: launchd LaunchAgent (
actions.runner.<owner-repo>.<name>)
- Repo-scoped (not org-scoped) registration
- 4 ephemeral=false runners on one host, ~10 other runners for a different repo on the same host
Expected behavior
On any unrecoverable broker session error, the listener should either:
- Surface a fatal error and exit (so the supervisor restarts it), or
- Keep retrying the reconnect indefinitely with visible diag log entries.
Actual behavior
The listener handled dozens of routine SocketException (89): Operation canceled events successfully across an ~11-hour run (these are the normal long-poll cancel/retry pattern). Then at a credential-refresh boundary it logged this sequence and went completely silent:
[2026-05-22 08:36:41Z INFO RSAFileKeyManager] Loading RSA key parameters from file <path>/.credentials_rsaparams
[2026-05-22 08:36:42Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
No further log entries for >30 min. No further TCP socket to the broker (TCP socket scan against the listener PID returns empty). Process alive but parked — main thread in ObjectNative::WaitTimeout -> SyncBlock::Wait -> _pthread_cond_wait per sample <pid> 3.
All four listeners on the host hit the same final-log signature within a 32-minute window (08:11-08:43 UTC), suggesting an upstream broker-session event (TLS keepalive teardown, broker rolling restart, or scheduled session expiry policy) triggered the wedge — but the runner client failed to re-establish.
Reproduction
Not deterministically reproducible from a fresh runner, but reliably observed in the steady-state of long-lived self-hosted runners. The trigger appears to require:
- Listener lifetime > a few hours (so credential refresh happens)
- A broker disconnect concurrent with or immediately following an AAD credential refresh
- macOS Apple Silicon (we have not observed it on Linux x64)
Workaround
Out-of-band watchdog that probes for ESTABLISHED TCP on the listener PID and the freshness of _diag/Runner_*.log, and launchctl bootout/bootstrap if both fail. Reference implementation: https://github.com/EVCA-Org/evca-web-app/pull/16130
Asks
- Detect the silent-exit path and either exit non-zero (so a supervisor restarts) or keep retrying loudly.
- After regaining the broker session, re-sync agent state so any in-flight job that was acknowledging at the moment of disconnect is correctly transitioned out of "busy" on the broker side.
- If there is already an internal channel for the "Correlation ID: Unknown" AAD response, surface it in the diag log instead of swallowing it.
Full diag logs for all four affected listeners available if helpful for diagnosis — happy to attach.
Describe the bug
On self-hosted macOS runners (v2.334.0), after a long-running session encounters a broker disconnect followed by a credential refresh, the
Runner.Listenerprocess can silently exit its broker-reconnect loop. The OS process stays alive (parked on apthread_cond_wait) but holds no ESTABLISHED TCP socket to the broker, writes no further diag log entries, and accepts no new jobs.Critically: the broker side of the agent state still shows the runner as busy from its last in-flight job, so the queue stalls behind the phantom. Only
launchctl bootout+bootstrapof the LaunchAgent (or equivalent hard process restart) clears the state. The runner does not self-recover.We observed this simultaneously on 4 of 4 macOS-arm64 runners on the same host within a 32-minute window, freezing our queue for ~4 hours.
Runner
actions.runner.<owner-repo>.<name>)Expected behavior
On any unrecoverable broker session error, the listener should either:
Actual behavior
The listener handled dozens of routine
SocketException (89): Operation canceledevents successfully across an ~11-hour run (these are the normal long-poll cancel/retry pattern). Then at a credential-refresh boundary it logged this sequence and went completely silent:No further log entries for >30 min. No further TCP socket to the broker (TCP socket scan against the listener PID returns empty). Process alive but parked — main thread in
ObjectNative::WaitTimeout->SyncBlock::Wait->_pthread_cond_waitpersample <pid> 3.All four listeners on the host hit the same final-log signature within a 32-minute window (08:11-08:43 UTC), suggesting an upstream broker-session event (TLS keepalive teardown, broker rolling restart, or scheduled session expiry policy) triggered the wedge — but the runner client failed to re-establish.
Reproduction
Not deterministically reproducible from a fresh runner, but reliably observed in the steady-state of long-lived self-hosted runners. The trigger appears to require:
Workaround
Out-of-band watchdog that probes for
ESTABLISHEDTCP on the listener PID and the freshness of_diag/Runner_*.log, andlaunchctl bootout/bootstrapif both fail. Reference implementation: https://github.com/EVCA-Org/evca-web-app/pull/16130Asks
Full diag logs for all four affected listeners available if helpful for diagnosis — happy to attach.