Free orphaned proxy port on stop and rm#5394
Conversation
When a workload's status file is missing, thv stop and thv rm left the proxy process running and holding the workload's port. The proxy-stop path terminates the proxy by the PID recorded in the status file, so with the file gone nothing was killed. On stop the surviving supervisor then restarted the container, so the workload would not stay stopped; on rm the orphaned proxy kept the port, so it could not be reused without killing the process by hand. Fixes stacklok#5393 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Greg Katz <gkatz@indeed.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5394 +/- ##
==========================================
+ Coverage 68.83% 68.84% +0.01%
==========================================
Files 628 628
Lines 63900 63911 +11
==========================================
+ Hits 43985 44001 +16
- Misses 16658 16665 +7
+ Partials 3257 3245 -12 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
jhrozek
left a comment
There was a problem hiding this comment.
A couple of non-blocking questions on the orphan-port cleanup. The fix itself looks correct for the container stop/rm paths. Nothing here blocks merge.
| if baseName == "" { | ||
| return false | ||
| } | ||
| return d.stopProcess(ctx, baseName) |
There was a problem hiding this comment.
Now that stopProxyIfNeeded reports whether a proxy was actually stopped, the container stop/delete paths use it to trigger the port-based fallback. stopRemoteWorkload (around line 472) still calls this and discards the result, so a remote workload with a missing status file wouldn't get the orphan-port cleanup. Remote workloads run a local proxy that holds a port too, so the same gap seems to apply there. Is the remote path intentionally out of scope for this PR, or worth adding the same if !... { portFreerOrDefault()(ctx, runConfig) } fallback?
| return d.stopProcess(ctx, baseName) | ||
| } | ||
|
|
||
| // freePortHolderIfNeeded kills the process holding the proxy port if it is in use. |
There was a problem hiding this comment.
This doc comment only describes the restart re-bind case, but after this change freePortHolderIfNeeded is also the fallback used during stop/delete, where there is no child re-binding. Could you broaden it to cover both uses so it doesn't read as restart-only?
Summary
When a workload's status file is missing,
thv stopandthv rmreport success but leave the workload's proxy process running and holding its port. The proxy-stop path kills the proxy by the PID recorded in the status file, so with the file gone nothing is killed:thv stop, the surviving supervisor restarts the container, so the workload returns torunningon its own.thv rm, the container is removed but the orphaned proxy keeps holding the port, so it cannot be reused without killing the process by hand.This makes stop and rm fall back to the existing port-based cleanup when the PID-based stop finds no proxy to kill, so the proxy is terminated and the port freed even when the status file is missing. The fallback reuses
freePortHolderIfNeeded(already used on the restart path), which only kills a process verified to be this workload's own proxy.stopProcess/stopProxyIfNeededreport whether a tracked proxy was actually killed.runConfiginto the container stop/delete paths so the fallback knows the proxy port.Fixes #5393
Type of change
Test plan
task test)task test-e2e)task lint-fix)Manual testing on macOS + OrbStack with a real container workload (
fetch):thv stopleft the supervisor process alive (verified by PID) and it recreated the container — a new container ID and newStartedAt— returning the workload torunning;thv rmleft the orphaned proxy holding the port.thv stopandthv rmterminate the proxy (PID gone) and free the port.The added unit tests fail without the fix and pass with it.
Does this introduce a user-facing change?
Yes.
thv stopandthv rmnow reliably stop the workload's proxy and free its port even when the workload's status file is missing, instead of leaving an orphaned proxy that holds the port (and, forstop, restarts the container).Special notes for reviewers
freePortHolderIfNeeded→process.IsToolHiveProxyForWorkloadconfirms the process on the port is this workload'sthv start <name>proxy before killing it, so it cannot touch an unrelated process or another workload's proxy.runner.LoadStateitself fails (the run config is gone, not just the status file), the proxy port cannot be recovered. In the reported scenario only the status file is missing, soLoadStatesucceeds and the port is recoverable.Generated with Claude Code