hcs: remove containers and reclaim snapshots so cancelled builds don't wedge the worker#209
Open
mtelvers wants to merge 1 commit into
Open
hcs: remove containers and reclaim snapshots so cancelled builds don't wedge the worker#209mtelvers wants to merge 1 commit into
mtelvers wants to merge 1 commit into
Conversation
…'t wedge The HCS sandbox runs builds with `ctr run --rm` and relies on the run client to delete the container on exit. When a build is cancelled or times out the task is SIGKILLed and the client never completes the `--rm`, so the container is orphaned and keeps its rootfs snapshot pinned. obuilder's snapshot delete then fails, the layer leaks, and over time the leaked layers fill the store below the prune threshold, leaving the worker spinning in a perpetual (useless) prune loop. Manage the container lifecycle explicitly instead: * hcs_sandbox.run: in the finalize (every exit path) issue `ctr task delete --force` + `ctr container rm` so a cancelled or timed-out build's container is actually removed; the build error handler then frees its snapshot. * hcs_store.create: recover crash/reboot leftovers at start-up, in order: remove leftover obuilder-run-* containers (unpinning their snapshots), then for each in-progress result-tmp build remove the specific snapshot named in its layerinfo and drop the entry.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The Windows (HCS) workers in our
windows-x86_64pool repeatedly fell off the cluster: theocluster-workerservice stayedRUNNINGbut the scheduler showed it disconnected and no builds progressed. The worker log showed it spinning in a prune loop, removing nothing:containerd had hundreds of
obuilder-*snapshots while OBuilder's database had zero entries — the two had drifted completely apart, and the only snapshots prune tried to remove were the base layers, which can't be removed while children exist.Root cause
Hcs_sandbox.runruns each build step withctr run --rmand, on cancellation, sendsctr task kill -s SIGKILL. But--rmonly deletes the container when the run client observes a clean task exit. When the task is killed — which is exactly what every cancellation or timeout does — the container is not removed. It lingers and keeps its rootfs snapshot pinned.From there it cascades:
obuilder-run-*container pinning its snapshot.val delete : t -> id -> unit Lwt.tcan't report failure), so the DB row is dropped while the snapshot survives.Cancellations are the common case (hundreds per worker over a week), so this builds up quickly. We confirmed
--rm's behaviour directly: start a container withctr run --rm,ctr task kill -s SIGKILLit, and the container remains inctr container ls— only an explicitctr task delete+ctr container rmremoves it.Fix
Stop relying on
--rmand manage the container lifecycle explicitly:Hcs_sandbox.run— in thefinalize(so it runs on success, error and cancellation) issuectr task delete --force+ctr container rm. The container is then actually gone, and OBuilder's existing build-error handler removes its snapshot.Hcs_store.create— recover crash/reboot leftovers at start-up (whenfinalizenever ran): remove any leftoverobuilder-run-*containers (which unpins their snapshots), then, for each interrupted build still inresult-tmp, remove the specific snapshot named in its layer info. Containers first, snapshots second, in one pass.No change to the
deletesignature or tocomplete_deletes— once the leak is closed the database stays in sync and normal leaf-first LRU pruning reclaims layers as before.Validation
Deployed to three production HCS workers and run for ~8 days under real OCaml-CI load:
Pruned 0 itemsspin.