Skip to content

hcs: remove containers and reclaim snapshots so cancelled builds don't wedge the worker#209

Open
mtelvers wants to merge 1 commit into
ocurrent:masterfrom
mtelvers:hcs-container-cleanup
Open

hcs: remove containers and reclaim snapshots so cancelled builds don't wedge the worker#209
mtelvers wants to merge 1 commit into
ocurrent:masterfrom
mtelvers:hcs-container-cleanup

Conversation

@mtelvers

Copy link
Copy Markdown
Member

Problem

The Windows (HCS) workers in our windows-x86_64 pool repeatedly fell off the cluster: the ocluster-worker service stayed RUNNING but the scheduler showed it disconnected and no builds progressed. The worker log showed it spinning in a prune loop, removing nothing:

OBuilder partition: 28% free, 0 items
Pruning 100 items
Exec "ctr" "snapshot" "rm" "obuilder-base-...-committed"
Pruned 0 items

containerd had hundreds of obuilder-* snapshots while OBuilder's database had zero entries — the two had drifted completely apart, and the only snapshots prune tried to remove were the base layers, which can't be removed while children exist.

Root cause

Hcs_sandbox.run runs each build step with ctr run --rm and, on cancellation, sends ctr task kill -s SIGKILL. But --rm only deletes the container when the run client observes a clean task exit. When the task is killed — which is exactly what every cancellation or timeout does — the container is not removed. It lingers and keeps its rootfs snapshot pinned.

From there it cascades:

  1. A cancelled/timed-out build leaves an orphaned obuilder-run-* container pinning its snapshot.
  2. OBuilder tries to delete that snapshot, fails (the container still holds it), but the delete is recorded as success (val delete : t -> id -> unit Lwt.t can't report failure), so the DB row is dropped while the snapshot survives.
  3. The orphan then pins its parent, whose delete also fails, and the DB empties one cascade at a time.
  4. With an empty DB the LRU pruner has nothing to evict, so completed layers accumulate until free space drops below the threshold and the worker wedges in a no-progress prune loop.

Cancellations are the common case (hundreds per worker over a week), so this builds up quickly. We confirmed --rm's behaviour directly: start a container with ctr run --rm, ctr task kill -s SIGKILL it, and the container remains in ctr container ls — only an explicit ctr task delete + ctr container rm removes it.

Fix

Stop relying on --rm and manage the container lifecycle explicitly:

  • Hcs_sandbox.run — in the finalize (so it runs on success, error and cancellation) issue ctr task delete --force + ctr container rm. The container is then actually gone, and OBuilder's existing build-error handler removes its snapshot.
  • Hcs_store.create — recover crash/reboot leftovers at start-up (when finalize never ran): remove any leftover obuilder-run-* containers (which unpins their snapshots), then, for each interrupted build still in result-tmp, remove the specific snapshot named in its layer info. Containers first, snapshots second, in one pass.

No change to the delete signature or to complete_deletes — once the leak is closed the database stays in sync and normal leaf-first LRU pruning reclaims layers as before.

Validation

Deployed to three production HCS workers and run for ~8 days under real OCaml-CI load:

  • 0 orphaned containers across 56+ cancellations per worker (previously dozens accumulated).
  • Workers organically filled to the prune threshold and pruned successfully 18–20 times each (real LRU eviction), instead of the old Pruned 0 items spin.
  • No disconnects, no wedges.

…'t wedge

The HCS sandbox runs builds with `ctr run --rm` and relies on the run
client to delete the container on exit. When a build is cancelled or
times out the task is SIGKILLed and the client never completes the
`--rm`, so the container is orphaned and keeps its rootfs snapshot
pinned. obuilder's snapshot delete then fails, the layer leaks, and over
time the leaked layers fill the store below the prune threshold, leaving
the worker spinning in a perpetual (useless) prune loop.

Manage the container lifecycle explicitly instead:

* hcs_sandbox.run: in the finalize (every exit path) issue
  `ctr task delete --force` + `ctr container rm` so a cancelled or
  timed-out build's container is actually removed; the build error
  handler then frees its snapshot.

* hcs_store.create: recover crash/reboot leftovers at start-up, in order:
  remove leftover obuilder-run-* containers (unpinning their snapshots),
  then for each in-progress result-tmp build remove the specific snapshot
  named in its layerinfo and drop the entry.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants