feat: cross-node migration (nodeName affinity + migration state machine) by tonicmuroq · Pull Request #11 · cocoonstack/cocoon-operator

tonicmuroq · 2026-06-26T02:32:01Z

Operator side of cross-node migrate(vmname, node): the control plane patches CocoonSet.spec.nodeName, the operator does the rest.

What

Placement (buildAgentPod): the main agent (slot 0) gets a required hostname nodeAffinity from spec.nodeName instead of a hard NodeName bind — it lands on the target only if it fits and the node is schedulable, else stays Pending (respects capacity/cordon, no OOM). Sub-agents keep their hard-bind to the main's node.
Migration state machine (reconcileMigration): a pure observation function over durable state (spec.nodeName, the pod, the epoch :hibernate snapshot) — set internal hibernate annotation → wait for snapshot → delete old pod → recreate on target with restore-from-hibernate → wait for the restored VMID → drop the snapshot. Idempotent and crash-recoverable; runs before applyUnsuspend so its hibernate annotation isn't cleared mid-flight. Ordering gates: old pod deleted only after the snapshot lands; snapshot dropped only after the new VM has a fresh VMID. Surfaces CocoonSetPhaseMigrating. Scoped to the main agent (one VM per CocoonSet).
Imports the regenerated CocoonSet CRD.

Dependency

Depends on cocoonstack/cocoon-common#3 (spec.nodeName + Migrating phase). go.mod pins the branch commit via pseudo-version; bump to the cocoon-common release tag after #3 merges.

Tests

migrate_test.go (7 transitions incl. both ordering gates), pods_test.go (3 affinity cases); full suite + make lint clean on linux + darwin.

Not in scope

Control-plane migrate API + IP backfill + involuntary-eviction reconcile (simular-pro-vm-service); end-to-end + crash-injection tests (need a cluster).

CMGS · 2026-06-29T08:51:41Z

Rebased onto main (code-style cleanup + dep bumps to common v0.2.2 / epoch v0.2.4 — the two embedded deps commits collapsed into main's single bump) and pushed a fix commit (8ba5d6a) from a strict review:

CRD was missing the Migrating phase enum. The operator's config/crd/bases is the deployed artifact (via config/default kustomize; common's CRD only reaches the cluster through make import-crds), and it was stale — the PR hand-added nodeName but the phase enum was [Pending…Suspended, Failed] with no Migrating. So apiserver rejects every phase=Migrating status PATCH → reconcileMigration error-loops and migration never progresses (the unit tests use a fake client that skips enum validation, so they stayed green and hid it). Ran make import-crds to regenerate from v0.2.2 — re-adds Migrating (+ the macos OS enum that had also drifted), keeps nodeName.
case-1 delete/recreate loop. The snapshot branch's main.Spec.NodeName != desired deletes the just-recreated restore pod while it's still unscheduled (NodeName == "" — the main==nil branch builds it with affinity, not a hard bind), looping forever. Gated on NodeName != "" (mirrors the !snap branch's existing :48 fallback) + added TestMigrationDoesNotDeleteRecreatedRestorePod (the existing cases all preset a NodeName, so the bug slipped through).
nits: migrating→markMigrating, error-wrap, requeueMigratePoll const.

One thing left for you — needs vk-cocoon context I can't verify locally:

reconcileMigration trusts a pre-existing :hibernate tag (HasManifest) to go straight to the destructive delete+restore. But the operator's own applyUnsuspend only does PatchHibernateState(false) — it does not drop the :hibernate tag. So the sequence suspend (creates tag) → unsuspend (tag survives) → set spec.nodeName → migration sees snap=true → skips taking a fresh snapshot → deletes the live pod → restores the stale snapshot = data loss since the last suspend. It's only safe if vk-cocoon drops the :hibernate tag when it wakes a VM via annotation removal. Could you confirm that — or have migration unconditionally reset the hibernate annotation and wait for a fresh snapshot before the destructive step?

Otherwise #11's migration flow is now correct (CRD + loop fixed, rebase verified intact).

…state machine) Rewritten on current main atop the merged restore-from-hibernate producer (#14): the control plane patches CocoonSet.spec.nodeName and the operator hibernates the main agent, waits for the :hibernate snapshot in the OCI registry, deletes the old pod, recreates it with hostname nodeAffinity + restore-from-hibernate, and drops the snapshot once the restored VM runs with a fresh VMID. Decisions are pure functions of durable state (spec.nodeName, status.phase, the pod, the snapshot), so every step is idempotent and crash-recoverable. Hardening over the original branch: - a registry probe error now owns the reconcile (handled=true) — falling through would let applyUnsuspend unwind the migration or fresh-boot over the only copy of the state - a :hibernate tag on a pod this controller never quiesced is treated as a leftover (suspend/unsuspend never deletes the tag) and dropped instead of restored — a raw presence check would delete a live pod and roll back state - re-targeting nodeName back to the current node mid-migration wakes the pod in place instead of deadlocking (unless a CocoonHibernation CR owns it) - clearing nodeName in the deleted-pod window finishes the restore instead of stranding the snapshot behind a fresh boot - steady-state pinned sets skip the registry probe (Migrating is persisted before the first side effect, so in-flight migrations are never mistaken) Scoped to the main agent (slot 0); sub-agents follow via their hard bind.

CMGS · 2026-07-02T05:42:03Z

Rewrote the branch on current main (5485764), single commit, design and flow preserved:

epoch → OCI registry: probes via the shared hasHibernateSnapshot (same helper as feat(cocoonset): restore hibernated agents from :hibernate on (re)create #14's producer); drops via Registry.DeleteManifest. No more r.Epoch (removed in the OCI cutover).
Reuses feat(cocoonset): restore hibernated agents from :hibernate on (re)create #14's infra: meta.MarkRestoreFromHibernate typed setter on the recreate; suspend/CR interplay leans on the merged producer semantics.
Hardening found in review (all with regression tests, 18 tests total):
1. registry probe error now owns the reconcile — before, it fell through and applyUnsuspend could unwind the migration or fresh-boot over the only snapshot;
2. a :hibernate tag on a never-quiesced pod is treated as a leftover (suspend→unsuspend never deletes the tag) and dropped, instead of deleting a live pod and restoring stale state;
3. re-targeting nodeName back mid-migration wakes the pod in place instead of deadlocking in Migrating (CR-owned hibernation excluded);
4. clearing nodeName in the deleted-pod window finishes the restore instead of stranding the snapshot;
5. steady-state pinned sets skip the per-reconcile registry probe (Migrating phase is persisted before the first side effect).

Follow-up (not in this PR): a migration timeout + Warning events like the hibernation controller's, and sub-agent migration (currently scoped out, one-VM-per-set model).

CMGS · 2026-07-02T07:10:49Z

Live E2E on the GKE cluster (operator oci-e71b762, CRD updated):

create e2e-mig on node-1 -> write marker in guest -> patch spec.nodeName=node-2
[3s]  phase=Scaling   node=node-1  (generation bump)
[6s]  phase=Migrating node=node-1  hibernate=true          <- startMigration
[12s] phase=Migrating node=node-2  restore-from-hibernate=true  <- teardown + recreate
[15s] phase=Migrating node=node-2  ready=true              <- restored
marker read back on node-2: MATCH  ==> cross-node migration with state preserved, 15s e2e

Two pre-existing gaps surfaced by the final drop-snapshot step (not this PR's logic):

IAM: the operator runs as cocoon-ar-writer@ which lacks artifactregistry.repositories.deleteArtifacts — every DeleteManifest 403s, so the migration correctly refuses to settle (phase stays Migrating, snapshot kept). Same known gap that already breaks hibernation-wake tag cleanup + CocoonSet teardown GC. Needs roles/artifactregistry.repoAdmin on the repo.
Observability: main.go sets crlog.SetLogger(logr.Discard()), so controller-runtime swallows every reconcile error — the 403s were invisible in the logs. Worth a small follow-up PR wiring controller-runtime's logger to core/log.

Once the IAM grant lands the full loop (drop + settle to Running) can be re-validated; everything up to that point is verified.

CMGS force-pushed the feat/cross-node-migration branch from 2edf006 to 3b78805 Compare June 29, 2026 08:25

CMGS force-pushed the feat/cross-node-migration branch from 8ba5d6a to e71b762 Compare July 2, 2026 05:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: cross-node migration (nodeName affinity + migration state machine)#11

feat: cross-node migration (nodeName affinity + migration state machine)#11
tonicmuroq wants to merge 1 commit into
mainfrom
feat/cross-node-migration

tonicmuroq commented Jun 26, 2026

Uh oh!

CMGS commented Jun 29, 2026

Uh oh!

CMGS commented Jul 2, 2026

Uh oh!

CMGS commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tonicmuroq commented Jun 26, 2026

What

Dependency

Tests

Not in scope

Uh oh!

CMGS commented Jun 29, 2026

Uh oh!

CMGS commented Jul 2, 2026

Uh oh!

CMGS commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants