Skip to content

feat: cross-node migration (nodeName affinity + migration state machine)#11

Open
tonicmuroq wants to merge 1 commit into
mainfrom
feat/cross-node-migration
Open

feat: cross-node migration (nodeName affinity + migration state machine)#11
tonicmuroq wants to merge 1 commit into
mainfrom
feat/cross-node-migration

Conversation

@tonicmuroq

Copy link
Copy Markdown
Contributor

Operator side of cross-node migrate(vmname, node): the control plane patches CocoonSet.spec.nodeName, the operator does the rest.

What

  • Placement (buildAgentPod): the main agent (slot 0) gets a required hostname nodeAffinity from spec.nodeName instead of a hard NodeName bind — it lands on the target only if it fits and the node is schedulable, else stays Pending (respects capacity/cordon, no OOM). Sub-agents keep their hard-bind to the main's node.
  • Migration state machine (reconcileMigration): a pure observation function over durable state (spec.nodeName, the pod, the epoch :hibernate snapshot) — set internal hibernate annotation → wait for snapshot → delete old pod → recreate on target with restore-from-hibernate → wait for the restored VMID → drop the snapshot. Idempotent and crash-recoverable; runs before applyUnsuspend so its hibernate annotation isn't cleared mid-flight. Ordering gates: old pod deleted only after the snapshot lands; snapshot dropped only after the new VM has a fresh VMID. Surfaces CocoonSetPhaseMigrating. Scoped to the main agent (one VM per CocoonSet).
  • Imports the regenerated CocoonSet CRD.

Dependency

Depends on cocoonstack/cocoon-common#3 (spec.nodeName + Migrating phase). go.mod pins the branch commit via pseudo-version; bump to the cocoon-common release tag after #3 merges.

Tests

migrate_test.go (7 transitions incl. both ordering gates), pods_test.go (3 affinity cases); full suite + make lint clean on linux + darwin.

Not in scope

Control-plane migrate API + IP backfill + involuntary-eviction reconcile (simular-pro-vm-service); end-to-end + crash-injection tests (need a cluster).

@CMGS CMGS force-pushed the feat/cross-node-migration branch from 2edf006 to 3b78805 Compare June 29, 2026 08:25
@CMGS

CMGS commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Rebased onto main (code-style cleanup + dep bumps to common v0.2.2 / epoch v0.2.4 — the two embedded deps commits collapsed into main's single bump) and pushed a fix commit (8ba5d6a) from a strict review:

  • CRD was missing the Migrating phase enum. The operator's config/crd/bases is the deployed artifact (via config/default kustomize; common's CRD only reaches the cluster through make import-crds), and it was stale — the PR hand-added nodeName but the phase enum was [Pending…Suspended, Failed] with no Migrating. So apiserver rejects every phase=Migrating status PATCH → reconcileMigration error-loops and migration never progresses (the unit tests use a fake client that skips enum validation, so they stayed green and hid it). Ran make import-crds to regenerate from v0.2.2 — re-adds Migrating (+ the macos OS enum that had also drifted), keeps nodeName.
  • case-1 delete/recreate loop. The snapshot branch's main.Spec.NodeName != desired deletes the just-recreated restore pod while it's still unscheduled (NodeName == "" — the main==nil branch builds it with affinity, not a hard bind), looping forever. Gated on NodeName != "" (mirrors the !snap branch's existing :48 fallback) + added TestMigrationDoesNotDeleteRecreatedRestorePod (the existing cases all preset a NodeName, so the bug slipped through).
  • nits: migratingmarkMigrating, error-wrap, requeueMigratePoll const.

One thing left for you — needs vk-cocoon context I can't verify locally:

reconcileMigration trusts a pre-existing :hibernate tag (HasManifest) to go straight to the destructive delete+restore. But the operator's own applyUnsuspend only does PatchHibernateState(false) — it does not drop the :hibernate tag. So the sequence suspend (creates tag) → unsuspend (tag survives) → set spec.nodeName → migration sees snap=trueskips taking a fresh snapshot → deletes the live pod → restores the stale snapshot = data loss since the last suspend. It's only safe if vk-cocoon drops the :hibernate tag when it wakes a VM via annotation removal. Could you confirm that — or have migration unconditionally reset the hibernate annotation and wait for a fresh snapshot before the destructive step?

Otherwise #11's migration flow is now correct (CRD + loop fixed, rebase verified intact).

…state machine)

Rewritten on current main atop the merged restore-from-hibernate producer (#14):
the control plane patches CocoonSet.spec.nodeName and the operator hibernates
the main agent, waits for the :hibernate snapshot in the OCI registry, deletes
the old pod, recreates it with hostname nodeAffinity + restore-from-hibernate,
and drops the snapshot once the restored VM runs with a fresh VMID. Decisions
are pure functions of durable state (spec.nodeName, status.phase, the pod, the
snapshot), so every step is idempotent and crash-recoverable.

Hardening over the original branch:
- a registry probe error now owns the reconcile (handled=true) — falling
  through would let applyUnsuspend unwind the migration or fresh-boot over the
  only copy of the state
- a :hibernate tag on a pod this controller never quiesced is treated as a
  leftover (suspend/unsuspend never deletes the tag) and dropped instead of
  restored — a raw presence check would delete a live pod and roll back state
- re-targeting nodeName back to the current node mid-migration wakes the pod
  in place instead of deadlocking (unless a CocoonHibernation CR owns it)
- clearing nodeName in the deleted-pod window finishes the restore instead of
  stranding the snapshot behind a fresh boot
- steady-state pinned sets skip the registry probe (Migrating is persisted
  before the first side effect, so in-flight migrations are never mistaken)

Scoped to the main agent (slot 0); sub-agents follow via their hard bind.
@CMGS

CMGS commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Rewrote the branch on current main (5485764), single commit, design and flow preserved:

  • epoch → OCI registry: probes via the shared hasHibernateSnapshot (same helper as feat(cocoonset): restore hibernated agents from :hibernate on (re)create #14's producer); drops via Registry.DeleteManifest. No more r.Epoch (removed in the OCI cutover).
  • Reuses feat(cocoonset): restore hibernated agents from :hibernate on (re)create #14's infra: meta.MarkRestoreFromHibernate typed setter on the recreate; suspend/CR interplay leans on the merged producer semantics.
  • Hardening found in review (all with regression tests, 18 tests total):
    1. registry probe error now owns the reconcile — before, it fell through and applyUnsuspend could unwind the migration or fresh-boot over the only snapshot;
    2. a :hibernate tag on a never-quiesced pod is treated as a leftover (suspend→unsuspend never deletes the tag) and dropped, instead of deleting a live pod and restoring stale state;
    3. re-targeting nodeName back mid-migration wakes the pod in place instead of deadlocking in Migrating (CR-owned hibernation excluded);
    4. clearing nodeName in the deleted-pod window finishes the restore instead of stranding the snapshot;
    5. steady-state pinned sets skip the per-reconcile registry probe (Migrating phase is persisted before the first side effect).

Follow-up (not in this PR): a migration timeout + Warning events like the hibernation controller's, and sub-agent migration (currently scoped out, one-VM-per-set model).

@CMGS CMGS force-pushed the feat/cross-node-migration branch from 8ba5d6a to e71b762 Compare July 2, 2026 05:42
@CMGS

CMGS commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Live E2E on the GKE cluster (operator oci-e71b762, CRD updated):

create e2e-mig on node-1 -> write marker in guest -> patch spec.nodeName=node-2
[3s]  phase=Scaling   node=node-1  (generation bump)
[6s]  phase=Migrating node=node-1  hibernate=true          <- startMigration
[12s] phase=Migrating node=node-2  restore-from-hibernate=true  <- teardown + recreate
[15s] phase=Migrating node=node-2  ready=true              <- restored
marker read back on node-2: MATCH  ==> cross-node migration with state preserved, 15s e2e

Two pre-existing gaps surfaced by the final drop-snapshot step (not this PR's logic):

  1. IAM: the operator runs as cocoon-ar-writer@ which lacks artifactregistry.repositories.deleteArtifacts — every DeleteManifest 403s, so the migration correctly refuses to settle (phase stays Migrating, snapshot kept). Same known gap that already breaks hibernation-wake tag cleanup + CocoonSet teardown GC. Needs roles/artifactregistry.repoAdmin on the repo.
  2. Observability: main.go sets crlog.SetLogger(logr.Discard()), so controller-runtime swallows every reconcile error — the 403s were invisible in the logs. Worth a small follow-up PR wiring controller-runtime's logger to core/log.

Once the IAM grant lands the full loop (drop + settle to Running) can be re-validated; everything up to that point is verified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants