From 08e2f184b16f770e70068dbd71ba6ecdc72ddcdc Mon Sep 17 00:00:00 2001 From: bdchatham Date: Sat, 13 Jun 2026 12:10:40 -0700 Subject: [PATCH] burn-down(docs): relocate 13 design artifacts to bdchatham-designs (Design 05 Phase-C, PLT-497) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Design artifacts move to the DRI's designs repo (Design 05). Removes 13 relocated docs from docs/; repoints controller-app-config's 3 inbound doc-links (README.md, CLAUDE.md, config/manager/manager.yaml comment) to the new location. Relocated → bdchatham-designs (paired add: PR #21), 12 as-is + 1 translated (design-networking-monitoring). NOT removed: - Sonar-gated (Design 05 §5) — real consumers cite these by path/anchor: docs/design/seinode-task-lld.md (Go + generated CRD YAML) docs/design/test-harness-lld.md (Go + CI workflow) docs/seinode-task.md (section headings are cited anchors for the PLT-489 gov-ops skill — renaming breaks it) - docs/design-e2e-kind-harness.md (Amir-authored → amir-deris-designs, separate) docs/ is therefore NOT fully evacuated this round (3 gated + 1 cross-DRI remain). Inbound audit (canonical origin/main 73701f1): controller-app-config had 3 doc-links (repointed here); the other 12 relocated docs had 0 inbound refs. Prepared in an isolated worktree off origin/main — active branches untouched. Paired with bdchatham-designs#21 (merge that first). Linear: PLT-497. Co-Authored-By: Claude Opus 4.8 (1M context) --- CLAUDE.md | 2 +- README.md | 2 +- config/manager/manager.yaml | 5 +- .../stale-target-height-bootstrap-failure.md | 88 - docs/controller-app-config.md | 80 - docs/design-networking-monitoring.md | 901 ---------- docs/design-progressive-rollout.md | 791 -------- docs/design-rpc-migration-istio.md | 119 -- docs/design-seinode-import-volume-lld.md | 479 ----- docs/design-seinode-import-volume.md | 144 -- ...esign-seinode-validator-signing-key-lld.md | 427 ----- docs/design-snapshot-publish-config.md | 234 --- docs/design/composable-genesis.md | 442 ----- docs/design/validation-run-lld.md | 1588 ----------------- docs/known-issues-node-alarms.md | 33 - docs/production-deployment-analysis.md | 166 -- 16 files changed, 5 insertions(+), 5496 deletions(-) delete mode 100644 docs/bugs/stale-target-height-bootstrap-failure.md delete mode 100644 docs/controller-app-config.md delete mode 100644 docs/design-networking-monitoring.md delete mode 100644 docs/design-progressive-rollout.md delete mode 100644 docs/design-rpc-migration-istio.md delete mode 100644 docs/design-seinode-import-volume-lld.md delete mode 100644 docs/design-seinode-import-volume.md delete mode 100644 docs/design-seinode-validator-signing-key-lld.md delete mode 100644 docs/design-snapshot-publish-config.md delete mode 100644 docs/design/composable-genesis.md delete mode 100644 docs/design/validation-run-lld.md delete mode 100644 docs/known-issues-node-alarms.md delete mode 100644 docs/production-deployment-analysis.md diff --git a/CLAUDE.md b/CLAUDE.md index c062f639..6727997f 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -120,6 +120,6 @@ make docker-push IMG= # Push container image - **Condition ownership** — The planner owns all condition management on the owning resource. It sets conditions when creating plans (e.g., `NodeUpdateInProgress=True`) and when observing terminal plans (e.g., `NodeUpdateInProgress=False`). The executor does not set conditions — it only mutates plan/task state and phase transitions. - **Single-patch model** — All status mutations (plan state, conditions, phase, currentImage) accumulate in-memory during a reconcile and are flushed in a single `Status().Patch()` at the end. Tasks mutate owned resources (StatefulSets, Services, PVCs); the executor mutates plan state in-memory; the reconciler flushes once. - **Resource generators** live in `internal/noderesource/` — pure functions that produce StatefulSets, Services, and PVCs from a SeiNode spec. Used by both the controller and plan tasks. -- **Platform config** is resolved by `platform.Load` (`internal/platform/load.go`). Infra fields (scheduling, storage, resources, snapshot/genesis/result-export buckets, images) come from the mounted app-config file (`SEI_CONTROLLER_CONFIG` → `platform.FileConfig`), which is authoritative — a required field unset in the file fails `Config.Validate` at startup. Networking/gateway fields (`SEI_GATEWAY_*`, `SEI_P2P_ENDPOINT_DOMAIN`, `SEI_NLB_TARGET_TYPE`) stay env-sourced pending their removal from the controller in PLT-451. The file is read once at startup for infra fields (an infra change needs a restart); the `stateSync` section is re-read per reconcile (it hot-reloads). See `internal/platform/platform.go` for the field list and `docs/controller-app-config.md` for the file schema. +- **Platform config** is resolved by `platform.Load` (`internal/platform/load.go`). Infra fields (scheduling, storage, resources, snapshot/genesis/result-export buckets, images) come from the mounted app-config file (`SEI_CONTROLLER_CONFIG` → `platform.FileConfig`), which is authoritative — a required field unset in the file fails `Config.Validate` at startup. Networking/gateway fields (`SEI_GATEWAY_*`, `SEI_P2P_ENDPOINT_DOMAIN`, `SEI_NLB_TARGET_TYPE`) stay env-sourced pending their removal from the controller in PLT-451. The file is read once at startup for infra fields (an infra change needs a restart); the `stateSync` section is re-read per reconcile (it hot-reloads). See `internal/platform/platform.go` for the field list and the [controller-app-config schema](https://github.com/sei-protocol/bdchatham-designs/blob/main/designs/controller-app-config/controller-app-config.md) (in bdchatham-designs — relocated per Design 05 / PLT-497) for the file schema. - **Genesis resolution** is handled by the sidecar autonomously: embedded sei-config for well-known chains, S3 fallback at `{SEI_GENESIS_BUCKET}/{chainID}/genesis.json` for custom chains. - Config keys in seid's `config.toml` use **hyphens** (e.g., `persistent-peers`, `trust-height`), not underscores. diff --git a/README.md b/README.md index 6fada0bd..ff579a06 100644 --- a/README.md +++ b/README.md @@ -85,7 +85,7 @@ lifecycle, signing topology, idempotency, and operational gotchas. ## Platform Configuration -Infrastructure-level settings (node pools, storage, resources, snapshot/genesis/result-export buckets, sidecar images) are read from the mounted app-config file (`SEI_CONTROLLER_CONFIG` → `platform.FileConfig`), which is authoritative — the controller fails fast at startup if a required field is unset. See [`docs/controller-app-config.md`](docs/controller-app-config.md) for the schema. +Infrastructure-level settings (node pools, storage, resources, snapshot/genesis/result-export buckets, sidecar images) are read from the mounted app-config file (`SEI_CONTROLLER_CONFIG` → `platform.FileConfig`), which is authoritative — the controller fails fast at startup if a required field is unset. See the [controller-app-config schema](https://github.com/sei-protocol/bdchatham-designs/blob/main/designs/controller-app-config/controller-app-config.md) (in bdchatham-designs — relocated per Design 05 / PLT-497). Gateway config (`SEI_GATEWAY_NAME`, `SEI_GATEWAY_NAMESPACE`, `SEI_GATEWAY_DOMAIN`) and the config-file path (`SEI_CONTROLLER_CONFIG`) remain environment variables. diff --git a/config/manager/manager.yaml b/config/manager/manager.yaml index 472f7ef4..a7e727e2 100644 --- a/config/manager/manager.yaml +++ b/config/manager/manager.yaml @@ -37,8 +37,9 @@ spec: env: # Infra config (node pools, storage, resources, snapshot/genesis/ # result-export buckets, images) is sourced from the mounted - # app-config ConfigMap (SEI_CONTROLLER_CONFIG below); see - # docs/controller-app-config.md. Gateway config stays env-sourced + # app-config ConfigMap (SEI_CONTROLLER_CONFIG below); see the + # controller-app-config schema in bdchatham-designs (relocated per + # Design 05 / PLT-497). Gateway config stays env-sourced # pending its removal from the controller in PLT-451. - name: SEI_GATEWAY_NAME value: sei-gateway diff --git a/docs/bugs/stale-target-height-bootstrap-failure.md b/docs/bugs/stale-target-height-bootstrap-failure.md deleted file mode 100644 index e6022267..00000000 --- a/docs/bugs/stale-target-height-bootstrap-failure.md +++ /dev/null @@ -1,88 +0,0 @@ -# Bug: Stale `targetHeight` Causes Bootstrap Job Immediate Halt - -**Date:** 2026-03-29 -**Severity:** Medium — blocks shadow replayer (and any bootstrap node) from initializing -**Status:** Mitigated (manual bump); root cause requires design fix - ---- - -## Summary - -When a `SeiNode` with a bootstrap Job has a `snapshot.s3.targetHeight` that is -lower than the height of the S3 snapshot it restores, the bootstrap `seid` -process immediately halts on its first block and exits with code 130 (SIGINT). -Because the Job is created with `backoffLimit: 0`, this single failure marks the -entire init plan as `Failed`. - -## Reproduction - -1. Deploy a `SeiNode` with `replayer.snapshot.s3.targetHeight: 198740000` -2. The `snapshot-restore` task downloads the latest snapshot from S3 (currently - at height 200045000 — ~1.3M blocks ahead of the target) -3. The bootstrap Job starts seid with `--halt-height 198740000` -4. seid begins at height 200045000 (already past the halt height), processes - one block, and triggers the Cosmos SDK halt-height check -5. seid sends itself SIGINT → exit code 130 -6. Job fails immediately (`backoffLimit: 0`), `await-bootstrap-complete` marks - the plan as `Failed` - -## Root Cause - -`targetHeight` serves two purposes in `bootstrap_resources.go`: - -1. **Snapshot selection** — passed to `snapshot-restore` to find the right S3 - object. But the seictl restore task downloads the *closest available* - snapshot, which may be far newer if snapshots have been regenerated since the - manifest was written. - -2. **Halt height** — used verbatim as `--halt-height` for the bootstrap seid - process. This assumes the restored snapshot is at or below `targetHeight`, - which becomes false as new snapshots are uploaded to S3. - -The coupling between these two concerns means `targetHeight` silently becomes -stale as the chain advances and new snapshots replace old ones. - -Relevant code path: - -``` -bootstrap_resources.go:142 haltHeight := snap.S3.TargetHeight -bootstrap_resources.go:143 seidCmd, seidArgs := bootstrapWaitCommand(bootstrapSidecarPort(node), haltHeight) -bootstrap_resources.go:189 exec seid start --home %s --halt-height %d -``` - -## Impact - -- Any `SeiNode` with a bootstrap flow (shadow replayer, full node with - `bootstrapImage`, validator with snapshot) will fail to initialize if - `targetHeight` falls behind the latest available snapshot in S3. -- The failure is silent from the user's perspective — the node simply goes to - `Failed` phase. Diagnosing it requires checking bootstrap Job exit codes and - seid logs for the halt message. - -## Mitigation - -Bump `targetHeight` in the manifest to a value ahead of the latest S3 snapshot. -For the shadow replayer, this was changed from `198740000` → `200100000`. - -This is a manual fix that will need to be repeated as the chain advances. - -## Potential Fixes - -1. **Decouple snapshot selection from halt height.** Add a separate - `haltHeight` field (or compute it dynamically) so `targetHeight` only - controls which snapshot to download. The halt height could be derived from - the actual restored snapshot height + a configurable offset. - -2. **Treat exit code 130 as success in the bootstrap Job.** The halt-height - exit is an expected shutdown. The `await-bootstrap-complete` task (or the - Job spec itself) could treat exit code 130 as a successful completion rather - than a failure. - -3. **Resolve `targetHeight` dynamically.** Read `latest.txt` from S3 at plan - time and set `--halt-height` to `latest + offset`, similar to what the - original sei-infra scripts did. - -4. **Skip halt-height if restored height exceeds it.** The bootstrap wait - script could query seid's current height after state sync and skip the - `--halt-height` flag if already past it. This avoids the immediate-halt - scenario entirely. diff --git a/docs/controller-app-config.md b/docs/controller-app-config.md deleted file mode 100644 index 2addcd58..00000000 --- a/docs/controller-app-config.md +++ /dev/null @@ -1,80 +0,0 @@ -# Controller app-config file - -The controller reads a single read-only application-config file, pointed at by -`SEI_CONTROLLER_CONFIG` and mounted as a directory (a GitOps-written ConfigMap, -typically `sei-controller-config`). It is decoded into `platform.FileConfig` -(`internal/platform/platform.go`). - -Two read paths, by design: - -- **Infra sections** (`scheduling`, `storage`, `resources`, `snapshot`, - `resultExport`, `genesis`, `images`) are resolved **once at startup** by - `platform.Load`. Editing them in the live ConfigMap propagates to the mount - but has **no effect until the controller pod restarts** - (`kubectl rollout restart`) — only `stateSync` hot-reloads. -- **`stateSync`** is re-read **per reconcile** so syncer changes hot-reload - without a restart (the directory mount swaps atomically). - -## Source of truth - -The file is **authoritative** for infra config: a required field unset in the -file fails `Config.Validate` at startup (the controller does not boot). There is -no env-var fallback for these fields. - -Networking/gateway config (`SEI_GATEWAY_*`, `SEI_P2P_ENDPOINT_DOMAIN`, -`SEI_NLB_TARGET_TYPE`) is **not** in the file — it stays env-sourced pending its -removal from the controller in the GitOps networking move (PLT-451). - -## Schema - -```yaml -# State-sync canonical syncers, keyed by chain-id. Bare host:port (no scheme). -# Read per-reconcile; >=2 entries per chain or the node fails closed. -stateSync: - syncers: - pacific-1: - - rpc-1.example.net:26657 - - rpc-2.example.net:26657 - -# --- infra (authoritative; read once at startup) --- - -scheduling: - nodepoolName: sei-node - nodepoolArchive: sei-archive - tolerationKey: sei.io/workload - serviceAccount: seid-node - -storage: # no sizePerf — perf is a storage-class tier only - classPerf: gp3-10k-750 - classDefault: gp3 - classArchive: gp3-archive - sizeDefault: 2000Gi - sizeArchive: 40Ti - -resources: - cpuArchive: "48" - memArchive: 448Gi - cpuDefault: "4" - memDefault: 32Gi - -snapshot: - bucket: sei-snapshots - region: us-east-2 - -resultExport: - bucket: sei-shadow-results - region: us-east-2 - prefix: shadow-results/ - -genesis: - bucket: sei-k8s-genesis - region: us-east-2 - -images: - sidecar: ghcr.io/sei-protocol/seictl@sha256:... - kubeRBACProxy: quay.io/brancz/kube-rbac-proxy:v0.19.1 - cosmosExporter: ghcr.io/sei-protocol/sei-cosmos-exporter@sha256:... -``` - -A present-but-unparseable file is a hard startup error. A required infra field -unset in the file fails `Config.Validate` at startup, naming the file key. diff --git a/docs/design-networking-monitoring.md b/docs/design-networking-monitoring.md deleted file mode 100644 index 73d98ec7..00000000 --- a/docs/design-networking-monitoring.md +++ /dev/null @@ -1,901 +0,0 @@ -# Design: SeiNodeDeployment — Fleet Orchestration with Networking & Monitoring - -**Branch:** `feature/networking-monitoring` -**Status:** Final (v3) — review findings incorporated -**Goal:** Introduce a `SeiNodeDeployment` CRD that orchestrates N SeiNodes behind shared networking (Service, Istio Gateway routing, network isolation) and monitoring (ServiceMonitor), closing the gap between the current controller and sei-infra's EC2 production deployment. - ---- - -## 1. Problem Statement - -The SeiNode controller manages single-node lifecycle (bootstrap → init → running → snapshots). Production deployments require: - -- **Multiple nodes** behind a shared load balancer (sei-infra runs 3 instances per role) -- **External Service** (ClusterIP/LoadBalancer) for RPC, REST, and EVM traffic -- **Ingress routing** via Kubernetes Gateway API (HTTPRoute) with TLS for public endpoints -- **Network isolation** so that only the ingress gateway and authorized peers can reach node APIs -- **DNS** (Route53 via external-dns) for stable hostnames -- **Monitoring** (Prometheus ServiceMonitor) for observability - -These are fleet-level concerns: a load balancer sits in front of N nodes, not one. Putting networking on each SeiNode would create N independent load balancers instead of one shared entry point. - ---- - -## 2. Architecture Overview - -``` -┌──────────────────────────────────────────────────────────────────┐ -│ SeiNodeDeployment │ -│ "pacific-1-archive-rpc" │ -│ │ -│ Owns: │ -│ ├── SeiNode "pacific-1-archive-rpc-0" ─┐ │ -│ │ └── (SeiNode controller manages │ │ -│ │ StatefulSet, PVC, headless Svc, │ shared label: │ -│ │ sidecar tasks) │ sei.io/group: ... │ -│ ├── SeiNode "pacific-1-archive-rpc-1" ─┤ │ -│ ├── SeiNode "pacific-1-archive-rpc-2" ─┘ │ -│ │ │ -│ ├── Service "...-external" (selects all 3 pods by group label) │ -│ ├── HTTPRoute "..." (routes to shared Service) │ -│ ├── AuthorizationPolicy "..." (applied to all 3 pods) │ -│ └── ServiceMonitor "..." (scrapes all 3 pods) │ -└──────────────────────────────────────────────────────────────────┘ -``` - -**SeiNode** is unchanged — it manages single-node lifecycle. -**SeiNodeDeployment** is the new orchestration layer that composes SeiNodes with shared infrastructure. - -This follows the same pattern as the existing `SeiNodePool → SeiNode` relationship, where SeiNodePool creates child SeiNode CRs and aggregates their status. - ---- - -## 3. Design Principles - -1. **SeiNode stays single-responsibility** — One small prerequisite change (`spec.podLabels`) is needed on SeiNode to support label propagation to pods. No networking logic is added. SeiNodeDeployment owns the fleet and exposure layer. - -2. **Same patterns as SeiNodePool** — The SeiNodeDeployment controller follows the same `ensureSeiNode` / `updateStatus` / owner-reference patterns already established by SeiNodePool. No new controller patterns to learn. - -3. **Passthrough over abstraction** — Service annotations, HTTPRoute annotations, and Istio config use Kubernetes-native values. No DSL wrappers. - -4. **Safe by default** — `DeletionPolicy` governs both networking resources and child SeiNodes. Network isolation is an additive feature, not a breaking change. **Important:** The default `DeletionPolicy` is `Delete`, which cascades deletion of _all_ child SeiNodes and networking resources when the group is removed. Production deployments should use `deletionPolicy: Retain` to prevent accidental data loss. - -5. **Two-way doors only** — Every field is optional. WAF is just an annotation on the Service or HTTPRoute. Update strategy for rolling out changes across replicas is a future concern that the current design does not block. - -6. **SeiNodePool vs SeiNodeDeployment** — SeiNodePool is for genesis network bootstrapping (prep jobs, shared genesis PVC, then SeiNodes). SeiNodeDeployment is for production fleet management (N nodes from a template + shared networking/monitoring). They target different use cases and should not manage the same SeiNodes. - ---- - -## 4. Prerequisite: SeiNode `podLabels` Field - -The shared external Service selects pods by `sei.io/group: {groupName}`. For this label to reach the pod template, the SeiNode controller must propagate it. Today, `resourceLabelsForNode()` only sets `sei.io/node: {name}` on the pod template — SeiNode metadata labels are ignored. - -**Change:** Add an optional `podLabels` field to `SeiNodeSpec`. The SeiNode controller merges these into the StatefulSet pod template labels alongside the existing `sei.io/node` label. - -```go -type SeiNodeSpec struct { - // ... existing fields ... - - // PodLabels are additional labels merged into the StatefulSet pod template. - // The controller always sets sei.io/node; these are additive. - // +optional - PodLabels map[string]string `json:"podLabels,omitempty"` -} -``` - -In `resources.go`: - -```go -func resourceLabelsForNode(node *seiv1alpha1.SeiNode) map[string]string { - labels := make(map[string]string, len(node.Spec.PodLabels)+1) - maps.Copy(labels, node.Spec.PodLabels) // user/group labels first - labels[nodeLabel] = node.Name // system label wins - return labels -} -``` - -The SeiNodeDeployment controller sets `podLabels: {"sei.io/group": groupName}` on each child SeiNode. This is a small, backward-compatible change — existing SeiNodes without `podLabels` behave identically. - -This is scoped as a standalone prerequisite PR before Phase 1. - ---- - -## 5. API Types - -### 5.1 SeiNodeDeployment (`api/v1alpha1/seinodedeployment_types.go`) - -```go -// SeiNodeDeploymentSpec defines the desired state of a SeiNodeDeployment. -type SeiNodeDeploymentSpec struct { - // Replicas is the number of SeiNode instances to create. - // +kubebuilder:validation:Minimum=1 - // +kubebuilder:validation:Maximum=16 - // +kubebuilder:default=1 - Replicas int32 `json:"replicas"` - - // Template defines the SeiNode spec stamped out for each replica. - // Each SeiNode is named "{group-name}-{ordinal}". - Template SeiNodeTemplate `json:"template"` - - // Networking controls how the group is exposed to traffic. - // Networking resources are shared across all replicas. - // +optional - Networking *NetworkingConfig `json:"networking,omitempty"` - - // Monitoring configures observability resources shared across - // all replicas. - // +optional - Monitoring *MonitoringConfig `json:"monitoring,omitempty"` -} - -// SeiNodeTemplate wraps a SeiNodeSpec for use in the group template. -type SeiNodeTemplate struct { - // Metadata allows setting labels and annotations on child SeiNodes. - // The controller always adds sei.io/group and sei.io/group-ordinal - // labels; user-specified labels are merged. - // +optional - Metadata *SeiNodeTemplateMeta `json:"metadata,omitempty"` - - // Spec is the SeiNodeSpec applied to each replica. - Spec SeiNodeSpec `json:"spec"` -} - -// SeiNodeTemplateMeta defines metadata for templated SeiNodes. -type SeiNodeTemplateMeta struct { - // Labels are merged onto each child SeiNode's metadata. - // +optional - Labels map[string]string `json:"labels,omitempty"` - - // Annotations are merged onto each child SeiNode's metadata. - // +optional - Annotations map[string]string `json:"annotations,omitempty"` -} -``` - -### 4.2 SeiNodeDeployment Status - -```go -// SeiNodeDeploymentPhase represents the high-level lifecycle state. -// +kubebuilder:validation:Enum=Pending;Initializing;Ready;Degraded;Failed;Terminating -type SeiNodeDeploymentPhase string - -const ( - GroupPhasePending SeiNodeDeploymentPhase = "Pending" - GroupPhaseInitializing SeiNodeDeploymentPhase = "Initializing" - GroupPhaseReady SeiNodeDeploymentPhase = "Ready" - GroupPhaseDegraded SeiNodeDeploymentPhase = "Degraded" - GroupPhaseFailed SeiNodeDeploymentPhase = "Failed" - GroupPhaseTerminating SeiNodeDeploymentPhase = "Terminating" -) - -// SeiNodeDeploymentStatus defines the observed state of a SeiNodeDeployment. -type SeiNodeDeploymentStatus struct { - // ObservedGeneration is the most recent generation observed by the controller. - // Clients can check this against metadata.generation to know if the - // status reflects the current spec. - // +optional - ObservedGeneration int64 `json:"observedGeneration,omitempty"` - - // Phase is the high-level lifecycle state. - Phase SeiNodeDeploymentPhase `json:"phase,omitempty"` - - // Replicas is the desired number of SeiNodes. - Replicas int32 `json:"replicas,omitempty"` - - // ReadyReplicas is the number of SeiNodes in Running phase. - ReadyReplicas int32 `json:"readyReplicas,omitempty"` - - // Nodes reports the status of each child SeiNode. - // +optional - Nodes []GroupNodeStatus `json:"nodes,omitempty"` - - // NetworkingStatus reports the observed state of networking resources. - // +optional - NetworkingStatus *NetworkingStatus `json:"networkingStatus,omitempty"` - - // +listType=map - // +listMapKey=type - // +optional - Conditions []metav1.Condition `json:"conditions,omitempty"` -} - -// GroupNodeStatus is a summary of a child SeiNode's state. -type GroupNodeStatus struct { - // Name is the SeiNode resource name. - Name string `json:"name"` - - // Phase is the SeiNode's current phase. - Phase SeiNodePhase `json:"phase,omitempty"` -} - -// NetworkingStatus reports the observed state of networking resources. -type NetworkingStatus struct { - // ExternalServiceName is the name of the managed external Service. - // +optional - ExternalServiceName string `json:"externalServiceName,omitempty"` - - // LoadBalancerIngress contains the hostname/IP assigned by the cloud - // provider once the LoadBalancer is provisioned. - // +optional - LoadBalancerIngress []corev1.LoadBalancerIngress `json:"loadBalancerIngress,omitempty"` -} -``` - -### 4.3 SeiNodeDeployment CRD markers - -```go -// +kubebuilder:object:root=true -// +kubebuilder:subresource:status -// +kubebuilder:resource:shortName=snd -// +kubebuilder:printcolumn:name="Ready",type=integer,JSONPath=`.status.readyReplicas` -// +kubebuilder:printcolumn:name="Replicas",type=integer,JSONPath=`.status.replicas` -// +kubebuilder:printcolumn:name="Phase",type=string,JSONPath=`.status.phase` -// +kubebuilder:printcolumn:name="Host",type=string,JSONPath=`.spec.networking.gateway.hostnames[0]`,priority=1 -// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp` - -type SeiNodeDeployment struct { - metav1.TypeMeta `json:",inline"` - metav1.ObjectMeta `json:"metadata,omitempty"` - Spec SeiNodeDeploymentSpec `json:"spec,omitempty"` - Status SeiNodeDeploymentStatus `json:"status,omitempty"` -} -``` - -### 4.4 Networking Types (`api/v1alpha1/networking_types.go`) - -```go -// PortName is a well-known sei-node port identifier from sei-config. -// +kubebuilder:validation:Enum=rpc;rest;evm-http;evm-ws;grpc;p2p;prometheus -type PortName string - -// DeletionPolicy controls what happens to managed networking resources -// when their spec is removed. -// +kubebuilder:validation:Enum=Delete;Retain -type DeletionPolicy string - -const ( - DeletionPolicyDelete DeletionPolicy = "Delete" - DeletionPolicyRetain DeletionPolicy = "Retain" -) - -// NetworkingConfig controls how the group is exposed to traffic. -// -// Routing uses the Kubernetes Gateway API exclusively; the platform must -// install the Gateway API CRDs (v1+) and a Gateway implementation such -// as Istio before HTTPRoute resources will take effect. -// +kubebuilder:validation:XValidation:rule="!has(self.gateway) || has(self.service)",message="gateway requires service to be configured" -type NetworkingConfig struct { - // Service creates a non-headless Service shared across all replicas. - // Each SeiNode still gets its own headless Service for pod DNS. - // +optional - Service *ExternalServiceConfig `json:"service,omitempty"` - - // Gateway creates a gateway.networking.k8s.io/v1 HTTPRoute - // targeting a shared Gateway (e.g. Istio ingress gateway). - // +optional - Gateway *GatewayRouteConfig `json:"gateway,omitempty"` - - // Isolation configures network-level access control for node pods. - // +optional - Isolation *NetworkIsolationConfig `json:"isolation,omitempty"` -} - -// ExternalServiceConfig defines the shared non-headless Service. -type ExternalServiceConfig struct { - // Type is the Kubernetes Service type. Defaults to ClusterIP. - // +optional - // +kubebuilder:default=ClusterIP - // +kubebuilder:validation:Enum=ClusterIP;LoadBalancer;NodePort - Type corev1.ServiceType `json:"type,omitempty"` - - // Ports selects which node ports to expose. When empty, all - // standard sei-config ports are exposed. - // +optional - Ports []PortName `json:"ports,omitempty"` - - // Annotations are merged onto the Service metadata. - // +optional - Annotations map[string]string `json:"annotations,omitempty"` -} - -// GatewayRouteConfig creates a gateway.networking.k8s.io/v1 HTTPRoute -// that references a shared Gateway resource. -type GatewayRouteConfig struct { - // ParentRef identifies the shared Gateway. - ParentRef GatewayParentRef `json:"parentRef"` - - // Hostnames are the DNS hostnames for the HTTPRoute. - // +kubebuilder:validation:MinItems=1 - Hostnames []string `json:"hostnames"` - - // Annotations are merged onto the HTTPRoute metadata. - // +optional - Annotations map[string]string `json:"annotations,omitempty"` -} - -// GatewayParentRef identifies a Gateway resource. -type GatewayParentRef struct { - // +kubebuilder:validation:MinLength=1 - Name string `json:"name"` - - // +kubebuilder:validation:MinLength=1 - Namespace string `json:"namespace"` -} - -// NetworkIsolationConfig defines network-level access control. -type NetworkIsolationConfig struct { - // AuthorizationPolicy creates an Istio AuthorizationPolicy - // restricting which identities can reach node pods. - // +optional - AuthorizationPolicy *AuthorizationPolicyConfig `json:"authorizationPolicy,omitempty"` -} - -// AuthorizationPolicyConfig defines allowed traffic sources. -type AuthorizationPolicyConfig struct { - // AllowedSources defines who can reach this group's pods. - // The controller generates an ALLOW policy; traffic from - // sources not listed here is denied. - // +kubebuilder:validation:MinItems=1 - AllowedSources []TrafficSource `json:"allowedSources"` -} - -// TrafficSource identifies a set of callers by Istio identity. -// +kubebuilder:validation:XValidation:rule="has(self.principals) || has(self.namespaces)",message="at least one of principals or namespaces must be set" -type TrafficSource struct { - // Principals are SPIFFE identities (e.g. - // "cluster.local/ns/istio-system/sa/istio-ingressgateway"). - // +optional - Principals []string `json:"principals,omitempty"` - - // Namespaces allows all pods in these namespaces. - // +optional - Namespaces []string `json:"namespaces,omitempty"` -} -``` - -### 4.5 Monitoring Types (`api/v1alpha1/monitoring_types.go`) - -```go -// MonitoringConfig controls observability resources. -type MonitoringConfig struct { - // ServiceMonitor creates a monitoring.coreos.com/v1 ServiceMonitor. - // Presence (non-nil) enables it; set to nil to disable. - // +optional - ServiceMonitor *ServiceMonitorConfig `json:"serviceMonitor,omitempty"` -} - -// ServiceMonitorConfig defines the ServiceMonitor. -type ServiceMonitorConfig struct { - // Interval is the Prometheus scrape interval. - // +optional - // +kubebuilder:default="30s" - // +kubebuilder:validation:Pattern="^[0-9]+(ms|s|m|h)$" - Interval string `json:"interval,omitempty"` - - // Labels are added to the ServiceMonitor metadata. - // +optional - Labels map[string]string `json:"labels,omitempty"` -} -``` - -### 4.6 Status Conditions - -```go -const ( - ConditionNodesReady = "NodesReady" - ConditionExternalServiceReady = "ExternalServiceReady" - ConditionNetworkingReady = "NetworkingReady" // HTTPRoute - ConditionIsolationReady = "IsolationReady" // AuthorizationPolicy - ConditionServiceMonitorReady = "ServiceMonitorReady" -) -``` - ---- - -## 6. Labels and Naming - -### Labels injected by SeiNodeDeployment controller - -| Label | Value | Set on | Purpose | -|-------|-------|--------|---------| -| `sei.io/group` | `{groupName}` | SeiNode metadata + `podLabels` | Shared Service selector, AuthorizationPolicy selector | -| `sei.io/group-ordinal` | `"0"`, `"1"`, ... | SeiNode metadata | Identify individual replicas | -| `sei.io/node` | `{nodeName}` | Pod template (by SeiNode controller) | Existing per-node label | - -The SeiNodeDeployment controller sets `sei.io/group` on both the child SeiNode's metadata labels AND `spec.podLabels`. The `podLabels` mechanism (Section 4) ensures the label propagates to the StatefulSet pod template. The shared external Service selects on `sei.io/group: {groupName}`, so all replica pods are endpoints of the same Service. - -### Label merge order (system labels win) - -When building child SeiNode labels, user-specified template labels are applied first, then system labels overwrite. This prevents a user from accidentally breaking the group selector: - -```go -// User labels first -maps.Copy(labels, group.Spec.Template.Metadata.Labels) -// System labels overwrite -labels["sei.io/group"] = group.Name -labels["sei.io/group-ordinal"] = strconv.Itoa(ordinal) -``` - -### Resource naming - -| Resource | Name | Why | -|----------|------|-----| -| SeiNode | `{group}-{ordinal}` | Matches SeiNodePool pattern | -| External Service | `{group}-external` | Distinguishes from per-node headless Services | -| HTTPRoute | `{group}` | One route per group | -| AuthorizationPolicy | `{group}` | Applied to all group pods | -| ServiceMonitor | `{group}` | Scrapes all group pods | - ---- - -## 7. Controller Reconciliation - -### 7.1 File Organization - -``` -internal/controller/ -├── node/ # SeiNode controller (UNCHANGED) -│ ├── controller.go -│ ├── resources.go -│ ├── plan_execution.go -│ ├── ... -│ -├── nodedeployment/ # NEW: SeiNodeDeployment controller -│ ├── controller.go # Reconcile loop, phase transitions -│ ├── nodes.go # ensureSeiNode, scaleDown -│ ├── networking.go # External Service, HTTPRoute, -│ │ # AuthorizationPolicy generation + reconcile -│ ├── monitoring.go # ServiceMonitor generation + reconcile -│ ├── status.go # Status aggregation -│ ├── labels.go # Label helpers, naming -│ ├── networking_test.go -│ ├── monitoring_test.go -│ ├── nodes_test.go -│ └── status_test.go -│ -└── nodepool/ # SeiNodePool controller (UNCHANGED) - ├── controller.go - └── ... -``` - -### 7.2 Reconcile Flow - -```go -func (r *SeiNodeDeploymentReconciler) Reconcile(ctx, req) (Result, error) { - group := &SeiNodeDeployment{} - r.Get(ctx, req, group) - - // Deletion handling (respects DeletionPolicy for networking AND child SeiNodes) - if !group.DeletionTimestamp.IsZero() { - return r.handleDeletion(ctx, group) - } - r.ensureFinalizer(ctx, group) - - // 1. Ensure N SeiNodes exist from template - r.reconcileSeiNodes(ctx, group) - - // 2. Networking (independent of SeiNode readiness) - r.reconcileNetworking(ctx, group) - - // 3. Monitoring - r.reconcileMonitoring(ctx, group) - - // 4. Status aggregation (sets observedGeneration) - r.updateStatus(ctx, group) - - // Periodic requeue to catch drift on unstructured resources - // (HTTPRoute, AuthorizationPolicy, ServiceMonitor) that lack Owns() watches - return ctrl.Result{RequeueAfter: statusPollInterval}, nil -} -``` - -### 7.3 SeiNode Management (`nodes.go`) - -Follows the SeiNodePool pattern: - -```go -func (r *SeiNodeDeploymentReconciler) reconcileSeiNodes(ctx, group) error { - for i := range group.Spec.Replicas { - r.ensureSeiNode(ctx, group, i) - } - return r.scaleDown(ctx, group) -} - -func (r *SeiNodeDeploymentReconciler) ensureSeiNode(ctx, group, ordinal) error { - desired := generateSeiNode(group, ordinal) - // Set owner reference, create-or-update - // On update: sync Image, Entrypoint, Sidecar (same as SeiNodePool) -} - -func generateSeiNode(group, ordinal) *SeiNode { - // User labels first, then system labels overwrite - labels := make(map[string]string) - if group.Spec.Template.Metadata != nil { - maps.Copy(labels, group.Spec.Template.Metadata.Labels) - } - labels["sei.io/group"] = group.Name - labels["sei.io/group-ordinal"] = strconv.Itoa(ordinal) - - spec := group.Spec.Template.Spec.DeepCopy() - // Inject podLabels so the SeiNode controller propagates sei.io/group to pods - if spec.PodLabels == nil { - spec.PodLabels = make(map[string]string) - } - spec.PodLabels["sei.io/group"] = group.Name - - return &SeiNode{ - ObjectMeta: ObjectMeta{ - Name: fmt.Sprintf("%s-%d", group.Name, ordinal), - Namespace: group.Namespace, - Labels: labels, - }, - Spec: *spec, - } -} -``` - -**Scale-down guard:** The `scaleDown` function refuses to delete SeiNodes if the computed desired count is 0 (defensive against uninitialized fields or controller bugs). The `Minimum=1` CEL validation on `replicas` prevents 0 at admission, but the guard catches code-level errors: - -```go -func (r *SeiNodeDeploymentReconciler) scaleDown(ctx, group) error { - if group.Spec.Replicas <= 0 { - log.Error("refusing scale-down: desired replicas is zero or negative") - return nil - } - // Delete SeiNodes with ordinal >= group.Spec.Replicas -} -``` - -### 7.4 Networking Reconciliation (`networking.go`) - -Each networking resource is managed independently: - -```go -func (r *SeiNodeDeploymentReconciler) reconcileNetworking(ctx, group) error { - r.reconcileExternalService(ctx, group) - r.reconcileRoute(ctx, group) // HTTPRoute - r.reconcileIsolation(ctx, group) // AuthorizationPolicy -} -``` - -**External Service:** -- Uses server-side apply with `fieldOwner: seinodedeployment-controller` -- Selector: `sei.io/group: {groupName}` (matches all replica pods via `podLabels`) -- Does NOT set `PublishNotReadyAddresses` (natural readiness gating) -- If spec is nil and `deletionPolicy: Delete`, delete the Service -- If spec is nil and `deletionPolicy: Retain`, remove owner reference (orphan) - -**HTTPRoute:** -- Generated as `unstructured.Unstructured` (avoids importing Gateway API Go modules) -- Backend targets `{group}-external` Service -- If CRD not installed (no Gateway API), sets `NetworkingReady` condition to False/CRDNotInstalled - -**AuthorizationPolicy:** -- Generated as `unstructured.Unstructured` (avoids importing Istio Go modules) -- Selector: `sei.io/group: {groupName}` -- Action: ALLOW with specified principals/namespaces -- **Controller SA auto-injection:** The controller always adds its own ServiceAccount principal to the AuthorizationPolicy, ensuring sidecar communication (port 7777) is never blocked. This is injected at generation time, not visible in the user's spec. Without this, a SeiNode controller running in a different namespace (e.g. `sei-system`) would be unable to drive node initialization via the sidecar API. -- If CRD not installed (no Istio), sets `IsolationReady` condition to False - -### 7.5 Monitoring Reconciliation (`monitoring.go`) - -**ServiceMonitor:** -- Generated as `unstructured.Unstructured` -- Selector: `sei.io/group: {groupName}` (scrapes all replica pods) -- Port: `prometheus` (26660) -- If CRD not installed, sets `ServiceMonitorReady` condition to False - -### 7.6 Status Aggregation (`status.go`) - -```go -func (r *SeiNodeDeploymentReconciler) updateStatus(ctx, group) error { - // List child SeiNodes by label - nodeList := r.listChildSeiNodes(ctx, group) - - // Count ready/total - var readyReplicas int32 - for _, node := range nodeList { - if node.Status.Phase == PhaseRunning { readyReplicas++ } - } - - // Determine group phase - phase := groupPhase(readyReplicas, group.Spec.Replicas, nodeList) - - // Read external Service for LB status - networkingStatus := r.readNetworkingStatus(ctx, group) - - // Patch status - group.Status.Replicas = group.Spec.Replicas - group.Status.ReadyReplicas = readyReplicas - group.Status.Phase = phase - group.Status.Nodes = nodeStatuses(nodeList) - group.Status.ObservedGeneration = group.Generation - group.Status.NetworkingStatus = networkingStatus -} -``` - -**Phase logic:** The `groupPhase` function differentiates between scaling-up (some nodes in Initializing/PreInitializing) and actual failures: - -| Condition | Phase | -|-----------|-------| -| All replicas Running | `Ready` | -| Some replicas Running, rest progressing (Pending/Initializing) | `Initializing` | -| Some replicas Running, some Failed | `Degraded` | -| All replicas Failed | `Failed` | -| No replicas exist yet | `Pending` | - -The `NodesReady` condition provides detail: `"2/3 nodes ready (1 initializing)"`. - -### 7.7 Deletion Handling - -The `DeletionPolicy` governs both networking resources AND child SeiNodes: - -| DeletionPolicy | Networking resources | Child SeiNodes | -|----------------|---------------------|----------------| -| `Delete` | Deleted | Deleted (via owner ref GC) | -| `Retain` | Orphaned (owner ref removed) | Orphaned (owner ref removed, continue running independently) | - -> **Production recommendation:** Always use `deletionPolicy: Retain` in production. The `Delete` default is convenient for development but dangerous in production because deleting a SeiNodeDeployment will cascade-delete all child SeiNodes (and their PVCs if configured), causing irreversible data loss. - -When `Retain`, the finalizer removes owner references from all managed resources before allowing the SeiNodeDeployment to be deleted. This prevents Kubernetes GC from cascading the deletion. - -### 7.8 RBAC - -```go -// +kubebuilder:rbac:groups=sei.io,resources=seinodedeployments,verbs=get;list;watch;create;update;patch;delete -// +kubebuilder:rbac:groups=sei.io,resources=seinodedeployments/status,verbs=get;update;patch -// +kubebuilder:rbac:groups=sei.io,resources=seinodedeployments/finalizers,verbs=update -// +kubebuilder:rbac:groups=sei.io,resources=seinodes,verbs=get;list;watch;create;update;patch;delete -// +kubebuilder:rbac:groups=sei.io,resources=seinodes/status,verbs=get -// +kubebuilder:rbac:groups="",resources=services,verbs=get;list;watch;create;update;patch;delete -// +kubebuilder:rbac:groups=gateway.networking.k8s.io,resources=httproutes,verbs=get;list;watch;create;update;patch;delete -// +kubebuilder:rbac:groups=security.istio.io,resources=authorizationpolicies,verbs=get;list;watch;create;update;patch;delete -// +kubebuilder:rbac:groups=monitoring.coreos.com,resources=servicemonitors,verbs=get;list;watch;create;update;patch;delete -``` - -### 7.9 SetupWithManager - -```go -func (r *SeiNodeDeploymentReconciler) SetupWithManager(mgr ctrl.Manager) error { - return ctrl.NewControllerManagedBy(mgr). - For(&seiv1alpha1.SeiNodeDeployment{}). - Owns(&seiv1alpha1.SeiNode{}). - Owns(&corev1.Service{}). - Named("seinodedeployment"). - Complete(r) -} -``` - -Note: HTTPRoute, AuthorizationPolicy, and ServiceMonitor are unstructured, so we don't add `Owns()` for them. Their reconciliation is idempotent and driven by the SeiNodeDeployment reconcile loop. - ---- - -## 8. AWS Topology Patterns - -### Pattern 1: Istio Gateway + Network Isolation (recommended) - -Traffic flow: `Client → ALB (WAF) → Istio Gateway → Envoy sidecar → seid` - -The controller manages the HTTPRoute and AuthorizationPolicy. The ALB and Istio Gateway are platform-level resources. - -```yaml -networking: - deletionPolicy: Retain - service: - type: ClusterIP - ports: ["rpc", "rest", "evm-http", "evm-ws"] - gateway: - parentRef: - name: sei-gateway - namespace: istio-system - hostnames: - - rpc.sei-archive.pacific-1.seinetwork.io - isolation: - authorizationPolicy: - allowedSources: - - principals: ["cluster.local/ns/istio-system/sa/istio-ingressgateway"] - - namespaces: ["sei-nodes"] -``` - -### Pattern 2: NLB for TCP (p2p, gRPC) - -```yaml -networking: - service: - type: LoadBalancer - ports: ["p2p", "grpc"] - annotations: - service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing - service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip -``` - -### WAF - -WAF is provisioned by the platform team (Terraform) and applied externally (e.g. via AWS WAF associated with the NLB or ALB fronting the Istio ingress gateway). This is a two-way door — WAF configuration lives outside the controller and can be toggled independently. - -### DNS - -DNS is handled by external-dns, which reads HTTPRoute `hostnames` fields and creates Route53 records. Prerequisites: -- external-dns deployed with `--source=gateway-httproute` -- `--domain-filter` matching the target domain -- IAM permissions for Route53 - ---- - -## 9. Complete Sample Manifest - -```yaml -apiVersion: sei.io/v1alpha1 -kind: SeiNodeDeployment -metadata: - name: pacific-1-archive-rpc - namespace: sei-nodes -spec: - replicas: 3 - - template: - metadata: - labels: - sei.io/chain: pacific-1 - sei.io/role: archive - spec: - chainId: pacific-1 - image: "ghcr.io/sei-protocol/sei:v6.3.0" - sidecar: - image: ghcr.io/sei-protocol/seictl@sha256:64f92fb... - resources: - requests: - cpu: "500m" - memory: "256Mi" - entrypoint: - command: ["seid"] - args: ["start", "--home", "/sei"] - storage: - retainOnDelete: true - archive: - peers: - - ec2Tags: - region: eu-central-1 - tags: - ChainIdentifier: pacific-1 - Component: state-syncer - snapshotGeneration: - keepRecent: 5 - destination: - s3: - bucket: pacific-1-snapshots - prefix: state-sync/ - region: eu-central-1 - - networking: - deletionPolicy: Retain - service: - type: ClusterIP - ports: ["rpc", "rest", "evm-http", "evm-ws"] - gateway: - parentRef: - name: sei-gateway - namespace: istio-system - hostnames: - - rpc.sei-archive.pacific-1.seinetwork.io - isolation: - authorizationPolicy: - allowedSources: - - principals: - - "cluster.local/ns/istio-system/sa/istio-ingressgateway" - - namespaces: - - "sei-nodes" - - monitoring: - serviceMonitor: - interval: "30s" - labels: - release: prometheus -``` - -**Generated resources:** - -| Resource | Name | Kind | -|----------|------|------| -| SeiNode | pacific-1-archive-rpc-0 | sei.io/v1alpha1/SeiNode | -| SeiNode | pacific-1-archive-rpc-1 | sei.io/v1alpha1/SeiNode | -| SeiNode | pacific-1-archive-rpc-2 | sei.io/v1alpha1/SeiNode | -| Service | pacific-1-archive-rpc-external | v1/Service (ClusterIP) | -| HTTPRoute | pacific-1-archive-rpc | gateway.networking.k8s.io/v1/HTTPRoute | -| AuthorizationPolicy | pacific-1-archive-rpc | security.istio.io/v1/AuthorizationPolicy | -| ServiceMonitor | pacific-1-archive-rpc | monitoring.coreos.com/v1/ServiceMonitor | - -Plus each SeiNode creates its own StatefulSet, PVC, and headless Service (managed by the existing SeiNode controller). - ---- - -## 10. What Changes vs. What Stays the Same - -| Concern | Status | -|---------|--------| -| SeiNode CRD | **SMALL CHANGE** — add optional `spec.podLabels` field (prerequisite) | -| SeiNode controller | **SMALL CHANGE** — merge `podLabels` into StatefulSet pod template | -| SeiNodePool controller | **UNCHANGED** | -| SeiNodePool CRD | **UNCHANGED** | -| SeiNodeDeployment CRD | **NEW** | -| Networking types | **NEW** (used by SeiNodeDeployment) | -| Monitoring types | **NEW** (used by SeiNodeDeployment) | -| SeiNodeDeployment controller | **NEW** | - -Existing SeiNode manifests without `podLabels` continue to work identically. SeiNodeDeployment is additive. - ---- - -## 11. Reversibility Analysis - -| Decision | How to reverse | Impact | -|----------|---------------|--------| -| `spec.podLabels` on SeiNode | Remove field, regenerate CRD. Existing nodes unaffected (nil defaults to empty map). | None | -| New SeiNodeDeployment CRD | Delete SeiNodeDeployment with `DeletionPolicy: Retain`. Child SeiNodes and networking resources are orphaned and keep running. | SeiNodes become standalone | -| `sei.io/group` label on child SeiNodes | Remove label. SeiNode controller doesn't read this label. | None | -| Unstructured HTTPRoute / AuthorizationPolicy / ServiceMonitor | Switch to typed imports later. Same apply semantics. | Code change only | -| `DeletionPolicy` (covers nodes + networking) | Change per-group. Existing groups unaffected. | Per-resource | -| WAF | External to controller. Managed by platform team via Terraform/annotations. | Two-way door | -| Network isolation via AuthorizationPolicy | Optional field. Remove to disable. Istio defaults to ALLOW-all when no policy exists. | Two-way door | -| Controller SA auto-injection in AuthPolicy | Implementation detail. User never sees it in spec. | Transparent | - ---- - -## 12. Implementation Plan - -### Phase 0: SeiNode Prerequisite (standalone PR) -- [ ] Add `spec.podLabels` field to `SeiNodeSpec` -- [ ] Update `resourceLabelsForNode()` to merge `podLabels` into pod template -- [ ] Unit tests for label propagation -- [ ] Regenerate CRD manifests - -### Phase 1: SeiNodeDeployment CRD + Node Orchestration -- [ ] `api/v1alpha1/seinodedeployment_types.go` — SeiNodeDeployment CRD types -- [ ] `api/v1alpha1/networking_types.go` — Networking types -- [ ] `api/v1alpha1/monitoring_types.go` — Monitoring types -- [ ] `internal/controller/nodedeployment/controller.go` — Reconcile loop with `RequeueAfter` -- [ ] `internal/controller/nodedeployment/nodes.go` — SeiNode create/update/scaleDown with guard -- [ ] `internal/controller/nodedeployment/labels.go` — Label helpers, naming, merge-order safety -- [ ] `internal/controller/nodedeployment/status.go` — Status aggregation with `observedGeneration` -- [ ] Wire into `cmd/main.go` -- [ ] `make manifests` to generate CRD + RBAC -- [ ] Unit tests for node orchestration and status - -### Phase 2: Shared Networking -- [ ] `internal/controller/nodedeployment/networking.go` — External Service, HTTPRoute, AuthorizationPolicy -- [ ] Controller SA auto-injection into AuthorizationPolicy -- [ ] DeletionPolicy logic (delete vs orphan, covers nodes + networking) -- [ ] Status conditions for each networking resource -- [ ] LB ingress reporting -- [ ] Unit tests for resource generation -- [ ] Integration tests - -### Phase 3: Monitoring + Samples + Documentation -- [ ] `internal/controller/nodedeployment/monitoring.go` — ServiceMonitor -- [ ] Sample manifests (Istio pattern, ALB pattern, NLB pattern) -- [ ] Documentation (external-dns prerequisites, Istio prerequisites) -- [ ] Printer columns on SeiNodeDeployment -- [ ] Update `production-deployment-analysis.md` gap table - ---- - -## 13. Resolved Questions - -| Question | Resolution | -|----------|-----------| -| Controller SA in AuthorizationPolicy | Auto-injected (Section 7.4). The controller always adds its own SA to prevent sidecar communication being blocked. | -| Scaling-up vs degraded phase | Differentiated (Section 7.6). `Initializing` = some nodes progressing. `Degraded` = some nodes failed. `NodesReady` condition provides detail. | -| Label propagation to pods | Resolved via `spec.podLabels` prerequisite (Section 4). | -| Label merge order | System labels overwrite user labels (Section 6). | -| Child SeiNode GC on group deletion | `DeletionPolicy` covers child SeiNodes (Section 7.7). `Retain` orphans everything. | - -## 14. Future Scope (explicitly not blocked) - -| Feature | Why deferred | How current design accommodates | -|---------|-------------|--------------------------------| -| **Rolling update strategy** | Whole feature in itself; needs careful design around archive node sync times | `ensureSeiNode` updates one node at a time naturally; adding `maxUnavailable` / ordered rollout is additive to the reconcile loop | -| **Heterogeneous groups** | Current use case is homogeneous replicas | Separate SeiNodeDeployments with different templates can share an external Service via matching labels. A future `overrides` per-ordinal field is additive. | -| **WAF provisioning from K8s** | WAF WebACL is a platform concern | Annotation passthrough makes WAF ARN a two-way door. AWS ACK or Crossplane can manage the WebACL separately. | -| **Gateway API route rules** | HTTPRoute with no explicit rules routes all traffic to the Service | A `rules` field can be added to `GatewayRouteConfig` without breaking existing manifests | -| **Multi-listener Gateway** | Current design targets a single Gateway listener | `GatewayParentRef` can be extended with `sectionName *string` for listener targeting | -| **SeiNodePool + SeiNodeDeployment unification** | Different lifecycle needs (genesis vs fleet) | Both create SeiNodes but don't share children. Could merge in a future major version. | diff --git a/docs/design-progressive-rollout.md b/docs/design-progressive-rollout.md deleted file mode 100644 index 01235b23..00000000 --- a/docs/design-progressive-rollout.md +++ /dev/null @@ -1,791 +0,0 @@ -# Design: Automated Progressive Rollout — EC2 to K8s RPC Migration - -## Overview - -Automated, zero-manual-intervention migration of Sei RPC traffic from EC2 to Kubernetes using Route53 weighted routing, in-cluster load generation, and a confidence-score-driven progression loop. - -### Architecture (simplified from previous Istio-centric design) - -``` -DNS: rpc.pacific-1.sei.io - | - Route53 weighted record set - / \ -EC2 ALB K8s NLB -(weight: W_ec2) (weight: W_k8s) - | | -EC2 RPC nodes Istio IngressGateway - | - K8s Service - (SeiNodeDeployment) -``` - -Key simplification: Istio fronts K8s only. It does not sit in the EC2 path. Traffic splitting is done at DNS level via Route53 weighted record sets. This avoids ServiceEntry complexity, mTLS termination issues to EC2, and keeps the EC2 path completely unchanged during migration. - -### Trade-offs vs. Istio-only weights - -| Factor | Route53 weighted | Istio VirtualService | -|--------|-----------------|---------------------| -| EC2 path impact | None | Must route through mesh | -| Rollback speed | ~60s (DNS TTL) | ~1s (Envoy push) | -| Split granularity | Per-DNS-resolution | Per-request | -| Client caching | Some clients cache DNS | No client caching | -| Complexity | Low (aws CLI) | Medium (ServiceEntry + DestinationRule) | - -For blockchain RPC clients, DNS caching is a real concern. Mitigations: (1) set TTL to 10s on the weighted records, (2) the progression holds at each step for hours, so transient DNS caching does not affect steady-state measurements. The simplicity wins. - ---- - -## 1. Automated Weight Progression - -### Tool choice: CronJob + shell script (not a custom controller) - -A custom Go controller is overkill for a one-time migration. Argo Rollouts and Flagger both assume they own the rollout object (Deployment/Rollout) and are designed for in-cluster traffic splitting, not Route53 manipulation. The right tool is a Kubernetes CronJob running a shell script that: - -1. Queries Prometheus for the confidence score -2. Evaluates gate conditions -3. Calls `aws route53 change-resource-record-sets` to adjust weights -4. Posts status to a Slack webhook - -This runs as a CronJob with `schedule: "*/5 * * * *"` (every 5 minutes). The script is idempotent: if conditions are not met, it does nothing. If conditions are met and the current weight is below the next step, it advances. - -### Implementation - -Container image: Alpine + `aws-cli` + `curl` + `jq`. No custom Go code. - -```yaml -apiVersion: batch/v1 -kind: CronJob -metadata: - name: rpc-migration-controller - namespace: sei-infra -spec: - schedule: "*/5 * * * *" - concurrencyPolicy: Forbid - jobTemplate: - spec: - backoffLimit: 0 - template: - spec: - serviceAccountName: rpc-migration-controller - containers: - - name: controller - image: amazon/aws-cli:2.15 - command: ["/bin/bash", "/scripts/progress.sh"] - env: - - name: PROMETHEUS_URL - value: "http://prometheus.monitoring:9090" - - name: HOSTED_ZONE_ID - value: "Z0123456789ABCDEF" - - name: RECORD_NAME - value: "rpc.pacific-1.sei.io" - - name: EC2_ALB_DNS - value: "ec2-rpc-alb-123456.us-east-1.elb.amazonaws.com" - - name: K8S_NLB_DNS - value: "k8s-rpc-nlb-789012.us-east-1.elb.amazonaws.com" - - name: SLACK_WEBHOOK_URL - valueFrom: - secretKeyRef: - name: rpc-migration-secrets - key: slack-webhook-url - - name: WEIGHT_STEPS - value: "0,1,10,50,100" - - name: MIN_HOLD_MINUTES - value: "240" # 4 hours at each step - - name: ROLLBACK_THRESHOLD - value: "40" # confidence score below this triggers rollback - volumeMounts: - - name: scripts - mountPath: /scripts - volumes: - - name: scripts - configMap: - name: rpc-migration-scripts - restartPolicy: Never -``` - -### The progression script (`progress.sh`) - -Core logic (pseudocode — the real script is straightforward bash): - -```bash -#!/bin/bash -set -euo pipefail - -STEPS=(${WEIGHT_STEPS//,/ }) -CURRENT_K8S_WEIGHT=$(get_current_route53_weight "k8s") -CONFIDENCE=$(query_prometheus_confidence_score) -LAST_CHANGE_TIME=$(get_annotation_from_configmap "last-weight-change") -MINUTES_AT_CURRENT=$(minutes_since "$LAST_CHANGE_TIME") - -# Rollback check — runs before progression -if (( CONFIDENCE < ROLLBACK_THRESHOLD )) && (( CURRENT_K8S_WEIGHT > 0 )); then - previous_step=$(find_previous_step "$CURRENT_K8S_WEIGHT") - set_route53_weight "$previous_step" - notify_slack ":rotating_light: ROLLBACK: confidence=$CONFIDENCE, weight $CURRENT_K8S_WEIGHT -> $previous_step" - exit 0 -fi - -# Progression check -if (( MINUTES_AT_CURRENT < MIN_HOLD_MINUTES )); then - echo "Holding at weight=$CURRENT_K8S_WEIGHT for $MINUTES_AT_CURRENT/$MIN_HOLD_MINUTES minutes" - exit 0 -fi - -next_step=$(find_next_step "$CURRENT_K8S_WEIGHT") -if [[ -z "$next_step" ]]; then - echo "Already at final weight. Migration complete." - exit 0 -fi - -# Gate: confidence must be above threshold for progression -if (( CONFIDENCE >= 80 )); then - set_route53_weight "$next_step" - record_change_time - notify_slack ":white_check_mark: PROGRESS: confidence=$CONFIDENCE, weight $CURRENT_K8S_WEIGHT -> $next_step" -else - echo "Confidence=$CONFIDENCE below 80, holding at weight=$CURRENT_K8S_WEIGHT" -fi -``` - -State is stored in a ConfigMap (`rpc-migration-state`) with keys: -- `current-k8s-weight`: redundant with Route53 but avoids API calls for reads -- `last-weight-change`: ISO 8601 timestamp -- `rollback-count`: number of rollbacks (alarm if > 2) - -### Route53 weight mechanics - -Route53 weighted records use relative weights, not percentages. To achieve "1% K8s": - -| Step | K8s weight | EC2 weight | Effective K8s % | -|------|-----------|-----------|----------------| -| 0% | 0 | 100 | 0% | -| 1% | 1 | 99 | ~1% | -| 10% | 10 | 90 | ~10% | -| 50% | 50 | 50 | 50% | -| 100% | 100 | 0 | 100% | - -TTL on both records: 10 seconds. This is the minimum practical TTL for Route53 and ensures DNS resolvers pick up weight changes within seconds. - -### Timing and cadence - -| Step | Min hold | Rationale | -|------|----------|-----------| -| 0% -> 1% | Requires passing load test (see section 2) | First real traffic | -| 1% -> 10% | 4 hours | Detect issues at low blast radius | -| 10% -> 50% | 4 hours | Significant traffic, covers edge cases | -| 50% -> 100% | 12 hours (overnight) | Full confidence before cutover | - -The 4-hour hold is configurable via `MIN_HOLD_MINUTES`. The 50% -> 100% step uses a longer hold, implemented as a special case in the script (check if current step is 50, use 720 minutes). - ---- - -## 2. Load Generation for 0% Phase - -### Tool choice: k6 - -k6 over vegeta or custom Go: -- Native JavaScript scripting for complex RPC query patterns -- Built-in Prometheus remote write (metrics go straight to our stack) -- Thresholds that can fail the test programmatically -- Runs as a Kubernetes Job, no persistent infrastructure -- Handles ramping, stages, and per-endpoint breakdowns natively - -### Deriving traffic patterns from EC2 ALB access logs - -Before writing k6 scripts, extract the real query distribution: - -```bash -# Enable ALB access logging to S3 if not already enabled -# Then analyze the logs: - -# 1. Download recent access logs (24h sample) -aws s3 sync s3://sei-infra-alb-logs/AWSLogs/.../elasticloadbalancing/ ./alb-logs/ \ - --exclude "*" --include "*.log.gz" - -# 2. Extract RPC method distribution -zcat alb-logs/*.gz | \ - awk -F'"' '{print $2}' | \ # extract request field - grep -oP '(GET|POST) [^ ]+' | \ # method + path - sort | uniq -c | sort -rn | head -30 - -# 3. For JSON-RPC POST bodies, enable ALB request logging or -# sample from application-level logs on EC2 nodes. -# CometBFT logs the method in its access log. -``` - -Expected distribution for a Sei RPC node (typical from sei-infra patterns): - -| Method | Approx % | Type | -|--------|---------|------| -| `abci_query` (bank balances, wasm state) | 35% | POST JSON-RPC | -| `block` / `block_results` | 20% | GET or POST | -| `tx_search` | 15% | GET | -| `status` | 10% | GET | -| `eth_call` (EVM JSON-RPC) | 8% | POST JSON-RPC on :8545 | -| `eth_getBlockByNumber` | 5% | POST JSON-RPC on :8545 | -| `broadcast_tx_sync` | 3% | POST (skip in load test) | -| `validators` / `consensus_state` | 2% | GET | -| Other | 2% | Mixed | - -### k6 load test script - -```javascript -// k6-rpc-load-test.js -import http from 'k6/http'; -import { check, sleep } from 'k6'; -import { Counter, Trend } from 'k6/metrics'; - -const RPC_URL = __ENV.RPC_URL || 'http://k8s-rpc-svc.sei.svc.cluster.local:26657'; -const EVM_URL = __ENV.EVM_URL || 'http://k8s-rpc-svc.sei.svc.cluster.local:8545'; - -const errorCount = new Counter('rpc_errors'); -const blockHeightLag = new Trend('block_height_lag'); - -// Ramping profile: warm up over 10 min, hold at target for 1h, cool down -export const options = { - stages: [ - { duration: '5m', target: 50 }, // warm up - { duration: '10m', target: 200 }, // ramp to target - { duration: '60m', target: 200 }, // sustained load - { duration: '5m', target: 0 }, // cool down - ], - thresholds: { - 'http_req_failed': ['rate<0.01'], // <1% errors - 'http_req_duration{method:status}': ['p(99)<500'], // p99 < 500ms - 'http_req_duration{method:abci_query}': ['p(99)<2000'], - 'rpc_errors': ['count<50'], - }, -}; - -// Weighted method selection matching real traffic distribution -const methods = [ - { weight: 35, fn: abciQuery }, - { weight: 20, fn: blockQuery }, - { weight: 15, fn: txSearch }, - { weight: 10, fn: statusQuery }, - { weight: 8, fn: ethCall }, - { weight: 5, fn: ethGetBlock }, - { weight: 5, fn: validatorsQuery }, - // broadcast_tx intentionally excluded -]; - -const totalWeight = methods.reduce((sum, m) => sum + m.weight, 0); - -export default function () { - const rand = Math.random() * totalWeight; - let cumulative = 0; - for (const m of methods) { - cumulative += m.weight; - if (rand < cumulative) { - m.fn(); - break; - } - } - sleep(0.1); // 100ms think time -} - -function statusQuery() { - const res = http.get(`${RPC_URL}/status`, { tags: { method: 'status' } }); - check(res, { 'status 200': (r) => r.status === 200 }); - if (res.status !== 200) errorCount.add(1); - - // Track block height lag vs EC2 - if (res.status === 200) { - try { - const height = parseInt(res.json().result.sync_info.latest_block_height); - // EC2 height fetched once per VU iteration via setup() - blockHeightLag.add(Math.abs(height - globalThis.ec2Height)); - } catch (e) { /* ignore parse errors */ } - } -} - -function abciQuery() { - const payload = JSON.stringify({ - jsonrpc: '2.0', id: 1, method: 'abci_query', - params: { - path: '/cosmos.bank.v1beta1.Query/AllBalances', - data: '', // empty query = recent state - height: '0', prove: false, - }, - }); - const res = http.post(RPC_URL, payload, { - headers: { 'Content-Type': 'application/json' }, - tags: { method: 'abci_query' }, - }); - if (res.status !== 200) errorCount.add(1); -} - -function blockQuery() { - const res = http.get(`${RPC_URL}/block`, { tags: { method: 'block' } }); - if (res.status !== 200) errorCount.add(1); -} - -function txSearch() { - // Search for recent transactions (last 100 blocks) - const res = http.get( - `${RPC_URL}/tx_search?query="tx.height>0"&per_page=10&page=1&order_by="desc"`, - { tags: { method: 'tx_search' } } - ); - if (res.status !== 200) errorCount.add(1); -} - -function ethCall() { - const payload = JSON.stringify({ - jsonrpc: '2.0', id: 1, method: 'eth_call', - params: [{ to: '0x0000000000000000000000000000000000001002', data: '0x' }, 'latest'], - }); - const res = http.post(EVM_URL, payload, { - headers: { 'Content-Type': 'application/json' }, - tags: { method: 'eth_call' }, - }); - if (res.status !== 200) errorCount.add(1); -} - -function ethGetBlock() { - const payload = JSON.stringify({ - jsonrpc: '2.0', id: 1, method: 'eth_getBlockByNumber', - params: ['latest', false], - }); - const res = http.post(EVM_URL, payload, { - headers: { 'Content-Type': 'application/json' }, - tags: { method: 'eth_getBlockByNumber' }, - }); - if (res.status !== 200) errorCount.add(1); -} - -function validatorsQuery() { - const res = http.get(`${RPC_URL}/validators`, { tags: { method: 'validators' } }); - if (res.status !== 200) errorCount.add(1); -} -``` - -### k6 Kubernetes Job - -```yaml -apiVersion: batch/v1 -kind: Job -metadata: - name: rpc-load-test - namespace: sei-infra -spec: - backoffLimit: 0 - template: - spec: - containers: - - name: k6 - image: grafana/k6:0.49.0 - command: ["k6", "run", "--out", "experimental-prometheus-rw", "/scripts/load-test.js"] - env: - - name: K6_PROMETHEUS_RW_SERVER_URL - value: "http://prometheus.monitoring:9090/api/v1/write" - - name: K6_PROMETHEUS_RW_TREND_AS_NATIVE_HISTOGRAM - value: "true" - - name: RPC_URL - value: "http://sei-rpc-pacific-1.sei:26657" - - name: EVM_URL - value: "http://sei-rpc-pacific-1.sei:8545" - volumeMounts: - - name: scripts - mountPath: /scripts - volumes: - - name: scripts - configMap: - name: k6-rpc-load-test - restartPolicy: Never -``` - -### Duration and intensity - -| Parameter | Value | Rationale | -|-----------|-------|-----------| -| Ramp-up | 15 minutes to 200 VUs | Avoid cold-start spike | -| Sustained | 60 minutes at 200 VUs | ~2000 req/s, matches EC2 ALB peak | -| Pass criteria | All k6 thresholds green | Automated gate | -| Runs required | 3 consecutive passes | Eliminate flakes | - -The target RPS (200 VUs * ~10 req/s/VU = 2000 req/s) should match or exceed the p95 traffic level observed on the EC2 ALB. Adjust VU count based on actual ALB CloudWatch `RequestCount` metrics. - ---- - -## 3. Observability During Rollout - -### Metrics sources - -| Metric | K8s source | EC2 source | -|--------|-----------|-----------| -| Request rate | Istio `istio_requests_total` | CloudWatch ALB `RequestCount` | -| Error rate | Istio `istio_requests_total{response_code=~"5.."}` | CloudWatch ALB `HTTPCode_Target_5XX_Count` | -| Latency p50/p99 | Istio `istio_request_duration_milliseconds` | CloudWatch ALB `TargetResponseTime` | -| Block height | Prometheus scraping K8s seid `:26657/status` | CloudWatch custom metric or Prometheus remote-write agent on EC2 | -| Pod health | `sei_controller_seinode_phase` | N/A | - -### EC2 metrics strategy - -Two options, prefer option A for simplicity: - -**Option A: CloudWatch only (recommended)** - -EC2 ALB already publishes metrics to CloudWatch. Use the `yet-another-cloudwatch-exporter` (YACE) running in K8s to scrape CloudWatch metrics into Prometheus. This avoids touching EC2 infrastructure. - -```yaml -# yace config for EC2 ALB metrics -discovery: - jobs: - - type: AWS/ApplicationELB - regions: [us-east-1] - searchTags: - - key: Name - value: sei-rpc-* - metrics: - - name: RequestCount - statistics: [Sum] - period: 60 - - name: TargetResponseTime - statistics: [p50, p99] - period: 60 - - name: HTTPCode_Target_5XX_Count - statistics: [Sum] - period: 60 -``` - -**Option B: Prometheus agent on EC2** - -Run `prometheus-agent` on one EC2 node, remote-write to the K8s Prometheus. More accurate latency data but requires EC2 changes. Only do this if CloudWatch TargetResponseTime granularity is insufficient. - -### Block height comparison - -The existing controller already uses CometBFT `/status` to track `latest_block_height` and `catching_up`. For the migration dashboard, add a recording rule that computes lag: - -```yaml -groups: -- name: rpc-migration - rules: - # K8s block height (from seid metrics or ServiceMonitor scrape) - - record: sei:rpc:block_height:k8s - expr: max(cometbft_consensus_latest_block_height{namespace="sei"}) - - # EC2 block height (scraped via YACE custom metric or a simple curl probe) - - record: sei:rpc:block_height:ec2 - expr: sei_ec2_rpc_block_height - - - record: sei:rpc:block_height_lag - expr: abs(sei:rpc:block_height:k8s - sei:rpc:block_height:ec2) -``` - -For the EC2 block height, deploy a simple CronJob that curls the EC2 ALB `/status` endpoint every 15s and pushes the height to Prometheus via pushgateway or a `/metrics` endpoint: - -```yaml -apiVersion: batch/v1 -kind: CronJob -metadata: - name: ec2-height-probe - namespace: sei-infra -spec: - schedule: "* * * * *" # every minute (finest CronJob granularity) - jobTemplate: - spec: - template: - spec: - containers: - - name: probe - image: curlimages/curl:8.7.1 - command: - - /bin/sh - - -c - - | - HEIGHT=$(curl -sf http://EC2_ALB_DNS:26657/status | \ - jq -r '.result.sync_info.latest_block_height') - curl -sf -X POST "http://pushgateway.monitoring:9091/metrics/job/ec2-height-probe" \ - --data-binary "sei_ec2_rpc_block_height $HEIGHT" - restartPolicy: Never -``` - -Better alternative: use a Prometheus Blackbox Exporter probe target that hits the EC2 ALB `/status` and parses height. Configure as a ServiceMonitor probe. - -### Dashboard layout (Grafana) - -Single dashboard: **"RPC Migration: EC2 vs K8s"** - -``` -Row 1: Migration Status - [Current Weight: 10% K8s] [Confidence Score: 87] [Time at Step: 2h 15m] [Rollback Count: 0] - -Row 2: Traffic Split (stacked time series) - [Requests/sec — EC2 vs K8s, stacked area] [Error Rate — EC2 vs K8s, line] - -Row 3: Latency Comparison - [p50 Latency — EC2 vs K8s, line] [p99 Latency — EC2 vs K8s, line] - -Row 4: Chain Health - [Block Height — EC2 vs K8s, line] [Block Height Lag, line with threshold at 5] - -Row 5: K8s Platform Health - [SeiNode Phase, state timeline] [Pod Restarts, bar] [Istio 5xx Rate, line] - -Row 6: Load Test Results (only during 0% phase) - [k6 RPS, line] [k6 Error Rate, line] [k6 p99 Latency, line] -``` - -The dashboard JSON will be stored in a ConfigMap and auto-provisioned via Grafana's sidecar provisioner. - ---- - -## 4. The Confidence Score - -### Definition - -A single number from 0 to 100 representing migration readiness. Computed every 5 minutes as a Prometheus recording rule. The progression script queries this single metric. - -### Formula - -``` -confidence = w_uptime * S_uptime - + w_error * S_error - + w_latency * S_latency - + w_height * S_height - + w_load * S_load -``` - -Where each sub-score is 0-100 and weights sum to 1.0: - -| Component | Weight | Score function | 100 (perfect) | 0 (fail) | -|-----------|--------|---------------|----------------|----------| -| `S_uptime` | 0.20 | K8s pod uptime over last 1h | 100% uptime | <95% uptime | -| `S_error` | 0.30 | Error rate comparison | K8s error rate <= EC2 | K8s error rate > 5x EC2 | -| `S_latency` | 0.25 | p99 latency parity | K8s p99 <= 1.1x EC2 | K8s p99 > 2x EC2 | -| `S_height` | 0.15 | Block height parity | Lag <= 1 block | Lag > 10 blocks | -| `S_load` | 0.10 | Load test pass (0% phase only) | All thresholds pass | Any threshold fail | - -Error rate gets the highest weight because RPC correctness is the primary concern. Latency is second because blockchain clients are sensitive to query timeouts. - -### Prometheus recording rules - -```yaml -groups: -- name: rpc-migration-confidence - interval: 60s - rules: - # Sub-score: uptime (fraction of time pods were Ready in last 1h) - - record: sei:migration:score:uptime - expr: | - clamp( - 100 * ( - 1 - (sum(rate(kube_pod_status_ready{namespace="sei",condition="false"}[1h])) - / max(sum(kube_pod_status_ready{namespace="sei"}), 1)) - ), - 0, 100 - ) - - # Sub-score: error rate - # 100 when K8s error rate <= EC2, linear decay to 0 when K8s >= 5x EC2 - - record: sei:migration:score:error_rate - expr: | - clamp( - 100 * (1 - clamp_min( - ( - sum(rate(istio_requests_total{reporter="destination",namespace="sei",response_code=~"5.."}[10m])) - / clamp_min(sum(rate(istio_requests_total{reporter="destination",namespace="sei"}[10m])), 0.001) - ) - / - clamp_min( - (sei:ec2:error_rate_5xx OR on() vector(0.001)), - 0.001 - ) - - 1, 0 - ) / 4), - 0, 100 - ) - - # Sub-score: latency parity - # 100 when K8s p99 <= 1.1x EC2, linear decay to 0 when >= 2x EC2 - - record: sei:migration:score:latency - expr: | - clamp( - 100 * (1 - clamp_min( - histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",namespace="sei"}[10m])) by (le)) - / clamp_min(sei:ec2:latency_p99_ms, 1) - - 1.1, 0 - ) / 0.9), - 0, 100 - ) - - # Sub-score: block height parity - # 100 when lag <= 1, linear decay to 0 when lag >= 10 - - record: sei:migration:score:block_height - expr: | - clamp( - 100 * (1 - clamp_min(sei:rpc:block_height_lag - 1, 0) / 9), - 0, 100 - ) - - # Sub-score: load test (set by k6 job completion) - # Stored in pushgateway: 100 if last test passed, 0 if failed - - record: sei:migration:score:load_test - expr: sei_migration_load_test_score OR on() vector(0) - - # Composite confidence score - - record: sei:migration:confidence_score - expr: | - 0.20 * sei:migration:score:uptime - + 0.30 * sei:migration:score:error_rate - + 0.25 * sei:migration:score:latency - + 0.15 * sei:migration:score:block_height - + 0.10 * sei:migration:score:load_test -``` - -### Score interpretation - -| Range | Meaning | Automation action | -|-------|---------|-------------------| -| 80-100 | Green: all signals nominal | Advance to next weight step | -| 60-79 | Yellow: minor degradation | Hold at current step, alert to Slack | -| 40-59 | Orange: significant issues | Hold, page on-call | -| 0-39 | Red: rollback conditions | Automatic rollback to previous step | - -### Rollback conditions (any one triggers) - -These are evaluated independently from the composite score as hard circuit breakers: - -1. **Error rate spike**: K8s 5xx rate > 5% for 5 consecutive minutes -2. **Latency regression**: K8s p99 > 3x EC2 p99 for 10 consecutive minutes -3. **Block height stall**: K8s block height not advancing for 2 minutes -4. **Pod failure**: SeiNodeDeployment phase == Degraded or Failed - -The progression script checks these independently: - -```bash -check_circuit_breakers() { - # Hard error rate check - error_rate=$(promql 'sum(rate(istio_requests_total{namespace="sei",response_code=~"5.."}[5m])) / sum(rate(istio_requests_total{namespace="sei"}[5m]))') - if (( $(echo "$error_rate > 0.05" | bc -l) )); then - echo "BREAKER: error_rate=$error_rate > 5%" - return 1 - fi - - # Block height stall - height_change=$(promql 'changes(cometbft_consensus_latest_block_height{namespace="sei"}[2m])') - if (( $(echo "$height_change < 1" | bc -l) )); then - echo "BREAKER: block height stalled" - return 1 - fi - - # SeiNodeDeployment health - group_phase=$(kubectl get sng -n sei -o jsonpath='{.items[0].status.phase}') - if [[ "$group_phase" == "Failed" || "$group_phase" == "Degraded" ]]; then - echo "BREAKER: SeiNodeDeployment phase=$group_phase" - return 1 - fi - - return 0 -} -``` - ---- - -## 5. Timeline - -### Prerequisites (before clock starts) - -- SeiNodeDeployment deployed, all nodes synced (`catching_up: false`) -- Istio sidecar injection on, Gateway + HTTPRoute working -- ServiceMonitor scraping K8s nodes -- YACE exporting EC2 ALB CloudWatch metrics -- EC2 block height probe running -- Grafana dashboard provisioned -- Confidence score recording rules deployed -- CronJob + scripts deployed (weight at 0%) - -### Automated progression timeline - -| Day | Phase | K8s Weight | Activity | -|-----|-------|-----------|----------| -| 0 | Load test | 0% | k6 Job runs 3x against K8s-internal endpoint. Automated: passes or fails. | -| 0-1 | Soak at 0% | 0% | Confidence score baking. ExportAndCompare running. Score must reach 80+. | -| 1 | First traffic | 1% | Automation advances when score >= 80 and load tests pass. Hold 4h. | -| 1-2 | Low canary | 10% | Automation advances. Hold 4h. | -| 2-3 | Mid canary | 50% | Hold overnight (12h minimum). | -| 3-4 | Full cutover | 100% | EC2 weight = 0. EC2 stays running as hot standby. | -| 4-7 | Hot standby | 100% | EC2 keeps syncing. Manual decommission after 72h of 100% K8s. | -| 7 | Decommission | 100% | Remove Route53 weighted records, switch to simple alias. Terminate EC2. | - -**Total: 7 days from "nodes synced" to "EC2 decommissioned".** - -This compresses the original 6-week plan to 1 week because: -1. No manual gate reviews (automation drives progression) -2. No shadow traffic phase (k6 load test replaces Istio mirroring) -3. Confidence score eliminates waiting for human judgment -4. Route53 is operationally simpler than Istio VirtualService weight management - -The 7-day timeline can extend automatically if the confidence score drops below threshold at any step. The automation holds until conditions improve, with no human intervention needed. - -### When human intervention IS required - -The system pages on-call (PagerDuty) for: -1. More than 2 rollbacks at the same weight step (potential systemic issue) -2. Confidence score below 40 for more than 1 hour -3. Complete K8s cluster failure (all pods down) - -These scenarios indicate problems that a weight-shifting automation cannot fix. - ---- - -## 6. File Layout - -``` -manifests/samples/migration/ - cronjob.yaml # Migration controller CronJob - configmap-scripts.yaml # progress.sh and helper functions - configmap-state.yaml # Mutable state (current weight, timestamps) - rbac.yaml # ServiceAccount + IAM role for Route53 - k6-load-test-job.yaml # Load test Job template - k6-load-test-configmap.yaml # k6 script - ec2-height-probe.yaml # Block height comparison CronJob - yace-config.yaml # CloudWatch exporter for EC2 ALB metrics - prometheus-rules.yaml # Recording rules for confidence score - grafana-dashboard-configmap.yaml # Migration dashboard JSON -``` - -No custom Go code. No new CRDs. No Argo/Flagger dependencies. The entire system is: -- 1 CronJob (progression controller) -- 1 Job (k6 load test, run on demand) -- 1 CronJob (EC2 height probe) -- 1 Deployment (YACE, likely already running) -- Recording rules + dashboard (declarative config) - ---- - -## 7. Relation to Existing Infrastructure - -### What changes in sei-k8s-controller: nothing - -The controller already provisions: -- `SeiNodeDeployment` with `networking.service` (type: LoadBalancer) -- this creates the K8s NLB -- `networking.gateway` with HTTPRoute -- Istio routes internal mesh traffic -- `monitoring.serviceMonitor` -- Prometheus scrapes K8s nodes -- `NetworkingStatus.LoadBalancerIngress` -- reports the NLB address - -The migration automation reads these outputs but does not modify the controller. The Route53 records are managed externally by the CronJob, pointing at the NLB address from `NetworkingStatus.LoadBalancerIngress`. - -### What changes in EC2: nothing - -EC2 ALB continues serving traffic at its current DNS name. Route53 weighted records sit above both the ALB and NLB. EC2 infrastructure is untouched until decommission day. - -### What changes in DNS - -Before migration: -``` -rpc.pacific-1.sei.io -> ALIAS -> EC2 ALB -``` - -During migration: -``` -rpc.pacific-1.sei.io -> WEIGHTED (TTL=10s) - ├── SetId=ec2, Weight=W, ALIAS -> EC2 ALB - └── SetId=k8s, Weight=W, ALIAS -> K8s NLB -``` - -After migration: -``` -rpc.pacific-1.sei.io -> ALIAS -> K8s NLB -``` diff --git a/docs/design-rpc-migration-istio.md b/docs/design-rpc-migration-istio.md deleted file mode 100644 index abe7fb48..00000000 --- a/docs/design-rpc-migration-istio.md +++ /dev/null @@ -1,119 +0,0 @@ -# Design: EC2-to-K8s RPC Node Migration via Istio Traffic Mirroring - -## Overview - -Progressive migration of Sei blockchain RPC infrastructure from EC2 to sei-k8s-controller-managed nodes using Istio traffic mirroring and weighted routing. - -## Architecture - -``` -DNS: rpc.pacific-1.sei.io - | - AWS NLB (L4) - | - Istio IngressGateway - (Envoy, in-mesh) - | - +------+------+ - | | -[primary] [mirror] - | | -ServiceEntry K8s Service -(EC2 ALB) (SeiNodeDeployment) -``` - -Replace ALB with NLB + Istio Gateway. The gateway terminates L7 and applies VirtualService routing. EC2 is the primary backend; K8s receives mirrored (fire-and-forget) traffic. - -## Istio Manifests - -All at `manifests/samples/istio/pacific-1-rpc-mirror/`: - -- **`service-entry.yaml`** — EC2 ALB as `ec2-rpc.pacific-1.internal`, DNS resolution, ports 26657/8545/9090 -- **`destination-rule.yaml`** — mTLS disabled to EC2 (outside mesh), HTTP/2 upgrade disabled (CometBFT is HTTP/1.1), outlier detection -- **`virtual-service.yaml`** — Phase 2 mirror config: 100% to EC2, 100% mirror to K8s. WebSocket routes to EC2 only (Istio cannot mirror WebSocket) -- **`virtual-service-cutover.yaml`** — Phase 3 template: weighted routing between EC2 and K8s -- **`peer-authentication.yaml`** — STRICT mTLS on K8s RPC pods -- **`telemetry.yaml`** — Access logging for error/latency analysis - -## Migration Phases - -### Phase 0: Isolated Validation (Week 1-2) -- Deploy SeiNodeDeployment for RPC, sync from S3 snapshot -- Enable `ExportAndCompare` with `canonicalRpc` pointing at EC2 -- Run 48h with zero app-hash divergence (Layer 0 + Layer 1) -- Validate all alerts fire correctly -- **Blast radius: zero** - -### Phase 1: Synthetic Load (Week 2) -- Replay recorded RPC queries against both EC2 and K8s -- Compare responses byte-for-byte (normalize node ID, peer list) -- Deploy synthetic WebSocket client subscribing to NewBlock on both -- **Gate: 100% response parity, latency within 20%** - -### Phase 2: Shadow Traffic (Week 3) -- Point DNS at NLB, VirtualService mirrors 100% to K8s -- Responses discarded — clients only see EC2 -- Monitor: mirror acceptance rate, K8s error rate, latency delta, block height lag -- **Gate: 48h clean metrics** -- **Rollback: DNS back to ALB (60s)** - -### Phase 3: Canary (Week 4) -- Switch from mirror to weighted routing: 1% → 5% → 10% → 25% → 50% -- Hold 4h minimum at each step, overnight at 50% -- Exclude `/broadcast_tx*` from early canary (add after 10%) -- **Gate: 24h at 50% with no degradation** -- **Rollback: Set K8s weight to 0 (seconds)** - -### Phase 4: Full Cutover (Week 5) -- 100% to K8s, EC2 hot standby -- Keep EC2 syncing for 48h -- **Rollback: Set EC2 weight to 100 (seconds)** - -### Phase 5: Decommission (Week 6) -- Remove ServiceEntry, comparison CronJob -- Decommission EC2 instances -- Optionally simplify ingress (remove Istio gateway if not needed long-term) - -## Key Design Decisions - -### Istio route weights over DNS -DNS caching by blockchain clients makes Route53 splits non-deterministic. Istio applies weights per-request at the proxy, with immediate propagation and rollback in seconds. - -### WebSocket handled separately -Istio cannot mirror WebSocket (persistent bidirectional stream). HTTP RPC is mirrored; WebSocket gets weighted routing during cutover. The sidecar `ExportAndCompare` validates execution correctness independently. - -### STRICT mTLS on K8s pods -EC2 traffic enters through the ingress gateway (which terminates external TLS and originates mTLS). No reason for non-mTLS traffic to reach K8s pods directly. - -### ExportAndCompare is the correctness oracle -Istio mirroring provides realistic query load. But execution correctness is validated by the sidecar's block-by-block Layer 0/Layer 1 comparison, which uploads DivergenceReport artifacts to S3. Access logs and metrics are supporting signals, not source of truth. - -### Do not mirror /broadcast_tx -Mirroring write endpoints would double-broadcast transactions. Mempool dedup handles it, but it wastes resources and creates confusing logs. - -## Confidence Criteria - -| Category | Signal | Pass | -|----------|--------|------| -| Chain correctness | App-hash agreement (L0+L1) | Zero divergences over 10k blocks | -| Chain correctness | Block height parity | Within 2 blocks of EC2 | -| Performance | RPC latency p99 | Within 20% of EC2 | -| Operations | Automated pod recovery | Recovers in < 5 min | -| Operations | Blue-green deployment | Works without manual steps | -| Data plane | Gateway healthy | ConditionNetworkingReady == True | - -## Prerequisites Checklist - -- [ ] SeiNodeDeployment for RPC deployed, all nodes synced (`catching_up: false`) -- [ ] Istio sidecar injection enabled on RPC namespace -- [ ] Gateway + ServiceEntry deployed, reachable from mesh -- [ ] ExportAndCompare running 48h with zero divergence -- [ ] Monitoring: block height lag alert, comparison divergence alert, gateway error rate alert -- [ ] Dashboard: chain health + traffic + operator health -- [ ] DNS TTL lowered to 60s -- [ ] Rollback procedure documented and rehearsed -- [ ] On-call briefed on the migration - -## Timeline - -6 weeks (compressible to 4). Do not compress below 4 — the canary ramp alone needs 5-7 days. diff --git a/docs/design-seinode-import-volume-lld.md b/docs/design-seinode-import-volume-lld.md deleted file mode 100644 index 44df9bc3..00000000 --- a/docs/design-seinode-import-volume-lld.md +++ /dev/null @@ -1,479 +0,0 @@ -# Design: SeiNode — Import Existing Storage (LLD) - -**Status:** Draft / LLD -**Date:** 2026-04-21 -**Tracks:** [#105](https://github.com/sei-protocol/sei-k8s-controller/issues/105) -**Related:** [#104](https://github.com/sei-protocol/sei-k8s-controller/issues/104), [`docs/design-seinode-import-volume.md`](design-seinode-import-volume.md) - -This is the companion LLD to the direction doc. The direction doc fixes *what* and *why*; this doc fixes *how*. Scope is strictly bounded to Shape A (import PVC by name) plus the #104 create-path fix. No new shapes, no new fields, no new use cases. - -## 1. CRD schema changes - -A single new optional sub-struct is added to `SeiNodeSpec` in `api/v1alpha1/seinode_types.go`. Field naming follows the spec sketch in the direction doc (`spec.dataVolume.import.pvcName`) and k8s API conventions (lowerCamelCase JSON tags, PascalCase Go fields, no acronyms beyond `PVC`). - -```go -// SeiNodeSpec additions (api/v1alpha1/seinode_types.go) - -// DataVolume configures the data PersistentVolumeClaim for this node. -// When omitted, the controller creates a PVC using the node's mode-default -// storage class and size (see noderesource.DefaultStorageForMode). -// +optional -DataVolume *DataVolumeSpec `json:"dataVolume,omitempty"` - -// DataVolumeSpec configures how the data PVC is sourced. -type DataVolumeSpec struct { - // Import references a pre-existing PersistentVolumeClaim in the same - // namespace as the SeiNode, instead of creating a new one. The - // controller validates the referenced PVC but never mutates it. - // - // When Import is set, the controller never deletes the referenced PVC - // on SeiNode deletion — storage lifecycle is the operator's responsibility. - // +optional - Import *DataVolumeImport `json:"import,omitempty"` -} - -// DataVolumeImport names a pre-existing PVC to adopt as this node's data volume. -type DataVolumeImport struct { - // PVCName is the name of a PersistentVolumeClaim in the SeiNode's - // namespace. The PVC must be Bound, ReadWriteOnce, and sized at or above - // the node mode's default storage size. Immutable after creation. - // - // +kubebuilder:validation:MinLength=1 - // +kubebuilder:validation:MaxLength=253 - // +kubebuilder:validation:Pattern=`^[a-z0-9]([-a-z0-9]*[a-z0-9])?$` - // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="pvcName is immutable" - PVCName string `json:"pvcName"` -} -``` - -### Why these markers - -- Length + pattern mirror the k8s DNS1123Label constraint on PVC names — catches typos at admission time. -- `XValidation: self == oldSelf` makes the reference immutable. Swapping an imported PVC out from under a running SeiNode has no defined semantics; force delete-and-recreate if the operator wants to re-point. -- `DataVolume` as pointer keeps `omitempty` clean so existing SeiNodes serialize identically to today. -- No default `storageClassName`/`size` sub-fields — Shape A is "name a PVC, nothing else" per the direction doc. - -### Regenerated artifacts - -- `zz_generated.deepcopy.go` — new `DeepCopy`/`DeepCopyInto` methods for `DataVolumeSpec` and `DataVolumeImport`; updated `SeiNodeSpec.DeepCopyInto` to copy the pointer. Produced by `make generate`. -- `manifests/sei.io_seinodes.yaml` and `config/crd/sei.io_seinodes.yaml` — new `spec.dataVolume.import.pvcName` subtree with validation constraints. Produced by `make manifests`. - -### Backward compatibility - -Existing SeiNodes serialize `SeiNodeSpec` without the `dataVolume` key. The new CRD schema makes `dataVolume` optional, so old objects remain valid. The controller reads `node.Spec.DataVolume` as nil and takes the create path unchanged. - -### Spec-unset vs. spec-empty - -Per the direction doc's idiomatic-k8s guidance: `spec.dataVolume == nil`, `spec.dataVolume.import == nil`, and `spec.dataVolume.import.pvcName == ""` all mean "no import." The task's branch check is a single helper: - -```go -func importPVCName(node *seiv1alpha1.SeiNode) string { - if node.Spec.DataVolume == nil || node.Spec.DataVolume.Import == nil { - return "" - } - return node.Spec.DataVolume.Import.PVCName -} -``` - -## 2. Task changes to `internal/task/ensure_pvc.go` - -The task splits into two internal paths under one public API. The struct, params, and deserializer stay the same; only `Execute()` and `Status()` change. - -### New structure - -The existing `ensureDataPVCExecution` struct adds two ephemeral fields (`lastReason`, `lastMessage`) for condition propagation within one reconcile. The executor re-deserializes the task on every reconcile (executor.go:150), so per-instance state does not survive. No other state is added; the validation path is stateless across reconciles. - -### Execute() — branching structure - -```go -func (e *ensureDataPVCExecution) Execute(ctx context.Context) error { - node, err := ResourceAs[*seiv1alpha1.SeiNode](e.cfg) - if err != nil { - return Terminal(err) - } - - if name := importPVCName(node); name != "" { - return e.executeImport(ctx, node, name) - } - return e.executeCreate(ctx, node) -} -``` - -### Create path (fixes #104) - -```go -func (e *ensureDataPVCExecution) executeCreate(ctx context.Context, node *seiv1alpha1.SeiNode) error { - desired := noderesource.GenerateDataPVC(node, e.cfg.Platform) - if err := ctrl.SetControllerReference(node, desired, e.cfg.Scheme); err != nil { - return Terminal(fmt.Errorf("setting owner reference: %w", err)) - } - - existing := &corev1.PersistentVolumeClaim{} - key := types.NamespacedName{Name: desired.Name, Namespace: desired.Namespace} - switch err := e.cfg.KubeClient.Get(ctx, key, existing); { - case apierrors.IsNotFound(err): - // proceed to Create - case err != nil: - return fmt.Errorf("checking for existing data PVC: %w", err) - default: - // PVC exists. Accept if we own it (crash-recovery); else fail. - if metav1.IsControlledBy(existing, node) { - e.complete() - return nil - } - return Terminal(fmt.Errorf( - "data PVC %q already exists and is not owned by SeiNode %q; "+ - "set spec.dataVolume.import.pvcName to adopt, or delete the PVC", - existing.Name, node.Name)) - } - - if err := e.cfg.KubeClient.Create(ctx, desired); err != nil { - if apierrors.IsAlreadyExists(err) { - // Lost the race with another actor between Get and Create; - // requeue so the next reconcile's Get resolves ownership. - return fmt.Errorf("data PVC created concurrently: %w", err) - } - return fmt.Errorf("creating data PVC: %w", err) - } - e.complete() - return nil -} -``` - -Key changes from today's behavior: - -| Before (bug #104) | After | -|---|---| -| `Create()` → swallow `AlreadyExists` as success | `Get()` first; `IsNotFound` is the happy prelude to `Create` | -| Any pre-existing PVC → task Complete | Pre-existing PVC owned by this SeiNode → Complete; otherwise `Terminal` failure | -| No way to distinguish "our crash-recovery" from "someone else's PVC" | `metav1.IsControlledBy` check separates the two | - -### Import path — state machine - -`Status()` remains: - -```go -func (e *ensureDataPVCExecution) Status(_ context.Context) ExecutionStatus { - return e.DefaultStatus() -} -``` - -`executeImport` leaves `e.status == ExecutionRunning` (the `taskBase` default) on transient validation failures and returns `nil`. The executor sees Running and requeues after `TaskPollInterval` (5 s, executor.go:178) — we use this interval as-is. Each reconcile does one Get against the controller-runtime cache, which is a ~free operation; no custom backoff, no state, no arithmetic. - -```go -func (e *ensureDataPVCExecution) executeImport(ctx context.Context, node *seiv1alpha1.SeiNode, name string) error { - reason, msg, state := e.validateImport(ctx, node, name) - recordTransient(node, reason, msg) // writes "reason: msg" into PlannedTask.Error - switch state { - case importValid: - e.complete() - return nil - case importTerminal: - return Terminal(fmt.Errorf("%s: %s", reason, msg)) - default: // transient - return nil - } -} -``` - -`importValid`/`importTransient`/`importTerminal` are internal enums; `validateImport` returns one per requirement. - -### Which reasons are transient vs. terminal - -| # | Requirement | Validation failure | State | -|---|---|---|---| -| 1 | PVC exists | `IsNotFound` on Get | transient — operator may be about to apply the PVC | -| 2 | `deletionTimestamp == nil` | PVC being deleted | transient — may resolve if finalizers complete and PVC re-appears under external management (rare, but cheaper to retry than to fail) | -| 3 | `phase == Bound` | `Pending` | transient — binder may complete | -| 3 | `phase == Bound` | `Lost` | terminal — no recovery path for Lost | -| 3 | `phase == Bound` | `Released` | transient — operator may rebind to a new claim | -| 4 | Contains `ReadWriteOnce` | wrong access mode | terminal — PVC spec is immutable for accessModes | -| 5 | `status.capacity.storage >= default` | too small | terminal — a smaller PVC cannot grow to required size without operator action; operator must expand or re-provision | -| 6 | PV `spec.capacity == PVC status.capacity` | mismatch | terminal — indicates misconfigured static PV | -| 7 | PV exists and not `Failed` | PV missing | transient — lookup race during bind | -| 7 | PV exists and not `Failed` | PV `Failed` | terminal — CSI/provisioner declared the volume unusable | - -**Rationale for the transient/terminal split:** the direction doc says "retry indefinitely with exponential backoff." A strict reading would make every failure mode transient. The LLD refines: failures that cannot recover without the operator changing the PVC spec (immutable accessModes, a too-small capacity) become terminal and surface as `plan.Phase = Failed` + `SeiNode Phase = Failed` via the existing `FailedPhase` from `buildBasePlan` (planner.go:447). This matches the direction doc's "seid will fail to start… the operator gets a clear signal from the Failed plan" — applied pre-flight where we know the failure is unrecoverable. A reviewer who prefers "always transient" can flip these at the cost of stuck-Initializing on operator typos. - -### Reason strings - -CamelCase, stable, and part of the public alerting contract (see §4). Exact strings: - -```go -const ( - ReasonImportValidated = "PVCValidated" - ReasonImportPVCNotFound = "PVCNotFound" - ReasonImportPVCTerminating = "PVCTerminating" - ReasonImportPVCNotBound = "PVCNotBound" // Pending/Released - ReasonImportPVCLost = "PVCLost" // terminal - ReasonImportAccessModeInvalid = "AccessModeInvalid" // terminal - ReasonImportCapacityTooSmall = "CapacityTooSmall" // terminal - ReasonImportPVMissing = "UnderlyingPVMissing" - ReasonImportPVCapacityMismatch = "UnderlyingPVCapacityMismatch" // terminal - ReasonImportPVFailed = "UnderlyingPVFailed" // terminal -) -``` - -Message format is human-readable and includes the PVC name plus the concrete defect: - -``` -PVC "data-archive-0-0" not found in namespace "default" -PVC "data-archive-0-0" phase is Pending, waiting for bind -PVC "data-archive-0-0" capacity 500Gi is less than required 2Ti -underlying PV "pv-abc" for PVC "data-archive-0-0" is in phase Failed -``` - -## 3. Validation path in detail - -`validateImport(ctx, node, name) (reason, message, state)` runs the seven checks in order, returning on the first defect. Sketch: - -```go -// 1. Get PVC in node.Namespace → IsNotFound → (PVCNotFound, transient) -// 2. deletionTimestamp != nil → (PVCTerminating, transient) -// 3. switch pvc.Status.Phase { -// case Bound: continue -// case Lost: (PVCLost, terminal) -// default: (PVCNotBound, transient) // Pending, Released, "" -// } -// 4. !containsAccessMode(..., ReadWriteOnce) → (AccessModeInvalid, terminal) -// 5. required := resource.MustParse(DefaultStorageForMode(mode, platform).size) -// actual := pvc.Status.Capacity[ResourceStorage] -// !ok → (CapacityTooSmall, transient) // capacity not yet reported -// actual