From 08e2f184b16f770e70068dbd71ba6ecdc72ddcdc Mon Sep 17 00:00:00 2001
From: bdchatham <bdchatham@gmail.com>
Date: Sat, 13 Jun 2026 12:10:40 -0700
Subject: [PATCH] burn-down(docs): relocate 13 design artifacts to
 bdchatham-designs (Design 05 Phase-C, PLT-497)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Design artifacts move to the DRI's designs repo (Design 05). Removes 13 relocated
docs from docs/; repoints controller-app-config's 3 inbound doc-links
(README.md, CLAUDE.md, config/manager/manager.yaml comment) to the new location.

Relocated → bdchatham-designs (paired add: PR #21), 12 as-is + 1 translated
(design-networking-monitoring).

NOT removed:
- Sonar-gated (Design 05 §5) — real consumers cite these by path/anchor:
    docs/design/seinode-task-lld.md   (Go + generated CRD YAML)
    docs/design/test-harness-lld.md   (Go + CI workflow)
    docs/seinode-task.md              (section headings are cited anchors for the
                                       PLT-489 gov-ops skill — renaming breaks it)
- docs/design-e2e-kind-harness.md     (Amir-authored → amir-deris-designs, separate)

docs/ is therefore NOT fully evacuated this round (3 gated + 1 cross-DRI remain).
Inbound audit (canonical origin/main 73701f1): controller-app-config had 3 doc-links
(repointed here); the other 12 relocated docs had 0 inbound refs. Prepared in an
isolated worktree off origin/main — active branches untouched.

Paired with bdchatham-designs#21 (merge that first). Linear: PLT-497.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 CLAUDE.md                                     |    2 +-
 README.md                                     |    2 +-
 config/manager/manager.yaml                   |    5 +-
 .../stale-target-height-bootstrap-failure.md  |   88 -
 docs/controller-app-config.md                 |   80 -
 docs/design-networking-monitoring.md          |  901 ----------
 docs/design-progressive-rollout.md            |  791 --------
 docs/design-rpc-migration-istio.md            |  119 --
 docs/design-seinode-import-volume-lld.md      |  479 -----
 docs/design-seinode-import-volume.md          |  144 --
 ...esign-seinode-validator-signing-key-lld.md |  427 -----
 docs/design-snapshot-publish-config.md        |  234 ---
 docs/design/composable-genesis.md             |  442 -----
 docs/design/validation-run-lld.md             | 1588 -----------------
 docs/known-issues-node-alarms.md              |   33 -
 docs/production-deployment-analysis.md        |  166 --
 16 files changed, 5 insertions(+), 5496 deletions(-)
 delete mode 100644 docs/bugs/stale-target-height-bootstrap-failure.md
 delete mode 100644 docs/controller-app-config.md
 delete mode 100644 docs/design-networking-monitoring.md
 delete mode 100644 docs/design-progressive-rollout.md
 delete mode 100644 docs/design-rpc-migration-istio.md
 delete mode 100644 docs/design-seinode-import-volume-lld.md
 delete mode 100644 docs/design-seinode-import-volume.md
 delete mode 100644 docs/design-seinode-validator-signing-key-lld.md
 delete mode 100644 docs/design-snapshot-publish-config.md
 delete mode 100644 docs/design/composable-genesis.md
 delete mode 100644 docs/design/validation-run-lld.md
 delete mode 100644 docs/known-issues-node-alarms.md
 delete mode 100644 docs/production-deployment-analysis.md
diff --git a/CLAUDE.md b/CLAUDE.md
index c062f639..6727997f 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -120,6 +120,6 @@ make docker-push IMG=<image>  # Push container image
 - **Condition ownership** — The planner owns all condition management on the owning resource. It sets conditions when creating plans (e.g., `NodeUpdateInProgress=True`) and when observing terminal plans (e.g., `NodeUpdateInProgress=False`). The executor does not set conditions — it only mutates plan/task state and phase transitions.
 - **Single-patch model** — All status mutations (plan state, conditions, phase, currentImage) accumulate in-memory during a reconcile and are flushed in a single `Status().Patch()` at the end. Tasks mutate owned resources (StatefulSets, Services, PVCs); the executor mutates plan state in-memory; the reconciler flushes once.
 - **Resource generators** live in `internal/noderesource/` — pure functions that produce StatefulSets, Services, and PVCs from a SeiNode spec. Used by both the controller and plan tasks.
-- **Platform config** is resolved by `platform.Load` (`internal/platform/load.go`). Infra fields (scheduling, storage, resources, snapshot/genesis/result-export buckets, images) come from the mounted app-config file (`SEI_CONTROLLER_CONFIG` → `platform.FileConfig`), which is authoritative — a required field unset in the file fails `Config.Validate` at startup. Networking/gateway fields (`SEI_GATEWAY_*`, `SEI_P2P_ENDPOINT_DOMAIN`, `SEI_NLB_TARGET_TYPE`) stay env-sourced pending their removal from the controller in PLT-451. The file is read once at startup for infra fields (an infra change needs a restart); the `stateSync` section is re-read per reconcile (it hot-reloads). See `internal/platform/platform.go` for the field list and `docs/controller-app-config.md` for the file schema.
+- **Platform config** is resolved by `platform.Load` (`internal/platform/load.go`). Infra fields (scheduling, storage, resources, snapshot/genesis/result-export buckets, images) come from the mounted app-config file (`SEI_CONTROLLER_CONFIG` → `platform.FileConfig`), which is authoritative — a required field unset in the file fails `Config.Validate` at startup. Networking/gateway fields (`SEI_GATEWAY_*`, `SEI_P2P_ENDPOINT_DOMAIN`, `SEI_NLB_TARGET_TYPE`) stay env-sourced pending their removal from the controller in PLT-451. The file is read once at startup for infra fields (an infra change needs a restart); the `stateSync` section is re-read per reconcile (it hot-reloads). See `internal/platform/platform.go` for the field list and the [controller-app-config schema](https://github.com/sei-protocol/bdchatham-designs/blob/main/designs/controller-app-config/controller-app-config.md) (in bdchatham-designs — relocated per Design 05 / PLT-497) for the file schema.
 - **Genesis resolution** is handled by the sidecar autonomously: embedded sei-config for well-known chains, S3 fallback at `{SEI_GENESIS_BUCKET}/{chainID}/genesis.json` for custom chains.
 - Config keys in seid's `config.toml` use **hyphens** (e.g., `persistent-peers`, `trust-height`), not underscores.
diff --git a/README.md b/README.md
index 6fada0bd..ff579a06 100644
--- a/README.md
+++ b/README.md
@@ -85,7 +85,7 @@ lifecycle, signing topology, idempotency, and operational gotchas.
 
 ## Platform Configuration
 
-Infrastructure-level settings (node pools, storage, resources, snapshot/genesis/result-export buckets, sidecar images) are read from the mounted app-config file (`SEI_CONTROLLER_CONFIG` → `platform.FileConfig`), which is authoritative — the controller fails fast at startup if a required field is unset. See [`docs/controller-app-config.md`](docs/controller-app-config.md) for the schema.
+Infrastructure-level settings (node pools, storage, resources, snapshot/genesis/result-export buckets, sidecar images) are read from the mounted app-config file (`SEI_CONTROLLER_CONFIG` → `platform.FileConfig`), which is authoritative — the controller fails fast at startup if a required field is unset. See the [controller-app-config schema](https://github.com/sei-protocol/bdchatham-designs/blob/main/designs/controller-app-config/controller-app-config.md) (in bdchatham-designs — relocated per Design 05 / PLT-497).
 
 Gateway config (`SEI_GATEWAY_NAME`, `SEI_GATEWAY_NAMESPACE`, `SEI_GATEWAY_DOMAIN`) and the config-file path (`SEI_CONTROLLER_CONFIG`) remain environment variables.
 
diff --git a/config/manager/manager.yaml b/config/manager/manager.yaml
index 472f7ef4..a7e727e2 100644
--- a/config/manager/manager.yaml
+++ b/config/manager/manager.yaml
@@ -37,8 +37,9 @@ spec:
         env:
           # Infra config (node pools, storage, resources, snapshot/genesis/
           # result-export buckets, images) is sourced from the mounted
-          # app-config ConfigMap (SEI_CONTROLLER_CONFIG below); see
-          # docs/controller-app-config.md. Gateway config stays env-sourced
+          # app-config ConfigMap (SEI_CONTROLLER_CONFIG below); see the
+          # controller-app-config schema in bdchatham-designs (relocated per
+          # Design 05 / PLT-497). Gateway config stays env-sourced
           # pending its removal from the controller in PLT-451.
           - name: SEI_GATEWAY_NAME
             value: sei-gateway
diff --git a/docs/bugs/stale-target-height-bootstrap-failure.md b/docs/bugs/stale-target-height-bootstrap-failure.md
deleted file mode 100644
index e6022267..00000000
--- a/docs/bugs/stale-target-height-bootstrap-failure.md
+++ /dev/null
@@ -1,88 +0,0 @@
-# Bug: Stale `targetHeight` Causes Bootstrap Job Immediate Halt
-
-**Date:** 2026-03-29
-**Severity:** Medium — blocks shadow replayer (and any bootstrap node) from initializing
-**Status:** Mitigated (manual bump); root cause requires design fix
-
----
-
-## Summary
-
-When a `SeiNode` with a bootstrap Job has a `snapshot.s3.targetHeight` that is
-lower than the height of the S3 snapshot it restores, the bootstrap `seid`
-process immediately halts on its first block and exits with code 130 (SIGINT).
-Because the Job is created with `backoffLimit: 0`, this single failure marks the
-entire init plan as `Failed`.
-
-## Reproduction
-
-1. Deploy a `SeiNode` with `replayer.snapshot.s3.targetHeight: 198740000`
-2. The `snapshot-restore` task downloads the latest snapshot from S3 (currently
-   at height 200045000 — ~1.3M blocks ahead of the target)
-3. The bootstrap Job starts seid with `--halt-height 198740000`
-4. seid begins at height 200045000 (already past the halt height), processes
-   one block, and triggers the Cosmos SDK halt-height check
-5. seid sends itself SIGINT → exit code 130
-6. Job fails immediately (`backoffLimit: 0`), `await-bootstrap-complete` marks
-   the plan as `Failed`
-
-## Root Cause
-
-`targetHeight` serves two purposes in `bootstrap_resources.go`:
-
-1. **Snapshot selection** — passed to `snapshot-restore` to find the right S3
-   object. But the seictl restore task downloads the *closest available*
-   snapshot, which may be far newer if snapshots have been regenerated since the
-   manifest was written.
-
-2. **Halt height** — used verbatim as `--halt-height` for the bootstrap seid
-   process. This assumes the restored snapshot is at or below `targetHeight`,
-   which becomes false as new snapshots are uploaded to S3.
-
-The coupling between these two concerns means `targetHeight` silently becomes
-stale as the chain advances and new snapshots replace old ones.
-
-Relevant code path:
-
-```
-bootstrap_resources.go:142  haltHeight := snap.S3.TargetHeight
-bootstrap_resources.go:143  seidCmd, seidArgs := bootstrapWaitCommand(bootstrapSidecarPort(node), haltHeight)
-bootstrap_resources.go:189  exec seid start --home %s --halt-height %d
-```
-
-## Impact
-
-- Any `SeiNode` with a bootstrap flow (shadow replayer, full node with
-  `bootstrapImage`, validator with snapshot) will fail to initialize if
-  `targetHeight` falls behind the latest available snapshot in S3.
-- The failure is silent from the user's perspective — the node simply goes to
-  `Failed` phase. Diagnosing it requires checking bootstrap Job exit codes and
-  seid logs for the halt message.
-
-## Mitigation
-
-Bump `targetHeight` in the manifest to a value ahead of the latest S3 snapshot.
-For the shadow replayer, this was changed from `198740000` → `200100000`.
-
-This is a manual fix that will need to be repeated as the chain advances.
-
-## Potential Fixes
-
-1. **Decouple snapshot selection from halt height.** Add a separate
-   `haltHeight` field (or compute it dynamically) so `targetHeight` only
-   controls which snapshot to download. The halt height could be derived from
-   the actual restored snapshot height + a configurable offset.
-
-2. **Treat exit code 130 as success in the bootstrap Job.** The halt-height
-   exit is an expected shutdown. The `await-bootstrap-complete` task (or the
-   Job spec itself) could treat exit code 130 as a successful completion rather
-   than a failure.
-
-3. **Resolve `targetHeight` dynamically.** Read `latest.txt` from S3 at plan
-   time and set `--halt-height` to `latest + offset`, similar to what the
-   original sei-infra scripts did.
-
-4. **Skip halt-height if restored height exceeds it.** The bootstrap wait
-   script could query seid's current height after state sync and skip the
-   `--halt-height` flag if already past it. This avoids the immediate-halt
-   scenario entirely.
diff --git a/docs/controller-app-config.md b/docs/controller-app-config.md
deleted file mode 100644
index 2addcd58..00000000
--- a/docs/controller-app-config.md
+++ /dev/null
@@ -1,80 +0,0 @@
-# Controller app-config file
-
-The controller reads a single read-only application-config file, pointed at by
-`SEI_CONTROLLER_CONFIG` and mounted as a directory (a GitOps-written ConfigMap,
-typically `sei-controller-config`). It is decoded into `platform.FileConfig`
-(`internal/platform/platform.go`).
-
-Two read paths, by design:
-
-- **Infra sections** (`scheduling`, `storage`, `resources`, `snapshot`,
-  `resultExport`, `genesis`, `images`) are resolved **once at startup** by
-  `platform.Load`. Editing them in the live ConfigMap propagates to the mount
-  but has **no effect until the controller pod restarts**
-  (`kubectl rollout restart`) — only `stateSync` hot-reloads.
-- **`stateSync`** is re-read **per reconcile** so syncer changes hot-reload
-  without a restart (the directory mount swaps atomically).
-
-## Source of truth
-
-The file is **authoritative** for infra config: a required field unset in the
-file fails `Config.Validate` at startup (the controller does not boot). There is
-no env-var fallback for these fields.
-
-Networking/gateway config (`SEI_GATEWAY_*`, `SEI_P2P_ENDPOINT_DOMAIN`,
-`SEI_NLB_TARGET_TYPE`) is **not** in the file — it stays env-sourced pending its
-removal from the controller in the GitOps networking move (PLT-451).
-
-## Schema
-
-```yaml
-# State-sync canonical syncers, keyed by chain-id. Bare host:port (no scheme).
-# Read per-reconcile; >=2 entries per chain or the node fails closed.
-stateSync:
-  syncers:
-    pacific-1:
-      - rpc-1.example.net:26657
-      - rpc-2.example.net:26657
-
-# --- infra (authoritative; read once at startup) ---
-
-scheduling:
-  nodepoolName: sei-node
-  nodepoolArchive: sei-archive
-  tolerationKey: sei.io/workload
-  serviceAccount: seid-node
-
-storage:                  # no sizePerf — perf is a storage-class tier only
-  classPerf: gp3-10k-750
-  classDefault: gp3
-  classArchive: gp3-archive
-  sizeDefault: 2000Gi
-  sizeArchive: 40Ti
-
-resources:
-  cpuArchive: "48"
-  memArchive: 448Gi
-  cpuDefault: "4"
-  memDefault: 32Gi
-
-snapshot:
-  bucket: sei-snapshots
-  region: us-east-2
-
-resultExport:
-  bucket: sei-shadow-results
-  region: us-east-2
-  prefix: shadow-results/
-
-genesis:
-  bucket: sei-k8s-genesis
-  region: us-east-2
-
-images:
-  sidecar: ghcr.io/sei-protocol/seictl@sha256:...
-  kubeRBACProxy: quay.io/brancz/kube-rbac-proxy:v0.19.1
-  cosmosExporter: ghcr.io/sei-protocol/sei-cosmos-exporter@sha256:...
-```
-
-A present-but-unparseable file is a hard startup error. A required infra field
-unset in the file fails `Config.Validate` at startup, naming the file key.
diff --git a/docs/design-networking-monitoring.md b/docs/design-networking-monitoring.md
deleted file mode 100644
index 73d98ec7..00000000
--- a/docs/design-networking-monitoring.md
+++ /dev/null
@@ -1,901 +0,0 @@
-# Design: SeiNodeDeployment — Fleet Orchestration with Networking & Monitoring
-
-**Branch:** `feature/networking-monitoring`
-**Status:** Final (v3) — review findings incorporated
-**Goal:** Introduce a `SeiNodeDeployment` CRD that orchestrates N SeiNodes behind shared networking (Service, Istio Gateway routing, network isolation) and monitoring (ServiceMonitor), closing the gap between the current controller and sei-infra's EC2 production deployment.
-
----
-
-## 1. Problem Statement
-
-The SeiNode controller manages single-node lifecycle (bootstrap → init → running → snapshots). Production deployments require:
-
-- **Multiple nodes** behind a shared load balancer (sei-infra runs 3 instances per role)
-- **External Service** (ClusterIP/LoadBalancer) for RPC, REST, and EVM traffic
-- **Ingress routing** via Kubernetes Gateway API (HTTPRoute) with TLS for public endpoints
-- **Network isolation** so that only the ingress gateway and authorized peers can reach node APIs
-- **DNS** (Route53 via external-dns) for stable hostnames
-- **Monitoring** (Prometheus ServiceMonitor) for observability
-
-These are fleet-level concerns: a load balancer sits in front of N nodes, not one. Putting networking on each SeiNode would create N independent load balancers instead of one shared entry point.
-
----
-
-## 2. Architecture Overview
-
-```
-┌──────────────────────────────────────────────────────────────────┐
-│                        SeiNodeDeployment                               │
-│  "pacific-1-archive-rpc"                                          │
-│                                                                   │
-│  Owns:                                                            │
-│  ├── SeiNode "pacific-1-archive-rpc-0"  ─┐                       │
-│  │   └── (SeiNode controller manages     │                       │
-│  │       StatefulSet, PVC, headless Svc, │  shared label:         │
-│  │       sidecar tasks)                  │  sei.io/group: ...     │
-│  ├── SeiNode "pacific-1-archive-rpc-1"  ─┤                       │
-│  ├── SeiNode "pacific-1-archive-rpc-2"  ─┘                       │
-│  │                                                                │
-│  ├── Service "...-external"  (selects all 3 pods by group label)  │
-│  ├── HTTPRoute "..."         (routes to shared Service)           │
-│  ├── AuthorizationPolicy "..." (applied to all 3 pods)           │
-│  └── ServiceMonitor "..."    (scrapes all 3 pods)                 │
-└──────────────────────────────────────────────────────────────────┘
-```
-
-**SeiNode** is unchanged — it manages single-node lifecycle.
-**SeiNodeDeployment** is the new orchestration layer that composes SeiNodes with shared infrastructure.
-
-This follows the same pattern as the existing `SeiNodePool → SeiNode` relationship, where SeiNodePool creates child SeiNode CRs and aggregates their status.
-
----
-
-## 3. Design Principles
-
-1. **SeiNode stays single-responsibility** — One small prerequisite change (`spec.podLabels`) is needed on SeiNode to support label propagation to pods. No networking logic is added. SeiNodeDeployment owns the fleet and exposure layer.
-
-2. **Same patterns as SeiNodePool** — The SeiNodeDeployment controller follows the same `ensureSeiNode` / `updateStatus` / owner-reference patterns already established by SeiNodePool. No new controller patterns to learn.
-
-3. **Passthrough over abstraction** — Service annotations, HTTPRoute annotations, and Istio config use Kubernetes-native values. No DSL wrappers.
-
-4. **Safe by default** — `DeletionPolicy` governs both networking resources and child SeiNodes. Network isolation is an additive feature, not a breaking change. **Important:** The default `DeletionPolicy` is `Delete`, which cascades deletion of _all_ child SeiNodes and networking resources when the group is removed. Production deployments should use `deletionPolicy: Retain` to prevent accidental data loss.
-
-5. **Two-way doors only** — Every field is optional. WAF is just an annotation on the Service or HTTPRoute. Update strategy for rolling out changes across replicas is a future concern that the current design does not block.
-
-6. **SeiNodePool vs SeiNodeDeployment** — SeiNodePool is for genesis network bootstrapping (prep jobs, shared genesis PVC, then SeiNodes). SeiNodeDeployment is for production fleet management (N nodes from a template + shared networking/monitoring). They target different use cases and should not manage the same SeiNodes.
-
----
-
-## 4. Prerequisite: SeiNode `podLabels` Field
-
-The shared external Service selects pods by `sei.io/group: {groupName}`. For this label to reach the pod template, the SeiNode controller must propagate it. Today, `resourceLabelsForNode()` only sets `sei.io/node: {name}` on the pod template — SeiNode metadata labels are ignored.
-
-**Change:** Add an optional `podLabels` field to `SeiNodeSpec`. The SeiNode controller merges these into the StatefulSet pod template labels alongside the existing `sei.io/node` label.
-
-```go
-type SeiNodeSpec struct {
-    // ... existing fields ...
-
-    // PodLabels are additional labels merged into the StatefulSet pod template.
-    // The controller always sets sei.io/node; these are additive.
-    // +optional
-    PodLabels map[string]string `json:"podLabels,omitempty"`
-}
-```
-
-In `resources.go`:
-
-```go
-func resourceLabelsForNode(node *seiv1alpha1.SeiNode) map[string]string {
-    labels := make(map[string]string, len(node.Spec.PodLabels)+1)
-    maps.Copy(labels, node.Spec.PodLabels)   // user/group labels first
-    labels[nodeLabel] = node.Name             // system label wins
-    return labels
-}
-```
-
-The SeiNodeDeployment controller sets `podLabels: {"sei.io/group": groupName}` on each child SeiNode. This is a small, backward-compatible change — existing SeiNodes without `podLabels` behave identically.
-
-This is scoped as a standalone prerequisite PR before Phase 1.
-
----
-
-## 5. API Types
-
-### 5.1 SeiNodeDeployment (`api/v1alpha1/seinodedeployment_types.go`)
-
-```go
-// SeiNodeDeploymentSpec defines the desired state of a SeiNodeDeployment.
-type SeiNodeDeploymentSpec struct {
-    // Replicas is the number of SeiNode instances to create.
-    // +kubebuilder:validation:Minimum=1
-    // +kubebuilder:validation:Maximum=16
-    // +kubebuilder:default=1
-    Replicas int32 `json:"replicas"`
-
-    // Template defines the SeiNode spec stamped out for each replica.
-    // Each SeiNode is named "{group-name}-{ordinal}".
-    Template SeiNodeTemplate `json:"template"`
-
-    // Networking controls how the group is exposed to traffic.
-    // Networking resources are shared across all replicas.
-    // +optional
-    Networking *NetworkingConfig `json:"networking,omitempty"`
-
-    // Monitoring configures observability resources shared across
-    // all replicas.
-    // +optional
-    Monitoring *MonitoringConfig `json:"monitoring,omitempty"`
-}
-
-// SeiNodeTemplate wraps a SeiNodeSpec for use in the group template.
-type SeiNodeTemplate struct {
-    // Metadata allows setting labels and annotations on child SeiNodes.
-    // The controller always adds sei.io/group and sei.io/group-ordinal
-    // labels; user-specified labels are merged.
-    // +optional
-    Metadata *SeiNodeTemplateMeta `json:"metadata,omitempty"`
-
-    // Spec is the SeiNodeSpec applied to each replica.
-    Spec SeiNodeSpec `json:"spec"`
-}
-
-// SeiNodeTemplateMeta defines metadata for templated SeiNodes.
-type SeiNodeTemplateMeta struct {
-    // Labels are merged onto each child SeiNode's metadata.
-    // +optional
-    Labels map[string]string `json:"labels,omitempty"`
-
-    // Annotations are merged onto each child SeiNode's metadata.
-    // +optional
-    Annotations map[string]string `json:"annotations,omitempty"`
-}
-```
-
-### 4.2 SeiNodeDeployment Status
-
-```go
-// SeiNodeDeploymentPhase represents the high-level lifecycle state.
-// +kubebuilder:validation:Enum=Pending;Initializing;Ready;Degraded;Failed;Terminating
-type SeiNodeDeploymentPhase string
-
-const (
-    GroupPhasePending      SeiNodeDeploymentPhase = "Pending"
-    GroupPhaseInitializing SeiNodeDeploymentPhase = "Initializing"
-    GroupPhaseReady        SeiNodeDeploymentPhase = "Ready"
-    GroupPhaseDegraded     SeiNodeDeploymentPhase = "Degraded"
-    GroupPhaseFailed       SeiNodeDeploymentPhase = "Failed"
-    GroupPhaseTerminating  SeiNodeDeploymentPhase = "Terminating"
-)
-
-// SeiNodeDeploymentStatus defines the observed state of a SeiNodeDeployment.
-type SeiNodeDeploymentStatus struct {
-    // ObservedGeneration is the most recent generation observed by the controller.
-    // Clients can check this against metadata.generation to know if the
-    // status reflects the current spec.
-    // +optional
-    ObservedGeneration int64 `json:"observedGeneration,omitempty"`
-
-    // Phase is the high-level lifecycle state.
-    Phase SeiNodeDeploymentPhase `json:"phase,omitempty"`
-
-    // Replicas is the desired number of SeiNodes.
-    Replicas int32 `json:"replicas,omitempty"`
-
-    // ReadyReplicas is the number of SeiNodes in Running phase.
-    ReadyReplicas int32 `json:"readyReplicas,omitempty"`
-
-    // Nodes reports the status of each child SeiNode.
-    // +optional
-    Nodes []GroupNodeStatus `json:"nodes,omitempty"`
-
-    // NetworkingStatus reports the observed state of networking resources.
-    // +optional
-    NetworkingStatus *NetworkingStatus `json:"networkingStatus,omitempty"`
-
-    // +listType=map
-    // +listMapKey=type
-    // +optional
-    Conditions []metav1.Condition `json:"conditions,omitempty"`
-}
-
-// GroupNodeStatus is a summary of a child SeiNode's state.
-type GroupNodeStatus struct {
-    // Name is the SeiNode resource name.
-    Name string `json:"name"`
-
-    // Phase is the SeiNode's current phase.
-    Phase SeiNodePhase `json:"phase,omitempty"`
-}
-
-// NetworkingStatus reports the observed state of networking resources.
-type NetworkingStatus struct {
-    // ExternalServiceName is the name of the managed external Service.
-    // +optional
-    ExternalServiceName string `json:"externalServiceName,omitempty"`
-
-    // LoadBalancerIngress contains the hostname/IP assigned by the cloud
-    // provider once the LoadBalancer is provisioned.
-    // +optional
-    LoadBalancerIngress []corev1.LoadBalancerIngress `json:"loadBalancerIngress,omitempty"`
-}
-```
-
-### 4.3 SeiNodeDeployment CRD markers
-
-```go
-// +kubebuilder:object:root=true
-// +kubebuilder:subresource:status
-// +kubebuilder:resource:shortName=snd
-// +kubebuilder:printcolumn:name="Ready",type=integer,JSONPath=`.status.readyReplicas`
-// +kubebuilder:printcolumn:name="Replicas",type=integer,JSONPath=`.status.replicas`
-// +kubebuilder:printcolumn:name="Phase",type=string,JSONPath=`.status.phase`
-// +kubebuilder:printcolumn:name="Host",type=string,JSONPath=`.spec.networking.gateway.hostnames[0]`,priority=1
-// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`
-
-type SeiNodeDeployment struct {
-    metav1.TypeMeta   `json:",inline"`
-    metav1.ObjectMeta `json:"metadata,omitempty"`
-    Spec   SeiNodeDeploymentSpec   `json:"spec,omitempty"`
-    Status SeiNodeDeploymentStatus `json:"status,omitempty"`
-}
-```
-
-### 4.4 Networking Types (`api/v1alpha1/networking_types.go`)
-
-```go
-// PortName is a well-known sei-node port identifier from sei-config.
-// +kubebuilder:validation:Enum=rpc;rest;evm-http;evm-ws;grpc;p2p;prometheus
-type PortName string
-
-// DeletionPolicy controls what happens to managed networking resources
-// when their spec is removed.
-// +kubebuilder:validation:Enum=Delete;Retain
-type DeletionPolicy string
-
-const (
-    DeletionPolicyDelete DeletionPolicy = "Delete"
-    DeletionPolicyRetain DeletionPolicy = "Retain"
-)
-
-// NetworkingConfig controls how the group is exposed to traffic.
-//
-// Routing uses the Kubernetes Gateway API exclusively; the platform must
-// install the Gateway API CRDs (v1+) and a Gateway implementation such
-// as Istio before HTTPRoute resources will take effect.
-// +kubebuilder:validation:XValidation:rule="!has(self.gateway) || has(self.service)",message="gateway requires service to be configured"
-type NetworkingConfig struct {
-    // Service creates a non-headless Service shared across all replicas.
-    // Each SeiNode still gets its own headless Service for pod DNS.
-    // +optional
-    Service *ExternalServiceConfig `json:"service,omitempty"`
-
-    // Gateway creates a gateway.networking.k8s.io/v1 HTTPRoute
-    // targeting a shared Gateway (e.g. Istio ingress gateway).
-    // +optional
-    Gateway *GatewayRouteConfig `json:"gateway,omitempty"`
-
-    // Isolation configures network-level access control for node pods.
-    // +optional
-    Isolation *NetworkIsolationConfig `json:"isolation,omitempty"`
-}
-
-// ExternalServiceConfig defines the shared non-headless Service.
-type ExternalServiceConfig struct {
-    // Type is the Kubernetes Service type. Defaults to ClusterIP.
-    // +optional
-    // +kubebuilder:default=ClusterIP
-    // +kubebuilder:validation:Enum=ClusterIP;LoadBalancer;NodePort
-    Type corev1.ServiceType `json:"type,omitempty"`
-
-    // Ports selects which node ports to expose. When empty, all
-    // standard sei-config ports are exposed.
-    // +optional
-    Ports []PortName `json:"ports,omitempty"`
-
-    // Annotations are merged onto the Service metadata.
-    // +optional
-    Annotations map[string]string `json:"annotations,omitempty"`
-}
-
-// GatewayRouteConfig creates a gateway.networking.k8s.io/v1 HTTPRoute
-// that references a shared Gateway resource.
-type GatewayRouteConfig struct {
-    // ParentRef identifies the shared Gateway.
-    ParentRef GatewayParentRef `json:"parentRef"`
-
-    // Hostnames are the DNS hostnames for the HTTPRoute.
-    // +kubebuilder:validation:MinItems=1
-    Hostnames []string `json:"hostnames"`
-
-    // Annotations are merged onto the HTTPRoute metadata.
-    // +optional
-    Annotations map[string]string `json:"annotations,omitempty"`
-}
-
-// GatewayParentRef identifies a Gateway resource.
-type GatewayParentRef struct {
-    // +kubebuilder:validation:MinLength=1
-    Name string `json:"name"`
-
-    // +kubebuilder:validation:MinLength=1
-    Namespace string `json:"namespace"`
-}
-
-// NetworkIsolationConfig defines network-level access control.
-type NetworkIsolationConfig struct {
-    // AuthorizationPolicy creates an Istio AuthorizationPolicy
-    // restricting which identities can reach node pods.
-    // +optional
-    AuthorizationPolicy *AuthorizationPolicyConfig `json:"authorizationPolicy,omitempty"`
-}
-
-// AuthorizationPolicyConfig defines allowed traffic sources.
-type AuthorizationPolicyConfig struct {
-    // AllowedSources defines who can reach this group's pods.
-    // The controller generates an ALLOW policy; traffic from
-    // sources not listed here is denied.
-    // +kubebuilder:validation:MinItems=1
-    AllowedSources []TrafficSource `json:"allowedSources"`
-}
-
-// TrafficSource identifies a set of callers by Istio identity.
-// +kubebuilder:validation:XValidation:rule="has(self.principals) || has(self.namespaces)",message="at least one of principals or namespaces must be set"
-type TrafficSource struct {
-    // Principals are SPIFFE identities (e.g.
-    // "cluster.local/ns/istio-system/sa/istio-ingressgateway").
-    // +optional
-    Principals []string `json:"principals,omitempty"`
-
-    // Namespaces allows all pods in these namespaces.
-    // +optional
-    Namespaces []string `json:"namespaces,omitempty"`
-}
-```
-
-### 4.5 Monitoring Types (`api/v1alpha1/monitoring_types.go`)
-
-```go
-// MonitoringConfig controls observability resources.
-type MonitoringConfig struct {
-    // ServiceMonitor creates a monitoring.coreos.com/v1 ServiceMonitor.
-    // Presence (non-nil) enables it; set to nil to disable.
-    // +optional
-    ServiceMonitor *ServiceMonitorConfig `json:"serviceMonitor,omitempty"`
-}
-
-// ServiceMonitorConfig defines the ServiceMonitor.
-type ServiceMonitorConfig struct {
-    // Interval is the Prometheus scrape interval.
-    // +optional
-    // +kubebuilder:default="30s"
-    // +kubebuilder:validation:Pattern="^[0-9]+(ms|s|m|h)$"
-    Interval string `json:"interval,omitempty"`
-
-    // Labels are added to the ServiceMonitor metadata.
-    // +optional
-    Labels map[string]string `json:"labels,omitempty"`
-}
-```
-
-### 4.6 Status Conditions
-
-```go
-const (
-    ConditionNodesReady            = "NodesReady"
-    ConditionExternalServiceReady  = "ExternalServiceReady"
-    ConditionNetworkingReady       = "NetworkingReady"     // HTTPRoute
-    ConditionIsolationReady        = "IsolationReady"      // AuthorizationPolicy
-    ConditionServiceMonitorReady   = "ServiceMonitorReady"
-)
-```
-
----
-
-## 6. Labels and Naming
-
-### Labels injected by SeiNodeDeployment controller
-
-| Label | Value | Set on | Purpose |
-|-------|-------|--------|---------|
-| `sei.io/group` | `{groupName}` | SeiNode metadata + `podLabels` | Shared Service selector, AuthorizationPolicy selector |
-| `sei.io/group-ordinal` | `"0"`, `"1"`, ... | SeiNode metadata | Identify individual replicas |
-| `sei.io/node` | `{nodeName}` | Pod template (by SeiNode controller) | Existing per-node label |
-
-The SeiNodeDeployment controller sets `sei.io/group` on both the child SeiNode's metadata labels AND `spec.podLabels`. The `podLabels` mechanism (Section 4) ensures the label propagates to the StatefulSet pod template. The shared external Service selects on `sei.io/group: {groupName}`, so all replica pods are endpoints of the same Service.
-
-### Label merge order (system labels win)
-
-When building child SeiNode labels, user-specified template labels are applied first, then system labels overwrite. This prevents a user from accidentally breaking the group selector:
-
-```go
-// User labels first
-maps.Copy(labels, group.Spec.Template.Metadata.Labels)
-// System labels overwrite
-labels["sei.io/group"] = group.Name
-labels["sei.io/group-ordinal"] = strconv.Itoa(ordinal)
-```
-
-### Resource naming
-
-| Resource | Name | Why |
-|----------|------|-----|
-| SeiNode | `{group}-{ordinal}` | Matches SeiNodePool pattern |
-| External Service | `{group}-external` | Distinguishes from per-node headless Services |
-| HTTPRoute | `{group}` | One route per group |
-| AuthorizationPolicy | `{group}` | Applied to all group pods |
-| ServiceMonitor | `{group}` | Scrapes all group pods |
-
----
-
-## 7. Controller Reconciliation
-
-### 7.1 File Organization
-
-```
-internal/controller/
-├── node/                        # SeiNode controller (UNCHANGED)
-│   ├── controller.go
-│   ├── resources.go
-│   ├── plan_execution.go
-│   ├── ...
-│
-├── nodedeployment/                   # NEW: SeiNodeDeployment controller
-│   ├── controller.go            # Reconcile loop, phase transitions
-│   ├── nodes.go                 # ensureSeiNode, scaleDown
-│   ├── networking.go            # External Service, HTTPRoute,
-│   │                            # AuthorizationPolicy generation + reconcile
-│   ├── monitoring.go            # ServiceMonitor generation + reconcile
-│   ├── status.go                # Status aggregation
-│   ├── labels.go                # Label helpers, naming
-│   ├── networking_test.go
-│   ├── monitoring_test.go
-│   ├── nodes_test.go
-│   └── status_test.go
-│
-└── nodepool/                    # SeiNodePool controller (UNCHANGED)
-    ├── controller.go
-    └── ...
-```
-
-### 7.2 Reconcile Flow
-
-```go
-func (r *SeiNodeDeploymentReconciler) Reconcile(ctx, req) (Result, error) {
-    group := &SeiNodeDeployment{}
-    r.Get(ctx, req, group)
-
-    // Deletion handling (respects DeletionPolicy for networking AND child SeiNodes)
-    if !group.DeletionTimestamp.IsZero() {
-        return r.handleDeletion(ctx, group)
-    }
-    r.ensureFinalizer(ctx, group)
-
-    // 1. Ensure N SeiNodes exist from template
-    r.reconcileSeiNodes(ctx, group)
-
-    // 2. Networking (independent of SeiNode readiness)
-    r.reconcileNetworking(ctx, group)
-
-    // 3. Monitoring
-    r.reconcileMonitoring(ctx, group)
-
-    // 4. Status aggregation (sets observedGeneration)
-    r.updateStatus(ctx, group)
-
-    // Periodic requeue to catch drift on unstructured resources
-    // (HTTPRoute, AuthorizationPolicy, ServiceMonitor) that lack Owns() watches
-    return ctrl.Result{RequeueAfter: statusPollInterval}, nil
-}
-```
-
-### 7.3 SeiNode Management (`nodes.go`)
-
-Follows the SeiNodePool pattern:
-
-```go
-func (r *SeiNodeDeploymentReconciler) reconcileSeiNodes(ctx, group) error {
-    for i := range group.Spec.Replicas {
-        r.ensureSeiNode(ctx, group, i)
-    }
-    return r.scaleDown(ctx, group)
-}
-
-func (r *SeiNodeDeploymentReconciler) ensureSeiNode(ctx, group, ordinal) error {
-    desired := generateSeiNode(group, ordinal)
-    // Set owner reference, create-or-update
-    // On update: sync Image, Entrypoint, Sidecar (same as SeiNodePool)
-}
-
-func generateSeiNode(group, ordinal) *SeiNode {
-    // User labels first, then system labels overwrite
-    labels := make(map[string]string)
-    if group.Spec.Template.Metadata != nil {
-        maps.Copy(labels, group.Spec.Template.Metadata.Labels)
-    }
-    labels["sei.io/group"] = group.Name
-    labels["sei.io/group-ordinal"] = strconv.Itoa(ordinal)
-
-    spec := group.Spec.Template.Spec.DeepCopy()
-    // Inject podLabels so the SeiNode controller propagates sei.io/group to pods
-    if spec.PodLabels == nil {
-        spec.PodLabels = make(map[string]string)
-    }
-    spec.PodLabels["sei.io/group"] = group.Name
-
-    return &SeiNode{
-        ObjectMeta: ObjectMeta{
-            Name:      fmt.Sprintf("%s-%d", group.Name, ordinal),
-            Namespace: group.Namespace,
-            Labels:    labels,
-        },
-        Spec: *spec,
-    }
-}
-```
-
-**Scale-down guard:** The `scaleDown` function refuses to delete SeiNodes if the computed desired count is 0 (defensive against uninitialized fields or controller bugs). The `Minimum=1` CEL validation on `replicas` prevents 0 at admission, but the guard catches code-level errors:
-
-```go
-func (r *SeiNodeDeploymentReconciler) scaleDown(ctx, group) error {
-    if group.Spec.Replicas <= 0 {
-        log.Error("refusing scale-down: desired replicas is zero or negative")
-        return nil
-    }
-    // Delete SeiNodes with ordinal >= group.Spec.Replicas
-}
-```
-
-### 7.4 Networking Reconciliation (`networking.go`)
-
-Each networking resource is managed independently:
-
-```go
-func (r *SeiNodeDeploymentReconciler) reconcileNetworking(ctx, group) error {
-    r.reconcileExternalService(ctx, group)
-    r.reconcileRoute(ctx, group)          // HTTPRoute
-    r.reconcileIsolation(ctx, group)      // AuthorizationPolicy
-}
-```
-
-**External Service:**
-- Uses server-side apply with `fieldOwner: seinodedeployment-controller`
-- Selector: `sei.io/group: {groupName}` (matches all replica pods via `podLabels`)
-- Does NOT set `PublishNotReadyAddresses` (natural readiness gating)
-- If spec is nil and `deletionPolicy: Delete`, delete the Service
-- If spec is nil and `deletionPolicy: Retain`, remove owner reference (orphan)
-
-**HTTPRoute:**
-- Generated as `unstructured.Unstructured` (avoids importing Gateway API Go modules)
-- Backend targets `{group}-external` Service
-- If CRD not installed (no Gateway API), sets `NetworkingReady` condition to False/CRDNotInstalled
-
-**AuthorizationPolicy:**
-- Generated as `unstructured.Unstructured` (avoids importing Istio Go modules)
-- Selector: `sei.io/group: {groupName}`
-- Action: ALLOW with specified principals/namespaces
-- **Controller SA auto-injection:** The controller always adds its own ServiceAccount principal to the AuthorizationPolicy, ensuring sidecar communication (port 7777) is never blocked. This is injected at generation time, not visible in the user's spec. Without this, a SeiNode controller running in a different namespace (e.g. `sei-system`) would be unable to drive node initialization via the sidecar API.
-- If CRD not installed (no Istio), sets `IsolationReady` condition to False
-
-### 7.5 Monitoring Reconciliation (`monitoring.go`)
-
-**ServiceMonitor:**
-- Generated as `unstructured.Unstructured`
-- Selector: `sei.io/group: {groupName}` (scrapes all replica pods)
-- Port: `prometheus` (26660)
-- If CRD not installed, sets `ServiceMonitorReady` condition to False
-
-### 7.6 Status Aggregation (`status.go`)
-
-```go
-func (r *SeiNodeDeploymentReconciler) updateStatus(ctx, group) error {
-    // List child SeiNodes by label
-    nodeList := r.listChildSeiNodes(ctx, group)
-
-    // Count ready/total
-    var readyReplicas int32
-    for _, node := range nodeList {
-        if node.Status.Phase == PhaseRunning { readyReplicas++ }
-    }
-
-    // Determine group phase
-    phase := groupPhase(readyReplicas, group.Spec.Replicas, nodeList)
-
-    // Read external Service for LB status
-    networkingStatus := r.readNetworkingStatus(ctx, group)
-
-    // Patch status
-    group.Status.Replicas = group.Spec.Replicas
-    group.Status.ReadyReplicas = readyReplicas
-    group.Status.Phase = phase
-    group.Status.Nodes = nodeStatuses(nodeList)
-    group.Status.ObservedGeneration = group.Generation
-    group.Status.NetworkingStatus = networkingStatus
-}
-```
-
-**Phase logic:** The `groupPhase` function differentiates between scaling-up (some nodes in Initializing/PreInitializing) and actual failures:
-
-| Condition | Phase |
-|-----------|-------|
-| All replicas Running | `Ready` |
-| Some replicas Running, rest progressing (Pending/Initializing) | `Initializing` |
-| Some replicas Running, some Failed | `Degraded` |
-| All replicas Failed | `Failed` |
-| No replicas exist yet | `Pending` |
-
-The `NodesReady` condition provides detail: `"2/3 nodes ready (1 initializing)"`.
-
-### 7.7 Deletion Handling
-
-The `DeletionPolicy` governs both networking resources AND child SeiNodes:
-
-| DeletionPolicy | Networking resources | Child SeiNodes |
-|----------------|---------------------|----------------|
-| `Delete` | Deleted | Deleted (via owner ref GC) |
-| `Retain` | Orphaned (owner ref removed) | Orphaned (owner ref removed, continue running independently) |
-
-> **Production recommendation:** Always use `deletionPolicy: Retain` in production. The `Delete` default is convenient for development but dangerous in production because deleting a SeiNodeDeployment will cascade-delete all child SeiNodes (and their PVCs if configured), causing irreversible data loss.
-
-When `Retain`, the finalizer removes owner references from all managed resources before allowing the SeiNodeDeployment to be deleted. This prevents Kubernetes GC from cascading the deletion.
-
-### 7.8 RBAC
-
-```go
-// +kubebuilder:rbac:groups=sei.io,resources=seinodedeployments,verbs=get;list;watch;create;update;patch;delete
-// +kubebuilder:rbac:groups=sei.io,resources=seinodedeployments/status,verbs=get;update;patch
-// +kubebuilder:rbac:groups=sei.io,resources=seinodedeployments/finalizers,verbs=update
-// +kubebuilder:rbac:groups=sei.io,resources=seinodes,verbs=get;list;watch;create;update;patch;delete
-// +kubebuilder:rbac:groups=sei.io,resources=seinodes/status,verbs=get
-// +kubebuilder:rbac:groups="",resources=services,verbs=get;list;watch;create;update;patch;delete
-// +kubebuilder:rbac:groups=gateway.networking.k8s.io,resources=httproutes,verbs=get;list;watch;create;update;patch;delete
-// +kubebuilder:rbac:groups=security.istio.io,resources=authorizationpolicies,verbs=get;list;watch;create;update;patch;delete
-// +kubebuilder:rbac:groups=monitoring.coreos.com,resources=servicemonitors,verbs=get;list;watch;create;update;patch;delete
-```
-
-### 7.9 SetupWithManager
-
-```go
-func (r *SeiNodeDeploymentReconciler) SetupWithManager(mgr ctrl.Manager) error {
-    return ctrl.NewControllerManagedBy(mgr).
-        For(&seiv1alpha1.SeiNodeDeployment{}).
-        Owns(&seiv1alpha1.SeiNode{}).
-        Owns(&corev1.Service{}).
-        Named("seinodedeployment").
-        Complete(r)
-}
-```
-
-Note: HTTPRoute, AuthorizationPolicy, and ServiceMonitor are unstructured, so we don't add `Owns()` for them. Their reconciliation is idempotent and driven by the SeiNodeDeployment reconcile loop.
-
----
-
-## 8. AWS Topology Patterns
-
-### Pattern 1: Istio Gateway + Network Isolation (recommended)
-
-Traffic flow: `Client → ALB (WAF) → Istio Gateway → Envoy sidecar → seid`
-
-The controller manages the HTTPRoute and AuthorizationPolicy. The ALB and Istio Gateway are platform-level resources.
-
-```yaml
-networking:
-  deletionPolicy: Retain
-  service:
-    type: ClusterIP
-    ports: ["rpc", "rest", "evm-http", "evm-ws"]
-  gateway:
-    parentRef:
-      name: sei-gateway
-      namespace: istio-system
-    hostnames:
-      - rpc.sei-archive.pacific-1.seinetwork.io
-  isolation:
-    authorizationPolicy:
-      allowedSources:
-        - principals: ["cluster.local/ns/istio-system/sa/istio-ingressgateway"]
-        - namespaces: ["sei-nodes"]
-```
-
-### Pattern 2: NLB for TCP (p2p, gRPC)
-
-```yaml
-networking:
-  service:
-    type: LoadBalancer
-    ports: ["p2p", "grpc"]
-    annotations:
-      service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
-      service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
-```
-
-### WAF
-
-WAF is provisioned by the platform team (Terraform) and applied externally (e.g. via AWS WAF associated with the NLB or ALB fronting the Istio ingress gateway). This is a two-way door — WAF configuration lives outside the controller and can be toggled independently.
-
-### DNS
-
-DNS is handled by external-dns, which reads HTTPRoute `hostnames` fields and creates Route53 records. Prerequisites:
-- external-dns deployed with `--source=gateway-httproute`
-- `--domain-filter` matching the target domain
-- IAM permissions for Route53
-
----
-
-## 9. Complete Sample Manifest
-
-```yaml
-apiVersion: sei.io/v1alpha1
-kind: SeiNodeDeployment
-metadata:
-  name: pacific-1-archive-rpc
-  namespace: sei-nodes
-spec:
-  replicas: 3
-
-  template:
-    metadata:
-      labels:
-        sei.io/chain: pacific-1
-        sei.io/role: archive
-    spec:
-      chainId: pacific-1
-      image: "ghcr.io/sei-protocol/sei:v6.3.0"
-      sidecar:
-        image: ghcr.io/sei-protocol/seictl@sha256:64f92fb...
-        resources:
-          requests:
-            cpu: "500m"
-            memory: "256Mi"
-      entrypoint:
-        command: ["seid"]
-        args: ["start", "--home", "/sei"]
-      storage:
-        retainOnDelete: true
-      archive:
-        peers:
-          - ec2Tags:
-              region: eu-central-1
-              tags:
-                ChainIdentifier: pacific-1
-                Component: state-syncer
-        snapshotGeneration:
-          keepRecent: 5
-          destination:
-            s3:
-              bucket: pacific-1-snapshots
-              prefix: state-sync/
-              region: eu-central-1
-
-  networking:
-    deletionPolicy: Retain
-    service:
-      type: ClusterIP
-      ports: ["rpc", "rest", "evm-http", "evm-ws"]
-    gateway:
-      parentRef:
-        name: sei-gateway
-        namespace: istio-system
-      hostnames:
-        - rpc.sei-archive.pacific-1.seinetwork.io
-    isolation:
-      authorizationPolicy:
-        allowedSources:
-          - principals:
-              - "cluster.local/ns/istio-system/sa/istio-ingressgateway"
-          - namespaces:
-              - "sei-nodes"
-
-  monitoring:
-    serviceMonitor:
-      interval: "30s"
-      labels:
-        release: prometheus
-```
-
-**Generated resources:**
-
-| Resource | Name | Kind |
-|----------|------|------|
-| SeiNode | pacific-1-archive-rpc-0 | sei.io/v1alpha1/SeiNode |
-| SeiNode | pacific-1-archive-rpc-1 | sei.io/v1alpha1/SeiNode |
-| SeiNode | pacific-1-archive-rpc-2 | sei.io/v1alpha1/SeiNode |
-| Service | pacific-1-archive-rpc-external | v1/Service (ClusterIP) |
-| HTTPRoute | pacific-1-archive-rpc | gateway.networking.k8s.io/v1/HTTPRoute |
-| AuthorizationPolicy | pacific-1-archive-rpc | security.istio.io/v1/AuthorizationPolicy |
-| ServiceMonitor | pacific-1-archive-rpc | monitoring.coreos.com/v1/ServiceMonitor |
-
-Plus each SeiNode creates its own StatefulSet, PVC, and headless Service (managed by the existing SeiNode controller).
-
----
-
-## 10. What Changes vs. What Stays the Same
-
-| Concern | Status |
-|---------|--------|
-| SeiNode CRD | **SMALL CHANGE** — add optional `spec.podLabels` field (prerequisite) |
-| SeiNode controller | **SMALL CHANGE** — merge `podLabels` into StatefulSet pod template |
-| SeiNodePool controller | **UNCHANGED** |
-| SeiNodePool CRD | **UNCHANGED** |
-| SeiNodeDeployment CRD | **NEW** |
-| Networking types | **NEW** (used by SeiNodeDeployment) |
-| Monitoring types | **NEW** (used by SeiNodeDeployment) |
-| SeiNodeDeployment controller | **NEW** |
-
-Existing SeiNode manifests without `podLabels` continue to work identically. SeiNodeDeployment is additive.
-
----
-
-## 11. Reversibility Analysis
-
-| Decision | How to reverse | Impact |
-|----------|---------------|--------|
-| `spec.podLabels` on SeiNode | Remove field, regenerate CRD. Existing nodes unaffected (nil defaults to empty map). | None |
-| New SeiNodeDeployment CRD | Delete SeiNodeDeployment with `DeletionPolicy: Retain`. Child SeiNodes and networking resources are orphaned and keep running. | SeiNodes become standalone |
-| `sei.io/group` label on child SeiNodes | Remove label. SeiNode controller doesn't read this label. | None |
-| Unstructured HTTPRoute / AuthorizationPolicy / ServiceMonitor | Switch to typed imports later. Same apply semantics. | Code change only |
-| `DeletionPolicy` (covers nodes + networking) | Change per-group. Existing groups unaffected. | Per-resource |
-| WAF | External to controller. Managed by platform team via Terraform/annotations. | Two-way door |
-| Network isolation via AuthorizationPolicy | Optional field. Remove to disable. Istio defaults to ALLOW-all when no policy exists. | Two-way door |
-| Controller SA auto-injection in AuthPolicy | Implementation detail. User never sees it in spec. | Transparent |
-
----
-
-## 12. Implementation Plan
-
-### Phase 0: SeiNode Prerequisite (standalone PR)
-- [ ] Add `spec.podLabels` field to `SeiNodeSpec`
-- [ ] Update `resourceLabelsForNode()` to merge `podLabels` into pod template
-- [ ] Unit tests for label propagation
-- [ ] Regenerate CRD manifests
-
-### Phase 1: SeiNodeDeployment CRD + Node Orchestration
-- [ ] `api/v1alpha1/seinodedeployment_types.go` — SeiNodeDeployment CRD types
-- [ ] `api/v1alpha1/networking_types.go` — Networking types
-- [ ] `api/v1alpha1/monitoring_types.go` — Monitoring types
-- [ ] `internal/controller/nodedeployment/controller.go` — Reconcile loop with `RequeueAfter`
-- [ ] `internal/controller/nodedeployment/nodes.go` — SeiNode create/update/scaleDown with guard
-- [ ] `internal/controller/nodedeployment/labels.go` — Label helpers, naming, merge-order safety
-- [ ] `internal/controller/nodedeployment/status.go` — Status aggregation with `observedGeneration`
-- [ ] Wire into `cmd/main.go`
-- [ ] `make manifests` to generate CRD + RBAC
-- [ ] Unit tests for node orchestration and status
-
-### Phase 2: Shared Networking
-- [ ] `internal/controller/nodedeployment/networking.go` — External Service, HTTPRoute, AuthorizationPolicy
-- [ ] Controller SA auto-injection into AuthorizationPolicy
-- [ ] DeletionPolicy logic (delete vs orphan, covers nodes + networking)
-- [ ] Status conditions for each networking resource
-- [ ] LB ingress reporting
-- [ ] Unit tests for resource generation
-- [ ] Integration tests
-
-### Phase 3: Monitoring + Samples + Documentation
-- [ ] `internal/controller/nodedeployment/monitoring.go` — ServiceMonitor
-- [ ] Sample manifests (Istio pattern, ALB pattern, NLB pattern)
-- [ ] Documentation (external-dns prerequisites, Istio prerequisites)
-- [ ] Printer columns on SeiNodeDeployment
-- [ ] Update `production-deployment-analysis.md` gap table
-
----
-
-## 13. Resolved Questions
-
-| Question | Resolution |
-|----------|-----------|
-| Controller SA in AuthorizationPolicy | Auto-injected (Section 7.4). The controller always adds its own SA to prevent sidecar communication being blocked. |
-| Scaling-up vs degraded phase | Differentiated (Section 7.6). `Initializing` = some nodes progressing. `Degraded` = some nodes failed. `NodesReady` condition provides detail. |
-| Label propagation to pods | Resolved via `spec.podLabels` prerequisite (Section 4). |
-| Label merge order | System labels overwrite user labels (Section 6). |
-| Child SeiNode GC on group deletion | `DeletionPolicy` covers child SeiNodes (Section 7.7). `Retain` orphans everything. |
-
-## 14. Future Scope (explicitly not blocked)
-
-| Feature | Why deferred | How current design accommodates |
-|---------|-------------|--------------------------------|
-| **Rolling update strategy** | Whole feature in itself; needs careful design around archive node sync times | `ensureSeiNode` updates one node at a time naturally; adding `maxUnavailable` / ordered rollout is additive to the reconcile loop |
-| **Heterogeneous groups** | Current use case is homogeneous replicas | Separate SeiNodeDeployments with different templates can share an external Service via matching labels. A future `overrides` per-ordinal field is additive. |
-| **WAF provisioning from K8s** | WAF WebACL is a platform concern | Annotation passthrough makes WAF ARN a two-way door. AWS ACK or Crossplane can manage the WebACL separately. |
-| **Gateway API route rules** | HTTPRoute with no explicit rules routes all traffic to the Service | A `rules` field can be added to `GatewayRouteConfig` without breaking existing manifests |
-| **Multi-listener Gateway** | Current design targets a single Gateway listener | `GatewayParentRef` can be extended with `sectionName *string` for listener targeting |
-| **SeiNodePool + SeiNodeDeployment unification** | Different lifecycle needs (genesis vs fleet) | Both create SeiNodes but don't share children. Could merge in a future major version. |
diff --git a/docs/design-progressive-rollout.md b/docs/design-progressive-rollout.md
deleted file mode 100644
index 01235b23..00000000
--- a/docs/design-progressive-rollout.md
+++ /dev/null
@@ -1,791 +0,0 @@
-# Design: Automated Progressive Rollout — EC2 to K8s RPC Migration
-
-## Overview
-
-Automated, zero-manual-intervention migration of Sei RPC traffic from EC2 to Kubernetes using Route53 weighted routing, in-cluster load generation, and a confidence-score-driven progression loop.
-
-### Architecture (simplified from previous Istio-centric design)
-
-```
-DNS: rpc.pacific-1.sei.io
-         |
-   Route53 weighted record set
-    /                    \
-EC2 ALB                 K8s NLB
-(weight: W_ec2)         (weight: W_k8s)
-   |                       |
-EC2 RPC nodes         Istio IngressGateway
-                           |
-                      K8s Service
-                      (SeiNodeDeployment)
-```
-
-Key simplification: Istio fronts K8s only. It does not sit in the EC2 path. Traffic splitting is done at DNS level via Route53 weighted record sets. This avoids ServiceEntry complexity, mTLS termination issues to EC2, and keeps the EC2 path completely unchanged during migration.
-
-### Trade-offs vs. Istio-only weights
-
-| Factor | Route53 weighted | Istio VirtualService |
-|--------|-----------------|---------------------|
-| EC2 path impact | None | Must route through mesh |
-| Rollback speed | ~60s (DNS TTL) | ~1s (Envoy push) |
-| Split granularity | Per-DNS-resolution | Per-request |
-| Client caching | Some clients cache DNS | No client caching |
-| Complexity | Low (aws CLI) | Medium (ServiceEntry + DestinationRule) |
-
-For blockchain RPC clients, DNS caching is a real concern. Mitigations: (1) set TTL to 10s on the weighted records, (2) the progression holds at each step for hours, so transient DNS caching does not affect steady-state measurements. The simplicity wins.
-
----
-
-## 1. Automated Weight Progression
-
-### Tool choice: CronJob + shell script (not a custom controller)
-
-A custom Go controller is overkill for a one-time migration. Argo Rollouts and Flagger both assume they own the rollout object (Deployment/Rollout) and are designed for in-cluster traffic splitting, not Route53 manipulation. The right tool is a Kubernetes CronJob running a shell script that:
-
-1. Queries Prometheus for the confidence score
-2. Evaluates gate conditions
-3. Calls `aws route53 change-resource-record-sets` to adjust weights
-4. Posts status to a Slack webhook
-
-This runs as a CronJob with `schedule: "*/5 * * * *"` (every 5 minutes). The script is idempotent: if conditions are not met, it does nothing. If conditions are met and the current weight is below the next step, it advances.
-
-### Implementation
-
-Container image: Alpine + `aws-cli` + `curl` + `jq`. No custom Go code.
-
-```yaml
-apiVersion: batch/v1
-kind: CronJob
-metadata:
-  name: rpc-migration-controller
-  namespace: sei-infra
-spec:
-  schedule: "*/5 * * * *"
-  concurrencyPolicy: Forbid
-  jobTemplate:
-    spec:
-      backoffLimit: 0
-      template:
-        spec:
-          serviceAccountName: rpc-migration-controller
-          containers:
-          - name: controller
-            image: amazon/aws-cli:2.15
-            command: ["/bin/bash", "/scripts/progress.sh"]
-            env:
-            - name: PROMETHEUS_URL
-              value: "http://prometheus.monitoring:9090"
-            - name: HOSTED_ZONE_ID
-              value: "Z0123456789ABCDEF"
-            - name: RECORD_NAME
-              value: "rpc.pacific-1.sei.io"
-            - name: EC2_ALB_DNS
-              value: "ec2-rpc-alb-123456.us-east-1.elb.amazonaws.com"
-            - name: K8S_NLB_DNS
-              value: "k8s-rpc-nlb-789012.us-east-1.elb.amazonaws.com"
-            - name: SLACK_WEBHOOK_URL
-              valueFrom:
-                secretKeyRef:
-                  name: rpc-migration-secrets
-                  key: slack-webhook-url
-            - name: WEIGHT_STEPS
-              value: "0,1,10,50,100"
-            - name: MIN_HOLD_MINUTES
-              value: "240"  # 4 hours at each step
-            - name: ROLLBACK_THRESHOLD
-              value: "40"   # confidence score below this triggers rollback
-            volumeMounts:
-            - name: scripts
-              mountPath: /scripts
-          volumes:
-          - name: scripts
-            configMap:
-              name: rpc-migration-scripts
-          restartPolicy: Never
-```
-
-### The progression script (`progress.sh`)
-
-Core logic (pseudocode — the real script is straightforward bash):
-
-```bash
-#!/bin/bash
-set -euo pipefail
-
-STEPS=(${WEIGHT_STEPS//,/ })
-CURRENT_K8S_WEIGHT=$(get_current_route53_weight "k8s")
-CONFIDENCE=$(query_prometheus_confidence_score)
-LAST_CHANGE_TIME=$(get_annotation_from_configmap "last-weight-change")
-MINUTES_AT_CURRENT=$(minutes_since "$LAST_CHANGE_TIME")
-
-# Rollback check — runs before progression
-if (( CONFIDENCE < ROLLBACK_THRESHOLD )) && (( CURRENT_K8S_WEIGHT > 0 )); then
-    previous_step=$(find_previous_step "$CURRENT_K8S_WEIGHT")
-    set_route53_weight "$previous_step"
-    notify_slack ":rotating_light: ROLLBACK: confidence=$CONFIDENCE, weight $CURRENT_K8S_WEIGHT -> $previous_step"
-    exit 0
-fi
-
-# Progression check
-if (( MINUTES_AT_CURRENT < MIN_HOLD_MINUTES )); then
-    echo "Holding at weight=$CURRENT_K8S_WEIGHT for $MINUTES_AT_CURRENT/$MIN_HOLD_MINUTES minutes"
-    exit 0
-fi
-
-next_step=$(find_next_step "$CURRENT_K8S_WEIGHT")
-if [[ -z "$next_step" ]]; then
-    echo "Already at final weight. Migration complete."
-    exit 0
-fi
-
-# Gate: confidence must be above threshold for progression
-if (( CONFIDENCE >= 80 )); then
-    set_route53_weight "$next_step"
-    record_change_time
-    notify_slack ":white_check_mark: PROGRESS: confidence=$CONFIDENCE, weight $CURRENT_K8S_WEIGHT -> $next_step"
-else
-    echo "Confidence=$CONFIDENCE below 80, holding at weight=$CURRENT_K8S_WEIGHT"
-fi
-```
-
-State is stored in a ConfigMap (`rpc-migration-state`) with keys:
-- `current-k8s-weight`: redundant with Route53 but avoids API calls for reads
-- `last-weight-change`: ISO 8601 timestamp
-- `rollback-count`: number of rollbacks (alarm if > 2)
-
-### Route53 weight mechanics
-
-Route53 weighted records use relative weights, not percentages. To achieve "1% K8s":
-
-| Step | K8s weight | EC2 weight | Effective K8s % |
-|------|-----------|-----------|----------------|
-| 0%   | 0         | 100       | 0%             |
-| 1%   | 1         | 99        | ~1%            |
-| 10%  | 10        | 90        | ~10%           |
-| 50%  | 50        | 50        | 50%            |
-| 100% | 100       | 0         | 100%           |
-
-TTL on both records: 10 seconds. This is the minimum practical TTL for Route53 and ensures DNS resolvers pick up weight changes within seconds.
-
-### Timing and cadence
-
-| Step | Min hold | Rationale |
-|------|----------|-----------|
-| 0% -> 1% | Requires passing load test (see section 2) | First real traffic |
-| 1% -> 10% | 4 hours | Detect issues at low blast radius |
-| 10% -> 50% | 4 hours | Significant traffic, covers edge cases |
-| 50% -> 100% | 12 hours (overnight) | Full confidence before cutover |
-
-The 4-hour hold is configurable via `MIN_HOLD_MINUTES`. The 50% -> 100% step uses a longer hold, implemented as a special case in the script (check if current step is 50, use 720 minutes).
-
----
-
-## 2. Load Generation for 0% Phase
-
-### Tool choice: k6
-
-k6 over vegeta or custom Go:
-- Native JavaScript scripting for complex RPC query patterns
-- Built-in Prometheus remote write (metrics go straight to our stack)
-- Thresholds that can fail the test programmatically
-- Runs as a Kubernetes Job, no persistent infrastructure
-- Handles ramping, stages, and per-endpoint breakdowns natively
-
-### Deriving traffic patterns from EC2 ALB access logs
-
-Before writing k6 scripts, extract the real query distribution:
-
-```bash
-# Enable ALB access logging to S3 if not already enabled
-# Then analyze the logs:
-
-# 1. Download recent access logs (24h sample)
-aws s3 sync s3://sei-infra-alb-logs/AWSLogs/.../elasticloadbalancing/ ./alb-logs/ \
-  --exclude "*" --include "*.log.gz"
-
-# 2. Extract RPC method distribution
-zcat alb-logs/*.gz | \
-  awk -F'"' '{print $2}' | \        # extract request field
-  grep -oP '(GET|POST) [^ ]+' | \   # method + path
-  sort | uniq -c | sort -rn | head -30
-
-# 3. For JSON-RPC POST bodies, enable ALB request logging or
-#    sample from application-level logs on EC2 nodes.
-#    CometBFT logs the method in its access log.
-```
-
-Expected distribution for a Sei RPC node (typical from sei-infra patterns):
-
-| Method | Approx % | Type |
-|--------|---------|------|
-| `abci_query` (bank balances, wasm state) | 35% | POST JSON-RPC |
-| `block` / `block_results` | 20% | GET or POST |
-| `tx_search` | 15% | GET |
-| `status` | 10% | GET |
-| `eth_call` (EVM JSON-RPC) | 8% | POST JSON-RPC on :8545 |
-| `eth_getBlockByNumber` | 5% | POST JSON-RPC on :8545 |
-| `broadcast_tx_sync` | 3% | POST (skip in load test) |
-| `validators` / `consensus_state` | 2% | GET |
-| Other | 2% | Mixed |
-
-### k6 load test script
-
-```javascript
-// k6-rpc-load-test.js
-import http from 'k6/http';
-import { check, sleep } from 'k6';
-import { Counter, Trend } from 'k6/metrics';
-
-const RPC_URL = __ENV.RPC_URL || 'http://k8s-rpc-svc.sei.svc.cluster.local:26657';
-const EVM_URL = __ENV.EVM_URL || 'http://k8s-rpc-svc.sei.svc.cluster.local:8545';
-
-const errorCount = new Counter('rpc_errors');
-const blockHeightLag = new Trend('block_height_lag');
-
-// Ramping profile: warm up over 10 min, hold at target for 1h, cool down
-export const options = {
-  stages: [
-    { duration: '5m',  target: 50 },   // warm up
-    { duration: '10m', target: 200 },   // ramp to target
-    { duration: '60m', target: 200 },   // sustained load
-    { duration: '5m',  target: 0 },     // cool down
-  ],
-  thresholds: {
-    'http_req_failed': ['rate<0.01'],        // <1% errors
-    'http_req_duration{method:status}': ['p(99)<500'],  // p99 < 500ms
-    'http_req_duration{method:abci_query}': ['p(99)<2000'],
-    'rpc_errors': ['count<50'],
-  },
-};
-
-// Weighted method selection matching real traffic distribution
-const methods = [
-  { weight: 35, fn: abciQuery },
-  { weight: 20, fn: blockQuery },
-  { weight: 15, fn: txSearch },
-  { weight: 10, fn: statusQuery },
-  { weight: 8,  fn: ethCall },
-  { weight: 5,  fn: ethGetBlock },
-  { weight: 5,  fn: validatorsQuery },
-  // broadcast_tx intentionally excluded
-];
-
-const totalWeight = methods.reduce((sum, m) => sum + m.weight, 0);
-
-export default function () {
-  const rand = Math.random() * totalWeight;
-  let cumulative = 0;
-  for (const m of methods) {
-    cumulative += m.weight;
-    if (rand < cumulative) {
-      m.fn();
-      break;
-    }
-  }
-  sleep(0.1); // 100ms think time
-}
-
-function statusQuery() {
-  const res = http.get(`${RPC_URL}/status`, { tags: { method: 'status' } });
-  check(res, { 'status 200': (r) => r.status === 200 });
-  if (res.status !== 200) errorCount.add(1);
-
-  // Track block height lag vs EC2
-  if (res.status === 200) {
-    try {
-      const height = parseInt(res.json().result.sync_info.latest_block_height);
-      // EC2 height fetched once per VU iteration via setup()
-      blockHeightLag.add(Math.abs(height - globalThis.ec2Height));
-    } catch (e) { /* ignore parse errors */ }
-  }
-}
-
-function abciQuery() {
-  const payload = JSON.stringify({
-    jsonrpc: '2.0', id: 1, method: 'abci_query',
-    params: {
-      path: '/cosmos.bank.v1beta1.Query/AllBalances',
-      data: '',  // empty query = recent state
-      height: '0', prove: false,
-    },
-  });
-  const res = http.post(RPC_URL, payload, {
-    headers: { 'Content-Type': 'application/json' },
-    tags: { method: 'abci_query' },
-  });
-  if (res.status !== 200) errorCount.add(1);
-}
-
-function blockQuery() {
-  const res = http.get(`${RPC_URL}/block`, { tags: { method: 'block' } });
-  if (res.status !== 200) errorCount.add(1);
-}
-
-function txSearch() {
-  // Search for recent transactions (last 100 blocks)
-  const res = http.get(
-    `${RPC_URL}/tx_search?query="tx.height>0"&per_page=10&page=1&order_by="desc"`,
-    { tags: { method: 'tx_search' } }
-  );
-  if (res.status !== 200) errorCount.add(1);
-}
-
-function ethCall() {
-  const payload = JSON.stringify({
-    jsonrpc: '2.0', id: 1, method: 'eth_call',
-    params: [{ to: '0x0000000000000000000000000000000000001002', data: '0x' }, 'latest'],
-  });
-  const res = http.post(EVM_URL, payload, {
-    headers: { 'Content-Type': 'application/json' },
-    tags: { method: 'eth_call' },
-  });
-  if (res.status !== 200) errorCount.add(1);
-}
-
-function ethGetBlock() {
-  const payload = JSON.stringify({
-    jsonrpc: '2.0', id: 1, method: 'eth_getBlockByNumber',
-    params: ['latest', false],
-  });
-  const res = http.post(EVM_URL, payload, {
-    headers: { 'Content-Type': 'application/json' },
-    tags: { method: 'eth_getBlockByNumber' },
-  });
-  if (res.status !== 200) errorCount.add(1);
-}
-
-function validatorsQuery() {
-  const res = http.get(`${RPC_URL}/validators`, { tags: { method: 'validators' } });
-  if (res.status !== 200) errorCount.add(1);
-}
-```
-
-### k6 Kubernetes Job
-
-```yaml
-apiVersion: batch/v1
-kind: Job
-metadata:
-  name: rpc-load-test
-  namespace: sei-infra
-spec:
-  backoffLimit: 0
-  template:
-    spec:
-      containers:
-      - name: k6
-        image: grafana/k6:0.49.0
-        command: ["k6", "run", "--out", "experimental-prometheus-rw", "/scripts/load-test.js"]
-        env:
-        - name: K6_PROMETHEUS_RW_SERVER_URL
-          value: "http://prometheus.monitoring:9090/api/v1/write"
-        - name: K6_PROMETHEUS_RW_TREND_AS_NATIVE_HISTOGRAM
-          value: "true"
-        - name: RPC_URL
-          value: "http://sei-rpc-pacific-1.sei:26657"
-        - name: EVM_URL
-          value: "http://sei-rpc-pacific-1.sei:8545"
-        volumeMounts:
-        - name: scripts
-          mountPath: /scripts
-      volumes:
-      - name: scripts
-        configMap:
-          name: k6-rpc-load-test
-      restartPolicy: Never
-```
-
-### Duration and intensity
-
-| Parameter | Value | Rationale |
-|-----------|-------|-----------|
-| Ramp-up | 15 minutes to 200 VUs | Avoid cold-start spike |
-| Sustained | 60 minutes at 200 VUs | ~2000 req/s, matches EC2 ALB peak |
-| Pass criteria | All k6 thresholds green | Automated gate |
-| Runs required | 3 consecutive passes | Eliminate flakes |
-
-The target RPS (200 VUs * ~10 req/s/VU = 2000 req/s) should match or exceed the p95 traffic level observed on the EC2 ALB. Adjust VU count based on actual ALB CloudWatch `RequestCount` metrics.
-
----
-
-## 3. Observability During Rollout
-
-### Metrics sources
-
-| Metric | K8s source | EC2 source |
-|--------|-----------|-----------|
-| Request rate | Istio `istio_requests_total` | CloudWatch ALB `RequestCount` |
-| Error rate | Istio `istio_requests_total{response_code=~"5.."}` | CloudWatch ALB `HTTPCode_Target_5XX_Count` |
-| Latency p50/p99 | Istio `istio_request_duration_milliseconds` | CloudWatch ALB `TargetResponseTime` |
-| Block height | Prometheus scraping K8s seid `:26657/status` | CloudWatch custom metric or Prometheus remote-write agent on EC2 |
-| Pod health | `sei_controller_seinode_phase` | N/A |
-
-### EC2 metrics strategy
-
-Two options, prefer option A for simplicity:
-
-**Option A: CloudWatch only (recommended)**
-
-EC2 ALB already publishes metrics to CloudWatch. Use the `yet-another-cloudwatch-exporter` (YACE) running in K8s to scrape CloudWatch metrics into Prometheus. This avoids touching EC2 infrastructure.
-
-```yaml
-# yace config for EC2 ALB metrics
-discovery:
-  jobs:
-  - type: AWS/ApplicationELB
-    regions: [us-east-1]
-    searchTags:
-    - key: Name
-      value: sei-rpc-*
-    metrics:
-    - name: RequestCount
-      statistics: [Sum]
-      period: 60
-    - name: TargetResponseTime
-      statistics: [p50, p99]
-      period: 60
-    - name: HTTPCode_Target_5XX_Count
-      statistics: [Sum]
-      period: 60
-```
-
-**Option B: Prometheus agent on EC2**
-
-Run `prometheus-agent` on one EC2 node, remote-write to the K8s Prometheus. More accurate latency data but requires EC2 changes. Only do this if CloudWatch TargetResponseTime granularity is insufficient.
-
-### Block height comparison
-
-The existing controller already uses CometBFT `/status` to track `latest_block_height` and `catching_up`. For the migration dashboard, add a recording rule that computes lag:
-
-```yaml
-groups:
-- name: rpc-migration
-  rules:
-  # K8s block height (from seid metrics or ServiceMonitor scrape)
-  - record: sei:rpc:block_height:k8s
-    expr: max(cometbft_consensus_latest_block_height{namespace="sei"})
-
-  # EC2 block height (scraped via YACE custom metric or a simple curl probe)
-  - record: sei:rpc:block_height:ec2
-    expr: sei_ec2_rpc_block_height
-
-  - record: sei:rpc:block_height_lag
-    expr: abs(sei:rpc:block_height:k8s - sei:rpc:block_height:ec2)
-```
-
-For the EC2 block height, deploy a simple CronJob that curls the EC2 ALB `/status` endpoint every 15s and pushes the height to Prometheus via pushgateway or a `/metrics` endpoint:
-
-```yaml
-apiVersion: batch/v1
-kind: CronJob
-metadata:
-  name: ec2-height-probe
-  namespace: sei-infra
-spec:
-  schedule: "* * * * *"  # every minute (finest CronJob granularity)
-  jobTemplate:
-    spec:
-      template:
-        spec:
-          containers:
-          - name: probe
-            image: curlimages/curl:8.7.1
-            command:
-            - /bin/sh
-            - -c
-            - |
-              HEIGHT=$(curl -sf http://EC2_ALB_DNS:26657/status | \
-                jq -r '.result.sync_info.latest_block_height')
-              curl -sf -X POST "http://pushgateway.monitoring:9091/metrics/job/ec2-height-probe" \
-                --data-binary "sei_ec2_rpc_block_height $HEIGHT"
-          restartPolicy: Never
-```
-
-Better alternative: use a Prometheus Blackbox Exporter probe target that hits the EC2 ALB `/status` and parses height. Configure as a ServiceMonitor probe.
-
-### Dashboard layout (Grafana)
-
-Single dashboard: **"RPC Migration: EC2 vs K8s"**
-
-```
-Row 1: Migration Status
-  [Current Weight: 10% K8s] [Confidence Score: 87] [Time at Step: 2h 15m] [Rollback Count: 0]
-
-Row 2: Traffic Split (stacked time series)
-  [Requests/sec — EC2 vs K8s, stacked area]  [Error Rate — EC2 vs K8s, line]
-
-Row 3: Latency Comparison
-  [p50 Latency — EC2 vs K8s, line]  [p99 Latency — EC2 vs K8s, line]
-
-Row 4: Chain Health
-  [Block Height — EC2 vs K8s, line]  [Block Height Lag, line with threshold at 5]
-
-Row 5: K8s Platform Health
-  [SeiNode Phase, state timeline]  [Pod Restarts, bar]  [Istio 5xx Rate, line]
-
-Row 6: Load Test Results (only during 0% phase)
-  [k6 RPS, line]  [k6 Error Rate, line]  [k6 p99 Latency, line]
-```
-
-The dashboard JSON will be stored in a ConfigMap and auto-provisioned via Grafana's sidecar provisioner.
-
----
-
-## 4. The Confidence Score
-
-### Definition
-
-A single number from 0 to 100 representing migration readiness. Computed every 5 minutes as a Prometheus recording rule. The progression script queries this single metric.
-
-### Formula
-
-```
-confidence = w_uptime * S_uptime
-           + w_error  * S_error
-           + w_latency * S_latency
-           + w_height * S_height
-           + w_load   * S_load
-```
-
-Where each sub-score is 0-100 and weights sum to 1.0:
-
-| Component | Weight | Score function | 100 (perfect) | 0 (fail) |
-|-----------|--------|---------------|----------------|----------|
-| `S_uptime` | 0.20 | K8s pod uptime over last 1h | 100% uptime | <95% uptime |
-| `S_error` | 0.30 | Error rate comparison | K8s error rate <= EC2 | K8s error rate > 5x EC2 |
-| `S_latency` | 0.25 | p99 latency parity | K8s p99 <= 1.1x EC2 | K8s p99 > 2x EC2 |
-| `S_height` | 0.15 | Block height parity | Lag <= 1 block | Lag > 10 blocks |
-| `S_load` | 0.10 | Load test pass (0% phase only) | All thresholds pass | Any threshold fail |
-
-Error rate gets the highest weight because RPC correctness is the primary concern. Latency is second because blockchain clients are sensitive to query timeouts.
-
-### Prometheus recording rules
-
-```yaml
-groups:
-- name: rpc-migration-confidence
-  interval: 60s
-  rules:
-  # Sub-score: uptime (fraction of time pods were Ready in last 1h)
-  - record: sei:migration:score:uptime
-    expr: |
-      clamp(
-        100 * (
-          1 - (sum(rate(kube_pod_status_ready{namespace="sei",condition="false"}[1h]))
-               / max(sum(kube_pod_status_ready{namespace="sei"}), 1))
-        ),
-        0, 100
-      )
-
-  # Sub-score: error rate
-  # 100 when K8s error rate <= EC2, linear decay to 0 when K8s >= 5x EC2
-  - record: sei:migration:score:error_rate
-    expr: |
-      clamp(
-        100 * (1 - clamp_min(
-          (
-            sum(rate(istio_requests_total{reporter="destination",namespace="sei",response_code=~"5.."}[10m]))
-            / clamp_min(sum(rate(istio_requests_total{reporter="destination",namespace="sei"}[10m])), 0.001)
-          )
-          /
-          clamp_min(
-            (sei:ec2:error_rate_5xx OR on() vector(0.001)),
-            0.001
-          )
-          - 1, 0
-        ) / 4),
-        0, 100
-      )
-
-  # Sub-score: latency parity
-  # 100 when K8s p99 <= 1.1x EC2, linear decay to 0 when >= 2x EC2
-  - record: sei:migration:score:latency
-    expr: |
-      clamp(
-        100 * (1 - clamp_min(
-          histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",namespace="sei"}[10m])) by (le))
-          / clamp_min(sei:ec2:latency_p99_ms, 1)
-          - 1.1, 0
-        ) / 0.9),
-        0, 100
-      )
-
-  # Sub-score: block height parity
-  # 100 when lag <= 1, linear decay to 0 when lag >= 10
-  - record: sei:migration:score:block_height
-    expr: |
-      clamp(
-        100 * (1 - clamp_min(sei:rpc:block_height_lag - 1, 0) / 9),
-        0, 100
-      )
-
-  # Sub-score: load test (set by k6 job completion)
-  # Stored in pushgateway: 100 if last test passed, 0 if failed
-  - record: sei:migration:score:load_test
-    expr: sei_migration_load_test_score OR on() vector(0)
-
-  # Composite confidence score
-  - record: sei:migration:confidence_score
-    expr: |
-      0.20 * sei:migration:score:uptime
-      + 0.30 * sei:migration:score:error_rate
-      + 0.25 * sei:migration:score:latency
-      + 0.15 * sei:migration:score:block_height
-      + 0.10 * sei:migration:score:load_test
-```
-
-### Score interpretation
-
-| Range | Meaning | Automation action |
-|-------|---------|-------------------|
-| 80-100 | Green: all signals nominal | Advance to next weight step |
-| 60-79 | Yellow: minor degradation | Hold at current step, alert to Slack |
-| 40-59 | Orange: significant issues | Hold, page on-call |
-| 0-39 | Red: rollback conditions | Automatic rollback to previous step |
-
-### Rollback conditions (any one triggers)
-
-These are evaluated independently from the composite score as hard circuit breakers:
-
-1. **Error rate spike**: K8s 5xx rate > 5% for 5 consecutive minutes
-2. **Latency regression**: K8s p99 > 3x EC2 p99 for 10 consecutive minutes
-3. **Block height stall**: K8s block height not advancing for 2 minutes
-4. **Pod failure**: SeiNodeDeployment phase == Degraded or Failed
-
-The progression script checks these independently:
-
-```bash
-check_circuit_breakers() {
-    # Hard error rate check
-    error_rate=$(promql 'sum(rate(istio_requests_total{namespace="sei",response_code=~"5.."}[5m])) / sum(rate(istio_requests_total{namespace="sei"}[5m]))')
-    if (( $(echo "$error_rate > 0.05" | bc -l) )); then
-        echo "BREAKER: error_rate=$error_rate > 5%"
-        return 1
-    fi
-
-    # Block height stall
-    height_change=$(promql 'changes(cometbft_consensus_latest_block_height{namespace="sei"}[2m])')
-    if (( $(echo "$height_change < 1" | bc -l) )); then
-        echo "BREAKER: block height stalled"
-        return 1
-    fi
-
-    # SeiNodeDeployment health
-    group_phase=$(kubectl get sng -n sei -o jsonpath='{.items[0].status.phase}')
-    if [[ "$group_phase" == "Failed" || "$group_phase" == "Degraded" ]]; then
-        echo "BREAKER: SeiNodeDeployment phase=$group_phase"
-        return 1
-    fi
-
-    return 0
-}
-```
-
----
-
-## 5. Timeline
-
-### Prerequisites (before clock starts)
-
-- SeiNodeDeployment deployed, all nodes synced (`catching_up: false`)
-- Istio sidecar injection on, Gateway + HTTPRoute working
-- ServiceMonitor scraping K8s nodes
-- YACE exporting EC2 ALB CloudWatch metrics
-- EC2 block height probe running
-- Grafana dashboard provisioned
-- Confidence score recording rules deployed
-- CronJob + scripts deployed (weight at 0%)
-
-### Automated progression timeline
-
-| Day | Phase | K8s Weight | Activity |
-|-----|-------|-----------|----------|
-| 0 | Load test | 0% | k6 Job runs 3x against K8s-internal endpoint. Automated: passes or fails. |
-| 0-1 | Soak at 0% | 0% | Confidence score baking. ExportAndCompare running. Score must reach 80+. |
-| 1 | First traffic | 1% | Automation advances when score >= 80 and load tests pass. Hold 4h. |
-| 1-2 | Low canary | 10% | Automation advances. Hold 4h. |
-| 2-3 | Mid canary | 50% | Hold overnight (12h minimum). |
-| 3-4 | Full cutover | 100% | EC2 weight = 0. EC2 stays running as hot standby. |
-| 4-7 | Hot standby | 100% | EC2 keeps syncing. Manual decommission after 72h of 100% K8s. |
-| 7 | Decommission | 100% | Remove Route53 weighted records, switch to simple alias. Terminate EC2. |
-
-**Total: 7 days from "nodes synced" to "EC2 decommissioned".**
-
-This compresses the original 6-week plan to 1 week because:
-1. No manual gate reviews (automation drives progression)
-2. No shadow traffic phase (k6 load test replaces Istio mirroring)
-3. Confidence score eliminates waiting for human judgment
-4. Route53 is operationally simpler than Istio VirtualService weight management
-
-The 7-day timeline can extend automatically if the confidence score drops below threshold at any step. The automation holds until conditions improve, with no human intervention needed.
-
-### When human intervention IS required
-
-The system pages on-call (PagerDuty) for:
-1. More than 2 rollbacks at the same weight step (potential systemic issue)
-2. Confidence score below 40 for more than 1 hour
-3. Complete K8s cluster failure (all pods down)
-
-These scenarios indicate problems that a weight-shifting automation cannot fix.
-
----
-
-## 6. File Layout
-
-```
-manifests/samples/migration/
-  cronjob.yaml                    # Migration controller CronJob
-  configmap-scripts.yaml          # progress.sh and helper functions
-  configmap-state.yaml            # Mutable state (current weight, timestamps)
-  rbac.yaml                       # ServiceAccount + IAM role for Route53
-  k6-load-test-job.yaml           # Load test Job template
-  k6-load-test-configmap.yaml     # k6 script
-  ec2-height-probe.yaml           # Block height comparison CronJob
-  yace-config.yaml                # CloudWatch exporter for EC2 ALB metrics
-  prometheus-rules.yaml           # Recording rules for confidence score
-  grafana-dashboard-configmap.yaml # Migration dashboard JSON
-```
-
-No custom Go code. No new CRDs. No Argo/Flagger dependencies. The entire system is:
-- 1 CronJob (progression controller)
-- 1 Job (k6 load test, run on demand)
-- 1 CronJob (EC2 height probe)
-- 1 Deployment (YACE, likely already running)
-- Recording rules + dashboard (declarative config)
-
----
-
-## 7. Relation to Existing Infrastructure
-
-### What changes in sei-k8s-controller: nothing
-
-The controller already provisions:
-- `SeiNodeDeployment` with `networking.service` (type: LoadBalancer) -- this creates the K8s NLB
-- `networking.gateway` with HTTPRoute -- Istio routes internal mesh traffic
-- `monitoring.serviceMonitor` -- Prometheus scrapes K8s nodes
-- `NetworkingStatus.LoadBalancerIngress` -- reports the NLB address
-
-The migration automation reads these outputs but does not modify the controller. The Route53 records are managed externally by the CronJob, pointing at the NLB address from `NetworkingStatus.LoadBalancerIngress`.
-
-### What changes in EC2: nothing
-
-EC2 ALB continues serving traffic at its current DNS name. Route53 weighted records sit above both the ALB and NLB. EC2 infrastructure is untouched until decommission day.
-
-### What changes in DNS
-
-Before migration:
-```
-rpc.pacific-1.sei.io  ->  ALIAS  ->  EC2 ALB
-```
-
-During migration:
-```
-rpc.pacific-1.sei.io  ->  WEIGHTED (TTL=10s)
-                            ├── SetId=ec2, Weight=W, ALIAS -> EC2 ALB
-                            └── SetId=k8s, Weight=W, ALIAS -> K8s NLB
-```
-
-After migration:
-```
-rpc.pacific-1.sei.io  ->  ALIAS  ->  K8s NLB
-```
diff --git a/docs/design-rpc-migration-istio.md b/docs/design-rpc-migration-istio.md
deleted file mode 100644
index abe7fb48..00000000
--- a/docs/design-rpc-migration-istio.md
+++ /dev/null
@@ -1,119 +0,0 @@
-# Design: EC2-to-K8s RPC Node Migration via Istio Traffic Mirroring
-
-## Overview
-
-Progressive migration of Sei blockchain RPC infrastructure from EC2 to sei-k8s-controller-managed nodes using Istio traffic mirroring and weighted routing.
-
-## Architecture
-
-```
-DNS: rpc.pacific-1.sei.io
-         |
-    AWS NLB (L4)
-         |
-  Istio IngressGateway
-    (Envoy, in-mesh)
-         |
-  +------+------+
-  |             |
-[primary]   [mirror]
-  |             |
-ServiceEntry  K8s Service
-(EC2 ALB)    (SeiNodeDeployment)
-```
-
-Replace ALB with NLB + Istio Gateway. The gateway terminates L7 and applies VirtualService routing. EC2 is the primary backend; K8s receives mirrored (fire-and-forget) traffic.
-
-## Istio Manifests
-
-All at `manifests/samples/istio/pacific-1-rpc-mirror/`:
-
-- **`service-entry.yaml`** — EC2 ALB as `ec2-rpc.pacific-1.internal`, DNS resolution, ports 26657/8545/9090
-- **`destination-rule.yaml`** — mTLS disabled to EC2 (outside mesh), HTTP/2 upgrade disabled (CometBFT is HTTP/1.1), outlier detection
-- **`virtual-service.yaml`** — Phase 2 mirror config: 100% to EC2, 100% mirror to K8s. WebSocket routes to EC2 only (Istio cannot mirror WebSocket)
-- **`virtual-service-cutover.yaml`** — Phase 3 template: weighted routing between EC2 and K8s
-- **`peer-authentication.yaml`** — STRICT mTLS on K8s RPC pods
-- **`telemetry.yaml`** — Access logging for error/latency analysis
-
-## Migration Phases
-
-### Phase 0: Isolated Validation (Week 1-2)
-- Deploy SeiNodeDeployment for RPC, sync from S3 snapshot
-- Enable `ExportAndCompare` with `canonicalRpc` pointing at EC2
-- Run 48h with zero app-hash divergence (Layer 0 + Layer 1)
-- Validate all alerts fire correctly
-- **Blast radius: zero**
-
-### Phase 1: Synthetic Load (Week 2)
-- Replay recorded RPC queries against both EC2 and K8s
-- Compare responses byte-for-byte (normalize node ID, peer list)
-- Deploy synthetic WebSocket client subscribing to NewBlock on both
-- **Gate: 100% response parity, latency within 20%**
-
-### Phase 2: Shadow Traffic (Week 3)
-- Point DNS at NLB, VirtualService mirrors 100% to K8s
-- Responses discarded — clients only see EC2
-- Monitor: mirror acceptance rate, K8s error rate, latency delta, block height lag
-- **Gate: 48h clean metrics**
-- **Rollback: DNS back to ALB (60s)**
-
-### Phase 3: Canary (Week 4)
-- Switch from mirror to weighted routing: 1% → 5% → 10% → 25% → 50%
-- Hold 4h minimum at each step, overnight at 50%
-- Exclude `/broadcast_tx*` from early canary (add after 10%)
-- **Gate: 24h at 50% with no degradation**
-- **Rollback: Set K8s weight to 0 (seconds)**
-
-### Phase 4: Full Cutover (Week 5)
-- 100% to K8s, EC2 hot standby
-- Keep EC2 syncing for 48h
-- **Rollback: Set EC2 weight to 100 (seconds)**
-
-### Phase 5: Decommission (Week 6)
-- Remove ServiceEntry, comparison CronJob
-- Decommission EC2 instances
-- Optionally simplify ingress (remove Istio gateway if not needed long-term)
-
-## Key Design Decisions
-
-### Istio route weights over DNS
-DNS caching by blockchain clients makes Route53 splits non-deterministic. Istio applies weights per-request at the proxy, with immediate propagation and rollback in seconds.
-
-### WebSocket handled separately
-Istio cannot mirror WebSocket (persistent bidirectional stream). HTTP RPC is mirrored; WebSocket gets weighted routing during cutover. The sidecar `ExportAndCompare` validates execution correctness independently.
-
-### STRICT mTLS on K8s pods
-EC2 traffic enters through the ingress gateway (which terminates external TLS and originates mTLS). No reason for non-mTLS traffic to reach K8s pods directly.
-
-### ExportAndCompare is the correctness oracle
-Istio mirroring provides realistic query load. But execution correctness is validated by the sidecar's block-by-block Layer 0/Layer 1 comparison, which uploads DivergenceReport artifacts to S3. Access logs and metrics are supporting signals, not source of truth.
-
-### Do not mirror /broadcast_tx
-Mirroring write endpoints would double-broadcast transactions. Mempool dedup handles it, but it wastes resources and creates confusing logs.
-
-## Confidence Criteria
-
-| Category | Signal | Pass |
-|----------|--------|------|
-| Chain correctness | App-hash agreement (L0+L1) | Zero divergences over 10k blocks |
-| Chain correctness | Block height parity | Within 2 blocks of EC2 |
-| Performance | RPC latency p99 | Within 20% of EC2 |
-| Operations | Automated pod recovery | Recovers in < 5 min |
-| Operations | Blue-green deployment | Works without manual steps |
-| Data plane | Gateway healthy | ConditionNetworkingReady == True |
-
-## Prerequisites Checklist
-
-- [ ] SeiNodeDeployment for RPC deployed, all nodes synced (`catching_up: false`)
-- [ ] Istio sidecar injection enabled on RPC namespace
-- [ ] Gateway + ServiceEntry deployed, reachable from mesh
-- [ ] ExportAndCompare running 48h with zero divergence
-- [ ] Monitoring: block height lag alert, comparison divergence alert, gateway error rate alert
-- [ ] Dashboard: chain health + traffic + operator health
-- [ ] DNS TTL lowered to 60s
-- [ ] Rollback procedure documented and rehearsed
-- [ ] On-call briefed on the migration
-
-## Timeline
-
-6 weeks (compressible to 4). Do not compress below 4 — the canary ramp alone needs 5-7 days.
diff --git a/docs/design-seinode-import-volume-lld.md b/docs/design-seinode-import-volume-lld.md
deleted file mode 100644
index 44df9bc3..00000000
--- a/docs/design-seinode-import-volume-lld.md
+++ /dev/null
@@ -1,479 +0,0 @@
-# Design: SeiNode — Import Existing Storage (LLD)
-
-**Status:** Draft / LLD
-**Date:** 2026-04-21
-**Tracks:** [#105](https://github.com/sei-protocol/sei-k8s-controller/issues/105)
-**Related:** [#104](https://github.com/sei-protocol/sei-k8s-controller/issues/104), [`docs/design-seinode-import-volume.md`](design-seinode-import-volume.md)
-
-This is the companion LLD to the direction doc. The direction doc fixes *what* and *why*; this doc fixes *how*. Scope is strictly bounded to Shape A (import PVC by name) plus the #104 create-path fix. No new shapes, no new fields, no new use cases.
-
-## 1. CRD schema changes
-
-A single new optional sub-struct is added to `SeiNodeSpec` in `api/v1alpha1/seinode_types.go`. Field naming follows the spec sketch in the direction doc (`spec.dataVolume.import.pvcName`) and k8s API conventions (lowerCamelCase JSON tags, PascalCase Go fields, no acronyms beyond `PVC`).
-
-```go
-// SeiNodeSpec additions (api/v1alpha1/seinode_types.go)
-
-// DataVolume configures the data PersistentVolumeClaim for this node.
-// When omitted, the controller creates a PVC using the node's mode-default
-// storage class and size (see noderesource.DefaultStorageForMode).
-// +optional
-DataVolume *DataVolumeSpec `json:"dataVolume,omitempty"`
-
-// DataVolumeSpec configures how the data PVC is sourced.
-type DataVolumeSpec struct {
-    // Import references a pre-existing PersistentVolumeClaim in the same
-    // namespace as the SeiNode, instead of creating a new one. The
-    // controller validates the referenced PVC but never mutates it.
-    //
-    // When Import is set, the controller never deletes the referenced PVC
-    // on SeiNode deletion — storage lifecycle is the operator's responsibility.
-    // +optional
-    Import *DataVolumeImport `json:"import,omitempty"`
-}
-
-// DataVolumeImport names a pre-existing PVC to adopt as this node's data volume.
-type DataVolumeImport struct {
-    // PVCName is the name of a PersistentVolumeClaim in the SeiNode's
-    // namespace. The PVC must be Bound, ReadWriteOnce, and sized at or above
-    // the node mode's default storage size. Immutable after creation.
-    //
-    // +kubebuilder:validation:MinLength=1
-    // +kubebuilder:validation:MaxLength=253
-    // +kubebuilder:validation:Pattern=`^[a-z0-9]([-a-z0-9]*[a-z0-9])?$`
-    // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="pvcName is immutable"
-    PVCName string `json:"pvcName"`
-}
-```
-
-### Why these markers
-
-- Length + pattern mirror the k8s DNS1123Label constraint on PVC names — catches typos at admission time.
-- `XValidation: self == oldSelf` makes the reference immutable. Swapping an imported PVC out from under a running SeiNode has no defined semantics; force delete-and-recreate if the operator wants to re-point.
-- `DataVolume` as pointer keeps `omitempty` clean so existing SeiNodes serialize identically to today.
-- No default `storageClassName`/`size` sub-fields — Shape A is "name a PVC, nothing else" per the direction doc.
-
-### Regenerated artifacts
-
-- `zz_generated.deepcopy.go` — new `DeepCopy`/`DeepCopyInto` methods for `DataVolumeSpec` and `DataVolumeImport`; updated `SeiNodeSpec.DeepCopyInto` to copy the pointer. Produced by `make generate`.
-- `manifests/sei.io_seinodes.yaml` and `config/crd/sei.io_seinodes.yaml` — new `spec.dataVolume.import.pvcName` subtree with validation constraints. Produced by `make manifests`.
-
-### Backward compatibility
-
-Existing SeiNodes serialize `SeiNodeSpec` without the `dataVolume` key. The new CRD schema makes `dataVolume` optional, so old objects remain valid. The controller reads `node.Spec.DataVolume` as nil and takes the create path unchanged.
-
-### Spec-unset vs. spec-empty
-
-Per the direction doc's idiomatic-k8s guidance: `spec.dataVolume == nil`, `spec.dataVolume.import == nil`, and `spec.dataVolume.import.pvcName == ""` all mean "no import." The task's branch check is a single helper:
-
-```go
-func importPVCName(node *seiv1alpha1.SeiNode) string {
-    if node.Spec.DataVolume == nil || node.Spec.DataVolume.Import == nil {
-        return ""
-    }
-    return node.Spec.DataVolume.Import.PVCName
-}
-```
-
-## 2. Task changes to `internal/task/ensure_pvc.go`
-
-The task splits into two internal paths under one public API. The struct, params, and deserializer stay the same; only `Execute()` and `Status()` change.
-
-### New structure
-
-The existing `ensureDataPVCExecution` struct adds two ephemeral fields (`lastReason`, `lastMessage`) for condition propagation within one reconcile. The executor re-deserializes the task on every reconcile (executor.go:150), so per-instance state does not survive. No other state is added; the validation path is stateless across reconciles.
-
-### Execute() — branching structure
-
-```go
-func (e *ensureDataPVCExecution) Execute(ctx context.Context) error {
-    node, err := ResourceAs[*seiv1alpha1.SeiNode](e.cfg)
-    if err != nil {
-        return Terminal(err)
-    }
-
-    if name := importPVCName(node); name != "" {
-        return e.executeImport(ctx, node, name)
-    }
-    return e.executeCreate(ctx, node)
-}
-```
-
-### Create path (fixes #104)
-
-```go
-func (e *ensureDataPVCExecution) executeCreate(ctx context.Context, node *seiv1alpha1.SeiNode) error {
-    desired := noderesource.GenerateDataPVC(node, e.cfg.Platform)
-    if err := ctrl.SetControllerReference(node, desired, e.cfg.Scheme); err != nil {
-        return Terminal(fmt.Errorf("setting owner reference: %w", err))
-    }
-
-    existing := &corev1.PersistentVolumeClaim{}
-    key := types.NamespacedName{Name: desired.Name, Namespace: desired.Namespace}
-    switch err := e.cfg.KubeClient.Get(ctx, key, existing); {
-    case apierrors.IsNotFound(err):
-        // proceed to Create
-    case err != nil:
-        return fmt.Errorf("checking for existing data PVC: %w", err)
-    default:
-        // PVC exists. Accept if we own it (crash-recovery); else fail.
-        if metav1.IsControlledBy(existing, node) {
-            e.complete()
-            return nil
-        }
-        return Terminal(fmt.Errorf(
-            "data PVC %q already exists and is not owned by SeiNode %q; "+
-                "set spec.dataVolume.import.pvcName to adopt, or delete the PVC",
-            existing.Name, node.Name))
-    }
-
-    if err := e.cfg.KubeClient.Create(ctx, desired); err != nil {
-        if apierrors.IsAlreadyExists(err) {
-            // Lost the race with another actor between Get and Create;
-            // requeue so the next reconcile's Get resolves ownership.
-            return fmt.Errorf("data PVC created concurrently: %w", err)
-        }
-        return fmt.Errorf("creating data PVC: %w", err)
-    }
-    e.complete()
-    return nil
-}
-```
-
-Key changes from today's behavior:
-
-| Before (bug #104) | After |
-|---|---|
-| `Create()` → swallow `AlreadyExists` as success | `Get()` first; `IsNotFound` is the happy prelude to `Create` |
-| Any pre-existing PVC → task Complete | Pre-existing PVC owned by this SeiNode → Complete; otherwise `Terminal` failure |
-| No way to distinguish "our crash-recovery" from "someone else's PVC" | `metav1.IsControlledBy` check separates the two |
-
-### Import path — state machine
-
-`Status()` remains:
-
-```go
-func (e *ensureDataPVCExecution) Status(_ context.Context) ExecutionStatus {
-    return e.DefaultStatus()
-}
-```
-
-`executeImport` leaves `e.status == ExecutionRunning` (the `taskBase` default) on transient validation failures and returns `nil`. The executor sees Running and requeues after `TaskPollInterval` (5 s, executor.go:178) — we use this interval as-is. Each reconcile does one Get against the controller-runtime cache, which is a ~free operation; no custom backoff, no state, no arithmetic.
-
-```go
-func (e *ensureDataPVCExecution) executeImport(ctx context.Context, node *seiv1alpha1.SeiNode, name string) error {
-    reason, msg, state := e.validateImport(ctx, node, name)
-    recordTransient(node, reason, msg) // writes "reason: msg" into PlannedTask.Error
-    switch state {
-    case importValid:
-        e.complete()
-        return nil
-    case importTerminal:
-        return Terminal(fmt.Errorf("%s: %s", reason, msg))
-    default: // transient
-        return nil
-    }
-}
-```
-
-`importValid`/`importTransient`/`importTerminal` are internal enums; `validateImport` returns one per requirement.
-
-### Which reasons are transient vs. terminal
-
-| # | Requirement | Validation failure | State |
-|---|---|---|---|
-| 1 | PVC exists | `IsNotFound` on Get | transient — operator may be about to apply the PVC |
-| 2 | `deletionTimestamp == nil` | PVC being deleted | transient — may resolve if finalizers complete and PVC re-appears under external management (rare, but cheaper to retry than to fail) |
-| 3 | `phase == Bound` | `Pending` | transient — binder may complete |
-| 3 | `phase == Bound` | `Lost` | terminal — no recovery path for Lost |
-| 3 | `phase == Bound` | `Released` | transient — operator may rebind to a new claim |
-| 4 | Contains `ReadWriteOnce` | wrong access mode | terminal — PVC spec is immutable for accessModes |
-| 5 | `status.capacity.storage >= default` | too small | terminal — a smaller PVC cannot grow to required size without operator action; operator must expand or re-provision |
-| 6 | PV `spec.capacity == PVC status.capacity` | mismatch | terminal — indicates misconfigured static PV |
-| 7 | PV exists and not `Failed` | PV missing | transient — lookup race during bind |
-| 7 | PV exists and not `Failed` | PV `Failed` | terminal — CSI/provisioner declared the volume unusable |
-
-**Rationale for the transient/terminal split:** the direction doc says "retry indefinitely with exponential backoff." A strict reading would make every failure mode transient. The LLD refines: failures that cannot recover without the operator changing the PVC spec (immutable accessModes, a too-small capacity) become terminal and surface as `plan.Phase = Failed` + `SeiNode Phase = Failed` via the existing `FailedPhase` from `buildBasePlan` (planner.go:447). This matches the direction doc's "seid will fail to start… the operator gets a clear signal from the Failed plan" — applied pre-flight where we know the failure is unrecoverable. A reviewer who prefers "always transient" can flip these at the cost of stuck-Initializing on operator typos.
-
-### Reason strings
-
-CamelCase, stable, and part of the public alerting contract (see §4). Exact strings:
-
-```go
-const (
-    ReasonImportValidated         = "PVCValidated"
-    ReasonImportPVCNotFound       = "PVCNotFound"
-    ReasonImportPVCTerminating    = "PVCTerminating"
-    ReasonImportPVCNotBound       = "PVCNotBound"       // Pending/Released
-    ReasonImportPVCLost           = "PVCLost"           // terminal
-    ReasonImportAccessModeInvalid = "AccessModeInvalid" // terminal
-    ReasonImportCapacityTooSmall  = "CapacityTooSmall"  // terminal
-    ReasonImportPVMissing         = "UnderlyingPVMissing"
-    ReasonImportPVCapacityMismatch = "UnderlyingPVCapacityMismatch" // terminal
-    ReasonImportPVFailed          = "UnderlyingPVFailed"            // terminal
-)
-```
-
-Message format is human-readable and includes the PVC name plus the concrete defect:
-
-```
-PVC "data-archive-0-0" not found in namespace "default"
-PVC "data-archive-0-0" phase is Pending, waiting for bind
-PVC "data-archive-0-0" capacity 500Gi is less than required 2Ti
-underlying PV "pv-abc" for PVC "data-archive-0-0" is in phase Failed
-```
-
-## 3. Validation path in detail
-
-`validateImport(ctx, node, name) (reason, message, state)` runs the seven checks in order, returning on the first defect. Sketch:
-
-```go
-// 1. Get PVC in node.Namespace → IsNotFound → (PVCNotFound, transient)
-// 2. deletionTimestamp != nil → (PVCTerminating, transient)
-// 3. switch pvc.Status.Phase {
-//      case Bound: continue
-//      case Lost:  (PVCLost, terminal)
-//      default:    (PVCNotBound, transient)   // Pending, Released, ""
-//    }
-// 4. !containsAccessMode(..., ReadWriteOnce) → (AccessModeInvalid, terminal)
-// 5. required := resource.MustParse(DefaultStorageForMode(mode, platform).size)
-//    actual := pvc.Status.Capacity[ResourceStorage]
-//    !ok       → (CapacityTooSmall, transient)   // capacity not yet reported
-//    actual<req → (CapacityTooSmall, terminal)
-// 6/7. Get PV by pvc.Spec.VolumeName:
-//      IsNotFound          → (UnderlyingPVMissing, transient)
-//      pv.Status.Phase==Failed → (UnderlyingPVFailed, terminal)
-//      pv.Spec.Capacity != actual → (UnderlyingPVCapacityMismatch, terminal)
-// default: (PVCValidated, "", valid)
-```
-
-Helper `containsAccessMode` is a one-line slice scan. Messages are the templates in §4's table.
-
-### Client operations summary
-
-| Call | Verb | Resource | Purpose |
-|---|---|---|---|
-| Get PVC | `get` | `persistentvolumeclaims` | requirements 1–5 |
-| Get PV | `get` | `persistentvolumes` | requirements 6–7 |
-
-Both are reads. The task never mutates the imported PVC or PV — explicit per direction doc §77. RBAC already grants `get` on PVCs (controller.go:55). A new RBAC marker is required on the node controller to read PVs:
-
-```go
-// +kubebuilder:rbac:groups="",resources=persistentvolumes,verbs=get;list;watch
-```
-
-This is the only RBAC addition. `make manifests` regenerates `manifests/role.yaml`.
-
-### Requeue interval
-
-We use the executor's existing `TaskPollInterval` (5 s, `executor.go:178`) unchanged for Running tasks. No custom backoff, no per-task state, no attempt counter. Each reconcile cycle on a still-transient import performs one cache-served Get against the named PVC (and possibly its PV) and returns Running.
-
-**Why this is fine**: controller-runtime's default client serves Get from an informer cache; these are essentially free local-memory reads, not round trips to the API server. At one Get per 5 s per stuck import, even a fleet of stuck imports produces a trivial load footprint. There is no cost saving to be had from stretching the interval, so we don't.
-
-**Stuck-import visibility**: the plan task's `error` field records the latest failure reason+message (§6), and the `ImportPVCReady` Condition (§4) carries the current state on the SeiNode. A human reading `kubectl describe seinode` sees the current reason directly; there is no hidden schedule state to reason about.
-
-**Transient vs. terminal**: unrecoverable failures (wrong access modes, too-small capacity — see the table in §2) bypass this entirely and mark the plan Failed, giving the operator a clear signal. The 5 s-polling path is only for genuinely transient conditions (PVC still binding, PV provisioning race, operator about to apply the PVC).
-
-## 4. Condition schema on SeiNode status
-
-### Type constant
-
-```go
-// api/v1alpha1/seinode_types.go
-const (
-    ConditionNodeUpdateInProgress = "NodeUpdateInProgress"
-    ConditionImportPVCReady       = "ImportPVCReady" // new
-)
-```
-
-**Naming decision:** `ImportPVCReady` follows the k8s convention `<Subject>Ready` (cf. `PodReady`, `ContainersReady`, `NodeReady`). Alternatives considered and rejected:
-
-| Name | Rejection reason |
-|---|---|
-| `DataVolumeImported` | Past tense implies one-shot event; condition should be a gate ("is the imported PVC ready right now?"). |
-| `ImportedPVCBound` | Too narrow — Bound is only one of seven requirements. |
-| `ImportPVCValid` | "Valid" is about spec correctness; "Ready" encompasses both validity and runtime state (Bound, PV present). Matches upstream usage. |
-
-### Status / Reason / Message matrix
-
-| Status | Reason | Message template | When set |
-|---|---|---|---|
-| `True` | `PVCValidated` | `PVC "<name>" passes all import requirements` | task transitions to Complete |
-| `False` | `PVCNotFound` | `PVC "<name>" not found in namespace "<ns>"` | Get returns NotFound |
-| `False` | `PVCTerminating` | `PVC "<name>" is being deleted (deletionTimestamp=<ts>)` | deletionTimestamp set |
-| `False` | `PVCNotBound` | `PVC "<name>" phase is "<phase>", waiting for Bound` | phase Pending/Released/"" |
-| `False` | `PVCLost` | `PVC "<name>" phase is Lost; underlying PV is gone` | phase Lost |
-| `False` | `AccessModeInvalid` | `PVC "<name>" accessModes <modes> does not include ReadWriteOnce` | accessModes wrong |
-| `False` | `CapacityTooSmall` | `PVC "<name>" capacity <actual> is less than required <required>` | status.capacity too small |
-| `False` | `UnderlyingPVMissing` | `underlying PV "<pv>" for PVC "<name>" not found` | PV Get NotFound |
-| `False` | `UnderlyingPVCapacityMismatch` | `underlying PV "<pv>" capacity <pv-cap> does not match PVC "<name>" capacity <pvc-cap>` | PV capacity mismatch |
-| `False` | `UnderlyingPVFailed` | `underlying PV "<pv>" for PVC "<name>" is in phase Failed` | PV phase Failed |
-| `Unknown` | — | — | never set; we always resolve to True/False after a reconcile |
-
-The condition is **not** set on SeiNodes without `spec.dataVolume.import.pvcName`. Its presence in `.status.conditions` is itself a signal that the node is an import node.
-
-### Public contract for alerting
-
-The `Reason` strings are a **public contract**. Any Prometheus alert, audit tool, or operator script that keys on these strings (e.g., `kube_seinode_status_condition{type="ImportPVCReady",reason="PVCNotFound"}`) must keep working across controller versions. Adding a new Reason is a minor-version addition; renaming or removing an existing Reason is a breaking change and requires a deprecation window. This is called out in a Go doc comment on the `ReasonImport*` constant block.
-
-### ObservedGeneration
-
-Every `meta.SetStatusCondition` call passes `ObservedGeneration: node.Generation`, matching the existing `NodeUpdateInProgress` pattern at planner.go:195.
-
-## 5. Finalizer interaction
-
-The single-line change lives in `internal/controller/node/controller.go:218`'s `deleteNodeDataPVC`:
-
-```go
-func (r *SeiNodeReconciler) deleteNodeDataPVC(ctx context.Context, node *seiv1alpha1.SeiNode) error {
-    // Imported PVCs are managed externally — never delete them.
-    if node.Spec.DataVolume != nil && node.Spec.DataVolume.Import != nil &&
-        node.Spec.DataVolume.Import.PVCName != "" {
-        log.FromContext(ctx).Info("skipping data PVC delete for imported volume",
-            "pvc", node.Spec.DataVolume.Import.PVCName)
-        return nil
-    }
-
-    pvc := &corev1.PersistentVolumeClaim{}
-    err := r.Get(ctx, types.NamespacedName{Name: noderesource.DataPVCName(node), Namespace: node.Namespace}, pvc)
-    if apierrors.IsNotFound(err) {
-        return nil
-    }
-    if err != nil {
-        return err
-    }
-    return r.Delete(ctx, pvc)
-}
-```
-
-Notes:
-
-- The check uses the full nested-nil guard rather than the `importPVCName` helper because the finalizer path is in a different package (`internal/controller/node`) and the helper lives in `internal/task`. Duplicating the three-line guard is cheaper than introducing a cross-package helper for a single caller.
-- `cleanupNodeMetrics` (controller.go:212) still runs — metrics cleanup is orthogonal to data lifecycle.
-
-### PV reclaim policy interaction
-
-Since the import path never touches the PVC on deletion, it never touches the PV either. The PV's `persistentVolumeReclaimPolicy` — whatever the operator configured — is unaffected. If the operator set `Retain`, the PV persists when the PVC is eventually deleted out-of-band (their responsibility). If they set `Delete`, the PV is cleaned up when they delete the PVC. The controller has no opinion. This is the symmetric behavior the direction doc calls for: "touch nothing the controller didn't create."
-
-The separate orphan-PV + finalizer bug (direction doc §140) is unrelated to the import path and remains out of scope here.
-
-## 6. Condition management
-
-Per the controller's established pattern (planner.go:165-179 for `NodeUpdateInProgress`), **the planner owns all condition mutations**. The executor does not touch `SeiNode.Status.Conditions`.
-
-### Where the condition is set
-
-`planner.ReconcileImportPVCCondition(node)` is called from the reconciler *after* plan execution and *before* the status flush (a single new line in `Reconcile`, near controller.go:123). It is idempotent and reads the plan's `ensure-data-pvc` task:
-
-- No import configured → `meta.RemoveStatusCondition(..., ConditionImportPVCReady)`.
-- Task `Complete` → `True` / `PVCValidated` / success message.
-- Task `Pending` or `Failed` → `False` / Reason parsed from `PlannedTask.Error` / message parsed from the same.
-- No task in plan (plan already terminal) → leave existing condition as-is (once True on successful init, it stays True).
-
-### Propagating task state to the condition
-
-The task's `lastReason`/`lastMessage` fields are ephemeral (re-deserialized per reconcile). We persist them by **reusing `PlannedTask.Error`**: transient validations write `t.Error = reason + ": " + message` directly on the in-memory plan, and the planner parses on the prefix. Terminal failures already land in `plan.FailedTaskDetail.Error` via `failTask` (executor.go:209) with the same shape.
-
-The task mutates `plan.Tasks[i].Error` from inside `Execute` — a deliberate narrow mutation, consistent with the single-patch model (the reconciler flushes the plan + conditions together) and with `Error` being a task-owned field.
-
-### Clearing
-
-- `ReasonImportValidated` is set on Complete (§6 table).
-- The condition is removed on: deletion (handled by finalizer path and is moot because the object is going away), or `spec.dataVolume.import` being unset (removed via `meta.RemoveStatusCondition`).
-
-The condition is **never cleared** while the task is Running and the PVC still fails validation; it just gets its Reason/Message updated. This matches the direction doc's "persistent Condition that is visible in `kubectl describe seinode`."
-
-## 7. Tests
-
-### Unit tests (`internal/task/ensure_pvc_test.go` — new file)
-
-Table-driven, using the same fake-client pattern as `observe_image_test.go` (reference lines 33-52 for the scheme+client boilerplate).
-
-**Create-path cases** (all with `spec.dataVolume.import` unset):
-
-- `PVCMissing_CreatesAndCompletes` — no pre-existing PVC → Execute succeeds, Complete, PVC exists after.
-- `PVCExistsOwnedByUs_Completes` — pre-existing PVC with controller ref to this SeiNode → Complete (crash-recovery).
-- `PVCExistsNotOwned_TerminalError` — pre-existing PVC without owner ref (the #104 bug) → `*TerminalError`.
-- `CreateRaceAlreadyExists_Requeues` — Get NotFound but Create returns AlreadyExists → non-terminal error, Running.
-
-**Import-path happy path:** `AllRequirementsMet_Completes` — PVC Bound/RWO/sized, PV present and matching.
-
-**Import-path per-requirement failures** (one case per row of §3's state-machine table), asserting task `Running`/`Failed` + `t.Error` prefix matching the expected Reason:
-
-- req 1: `PVCNotFound_Transient`; `PVCNotFound_ThenAppears_Completes` (apply PVC between reconciles).
-- req 2: `PVCTerminating_Transient`.
-- req 3: `PVCPending_Transient`, `PVCReleased_Transient`, `PVCLost_Terminal`.
-- req 4: `WrongAccessMode_Terminal`.
-- req 5: `CapacityTooSmall_Terminal`, `CapacityUnset_Transient`.
-- req 6: `PVCapacityMismatch_Terminal`.
-- req 7: `PVMissing_Transient`, `PVFailed_Terminal`.
-
-**Poll cadence** (regression guard): `TransientValidationRepeats` — stuck-in-transient state produces one Get per reconcile, matching the executor's `TaskPollInterval`. Asserted via counting client decorator over N reconciles.
-
-### Integration / e2e-style (extends existing harnesses in `internal/controller/node/`)
-
-- `Controller_Import_Happy_ReachesRunning` — import.pvcName references a Bound PVC → Phase=Running, `ImportPVCReady=True/PVCValidated`.
-- `Controller_Import_LatePVC_Converges` — SeiNode applied first (no PVC) → Initializing+`PVCNotFound`. Apply PVC → Running+`True`. Validates the ordering-race promise in the direction doc.
-- `Controller_Import_TerminalFailure_MarksFailed` — wrong accessMode → Plan Failed, SeiNode Phase=Failed, `False/AccessModeInvalid`.
-- `Controller_Import_Deletion_PreservesPVC` — delete SeiNode with import → finalizer runs, PVC still exists.
-- `Controller_NoImport_Deletion_DeletesPVC` — regression guard for existing behavior.
-
-## 8. Observability
-
-### Metrics
-
-Add to `internal/controller/observability/`:
-
-- `importPVCValidationTotal` (counter, attrs: controller, namespace, result∈{valid,transient,terminal}) — emitted every `validateImport` call.
-- `importPVCTerminalTotal` (counter, attrs: controller, namespace, reason) — emitted on Terminal, so `CapacityTooSmall` vs. `UnderlyingPVFailed` alert independently.
-- `importPVCTimeToValid` (histogram, attrs: controller, namespace) — observed at Complete from `SubmittedAt` → now. SLO: time-from-apply-to-imported-PVC-validated.
-
-The existing `planDuration` histogram (planner.go:214) continues to record end-to-end init duration — no change needed beyond `classifyPlan` which already recognizes `ensure-data-pvc` as the "init" plan marker.
-
-### Events
-
-Emitted from the reconciler after status flush (tasks have no Recorder access via `ExecutionConfig`). `PVCImportValidated` (Normal) on transition to True; `PVCImportFailed` (Warning) when the False-condition Reason changes. De-duplication: compare against the previous condition Reason captured from the `DeepCopy` taken for the patch base (controller.go:91).
-
-### Status updates
-
-The latest validation error appears both on the `ImportPVCReady` condition (stable alerting contract) and on the `ensure-data-pvc` task's `.error` field (operator-debugging view). Both carry identical content; the task's field is `"<Reason>: <Message>"`.
-
-## 9. Migration / rollout plan
-
-**No migration.** Existing SeiNodes have `spec.dataVolume == nil`; the controller reads it as nil and takes the create path unchanged. The #104 create-path fix changes behavior in one edge case: an existing SeiNode whose data PVC was accidentally created out-of-band will now fail the plan on next reconcile with `data PVC "foo" already exists and is not owned by SeiNode "foo"`. This is intentional — the operator must either set `import.pvcName` to adopt it, or delete the conflicting PVC. Document this in the PR description and in the changelog.
-
-**Controller upgrade is idempotent.** Any running SeiNode whose plan has already completed `ensure-data-pvc` is unaffected. A SeiNode whose plan is still Active and currently on `ensure-data-pvc` will re-enter the new task code on the next reconcile; if it's a plain create path and the PVC exists and is owned, the new path sees `IsControlledBy` and marks Complete.
-
-**Operator rollout:**
-
-```bash
-make manifests generate
-make test
-make lint
-make docker-build IMG=<registry>/sei-k8s-controller:<sha>
-make docker-push IMG=<registry>/sei-k8s-controller:<sha>
-# update Helm/flux values; controller restart is disruption-free
-```
-
-No CRD data migration is required since the new fields are additive and optional. Applying the new CRD on a cluster with old SeiNode objects is safe.
-
-## 10. What this LLD does NOT cover
-
-- **Shape D** (`importFromPV`) — deferred per direction doc.
-- **Shape E** (VolumeSnapshot dataSource) — deferred per direction doc.
-- **External AMI/snapshot → PV+PVC pipeline** — operator/platform concern.
-- **Orphan-PV + finalizer interaction bug** — separate issue; import path does not touch PVs, so it is not made worse by this change.
-- **Expanding an imported PVC after import** — not addressed; if capacity changes, operator edits the PVC directly. Controller does not reconcile size.
-- **Re-pointing `import.pvcName` after creation** — explicitly blocked by the immutability marker (§1).
-
-## Related work
-
-- [#104](https://github.com/sei-protocol/sei-k8s-controller/issues/104) — the `AlreadyExists` swallow; fixed in §2's create path.
-- [#105](https://github.com/sei-protocol/sei-k8s-controller/issues/105) — the import feature tracked here.
-- `internal/task/ensure_pvc.go` — modified per §2.
-- `internal/noderesource/noderesource.go:176` — unchanged.
-- `api/v1alpha1/seinode_types.go` — new fields per §1.
-- `internal/controller/node/controller.go:218` — finalizer skip per §5.
-- `internal/planner/bootstrap.go` — unchanged.
-- `internal/planner/executor.go:150-200` — retry model reused without modification.
-- `internal/planner/planner.go:165-197` — condition pattern mirrored for `ImportPVCReady`.
diff --git a/docs/design-seinode-import-volume.md b/docs/design-seinode-import-volume.md
deleted file mode 100644
index 88eb0d70..00000000
--- a/docs/design-seinode-import-volume.md
+++ /dev/null
@@ -1,144 +0,0 @@
-# Design: SeiNode — Import Existing Storage
-
-**Status:** Draft / RFC
-**Date:** 2026-04-20
-**Tracks:** [#105](https://github.com/sei-protocol/sei-k8s-controller/issues/105)
-**Related:** [#104](https://github.com/sei-protocol/sei-k8s-controller/issues/104)
-
-## Problem
-
-Today `internal/task/ensure_pvc.go` always calls `Create()` for a SeiNode's data PVC. There is no first-class way to express "use this pre-populated volume." The controller *accidentally* tolerates pre-existing PVCs — `Create()` returns `AlreadyExists`, the task marks Complete, and reconciliation proceeds. This is the bug in #104.
-
-Three use cases want this behavior to be a real feature, not an accident:
-
-1. **AMI/snapshot-based bootstrap.** Weekly AMIs of running pacific-1 archives exist; new k8s archives should bootstrap from the latest snapshot (catching up hours of blocks) instead of resync-from-genesis (months).
-2. **One-off migrations.** Moving an existing archive into k8s without re-syncing — currently done by staging a PVC with the expected name and relying on the accidental path.
-3. **Orphan adoption.** `deletionPolicy: Retain` + the SeiNode finalizer's `deleteNodeDataPVC` orphan PVs. A re-created SeiNode should be able to explicitly adopt an orphan.
-
-The feature and the bug share a refactor: splitting `ensure-data-pvc` into a create-path and an import-path forces us to write the validation logic that #104's `AlreadyExists` swallow is currently hiding.
-
-## Shapes evaluated
-
-| Axis | **A** — import PVC by name | **B** — import EBS volume ID | **C** — import EBS snapshot ID |
-|---|---|---|---|
-| User ergonomics | Two-step (pre-create PV+PVC, then reference). Familiar to k8s operators. | One-step from user POV. Controller owns PV. | One-step; covers weekly-AMI use case end-to-end. |
-| Controller complexity | Near-zero — conditional skip of `ensure-data-pvc`, plus a validating read. | Moderate — controller synthesizes a static PV with `csi.volumeHandle`, sets `claimRef`. No cloud SDK. | High — AWS SDK, IRSA creds, region plumbing, async `CreateVolume` polling, idempotency, AZ selection, backoff/retry. |
-| Cloud-provider awareness | None | Thin — names a volume handle but never calls AWS | Crosses into cloud-provider controller territory |
-| Failure modes | PVC missing / wrong size / wrong AZ / already bound elsewhere — all observable via k8s API | Volume in wrong AZ, attached elsewhere, wrong fsType — observable when CSI fails to attach | All of B, plus: snapshot doesn't exist, quota exceeded, AWS throttling, half-created volumes on controller crash, orphan-volume billing leaks |
-| Safety | Strong — operator explicitly staged everything | Medium — controller creates the PV but the volume pre-exists and is user-supplied | Weak by default — controller creates real cloud resources that can leak money or conflict with out-of-band automation |
-| Interaction with #104 | Forces the fix: if `import` is set, we validate bind state and spec match. If unset, the create path must assert `NotFound`. | Requires #104 fix plus PV-state validation | Same as B plus new failure surface during volume creation |
-| Interaction with Retain-orphan | Solves adoption cleanly — reference the orphaned PVC, or recreate a PVC bound to the orphaned PV | Helps only if orphan retains the EBS volume (it does) | Irrelevant — snapshot path is orthogonal |
-
-### Additional options considered
-
-- **D** — `importFromPV` (reference a PV by name). Operator pre-creates the PV, controller creates the PVC with `volumeName:` pointing at it. Covers orphan adoption with the controller owning the PVC object. Evaluated but **not adopted** in this pass — the orphan-adoption use case is already reachable via Shape A (operator creates a PVC with `volumeName:` pointing at the orphan PV, then references that PVC from the SeiNode). Adding D is a 2nd CRD field and a 2nd task path for a use case that Shape A already covers with one extra operator step. Revisit if orphan adoption becomes frequent enough that the extra step is operational friction.
-- **E** — `VolumeSnapshot` CR as the snapshot source. If a team wants snapshot-restore, they materialize a `VolumeSnapshot` (or `VolumeSnapshotContent` pre-provisioned from an EBS snap ID), then reference it via the standard k8s `dataSource`. The external-snapshotter + CSI driver do the heavy lifting. **No AWS SDK in our controller.**
-- **F** — Bootstrap Job pattern (separate `SeiNodeBootstrap` CR). Over-engineered for the current use cases; adds a resource type to reason about.
-
-E is the k8s-native equivalent of C. Same user outcome, zero cloud SDK in our controller. The operator or an external platform pipeline produces the `VolumeSnapshotContent`; we consume it.
-
-## Decision
-
-**Adopt Shape A. Keep E on the roadmap. Do not build C. Defer D.**
-
-### Rationale
-
-1. **A covers all three stated use cases today.** The manual migration was A minus the validation. Orphan adoption is A (operator creates a PVC bound to the orphaned PV, then references it). AMI bootstrap can be staged into A by a human or an external tool that produces the PV+PVC — later upgraded to E when snapshot-restore is worth the investment.
-
-2. **A is the smallest defensible fix to #104.** The bug exists because we conflate "PVC present" with "PVC correct." Splitting the task into a create-path and an import-path forces the validation we're missing: the import path must assert the PVC is present, bound, and matches node requirements. Same validation catches the orphan-adoption race where a Retain PV is still `Released`.
-
-3. **A avoids the cloud-provider precedent.** Once the controller has AWS creds for `CreateVolume`, gravity pulls toward more cloud calls inside the controller. The team is explicitly shallow on k8s; the operator already runs alongside tooling (Karpenter, EBS CSI driver, Terraform/CDK) better-placed for cloud-resource lifecycle.
-
-4. **One shape, one field, one task path.** Keeping the surface area minimal means less to document, test, and maintain. D can be added later without invalidating A.
-
-5. **E becomes the snapshot story when it's needed.** Reuses the existing k8s snapshot ecosystem rather than building a parallel one. Until a second team asks for snapshot-restore, A + external tooling is fine.
-
-## Spec sketch
-
-```yaml
-spec:
-  dataVolume:
-    import:
-      pvcName: data-archive-0-0      # name of a pre-existing PVC in the SeiNode's namespace
-```
-
-Planner behavior: **the init plan is unchanged.** The only difference is inside `ensure-data-pvc`: if `spec.dataVolume.import.pvcName` is set, the task verifies the named PVC instead of creating a fresh one. Every successor task (`apply-statefulset`, `apply-service`, `configure-genesis`, `config-apply`, `configure-state-sync`, `config-validate`, `mark-ready`) runs exactly as it does today.
-
-This is a deliberate "no extra fluff" choice: import is a PVC-source substitution, not a bootstrap off-ramp. The operator is trusted to provide a PVC whose contents are compatible with the rest of the init progression. If the imported data is from an incompatible seid version, the wrong chain, or in an unexpected on-disk format, seid will fail to start on the pod and the operator gets a clear signal from the Failed plan — same failure channel as any other init problem.
-
-The controller does **not**:
-- Detect that the PVC is pre-populated and skip any steps
-- Refuse to run `configure-genesis` or `config-apply` against imported data
-- Add an `import` state to the init plan
-- Take any responsibility for the data's contents
-
-This keeps the controller small and the contract with the operator honest: "give me a PVC, I'll use it; making it the right PVC is your job."
-
-## Requirements for an imported PVC
-
-For the import branch of `ensure-data-pvc` to succeed, the PVC must satisfy **all** of the following. The controller never mutates the PVC — it reads and either accepts or fails the task.
-
-| # | Requirement | Rationale |
-|---|---|---|
-| 1 | PVC exists in the same namespace as the SeiNode | Namespace-scoped resource lookup |
-| 2 | `metadata.deletionTimestamp` is nil | A PVC being deleted cannot be relied on to persist |
-| 3 | `status.phase == Bound` | Pending/Lost PVCs have no data we can read; Released PVCs aren't attachable |
-| 4 | `spec.accessModes` contains `ReadWriteOnce` | Matches what pods attached to a SeiNode require |
-| 5 | `status.capacity.storage >=` whatever default the node's role demands (see `noderesource.DefaultStorageForMode`) | Prevent attaching too-small volumes |
-| 6 | The underlying PV's `spec.capacity.storage` matches `status.capacity.storage` | Sanity — catch misconfigured PVs |
-| 7 | The underlying PV exists and is not in a terminal state (`Failed`) | Same |
-
-The controller does **not** check:
-- The filesystem type (CSI mount will fail loudly if wrong)
-- The data contents (seid startup / sidecar `configure-genesis` catches data-content problems)
-- Labels on the PVC (those are operator concerns; we shouldn't gate on them)
-- The `storageClassName` (operators may intentionally use empty SC for static PVs; we trust them)
-
-If any check fails, the task enters a retry loop with the current error in the task's `error` field. A PVC that becomes valid later (e.g., finishes binding) resolves the task on the next reconcile.
-
-## Deletion semantics (decided)
-
-**When a SeiNode with `import.pvcName` is deleted, the controller does NOT delete the imported PVC.** The `deleteNodeDataPVC` finalizer is a no-op for imported volumes.
-
-Rationale: the operator opted into `import.pvcName` precisely because they are managing the storage lifecycle externally (via Terraform, by hand, via another operator). Having the SeiNode controller silently delete the PVC on `kubectl delete seinode` would surprise that external lifecycle and risk data loss. The safer default for an import feature is "touch nothing the controller didn't create."
-
-If the operator wants the PVC gone, they delete it explicitly after the SeiNode is gone. This is symmetric with `kubectl delete deployment` not deleting PVCs on the pods it owned.
-
-## Validation retry semantics (decided)
-
-When the import branch of `ensure-data-pvc` finds the PVC in a transient-bad state (missing, `Pending` during initial bind, `Released` mid-adoption, etc.), **the task retries indefinitely with exponential backoff and surfaces the latest validation error via a `Condition` on the SeiNode status.**
-
-Rationale:
-- Matches the existing pattern for other reconcilable-but-might-take-a-while tasks in the controller — tasks return `Running` until the external state is acceptable, and only mark `Failed` on truly terminal errors.
-- Resilient to ordering races: an operator who applies the SeiNode and the PVC in close succession (or in either order) sees the controller converge once both resources exist.
-- Typo detection remains feasible: a misspelled PVC name produces a persistent `Condition: ImportPVCReady=False, Reason: PVCNotFound` that is visible in `kubectl describe seinode`, `kubectl get seinode -o wide`, and any Prometheus alert built on `kube_seinode_status_condition`. Operator fixes the spec; reconciliation converges.
-- A pod stuck in `Pending` because its PVC isn't bound will already fire the cluster's existing `KubePodNotReady` alerts, so we don't lose failure signal by retrying rather than giving up.
-
-The LLD pins down the specifics: exact backoff curve, exact Condition name and Reason strings for each failure mode (`PVCNotFound`, `PVCTerminating`, `PVCNotBound`, `CapacityTooSmall`, `UnderlyingPVMissing`, etc.), and whether those Reasons become part of the public contract for alerting.
-
-## Open questions (for the LLD)
-
-None remaining at the direction level. All prior open questions are now decided:
-- **Scope**: Shape A only (PVC by name)
-- **Deletion semantics**: controller never deletes imported PVCs
-- **Validation requirements**: seven-item table above
-- **Init plan interaction**: unchanged — import is a PVC-source substitution inside `ensure-data-pvc`, nothing else
-- **Retry semantics**: retry indefinitely with exponential backoff, surface via Condition
-
-The LLD is now about implementation specifics (exact backoff curve, Condition schema, unit-test matrix, CRD schema details) rather than direction.
-
-## What this design does NOT cover
-
-- **Shape E / VolumeSnapshot integration.** Deferred; revisit when a team asks for snapshot-restore directly through the controller.
-- **Automated AMI → PV+PVC pipeline.** External tooling concern (platform repo or a separate operator). This design just accepts a PVC/PV someone else produced.
-- **Cross-cluster imports.** Out of scope.
-- **Migration path for existing SeiNodes.** Import is opt-in via `spec.dataVolume.import`; SeiNodes created without it go through the normal `ensure-data-pvc` path unchanged.
-
-## Related work
-
-- [#104](https://github.com/sei-protocol/sei-k8s-controller/issues/104) — `ensure-data-pvc` marks Complete on `AlreadyExists`. Import-path design forces the underlying fix.
-- Retain-reclaim-policy + finalizer orphan behavior (separate issue to file; the import-feature is safe to ship before it lands, but together they close the loop on SeiNode deletion + re-creation flows).
-- `internal/task/ensure_pvc.go` — site of the create-path/import-path split
-- `internal/noderesource/noderesource.go:176` (`GenerateDataPVC`) — unchanged; still generates the create-path PVC
-- `internal/planner/bootstrap.go` — unchanged (init plan is unchanged); `ensure-data-pvc` internally branches on `spec.dataVolume.import.pvcName`
-- `api/v1alpha1/seinode_types.go` — where `spec.dataVolume.import` gets added
diff --git a/docs/design-seinode-validator-signing-key-lld.md b/docs/design-seinode-validator-signing-key-lld.md
deleted file mode 100644
index 10acfe79..00000000
--- a/docs/design-seinode-validator-signing-key-lld.md
+++ /dev/null
@@ -1,427 +0,0 @@
-# Design: SeiNode — Validator Signing Key (LLD)
-
-**Status:** Draft / LLD
-**Date:** 2026-04-27
-**Tracks:** validator migration from sei-infra (EC2) → sei-k8s-controller
-**Related:** [`docs/design-seinode-import-volume-lld.md`](design-seinode-import-volume-lld.md), [`docs/design/composable-genesis.md`](design/composable-genesis.md)
-
-This LLD specifies how an existing external validator identity is mounted onto a `SeiNode` so the node, once synced to the chain tip, takes over signing from a previously-running validator instance with zero risk of double-signing.
-
-This LLD fixes the API shape, mount mechanics, validation lifecycle, and v1 scope. No new keystore variants, no remote-signer integration, no automated cutover orchestration — those are explicitly deferred (§11).
-
-## 0. Background: what is already in tree
-
-The bootstrap-Job + StatefulSet pattern that powers this design is already implemented:
-
-- `ValidatorSpec.Snapshot *SnapshotSource` carries `BootstrapImage` + `S3.TargetHeight` (`api/v1alpha1/validator_types.go:9`, `api/v1alpha1/common_types.go:85-101`).
-- The validator planner dispatches to `buildBootstrapPlan` when these are set (`internal/planner/validator.go:50-51`), validating the combo upfront (`internal/planner/validator.go:30-34`).
-- `buildBootstrapPlan` runs a one-shot Job with `BootstrapImage`, halts at `TargetHeight`, tears down, then creates the production StatefulSet against the **same** `data-<name>` PVC (`internal/planner/bootstrap.go:47-95`).
-
-So "Phase 1: pre-sync without keys" from the direction doc is already operational. This LLD adds the missing piece: how to land validator signing keys on the production StatefulSet pod.
-
-## 1. CRD schema changes
-
-A single new optional sub-struct is added to `ValidatorSpec` in `api/v1alpha1/validator_types.go`. The shape is a discriminated union with one v1 variant; future variants (TMKMS, remote signer, Vault) slot in as sibling pointer fields.
-
-```go
-// api/v1alpha1/validator_types.go
-
-type ValidatorSpec struct {
-    // Snapshot, GenesisCeremony — existing fields, both optional.
-    // +optional
-    Snapshot        *SnapshotSource             `json:"snapshot,omitempty"`
-    // +optional
-    GenesisCeremony *GenesisCeremonyNodeConfig  `json:"genesisCeremony,omitempty"`
-
-    // SigningKey declares the source of this validator's consensus signing
-    // key (priv_validator_key.json). When omitted, the node runs as a
-    // non-signing observer — suitable for pre-sync (Phase 1 of the
-    // validator-migration runbook) or for genesis-ceremony bootstraps that
-    // produce keys on-cluster. Mutually exclusive with GenesisCeremony
-    // (cross-field check in the validator planner; see §6).
-    // +optional
-    SigningKey *SigningKeySource `json:"signingKey,omitempty"`
-}
-
-// SigningKeySource declares where a validator's consensus signing key
-// material comes from. Exactly one variant must be set. Variants are
-// mutually exclusive — a validator has one consensus identity.
-// +kubebuilder:validation:XValidation:rule="(has(self.secret) ? 1 : 0) == 1",message="exactly one signing key source must be set"
-type SigningKeySource struct {
-    // Secret loads signing material from a Kubernetes Secret in the
-    // SeiNode's namespace.
-    // +optional
-    Secret *SecretSigningKeySource `json:"secret,omitempty"`
-
-    // Future siblings: TMKMS *TMKMSSigningKeySource,
-    // Remote *RemoteSigningKeySource, Vault *VaultSigningKeySource.
-    // When added, update the XValidation exactly-one rule.
-}
-
-// SecretSigningKeySource references a Kubernetes Secret containing the
-// validator's consensus signing key. The Secret must contain a data key
-// `priv_validator_key.json` holding the Tendermint validator key (consensus
-// identity), mounted read-only at $SEI_HOME/config/priv_validator_key.json.
-// Rotating this value without a paired on-chain MsgEditValidator will cause
-// the validator to miss blocks.
-//
-// priv_validator_state.json (CometBFT's slashing-protection ledger) is owned
-// by seid on the data PVC and is created automatically on first start. The
-// controller does not inject this file — see LLD §11.
-type SecretSigningKeySource struct {
-    // SecretName is the name of a Secret in the SeiNode's namespace.
-    // The controller never creates, mutates, or deletes this Secret —
-    // its lifecycle is fully external (kubectl, ESO, CSI Secrets Store, etc.).
-    //
-    // +kubebuilder:validation:MinLength=1
-    // +kubebuilder:validation:MaxLength=253
-    // +kubebuilder:validation:Pattern=`^[a-z0-9]([-a-z0-9]*[a-z0-9])?$`
-    // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="secretName is immutable"
-    SecretName string `json:"secretName"`
-}
-```
-
-### Why these markers
-
-- **`XValidation: exactly-one` on `SigningKeySource`** ships in v1 with one variant. Cost is one CEL expression that always returns 1; benefit is the rule body is reviewed once and v2 is a one-line edit (`+ (has(self.tmkms) ? 1 : 0)`). Skipping it now means a v2 PR adds the rule for the first time and a reviewer has to re-derive the variant invariant from scratch.
-- **`secretName` immutable.** Re-pointing a running validator's consensus key is a slashing risk — force delete-and-recreate. Matches the immutability discipline applied to `import.pvcName` in the import-volume LLD §1.
-- **Pattern + length** mirror the K8s DNS1123Label constraint on Secret names, catching typos at admission time.
-- **`SigningKey` itself is not made immutable.** The cutover use case explicitly requires *adding* `SigningKey` to a pre-sync SeiNode that booted without it. The field can be added; once added, the variant choice and `secretName` are pinned.
-- **No `Key` field for the data-key name.** The well-known name `priv_validator_key.json` matches the seid filename; an override field is YAGNI in v1.
-
-### Backward compatibility
-
-Existing SeiNodes serialize without `signingKey`. The new field is optional; old objects remain valid. The validator planner's existing genesis-ceremony and snapshot-bootstrap paths continue to behave identically when `SigningKey` is unset.
-
-### Regenerated artifacts
-
-- `zz_generated.deepcopy.go` — new `DeepCopy`/`DeepCopyInto` for `SigningKeySource` and `SecretSigningKeySource`; updated `ValidatorSpec.DeepCopyInto`. Produced by `make generate`.
-- `manifests/sei.io_seinodes.yaml` and `config/crd/sei.io_seinodes.yaml` — new `spec.validator.signingKey` subtree with validation. Produced by `make manifests`.
-
-## 2. Pod spec changes — production StatefulSet only
-
-All changes land in `buildNodePodSpec` (`internal/noderesource/noderesource.go:229-270`), which generates the StatefulSet pod template. The bootstrap Job pod-spec generator (`task.GenerateBootstrapJob`) is **deliberately unchanged** — see §3.
-
-When `node.Spec.Validator != nil && node.Spec.Validator.SigningKey != nil && node.Spec.Validator.SigningKey.Secret != nil`:
-
-### 2.1 Add a Secret volume
-
-```yaml
-volumes:
-- name: data
-  persistentVolumeClaim: { claimName: data-<name> }   # existing
-- name: signing-key                                    # NEW
-  secret:
-    secretName: <signingKey.secret.secretName>
-    defaultMode: 0400
-    items:
-    - key: priv_validator_key.json
-      path: priv_validator_key.json
-```
-
-One Secret-backed volume, scoped to a single data key. Mounted into the seid container via `subPath` — see §2.2.
-
-### 2.2 Mount the key into the seid container
-
-In `buildNodeMainContainer` (`internal/noderesource/noderesource.go:397`):
-
-```yaml
-volumeMounts:
-- { name: data, mountPath: /sei }                                       # existing
-- name: signing-key                                                     # NEW
-  mountPath: /sei/config/priv_validator_key.json
-  subPath: priv_validator_key.json
-  readOnly: true
-```
-
-`subPath` is deliberate. Normally, a Secret-volume mount is auto-refreshed by the kubelet when the Secret changes (within seconds). A `subPath` mount is **not** refreshed — the file is pinned at pod-start until the next pod restart. For most workloads this is a footgun; for a consensus key, it's the safety property — a `kubectl edit secret` cannot swap the consensus key under a running seid (which would risk signing two different blocks at the same height under two different keys). Rotating the consensus key requires a deliberate pod restart paired with an on-chain `MsgEditValidator` — out of scope here.
-
-The sidecar container does **not** mount `signing-key`. It has no business reading consensus material.
-
-### 2.3 Pod security context
-
-The seid container runs as a non-root user. The Secret volume's `defaultMode: 0400` plus the StatefulSet's existing `fsGroup` make the mounted file readable by the seid process. Test on the actual EKS CSI driver before staging — some CSI implementations interact poorly with `defaultMode` + `fsGroup`, leaving files as `root:root 0400` and locking out the non-root container.
-
-### 2.4 The underlying-PVC stale-key concern
-
-If a SeiNode previously bootstrapped without `SigningKey` (Phase 1) and later gets `SigningKey` added (Phase 2 cutover), the `seid-init` container during Phase 1 may have created a generated `priv_validator_key.json` on the PVC. The `subPath` mount overlays this file at runtime, so seid reads the Secret-mounted key. But the stale generated file remains on the PVC underneath the mount.
-
-For v1, this is acceptable: `SigningKey` is immutable once set (per §1), so the overlay is permanent for the SeiNode's lifetime. The stale file is invisible to seid and never used. If a future operator unsets `SigningKey` (currently blocked by no field-level immutability rule, but not a supported workflow), they'd see the stale key surface — flagged in §11.
-
-## 3. Bootstrap Job invariant
-
-The bootstrap Job pod-spec is generated by `task.GenerateBootstrapJob`, a **separate code path** from `buildNodePodSpec`. The bootstrap Job mounts only the data PVC and runs `seid` to restore a snapshot and sync to halt-height. It does not consult `SigningKey` and never mounts the Secret.
-
-This is the v1 safety property: **the bootstrap Job is physically incapable of signing**, because it has no key file in `$SEI_HOME/config/`. Even if a misconfigured cluster wedged a bootstrap Job into a running validator role, it would refuse to sign for lack of key material.
-
-A one-line code comment at the head of `GenerateBootstrapJob` will pin this invariant explicitly so a future refactor doesn't accidentally share volume code with `buildNodePodSpec`:
-
-```go
-// GenerateBootstrapJob deliberately omits ValidatorSpec.SigningKey volumes.
-// The bootstrap pod must never have access to consensus signing material —
-// see docs/design-seinode-validator-signing-key-lld.md §3.
-```
-
-## 4. Pre-flight validation task
-
-Like the import-volume LLD's `ensure-data-pvc` validation pattern (§2-3), we want the operator to learn that a referenced Secret is missing or malformed *before* the StatefulSet pod hits `ContainerCreating`-stuck purgatory. A new controller-side task validates the Secret's existence and shape, surfacing failures via a Status condition.
-
-### Task: `validate-signing-key`
-
-A new task type at `internal/task/validate_signing_key.go`. Inserted into the validator plan immediately before `apply-statefulset` (so the StatefulSet only attempts pod creation if the Secret is valid).
-
-```go
-// internal/task/validate_signing_key.go (sketch)
-
-type ValidateSigningKeyParams struct {
-    NodeName   string `json:"nodeName"`
-    Namespace  string `json:"namespace"`
-    SecretName string `json:"secretName"`
-}
-```
-
-### Validation rules (in order)
-
-| # | Check | Failure | State |
-|---|---|---|---|
-| 1 | Secret exists | `IsNotFound` | transient — operator may be about to apply it |
-| 2 | `deletionTimestamp == nil` | being deleted | transient |
-| 3 | `data["priv_validator_key.json"]` non-empty | missing/empty | terminal — no recovery without operator action |
-| 4 | Value parses as valid JSON | malformed | terminal |
-| 5 | Has expected Tendermint shape (top-level `address`, `pub_key.type`, `priv_key.type`) | malformed | terminal |
-
-Transient failures keep the task `Running` and requeue at the executor's `TaskPollInterval` (5s), matching the `ensure-data-pvc` import path (import-volume LLD §3). Terminal failures mark the plan `Failed`.
-
-### Reason strings (alerting contract)
-
-```go
-const (
-    ReasonSigningKeyValidated         = "SigningKeyValidated"
-    ReasonSigningKeySecretNotFound    = "SecretNotFound"
-    ReasonSigningKeySecretTerminating = "SecretTerminating"
-    ReasonSigningKeyMissingKeyFile    = "MissingPrivValidatorKey"   // terminal
-    ReasonSigningKeyMalformedKey      = "MalformedPrivValidatorKey" // terminal
-)
-```
-
-These mirror the `ImportPVC*` reason taxonomy from the import-volume LLD §2 and are part of the public alerting contract — adding a Reason is a minor-version addition; renaming or removing one is a breaking change.
-
-### Plan integration
-
-The validator plan is amended to insert `validate-signing-key` only when `SigningKey != nil`. Both bootstrap and base paths get the insertion:
-
-- `buildBootstrapPlan` — insert after `EnsureDataPVC`, before `DeployBootstrapSvc`. Validation gates the bootstrap Job from even creating the Job resource.
-- `buildBasePlan` (validator path) — insert after `EnsureDataPVC`, before `ApplyStatefulSet`.
-
-For genesis-ceremony validators, `SigningKey` is mutually exclusive with `GenesisCeremony` (validation rule §6). Genesis-ceremony validators generate keys on-cluster and never reference an external Secret.
-
-### RBAC
-
-A new marker on the SeiNode controller:
-
-```go
-// +kubebuilder:rbac:groups="",resources=secrets,verbs=get;list;watch
-```
-
-This is the only RBAC addition. `make manifests` regenerates `manifests/role.yaml`. The verbs are read-only — the controller never creates, updates, patches, or deletes Secrets. Operators provision the Secret out-of-band.
-
-## 5. Status condition: `SigningKeyReady`
-
-Mirrors `ImportPVCReady` from the import-volume LLD §4.
-
-```go
-// api/v1alpha1/seinode_types.go
-const (
-    ConditionNodeUpdateInProgress = "NodeUpdateInProgress"
-    ConditionImportPVCReady       = "ImportPVCReady"
-    ConditionSigningKeyReady      = "SigningKeyReady"  // new
-)
-```
-
-### Status / Reason / Message matrix
-
-| Status | Reason | Message template | When set |
-|---|---|---|---|
-| `True` | `SigningKeyValidated` | `Secret "<name>" passes all signing-key validation rules` | task transitions to Complete |
-| `False` | `SecretNotFound` | `Secret "<name>" not found in namespace "<ns>"` | Get returns NotFound |
-| `False` | `SecretTerminating` | `Secret "<name>" is being deleted` | deletionTimestamp set |
-| `False` | `MissingPrivValidatorKey` | `Secret "<name>" missing required data key "priv_validator_key.json"` | data key missing/empty |
-| `False` | `MalformedPrivValidatorKey` | `Secret "<name>" data key "priv_validator_key.json" is not a valid Tendermint validator key` | shape check fails |
-
-### Condition lifecycle
-
-- Set on every reconcile after `validate-signing-key` runs, by the planner (matching the planner-owns-conditions pattern from CLAUDE.md).
-- Removed via `meta.RemoveStatusCondition` when `SigningKey` is unset on the spec.
-- The condition's presence in `.status.conditions` is itself a signal that the node is configured to mount external signing keys.
-- `ObservedGeneration` is set to `node.Generation` on every `meta.SetStatusCondition` call, mirroring the `NodeUpdateInProgress` setCondition discipline in `internal/planner/planner.go`.
-
-### Why no `SigningKeyMounted=True/False` condition
-
-We could express "the pod has actually picked up the key files." That's harder to observe — would require the sidecar to check filesystem state and report back, or a webhook into seid's logs. For v1, `SigningKeyReady=True` + the pod being `Ready` per its standard probes is sufficient signal that mounting succeeded; mount failure surfaces as `kubectl describe pod` events on the StatefulSet pod. Deferred: a richer `Signing=True/False` condition driven by sidecar observation of seid's RPC `signing_info` or proposer-presence on chain (§11).
-
-## 6. Validator planner validation additions
-
-In `validatorPlanner.Validate` (`internal/planner/validator.go:17-36`), add:
-
-```go
-if v.SigningKey != nil {
-    if v.GenesisCeremony != nil {
-        return fmt.Errorf("validator: signingKey is mutually exclusive with genesisCeremony")
-    }
-    if v.SigningKey.Secret == nil {
-        return fmt.Errorf("validator: signingKey requires a variant (secret); none set")
-    }
-    if v.SigningKey.Secret.SecretName == "" {
-        return fmt.Errorf("validator: signingKey.secret.secretName is required")
-    }
-}
-```
-
-The CRD-level `XValidation: exactly-one` covers `SigningKeySource` variants but not cross-field exclusion with `GenesisCeremony` — that goes here. The full legality of `SigningKey` against the other validator-source fields:
-
-| `SigningKey` combined with… | Verdict | Why |
-|---|---|---|
-| `GenesisCeremony` set | **Rejected** (this check) | A fresh-chain validator generates keys via the genesis ceremony; it must not also import keys from a Secret. |
-| `Snapshot.BootstrapImage` set | **Valid and expected** | The migration use case: bootstrap-Job sync + signing key on the production StatefulSet. No additional validation needed. |
-| neither `GenesisCeremony` nor `Snapshot` | **Valid** | A validator joining an existing chain via block-sync from genesis (slow but correct). A snapshot mechanism is not required alongside `SigningKey`. |
-
-## 7. Finalizer interaction
-
-Like imported PVCs (import-volume LLD §5), Secrets referenced by `SigningKey` are managed externally and must never be deleted by the controller. The finalizer path in `internal/controller/node/controller.go` requires no changes — it currently deletes the data PVC and cleans up metrics, neither of which references the signing-key Secret. Added as an explicit code-comment in the finalizer to prevent future accidental coupling:
-
-```go
-// Note: SigningKey-referenced Secrets are managed externally (operator,
-// ESO, CSI Secrets Store). The controller never reads, writes, or
-// deletes these Secrets — see docs/design-seinode-validator-signing-key-lld.md §7.
-```
-
-## 8. Cutover flow (operational)
-
-v1 supports a single shape: **deploy a SeiNode with `validator.signingKey` set from creation**. Mid-life patching of `SigningKey` onto a Running validator is not supported — `buildRunningPlan` only detects image drift and would not reconcile the StatefulSet pod-spec on a SigningKey change. Adding that capability is a follow-up (see §11).
-
-```
-Cutover (single-shot deployment):
-  [stop old EC2 validator at height H; confirm process is dead]
-  [scrape priv_validator_key.json from old EC2]
-  [wait for chain to advance ≥ M blocks past H — defense against re-org at the cutover boundary]
-  [create K8s Secret <name>-signing-key containing priv_validator_key.json]
-  kubectl apply <SeiNode with validator.snapshot.bootstrapImage + validator.signingKey>
-    → controller: buildBootstrapPlan
-    → ensure-data-pvc, validate-signing-key (Secret exists, well-formed → SigningKeyReady=True)
-    → bootstrap Job runs with halt-height (no key material on bootstrap pod — safety invariant in §3)
-    → bootstrap Job exits, teardown
-    → production StatefulSet starts with signing-key Secret mounted at config/priv_validator_key.json
-    → seid starts, creates fresh priv_validator_state.json (height=0), block-syncs to chain tip
-    → first signing opportunity is at chain tip ≫ H → seid signs (not a re-sign of any height the old validator signed)
-
-Verify:
-  Operator checks on-chain signing_info; controller reports SigningKeyReady=True
-```
-
-### Operational tradeoff
-
-The single-shot deployment trades cutover downtime (the bootstrap-Job sync window) for implementation simplicity. There is no Phase 1 pre-sync; the old EC2 validator must stop before the K8s validator is created. For arctic-1 and most pacific-1 timelines this is acceptable. Operators needing zero-downtime cutover (deploy without keys → pre-sync → patch keys in) require the drift detection deferred in §11.
-
-### Slashing protection
-
-CometBFT auto-creates `priv_validator_state.json` on first start with zero state. The new validator's first signing opportunity is at chain tip, well past the old validator's last-signed height once the chain has advanced ≥ M blocks (M=20 is generous for Sei). The runbook's "wait M blocks past last-signed height before applying the SeiNode" step is the operational protection against the narrow case where a chain re-org at the cutover boundary could produce a height-collision with the old validator. The controller does not inject `priv_validator_state.json` — see §11.
-
-### Why NodeUpdate naturally handles this
-
-The existing NodeUpdate plan path (`internal/planner/planner.go`, when `spec.image != status.currentImage` — but more generally, when the StatefulSet pod template differs from the desired template) already orchestrates `apply-statefulset → observe-image → mark-ready`. Adding signing-key volumes to the pod template is a structurally-equivalent diff. We rely on this mechanism unchanged for v1; richer "validator-promotion" plan semantics are deferred (§11).
-
-## 9. Tests
-
-### Unit tests
-
-**API validation** (`api/v1alpha1/validator_types_test.go`):
-
-- `SigningKey_SecretOnly_Valid`
-- `SigningKey_NoVariant_RejectedByXValidation`
-- `SigningKey_SecretNameImmutable` — apply with one name, patch with another, expect rejection
-- `SigningKey_WithGenesisCeremony_RejectedByPlannerValidate`
-
-**Pod spec generation** (`internal/noderesource/noderesource_test.go`):
-
-- `SigningKeySet_SecretVolumePresent` — `signing-key` volume exists on the StatefulSet pod template, scoped to `priv_validator_key.json`
-- `SigningKeySet_SeidContainerHasSubPathMount` — seid container has `subPath: priv_validator_key.json` mount at `/sei/config/priv_validator_key.json`
-- `SigningKeySet_SidecarHasNoSigningMount` — sidecar container does not mount the signing volume
-- `SigningKeyUnset_NoSigningVolume` — regression guard for the absence path
-- `BootstrapJob_NeverHasSigningVolume` — even when `SigningKey` is set on the SeiNode, `task.GenerateBootstrapJob` produces a Job with no signing-related volume (the §3 invariant)
-
-**Validate-signing-key task** (`internal/task/validate_signing_key_test.go`):
-
-Table-driven, fake-client pattern matching `ensure_pvc_test.go`. One case per row of §4's validation table (transient + terminal variants), plus:
-
-- `ValidSecret_Completes`
-- `SecretNotFound_Transient_ThenAppears_Completes` (apply Secret between reconciles)
-
-### Integration / e2e
-
-- `Controller_ValidatorWithSigningKey_ReachesRunningSigning` — apply SeiNode with SigningKey, Secret pre-applied; expect Phase=Running, SigningKeyReady=True, key file present at expected mount path inside pod
-- `Controller_ValidatorPhase1ThenPhase2_PodRestartsWithKeys` — apply without SigningKey, wait for Running, patch SigningKey in, observe pod restart with mounts
-- `Controller_ValidatorWithMissingSecret_StuckTransient` — apply with SigningKey but no Secret; expect SigningKeyReady=False/SecretNotFound; apply Secret; expect convergence
-- `Controller_ValidatorWithMalformedSecret_PlanFailed` — Secret exists but `priv_validator_key.json` is not valid JSON or wrong shape; expect plan Failed and SigningKeyReady=False with terminal Reason
-- `Controller_ValidatorDeletion_PreservesSecret` — delete SeiNode, confirm referenced Secret still exists
-
-## 10. Observability
-
-### Metrics (`internal/controller/observability/`)
-
-- `signingKeyValidationTotal` (counter, attrs: controller, namespace, result∈{valid,transient,terminal})
-- `signingKeyTerminalTotal` (counter, attrs: controller, namespace, reason) — independent counters per terminal Reason for alerting
-- `signingKeyTimeToValid` (histogram, attrs: controller, namespace) — observed at task Complete, from `SubmittedAt` → now
-
-### Events
-
-Emitted by the reconciler after status flush:
-
-- `SigningKeyValidated` (Normal) on transition to True
-- `SigningKeyValidationFailed` (Warning) when False-condition Reason changes
-
-### Dashboards / alerts
-
-Out of scope for this LLD — they live in `~/workspace/platform/clusters/prod/monitoring/`. The Reason strings in §4 and §5 are the public contract those alerts will key on.
-
-## 11. What this LLD does NOT cover
-
-- **Mid-life `SigningKey` patch on a Running validator (drift detection).** v1 supports SigningKey set from SeiNode creation. Patching SigningKey onto an already-Running validator is a no-op today because `buildRunningPlan` only checks image drift. Adding this would require: (a) tracking the currently-mounted secretName in `SeiNodeStatus`, (b) extending `buildRunningPlan` to detect SigningKey drift, (c) building a re-apply plan that includes `validate-signing-key` + `apply-statefulset` + `observe-image`-equivalent rollout-watch. The plumbing already exists; the gap is just the drift trigger.
-- **Variants beyond `Secret`.** TMKMS, Horcrux, remote signer (Web3Signer / Vault / AWS-KMS-fronted), Tendermint KMS protocol over a Unix socket. Add as sibling fields under `SigningKeySource` when in-house validators need them; the union is shaped to accept them additively.
-- **Automated cutover orchestration.** The cutover in §8 is operator-driven (manual stop of EC2, manual scrape of keys, manual `kubectl apply`). Automating this end-to-end is explicitly deferred until first manual cutover succeeds.
-- **`priv_validator_state.json` injection.** This file is CometBFT's slashing-protection ledger. seid auto-creates it (height=0) on first start and owns it on the PVC thereafter; on pod restart it's read from the PVC, not from any external source. For the migration use case, transferring the old validator's state file is unnecessary because the cutover runbook already enforces a hard halt of the old validator and a wait for chain advance ≥ M blocks past `last_signed_height` before activating the new instance — at that point the new validator's first signing opportunity is far past anything the old validator signed. The file is also operational data, not secret material (height/round/step plus the most recent signature, all of which are public on-chain). If a future use case needs explicit state injection (e.g., automated chain-rollback tooling), add a separate ConfigMap-based source distinct from `SigningKey`.
-- **Double-sign detection.** Out-of-band monitoring (slashing-info polling, sentry comparisons) is the right venue. Not a controller responsibility.
-- **Sentry-node topology** (private validator behind public sentries). Architecture decision orthogonal to keying; deferred.
-- **Consensus key rotation.** Rotating `priv_validator_key.json` requires an on-chain `MsgEditValidator` and is a coordinated operation, not a `kubectl edit secret`. Out of scope; runbook concern.
-- **Unsetting `SigningKey` after it's been set.** Currently no field-level immutability rule prevents unset-after-set, but the workflow is unsupported (pod would restart without keys, validator would stop signing, miss blocks → jail). If a future use case requires demote-to-non-signing, add a controlled demotion plan rather than relying on unset.
-- **Cross-namespace Secret references.** SecretName resolves in the SeiNode's namespace only.
-- **HSM integration.** The natural variant for HSM is a remote-signer-style sibling under `SigningKeySource`, not a Secret-based shape. Defer until concrete HSM platform is selected.
-
-## 12. Migration / rollout
-
-**No migration required.** Existing SeiNodes serialize without `signingKey`; the controller reads it as nil and follows existing code paths unchanged. The new CRD schema is additive.
-
-**Controller upgrade is idempotent.** Any running SeiNode whose plan completed before this version is unaffected. A SeiNode with `SigningKey` already set in spec but plan not yet started will pick up the new `validate-signing-key` task on the next reconcile; if the Secret is valid, the task completes immediately and the plan continues.
-
-**Operator rollout:**
-
-```bash
-make manifests generate
-make test
-make lint
-make docker-build IMG=<registry>/sei-k8s-controller:<sha>
-make docker-push IMG=<registry>/sei-k8s-controller:<sha>
-# update Helm/flux values; controller restart is disruption-free
-```
-
-No CRD data migration. Applying the new CRD on a cluster with old SeiNode objects is safe.
-
-## Related work
-
-- `docs/design-seinode-import-volume-lld.md` — adjacent LLD; sets in-house style for K8s API design (immutability, validation, condition lifecycle).
-- `docs/design/composable-genesis.md` — sketches `register-validator` for the *new-chain validator joining* case (deferred for migration use case).
-- `api/v1alpha1/validator_types.go` — new fields per §1.
-- `api/v1alpha1/seinode_types.go` — new condition constant per §5.
-- `internal/noderesource/noderesource.go` (`buildNodePodSpec`, `buildNodeMainContainer`) — pod-spec changes per §2.
-- `internal/task/validate_signing_key.go` — new task per §4.
-- `internal/planner/validator.go` (`validatorPlanner.Validate`) — planner Validate additions per §6.
-- `internal/planner/bootstrap.go` (`buildBootstrapPlan`), `internal/planner/planner.go` (`buildBasePlan` validator path) — plan integration per §4.
diff --git a/docs/design-snapshot-publish-config.md b/docs/design-snapshot-publish-config.md
deleted file mode 100644
index 91173587..00000000
--- a/docs/design-snapshot-publish-config.md
+++ /dev/null
@@ -1,234 +0,0 @@
-# Design: SeiNode — Decoupling Snapshot Generation from Publishing
-
-**Status:** Draft / RFC
-**Date:** 2026-04-20
-**Tracks:** (issue TBD)
-**Related:** `api/v1alpha1/common_types.go` (`SnapshotGenerationConfig`), `api/v1alpha1/full_node_types.go`, `api/v1alpha1/archive_types.go`
-
-## Problem
-
-Today `SnapshotGenerationConfig` is a flat type with a single field — `KeepRecent` — and its presence implicitly toggles *three* distinct behaviors:
-
-1. Set archival pruning in app.toml (`pruning = "nothing"`).
-2. Set `snapshot-interval` / `snapshot-keep-recent` in app.toml so seid produces Tendermint state-sync snapshots on disk.
-3. Cause the sidecar to upload completed snapshots to `{SEI_SNAPSHOT_BUCKET}/{chainID}/` in S3.
-
-There is no way to opt into (1)+(2) without also getting (3). A full node operator who wants N local Tendermint snapshots on disk — for fast internal restarts, for a local peer to state-sync from, or while a bucket isn't provisioned yet — has no first-class way to express that. Turning the feature off entirely is the only alternative, which also disables local generation.
-
-The shape is also **not extensible**. `SnapshotGenerationConfig` is a flat record keyed to Tendermint state-sync snapshots specifically. Any future snapshot flavor (e.g., seidb snapshots, app-layer snapshots, a different upload target, or a second publish protocol) would require overloading field names or renaming existing fields. The current type is shaped for one mode and one destination.
-
-Both gaps trace to the same underlying issue: the config conflates *what kind of snapshot* with *what to do with it*. Pulling those apart is the refactor.
-
-## What the user can configure today
-
-```yaml
-spec:
-  fullNode:
-    snapshotGeneration:
-      keepRecent: 5        # <-- only knob
-```
-
-Behavior: produce Tendermint snapshots on disk, retain the last 5, and upload every completed snapshot to S3. The upload is not separately addressable.
-
-## Shapes evaluated
-
-| Axis | **A** — add `publishToS3: bool` to flat struct | **B** — nest under `tendermint:` + boolean toggle | **C** — nest under `tendermint:` + present/absent `publish` struct | **D** — polymorphic `snapshotMethods: []` list |
-|---|---|---|---|---|
-| Extensibility to new snapshot modes | Poor — adding seidb/app-layer requires renaming or mode discriminators on the flat struct | Good — new modes = new sibling sub-structs (`seidb:`, `appLayer:`) | Good — same as B | Excellent — list is ordered and naturally polymorphic |
-| Extensibility of `publish` config (bucket, prefix, future protocols) | Poor — boolean can't carry config; renaming later is churn | Poor — same as A | Good — struct can grow fields (`publish.bucket`, `publish.prefix`) without renaming | Good — each item carries its own config block |
-| Consistency with existing SeiNode patterns | Low | Medium | **High** — matches `stateSync: {}` and `snapshot: { s3: {...} }` presence-as-toggle used throughout `common_types.go` | Low — no precedent for polymorphic lists in this CRD |
-| Ergonomics for the "local-only snapshots" case | `publishToS3: false` (explicit negation reads awkwardly) | `publish: { enabled: false }` (double-negative adjacent to an `enabled` field) | Omit `publish:` entirely — the absence is the signal | Omit the S3-publish entry |
-| Ergonomics for the default / common case (publish-enabled) | Most compact (`publishToS3: true`) | Compact (`publish: { enabled: true }`) | Compact (`publish: {}`) | Verbose (full method entry) |
-| CRD validation surface | Minimal | Minimal | Minimal | High — list-of-discriminated-unions requires oneOf enforcement in OpenAPI |
-| Reviewability of the diff | One-line spec change | New type + one field rename | New type + new leaf type + one field rename | New list type + method discriminator + union validation |
-
-### Why not C with a stringly-typed mode?
-
-An alternative to B/C is a single `mode: enum` field (`TendermintStateSync`, `Seidb`, …) on a flat struct. This keeps the shape flat but moves extensibility into enum values — future modes add enum cases rather than fields. Rejected: enum discriminators force every mode to share the same config fields (`keepRecent` may not even be meaningful for seidb), defeating the purpose. Nested sub-structs keep each mode's config co-located with the mode.
-
-## Decision
-
-**Adopt Shape C**: rename `SnapshotGenerationConfig` to wrap a `tendermint:` sub-struct that carries mode-specific fields (`keepRecent`), plus an optional `publish:` struct whose *presence* enables S3 upload. Absence of `publish` means "generate and retain locally; do not upload."
-
-```go
-// SnapshotGenerationConfig configures snapshot generation. One or more snapshot
-// modes may be enabled by setting the corresponding sub-struct. A mode sub-struct
-// being absent means that snapshot type is not produced.
-type SnapshotGenerationConfig struct {
-    // Tendermint configures Tendermint state-sync snapshot generation.
-    // +optional
-    Tendermint *TendermintSnapshotGenerationConfig `json:"tendermint,omitempty"`
-}
-
-// TendermintSnapshotGenerationConfig configures a node to produce Tendermint
-// state-sync snapshots. The controller sets archival pruning and a
-// system-default snapshot-interval in app.toml. Snapshots are written to the
-// node's data volume.
-type TendermintSnapshotGenerationConfig struct {
-    // KeepRecent is the number of recent snapshots to retain on disk.
-    // When Publish is set, must be at least 2 so the upload algorithm can
-    // select the second-to-latest completed snapshot. Otherwise must be at
-    // least 1.
-    // +kubebuilder:validation:Minimum=1
-    KeepRecent int32 `json:"keepRecent"`
-
-    // Publish, when set, causes the sidecar to upload completed snapshots
-    // to {SEI_SNAPSHOT_BUCKET}/{chainID}/. Absence means snapshots are kept
-    // on disk only and are not uploaded.
-    // +optional
-    Publish *TendermintSnapshotPublishConfig `json:"publish,omitempty"`
-}
-
-// TendermintSnapshotPublishConfig configures how completed Tendermint
-// snapshots are uploaded. Currently an empty struct — its presence on
-// TendermintSnapshotGenerationConfig enables upload to the platform
-// snapshot bucket. Fields may be added here in the future (e.g., bucket
-// override, prefix) without a breaking change.
-type TendermintSnapshotPublishConfig struct{}
-```
-
-YAML surface:
-
-```yaml
-# Generate Tendermint snapshots, keep 5 locally, upload to S3 (current default use case)
-spec:
-  fullNode:
-    snapshotGeneration:
-      tendermint:
-        keepRecent: 5
-        publish: {}
-
-# Generate Tendermint snapshots, keep 3 locally, DO NOT upload (new capability)
-spec:
-  fullNode:
-    snapshotGeneration:
-      tendermint:
-        keepRecent: 3
-
-# No snapshot generation (unchanged)
-spec:
-  fullNode: {}
-```
-
-### Why present/absent struct instead of `enabled: bool`
-
-Three reasons, in order of weight:
-
-1. **Existing precedent in this CRD.** `StateSyncSource struct{}` (`common_types.go:107`) is already used this way — setting `spec.fullNode.snapshot.stateSync: {}` enables state sync. `spec.fullNode.snapshot.s3: {...}` is the sibling case where presence enables the S3 source path. Using the same pattern here keeps the CRD internally consistent; mixing `stateSync: {}` with `publish: { enabled: true }` would be needless schema drift.
-2. **Extensibility without renames.** When `publish` eventually grows fields — bucket override per-node, snapshot-prefix override, a second publish target — those fields go into the existing struct. `enabled: bool` would need to either be dropped or coexist with the new fields, producing an awkward `enabled: true, bucket: "..."` shape.
-3. **No double-negative omitted-field cases.** `publish: { enabled: false }` carries the same semantic as `publish:` unset, and operators will mix the two inconsistently. The absent struct is the single canonical way to say "off."
-
-The Kubernetes API conventions don't mandate one form over the other — both appear in core APIs — but "presence-as-enabler for an optional nested feature" is idiomatic (e.g., `PodSpec.securityContext`, `ServiceSpec.sessionAffinityConfig`). `enabled: bool` is reserved for cases where presence of the parent struct is *also* meaningful and the boolean distinguishes a substate.
-
-### Why the outer wrapper (`SnapshotGenerationConfig`) still exists
-
-A thinner alternative is to drop the wrapper and put `tendermint:` directly on the full-node / archive spec:
-
-```yaml
-spec:
-  fullNode:
-    tendermintSnapshotGeneration:
-      keepRecent: 5
-```
-
-Rejected: the outer `snapshotGeneration:` groups all snapshot-producing behavior under one namespace. A future `spec.fullNode.snapshotGeneration.seidb: {...}` has an obvious home; `spec.fullNode.seidbSnapshotGeneration` would proliferate top-level fields. The wrapper is one nesting level of YAML for consistent grouping — cheap.
-
-## Controller behavior
-
-The controller changes are confined to three call sites that read the config plus one helper in `internal/planner/planner.go`.
-
-### 1. Planner override logic
-
-`internal/planner/full.go:44` and `internal/planner/archive.go:35` currently read:
-
-```go
-sg := node.Spec.FullNode.SnapshotGeneration
-if sg == nil { return nil }
-return seiconfig.SnapshotGenerationOverrides(sg.KeepRecent)
-```
-
-After the refactor:
-
-```go
-sg := node.Spec.FullNode.SnapshotGeneration
-if sg == nil || sg.Tendermint == nil { return nil }
-return seiconfig.SnapshotGenerationOverrides(sg.Tendermint.KeepRecent)
-```
-
-Archival pruning + `snapshot-interval` overrides apply whenever `tendermint` is set — identical to today's behavior when `SnapshotGeneration` was set. The override code does not care whether publishing is enabled; the overrides control what seid writes to disk.
-
-### 2. Reuse the existing `snapshot-upload` sidecar task
-
-The sidecar already defines `TaskTypeSnapshotUpload` and `SnapshotUploadTask{}` (seictl `sidecar/client/tasks.go`, present in pinned version v0.0.30). The task is a fire-and-forget RPC that tells the sidecar to start its continuous upload loop at its configured interval. Bucket/region/prefix are derived by the sidecar from its env — the existing `SEI_SNAPSHOT_BUCKET` / `SEI_SNAPSHOT_REGION` wiring at `internal/noderesource/noderesource.go:300-301` is unchanged.
-
-The controller-side plumbing that used to submit this task (`planner.SnapshotUploadMonitorTask`, `planner.TaskSnapshotUpload`, the monitor-task reconcile loop) was deleted in #89 (`358a216 feat: remove monitor task system`) when the monitor-task subsystem was removed in favor of plan-driven reconciliation. The net change here is re-introducing that wiring, but as a **plan task** instead of a separate monitor-task surface:
-
-- Register `sidecar.TaskTypeSnapshotUpload` in the controller's task registry (`internal/task/task.go`) with a `SnapshotUploadParams` wrapper, as fire-and-forget (matching `mark-ready`, `config-validate`).
-- In `internal/planner/bootstrap.go`, append the `snapshot-upload` task to the init progression when `spec.*.snapshotGeneration.tendermint.publish` is present. Absence of `publish` means the task is not emitted; the sidecar never starts uploading.
-
-Placement in the init progression: after `config-apply` / `config-validate` (so the snapshot-interval overrides are in place first) and before `mark-ready` (so the sidecar's upload loop is armed before seid starts producing snapshots).
-
-This keeps the plan self-describing — a reader can tell from `.status.plan` whether a node is configured to publish — and it does so by using the sidecar task type that already exists, rather than inventing a new one.
-
-### 3. `SnapshotGeneration` helper
-
-`internal/planner/planner.go:312` currently extracts `*SnapshotGenerationConfig` from either `FullNode` or `Archive`. That helper stays; its return type doesn't change. Downstream callers reach through `.Tendermint` when they need the mode-specific config.
-
-## Validation
-
-Cross-field validation lives in each per-mode planner's `Validate(node)` hook — `fullNodePlanner.Validate` (`internal/planner/full.go:17`), `archiveNodePlanner.Validate` (`internal/planner/archive.go:18`). That hook already enforces semantic rules on sub-specs (e.g., `snap.BootstrapImage requires s3.TargetHeight > 0`) and is the established home for rules that the OpenAPI schema cannot express.
-
-Not a job for the sidecar's `config-validate` task (`internal/task/config.go:33`): that task validates the **rendered** `app.toml` / `config.toml` against seid's requirements. `snapshot-keep-recent=1` is perfectly legal in seid — the "must be ≥ 2 when publishing" rule is ours, not seid's, so it does not belong at the rendered-config layer.
-
-Rules:
-
-- `snapshotGeneration.tendermint.keepRecent >= 1` at the schema level via kubebuilder marker (was `>= 2`).
-- When `snapshotGeneration.tendermint.publish` is set, the planner `Validate(node)` enforces `keepRecent >= 2`. Rationale: the upload algorithm picks the second-to-latest completed snapshot. Without publish, retaining only the most recent is fine.
-- `snapshotGeneration` with neither `tendermint` nor any other sub-struct set is treated as a validation error in `Validate(node)` — an empty `snapshotGeneration: {}` is almost certainly a user typo. Schema-level `oneOf` is an option but not required; planner-level validation is consistent with how other "pick one sub-mode" invariants are handled in this codebase.
-
-No admission webhook is introduced.
-
-## Migration
-
-Per the skill brief: no launched product, no backward compat required. The existing field name (`snapshotGeneration`) is preserved; only its inner shape changes. All four in-tree samples that reference it (`manifests/samples/seinode/pacific-1-snapshotter.yaml`, `manifests/samples/seinode/pacific-1-state-syncer.yaml`, test fixtures) are updated in the same PR that lands the type change.
-
-`make manifests generate` regenerates the CRD YAML and DeepCopy methods; no hand-editing. `make test` updates to reference the new shape.
-
-## Rollout steps
-
-1. Land type changes in `api/v1alpha1/common_types.go`: new `TendermintSnapshotGenerationConfig`, new empty `TendermintSnapshotPublishConfig`, rewritten `SnapshotGenerationConfig`.
-2. `make manifests generate` — regenerates CRDs and DeepCopy.
-3. Update planner override call sites (`internal/planner/full.go`, `internal/planner/archive.go`) to read `.Tendermint.KeepRecent`.
-4. Add cross-field checks to `fullNodePlanner.Validate` and `archiveNodePlanner.Validate`: reject `snapshotGeneration: {}` with no sub-struct; require `keepRecent >= 2` when `publish` is set.
-5. Re-wire the existing seictl `TaskTypeSnapshotUpload` on the controller side: add a `SnapshotUploadParams` wrapper in `internal/task/`, register it in the task registry at `internal/task/task.go` as fire-and-forget, and thread it into the init progression in `internal/planner/bootstrap.go` when `publish` is present. This restores controller-side plumbing that was deleted in #89 (`358a216`), now as a plan task rather than a monitor task.
-6. Update sample manifests and tests to the new shape.
-7. Update `internal/planner/planner.go:312` doc comment to note the indirection through `.Tendermint`.
-
-## Open questions (for the LLD)
-
-1. **Should `publish: {}` later grow a `bucket` override?** Out of scope for this design; the struct exists precisely to absorb that field later without a rename. Flag if the operator wants per-node bucket overrides imminently — affects whether `TendermintSnapshotPublishConfig` should be defined with future fields in mind now. Note: `SnapshotUploadTask` in seictl v0.0.30 currently takes no params; per-node overrides would likely require a minor bump on the sidecar as well.
-
-2. **Archive node behavior when `publish` is absent.** Archives already disable pruning independent of `SnapshotGeneration` (per `archive_types.go` comment). Does omitting `publish` on an archive that was previously a "snapshotter" orphan the S3 history? Operational, not schema-level — note in release notes / migration instructions for any already-deployed dev archives.
-
-## What this design does NOT cover
-
-- **Other snapshot flavors** (seidb, app-layer, … ). The wrapper is shaped to accept them as sibling sub-structs on `SnapshotGenerationConfig`; actual design for those modes ships when the use case appears.
-- **Per-node bucket / prefix overrides on `publish`**. Deferred; `TendermintSnapshotPublishConfig` is empty today precisely so these can land as field additions later.
-- **A sidecar-managed on-disk GC policy beyond `keepRecent`.** The sidecar currently relies on seid's `snapshot-keep-recent`; no separate retention policy is introduced here.
-- **Metrics / events for snapshot upload success & failure.** Existing snapshot-related events (`SnapshotScheduled`, `SnapshotComplete`, `SnapshotFailed`) are unchanged by this design.
-
-## Related work
-
-- `api/v1alpha1/common_types.go:109` — `SnapshotGenerationConfig` (type being refactored)
-- `api/v1alpha1/full_node_types.go:11`, `api/v1alpha1/archive_types.go:8` — consumers of the type
-- `api/v1alpha1/common_types.go:107` — `StateSyncSource struct{}` precedent for the presence-as-toggle pattern
-- `internal/planner/full.go:17`, `internal/planner/archive.go:18` — `Validate(node)` hooks; home for the new cross-field checks
-- `internal/planner/full.go:44`, `internal/planner/archive.go:35` — override call sites
-- `internal/planner/planner.go:312` — `SnapshotGeneration()` helper
-- `internal/task/bootstrap.go:25` — `ConfigureStateSyncParams`, an example of the one-shot "enable a background sidecar behavior" task pattern
-- `internal/task/config.go:33` — `ConfigValidateParams`, the rendered-config validator (distinct from spec-level `Validate(node)`)
-- `internal/task/task.go:180` — sidecar task registry where `TaskTypeSnapshotUpload` is re-registered
-- `internal/noderesource/noderesource.go:300-301` — sidecar `SEI_SNAPSHOT_BUCKET` / `SEI_SNAPSHOT_REGION` env (unchanged; sidecar reads upload target from these)
-- seictl `sidecar/client/tasks.go` — `TaskTypeSnapshotUpload` / `SnapshotUploadTask{}` (v0.0.30, already present; the controller side is what's missing)
-- Commit `358a216` (PR #89) — removed the monitor-task subsystem, including the old `SnapshotUploadMonitorTask` plumbing that this design reinstates as a plan task
-- `manifests/samples/seinode/pacific-1-snapshotter.yaml`, `pacific-1-state-syncer.yaml` — sample manifests updated alongside the type change
diff --git a/docs/design/composable-genesis.md b/docs/design/composable-genesis.md
deleted file mode 100644
index b0771536..00000000
--- a/docs/design/composable-genesis.md
+++ /dev/null
@@ -1,442 +0,0 @@
-# Composable Genesis Ceremony
-
-## Status
-
-Proposal — not yet scheduled for implementation.
-
-## Problem
-
-The current `SeiNodePool` genesis ceremony is a monolithic, centralized process:
-
-1. A single Kubernetes Job (`{pool}-genesis`) runs an embedded shell script that
-   sequentially calls `seid init` for every node, creates keys, gentx, collects
-   gentx, patches genesis parameters, and distributes the final `genesis.json`.
-2. Per-node prep Jobs copy each node's directory from a shared EFS PVC to the
-   node's data PVC, rewrite peer addresses, and set validator-specific config.
-3. SeiNodes are created with pre-populated PVCs. Their sidecars do almost nothing.
-
-This approach has several drawbacks:
-
-- **Rigid**: the entire ceremony is a single shell script. Adding a step (e.g.,
-  oracle config, IBC relayer keys) means editing the script.
-- **Sequential**: all N node identities are created in series within one Job pod.
-- **EFS dependency**: a ReadWriteMany PVC (EFS) is required for the shared genesis
-  artifact, adding infrastructure cost and operational complexity.
-- **Not reusable**: the validator initialization logic is locked inside the pool's
-  genesis Job. A standalone SeiNode joining an existing chain as a validator cannot
-  reuse any of it.
-
-## Proposal
-
-Decompose the genesis ceremony into per-node sidecar tasks orchestrated by the
-`SeiNodePool` controller. Each `SeiNode` independently bootstraps its own validator
-identity. The pool controller coordinates the single fan-in step (genesis assembly)
-that no individual node can perform alone.
-
-### Architecture
-
-```
-  SeiNodePool Controller (orchestrator)
-      │
-      ├── Creates N SeiNodes with Validator config
-      │
-      ├── Watches SeiNode statuses for identity completion
-      │
-      ├── Triggers assembly when all identities are ready
-      │   (lightweight Job or designated-node sidecar task)
-      │
-      └── Patches each SeiNode's Genesis.S3 to point to assembled genesis
-          └── Each node's existing init plan machinery takes over
-```
-
-### Per-Node Init Plan (pool-created validator)
-
-```
-init-validator          → create keys, gentx, publish identity to S3
-configure-genesis       → download assembled genesis.json from S3 (retries until available)
-configure-state-sync    → (only if StateSync is set)
-config-patch            → apply TOML config patches
-mark-ready              → signal bootstrap complete
-register-validator      → (only on existing chains) submit create-validator tx
-```
-
-### Per-Node Init Plan (standalone validator joining existing chain)
-
-```
-init-validator          → create keys (gentx not needed; genesis already exists)
-configure-genesis       → download existing chain genesis from S3 (immediately available)
-configure-state-sync    → sync to chain tip
-config-patch            → apply TOML config patches
-mark-ready              → signal bootstrap complete
-register-validator      → submit create-validator tx on-chain
-```
-
-### S3 Ceremony Artifact Layout
-
-Each node publishes its identity artifacts to a shared S3 prefix. The assembler
-reads from this prefix to build genesis.
-
-The ceremony prefix includes a unique identifier (the pool's UID or a user-supplied
-value) so that multiple pools with the same `chainId` do not collide. For example,
-two separate `arctic-1` test networks created at different times each get their own
-isolated ceremony namespace.
-
-```
-s3://{bucket}/{chainId}/ceremony/{ceremonyId}/
-├── node-0/
-│   ├── node-id.json       {"node_id": "7ea2fc...", "address": "pool-0.pool.ns.svc:26656"}
-│   ├── gentx.json         validator genesis transaction
-│   └── account.json       {"address": "sei1abc...", "coins": ["10000000usei"]}
-├── node-1/
-│   ├── node-id.json
-│   ├── gentx.json
-│   └── account.json
-├── ...
-└── genesis.json           written by assembler after collect-gentxs
-```
-
-The `ceremonyId` is derived from the `SeiNodePool` resource UID by default, ensuring
-uniqueness across pool recreations. Users can override it via the pool spec for
-deterministic paths in CI/CD workflows.
-
-### Determining Whether Gentx Is Needed
-
-No explicit flag is required. The controller infers this from the existing spec:
-
-```go
-func needsGentx(node *SeiNode) bool {
-    return node.Spec.Validator != nil &&
-           node.Spec.Genesis.S3 == nil &&
-           node.Spec.Genesis.PVC == nil
-}
-```
-
-If there is no pre-existing genesis source, the node is bootstrapping a new network
-and must produce a gentx for the assembler. If genesis is already available, the node
-only needs its key material.
-
-## CRD Changes
-
-### SeiNodeSpec additions
-
-```go
-// Validator configures this node to initialize as a validator.
-// When set, the sidecar generates key material and (for new chains)
-// a genesis transaction during bootstrap.
-// +optional
-Validator *ValidatorConfig `json:"validator,omitempty"`
-```
-
-### New types
-
-```go
-type ValidatorConfig struct {
-    // Moniker is the human-readable validator name.
-    // Defaults to the SeiNode name if unset.
-    // +optional
-    Moniker string `json:"moniker,omitempty"`
-
-    // StakeDenom is the token denomination for staking (e.g. "usei").
-    StakeDenom string `json:"stakeDenom"`
-
-    // StakeAmount is the self-delegation amount in the gentx (e.g. "1000000usei").
-    StakeAmount string `json:"stakeAmount"`
-
-    // Identity configures where this node publishes its validator artifacts.
-    Identity ValidatorIdentityStore `json:"identity"`
-}
-
-type ValidatorIdentityStore struct {
-    // S3 is the prefix where identity artifacts are published.
-    // Each node writes to {prefix}/node-{name}/ under this location.
-    S3 S3Location `json:"s3"`
-}
-```
-
-## New Sidecar Tasks
-
-### `init-validator`
-
-Creates validator key material and, when the node is bootstrapping a new chain,
-produces a gentx and publishes identity artifacts to S3.
-
-**Inputs**: moniker, chainId, stakeDenom, stakeAmount, S3 prefix
-**Actions**:
-1. `seid init {moniker} --chain-id {chainId}`
-2. `seid keys add validator`
-3. If gentx is needed:
-   - `seid add-genesis-account`
-   - `seid gentx validator {stakeAmount}`
-   - Upload `node-id.json`, `gentx.json`, `account.json` to S3
-
-### `assemble-genesis` (assembler only)
-
-Collects all published identities and produces the final genesis.json.
-
-**Inputs**: S3 ceremony prefix, expected node count, genesis parameter patches
-**Actions**:
-1. Download all `node-{N}/` artifacts from S3
-2. `seid add-genesis-account` for each node
-3. Copy all gentx into `config/gentx/`
-4. `seid collect-gentxs`
-5. Apply genesis parameter patches (staking, oracle, gov, etc.)
-6. Upload final `genesis.json` to the ceremony prefix root
-
-### `register-validator` (post-sync, existing chains only)
-
-Submits a `create-validator` transaction on-chain after the node is synced.
-
-**Inputs**: key name, stakeAmount, commission params
-**Actions**:
-1. Wait for node to report as catching up = false
-2. `seid tx staking create-validator ...`
-
-## Prerequisite: Task Retry Policy
-
-**The current controller marks the entire init plan as `Failed` when any task
-execution fails.** This is the single-failure-kills-plan policy established early
-in the design.
-
-The composable genesis model depends on tasks that naturally retry until an external
-precondition is met (e.g., `configure-genesis` retrying until the assembler uploads
-`genesis.json`). This requires changing the failure policy.
-
-### Proposed change
-
-Add a retry semantic to the task plan. When a task fails, instead of immediately
-marking the plan as `Failed`:
-
-1. Reset the task status from `Failed` back to `Pending`.
-2. Increment a retry counter on the task.
-3. Requeue with a backoff interval.
-4. Only mark the plan as `Failed` after exceeding a configurable retry limit
-   (or use unlimited retries for specific task types).
-
-The simplest initial approach: **retry all failed tasks indefinitely** with the
-existing poll interval. A failed task goes back to `Pending`, gets resubmitted on
-the next reconcile. This matches the user's intent that tasks like `configure-genesis`
-should keep trying until the precondition (genesis file exists) is satisfied.
-
-A future refinement could introduce per-task retry limits or distinguishable error
-categories (retryable vs. permanent), but unlimited retry is the correct starting
-behavior for the scenarios this design enables.
-
-### Affected code
-
-`failTask()` in `plan_execution.go` currently sets `task.Status = PlannedTaskFailed`
-and `plan.Phase = TaskPlanFailed`. The change would instead set
-`task.Status = PlannedTaskPending` (clearing the task ID and error) and requeue.
-
-## SeiNodePool Redesign
-
-The current `SeiNodePool` CRD and controller are entirely unused in production and
-carry no established contracts. This design replaces the pool implementation
-wholesale, reshaping it around the composable SeiNode and sidecar architecture.
-
-### New SeiNodePool spec (sketch)
-
-A `SeiNodePool` is a generic grouping of N identical `SeiNode` instances. It is
-not inherently a "validator pool" — it can stamp out any type of node: snapshotters,
-RPC full nodes, validators, etc. The `NodeTemplate` determines what kind of nodes
-the pool creates.
-
-The optional `Ceremony` field is what triggers the genesis ceremony flow. A pool of
-5 snapshotters joining an existing chain would have no `Ceremony` at all — just a
-`NodeTemplate` with `StateSync` and `SnapshotGeneration` configured. A pool of 10
-validators bootstrapping a fresh test network would include `Ceremony` to orchestrate
-identity generation and genesis assembly.
-
-```go
-type SeiNodePoolSpec struct {
-    // ChainID is the chain identifier shared by all nodes in the pool.
-    ChainID string `json:"chainId"`
-
-    // NodeCount is the number of nodes in the pool.
-    // +kubebuilder:validation:Minimum=1
-    NodeCount int32 `json:"nodeCount"`
-
-    // NodeTemplate is the full SeiNodeSpec used to generate each node.
-    // The pool controller copies this into every SeiNode it creates,
-    // overriding ChainID and injecting per-node values (name, ordinal,
-    // PVC references). When Ceremony is set, the controller also injects
-    // Validator config and defers Genesis.S3 until assembly completes.
-    NodeTemplate SeiNodeSpec `json:"nodeTemplate"`
-
-    // Ceremony configures a genesis ceremony for the pool. When set, the pool
-    // controller orchestrates validator identity generation across all nodes
-    // and triggers genesis assembly when all identities are ready.
-    // When nil, nodes bootstrap independently using the sources already
-    // configured in the NodeTemplate (e.g., StateSync, Genesis.S3).
-    // +optional
-    Ceremony *CeremonyConfig `json:"ceremony,omitempty"`
-
-    // Storage controls PVC lifecycle for pool-managed nodes.
-    // +optional
-    Storage SeiNodeStorageConfig `json:"storage,omitempty"`
-}
-
-// SeiNodeSpec is used directly as the template — every field available to a
-// standalone SeiNode is available to pool-created nodes. There is no parallel
-// type to maintain. The pool controller overrides ChainID from the pool-level
-// field and injects per-node values; everything else passes through as-is.
-
-type CeremonyConfig struct {
-    // Validator configures the validator identity for each node.
-    // The pool controller sets ValidatorConfig.Identity on each SeiNode
-    // using the ceremony S3 prefix and the node's ordinal.
-    Validator ValidatorCeremonyConfig `json:"validator"`
-
-    // S3 is the bucket and prefix used for the ceremony artifact exchange.
-    // A unique ceremonyId is appended automatically (derived from pool UID).
-    S3 S3Location `json:"s3"`
-
-    // CeremonyID is an optional user-supplied identifier appended to the S3
-    // prefix. When empty, the pool's resource UID is used. Useful for
-    // deterministic paths in CI/CD.
-    // +optional
-    CeremonyID string `json:"ceremonyId,omitempty"`
-
-    // GenesisParams configures chain-level genesis parameters applied during
-    // assembly (staking denom, gov params, oracle config, etc.).
-    // +optional
-    GenesisParams *GenesisParams `json:"genesisParams,omitempty"`
-}
-
-type ValidatorCeremonyConfig struct {
-    // StakeDenom is the token denomination for staking (e.g. "usei").
-    StakeDenom string `json:"stakeDenom"`
-
-    // StakeAmount is the self-delegation amount per validator (e.g. "1000000usei").
-    StakeAmount string `json:"stakeAmount"`
-}
-```
-
-### Controller reconciliation
-
-The pool controller is a pure orchestrator with no data-plane responsibilities:
-
-1. **Create**: stamp out N SeiNodes from `NodeTemplate`, injecting per-node values
-   (name, ordinal, PVC names). If `Ceremony` is set, each node also gets a
-   `Validator` config pointing to the ceremony S3 prefix, and no genesis source
-   is set yet. If `Ceremony` is nil, nodes are created exactly as the template
-   specifies and bootstrap independently (e.g., a pool of snapshotters with
-   `StateSync` and `SnapshotGeneration` already configured).
-
-2. **Watch** (ceremony only): observe child SeiNode statuses. Each node's init
-   plan completes `init-validator` and then retries `configure-genesis` against
-   S3 (which 404s until assembly is done).
-
-3. **Assemble** (ceremony only): when all N nodes show `init-validator` as
-   `Complete`, the pool controller triggers genesis assembly. This is a lightweight
-   Job that runs `assemble-genesis` using the ceremony S3 prefix. It collects all
-   identities, applies `GenesisParams` patches, and uploads the final `genesis.json`.
-
-4. **Status**: aggregate child SeiNode statuses into pool-level status and
-   conditions. This applies to all pools regardless of whether a ceremony is
-   configured.
-
-### What the current pool implementation becomes
-
-The existing pool code (genesis Job, prep Jobs, EFS PVC, genesis script ConfigMap,
-`reconcileDataPreparation`) is removed entirely. The embedded `generate.sh` script
-is replaced by the `assemble-genesis` sidecar task / assembly Job. The prep Jobs
-are unnecessary because each node owns its data PVC from creation and populates
-it through its own sidecar tasks.
-
-### How the pool detects identity completion
-
-The SeiNode status already tracks per-task status in `InitPlan.Tasks`. The pool
-controller checks whether `init-validator` is `Complete` for each of its child
-SeiNodes. A dedicated status condition (e.g., `ValidatorIdentityReady`) could
-make this more explicit and decouple the pool from init plan internals.
-
-## Examples
-
-### Pool of snapshotters (no ceremony)
-
-```yaml
-apiVersion: sei.io/v1alpha1
-kind: SeiNodePool
-metadata:
-  name: pacific-1-snapshotters
-spec:
-  chainId: pacific-1
-  nodeCount: 5
-  nodeTemplate:
-    image: sei:latest
-    entrypoint:
-      command: ["seid"]
-      args: ["start", "--home", "/sei"]
-    genesis:
-      chainId: pacific-1
-      s3:
-        uri: s3://sei-genesis/pacific-1/genesis.json
-        region: us-east-2
-    snapshotRestore:
-      bucket:
-        uri: s3://pacific-1-snapshots/state-sync/
-      region: eu-central-1
-    peers:
-      sources:
-        - ec2Tags:
-            region: eu-central-1
-            tags: { ChainIdentifier: pacific-1, Component: snapshotter }
-    snapshotGeneration:
-      interval: 2000
-      keepRecent: 5
-      destination:
-        s3:
-          bucket: sei-node-mvp
-          prefix: snapshot/
-          region: eu-central-1
-```
-
-No `ceremony` — each of the 5 nodes boots independently using snapshot restore,
-state sync, and peer discovery. The pool controller just creates them and
-aggregates status.
-
-### Pool of validators bootstrapping a fresh chain (with ceremony)
-
-```yaml
-apiVersion: sei.io/v1alpha1
-kind: SeiNodePool
-metadata:
-  name: arctic-1-validators
-spec:
-  chainId: arctic-1
-  nodeCount: 10
-  nodeTemplate:
-    image: sei:latest
-    entrypoint:
-      command: ["seid"]
-      args: ["start", "--home", "/sei"]
-  ceremony:
-    validator:
-      stakeDenom: usei
-      stakeAmount: "10000000usei"
-    s3:
-      bucket: sei-ceremony
-      prefix: arctic-1/
-      region: us-east-2
-```
-
-The pool controller creates 10 SeiNodes, each with a `Validator` config and no
-genesis source. Each node runs `init-validator`, publishes identity to S3, and
-retries `configure-genesis` until the assembler completes. The pool triggers
-assembly after all 10 identities are ready.
-
-## What This Enables
-
-- **Generic pooling**: `SeiNodePool` works for any node type — validators,
-  snapshotters, RPC nodes — not just genesis ceremonies.
-- **Parallel identity generation**: N nodes init concurrently instead of one Job
-  doing N sequential `seid init` calls.
-- **No EFS dependency**: nodes publish to S3; no ReadWriteMany storage needed.
-- **No prep jobs**: each node owns its own data PVC from the start.
-- **Reusable validator bootstrap**: the same `init-validator` task works for pool
-  nodes and standalone validators joining an existing chain.
-- **Extensible**: new ceremony steps (oracle config, IBC relayer keys, etc.) are
-  just additional sidecar tasks in the plan, not edits to a shell script.
-- **Post-sync hooks**: tasks like `register-validator` use the same plan/retry
-  machinery. No new "hook" abstraction needed — a task that depends on an external
-  condition simply retries until the condition is met.
diff --git a/docs/design/validation-run-lld.md b/docs/design/validation-run-lld.md
deleted file mode 100644
index ca7dac35..00000000
--- a/docs/design/validation-run-lld.md
+++ /dev/null
@@ -1,1588 +0,0 @@
-# ValidationRun — Composable Validation Workloads against Ephemeral Sei Chains (LLD)
-
-## Status
-
-Draft — under council review (`sei-protocol/sei-k8s-controller#139`). Council gate closed 2026-04-28. Architectural refinement (sub-controller rails) added 2026-04-29. Cross-review pending; PR #143 open.
-
-This LLD is the substrate for one or more handoff implementation issues. It does **not** itself authorize implementation — it is the artifact the gate reviews.
-
-## Problem
-
-Three differently-shaped validation workloads coexist on the Harbor cluster today with no shared abstraction: the seiload nightly performance run, the qa-testing TS/mocha suite (Phase 1 target), and the in-pod `SeiNode.spec.replayer` shadow-result sidecar. Only the third is first-class on the control plane; the first two are bash glue around `kubectl`, `envsubst`, `kubectl wait`, `kubectl exec seid status`, `aws s3 cp`. The bash pattern's documented sharp edges (missing `OwnerReferences`, no heartbeat, no structured report, race against `sei-chain` CI handled in shell) are symptoms of orchestrating outside the controller boundary; they don't grow into an abstraction, they compound as more workload shapes (fuzzers, soak, chaos suites, pectra-upgrade tests) come online.
-
-Critically, "validation" in this context isn't just "did the workload exit 0." Many failure modes are infrastructure-level invariants the workload can't see: validators dropped peer connections mid-run, `seid` memory grew unbounded, an alert fired during the load window. The current pattern can't express **observability-as-test-oracle** — composing workload signal with alert/PromQL signal into a single pass/fail.
-
-A second axis of expansion is also already visible: workloads and sequences and chaos injection all want to operate against the same ephemeral chain substrate, but they are different actors with different lifecycles. Hard-coding "ValidationRun owns one Job" forecloses on sequencing (apply N steps in order) and chaos (inject failures during the window). The architecture below treats the chain substrate as one concern and the actors as composable, additively-extensible sub-controllers.
-
-See `sei-protocol/sei-k8s-controller#139` for the full problem statement, OSS survey, and Phase 1 contract dependency. This LLD picks up where #139's open questions left off, integrating Round 1 specialist input, the Round 2 user gate decisions through 2026-04-28, and the Round 3 architectural-refinement pass through 2026-04-29.
-
-## Goals
-
-1. **One CRD, two cooperating sub-controllers in one binary that drive a validation Run to a terminal verdict against an ephemeral chain.** ValidationOrchestrationReconciler always runs; ValidationLoadGenerationReconciler is the v1 actor; ValidationSequenceReconciler and ValidationChaosReconciler are designed-as-extension and gated off by default. The CR carries one plan per controller in `.status.plans.<controllerName>`.
-2. **Workload contract parity with Phase 1** — `ValidationRun.spec.load.workload` adopts the env-vars / exit-codes / termination-message / S3 contract from `sei-protocol/platform#235` verbatim. A Job manifest that runs under Phase 1 GHA glue runs unchanged under the controller.
-3. **Observability-as-test-oracle** — the verdict is `workload_exit_code AND ⋀rules`, where rules are typed (`alert` and `query` in v1) and evaluated continuously over the load window with a deterministic, auditable Prometheus contract. v1 ships continuous polling with stop-on-failure as a first-class behavior. Rules-only Runs (no `load`) are valid: the Run becomes passive monitoring of an existing chain.
-4. **GitOps- and ad-hoc-friendly** — a `ValidationRun` is a self-contained, idempotent resource. Apply from GHA, Flux, kubectl, or a future bot; the controllers take it to a terminal phase exactly once.
-5. **No cross-tenant blast radius** — same-namespace by construction (CEL-enforced) with one narrow exception (`alert.ruleRef` into namespaces labeled `sei.io/validation-shared-rules=true`). Tenants pre-provision SAs; controller is never an IAM controller.
-6. **v2 actor expansion is purely additive.** Adding sequence or chaos = registering a new sub-controller behind a deployment-time opt-in, plus stamping a new plan slot on `.status.plans`. No refactor of ValidationOrchestrationReconciler or ValidationLoadGenerationReconciler.
-
-## Non-goals
-
-Adopted from #139 "Out of scope" plus the Round 1 + user refinements:
-
-- **`ValidationSuite` and `ValidationSchedule` controllers.** Kind names are reserved (CRDs install with v1) and the types are sketched here so v1 `ValidationRun` doesn't paint them into a corner. No reconcilers in v1.
-- **`SequenceSpec` and `ChaosSpec` reconcilers.** Field names reserved on `ValidationRunSpec` so the composable-block union does not have to change shape later. Admission-rejected at v1 until the sub-controllers register.
-- **`IntegrationTest` as a separate kind/discriminator.** The composable-blocks model collapses it: "stand up a chain + run a single container that validates" is just `load + rules` with the workload acting as a verifier. No separate kind needed.
-- **`container` rule type.** Reserved in the rule schema discussion; not in v1. Re-defer until a real heuristic ships outside Prometheus.
-- **`window-end` and `edge` rule modes.** v1 ships `continuous` semantics via the polling loop. Mode field collapses to a single behavior in v1 (no `mode` field on the wire); expansion is additive.
-- **Shadow-replayer migration.** Stays typed on `SeiNode.spec.replayer.resultExport.shadowResult`. Different shape.
-- **Kueue admission control / multi-tenant fairness.** Defer until ≥3 suite-consumers contend.
-- **DAG orchestration in `ValidationSuite`.** Flat sequential / parallel / `stopOnFailure` only when implemented.
-- **Run-vs-Template split / `ValidationDefinition` CR.** Embed inline; reserve `spec.runRef` field name only.
-- **Per-PR triggers, regression diffing, structured-report registry, multi-workload composition, mid-run RPC fleet refresh.** All deferred per #139.
-- **Pushgateway / Prometheus push of report metrics.** v1 pattern is `.status.report.s3Url` only — S3 is authoritative. Aggregation lives downstream.
-- **The controller minting ServiceAccounts or Pod Identity Associations.** Tenant pre-provisions. Forever.
-- **`spec.status.report.raw` (or any termination-message echo into status).** Cut at gate; consumers fetch from the S3 URL.
-- **`pods/exec` for sequence steps.** Rejected one-way door — when ValidationSequenceReconciler ships, transactions submit via short-lived `Job`s targeting RPC, not `kubectl exec` into validator pods.
-
-## Architecture overview
-
-The headline shift from earlier drafts: ValidationRun is reconciled by **two cooperating controllers in one binary**, each owning a distinct slice of the lifecycle and writing a distinct slice of `.status`. v2 actor controllers register identically alongside.
-
-```
-┌──────────────────────────────────────────────────────────────────────────┐
-│                        ValidationRun (CR)                                │
-│  spec.chain.deployments[]   spec.load   spec.rules[]   spec.timeouts     │
-│  status.phase   status.verdict   status.chain   status.rules[]           │
-│  status.report   status.workloadExitCode                                 │
-│  status.plans.orchestration   status.plans.loadGeneration                │
-│  status.conditions[TestRunning, TestComplete, Succeeded,                 │
-│                    TestCancelled, LoadComplete]                          │
-└──────────────────────────────────────────────────────────────────────────┘
-            ▲                                       ▲
-            │ Reconcile()                           │ Reconcile() (predicate-gated)
-            │                                       │
-┌───────────┴──────────────┐         ┌──────────────┴────────────┐
-│ ValidationOrchestrationReconciler  │  ──IPC──▶ ValidationLoadGenerationReconciler  │
-│ (required if validation  │  via    │ (opt-in; default-on within │
-│  is enabled)             │  status │  validation slice)         │
-│ ControllerName=          │  status │ ControllerName=           │
-│   "validationrun-        │  conds  │   "validationrun-         │
-│    orchestration"        │  only   │    loadgeneration"        │
-│                          │         │                           │
-│ Owns: SND children       │         │ Owns: Job, ConfigMap      │
-│ Plan slot:               │         │ Plan slot:                │
-│   .status.plans.         │         │   .status.plans.          │
-│      orchestration       │         │      loadGeneration       │
-│ Field manager:           │         │ Field manager:            │
-│   validationrun-         │         │   validationrun-          │
-│    orchestration         │         │    loadgeneration         │
-│ Writes:                  │         │ Writes:                   │
-│  Conditions[TestRunning, │         │  Conditions[LoadComplete] │
-│             TestComplete,│         │  status.workloadExitCode  │
-│             Succeeded,   │         │  status.plans.            │
-│             TestCancelled│         │     loadGeneration        │
-│            ],            │         │                           │
-│  status.chain.*,         │         │                           │
-│  status.report.*,        │         │                           │
-│  status.rules[],         │         │                           │
-│  status.phase,           │         │                           │
-│  status.verdict,         │         │                           │
-│  status.plans.           │         │                           │
-│     orchestration        │         │                           │
-└──────────────────────────┘         └───────────────────────────┘
-            │                                       │
-            ▼                                       ▼
-   SeiNodeDeployments                     batch/v1.Job + ConfigMap
-   (one per spec.chain.deployments[i])    workload pod(s)
-   role=validator | fullNode              envs from contract
-            │                                       │
-            └─── owns SeiNodes ───┐                 │
-                                  ▼                 ▼
-                    Prometheus /api/v1/query   S3 upload via Pod Identity
-                    (per rule per interval)
-```
-
-The two sub-controllers communicate **only through the CR's `.status`**. There is no direct in-process API between them; both watch the same CR and react to condition flips. This keeps each controller individually testable, individually opt-out-able by deployment configuration, and gives v2 actor sub-controllers a clean integration point.
-
-**Phase machine** (mirrors Tekton/Argo, capitalized — *not* Testkube's lowercase):
-
-```
-                         ┌──────────────┐
-            ┌────────────│   Pending    │
-            │            └──────┬───────┘
-            │                   │ ValidationOrchestrationReconciler builds plan,
-            │                   │ persists, sets phase=Running on persist
-            │                   ▼
-            │            ┌──────────────┐
-            │            │   Running    │── Orchestration plan tasks ──┐
-            │            └──────┬───────┘   (and, gated by             │
-            │                   │            Conditions[TestRunning],  │
-            │                   │            LoadGeneration tasks)     │
-            │       ┌───────────┼─────────────┬────────────┐           │
-            │       ▼           ▼             ▼            ▼           │
-            │  ┌─────────┐ ┌─────────┐  ┌─────────┐  ┌─────────┐       │
-            └─▶│Cancelled│ │Succeeded│  │ Failed  │  │  Error  │◀──────┘
-               └─────────┘ └─────────┘  └─────────┘  └─────────┘
-```
-
-Terminal phases: `Succeeded` / `Failed` / `Error` / `Cancelled`. `Failed` = workload or rules said SUT misbehaved (test verdict). `Error` = controller couldn't ask (Prometheus 5xx, Job-create denial, infra failure that the workload signaled with exit code 2). `Cancelled` = `metadata.deletionTimestamp != nil` OR `Conditions[TestCancelled]=True` (terminal-failure short-circuit). The distinction matters for heartbeat alerting and for retry policy if/when added. **Phase is owned exclusively by ValidationOrchestrationReconciler**; ValidationLoadGenerationReconciler never sets phase, only conditions.
-
-## CRD types
-
-All three kinds (`ValidationRun`, `ValidationSuite`, `ValidationSchedule`) live in `api/v1alpha1/` (group `validation.sei.io`, version `v1alpha1`). All three are namespaced.
-
-### `ValidationRun` — the load-bearing kind
-
-```go
-// validationrun_types.go
-
-// +kubebuilder:object:root=true
-// +kubebuilder:subresource:status
-// +kubebuilder:resource:shortName=vr
-// +kubebuilder:printcolumn:name="Phase",type=string,JSONPath=`.status.phase`
-// +kubebuilder:printcolumn:name="Chain",type=string,JSONPath=`.spec.chain.chainId`
-// +kubebuilder:printcolumn:name="Started",type=date,JSONPath=`.status.startTime`
-// +kubebuilder:printcolumn:name="Duration",type=string,JSONPath=`.status.duration`
-// +kubebuilder:printcolumn:name="Verdict",type=string,JSONPath=`.status.verdict`,priority=1
-// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`
-type ValidationRun struct {
-    metav1.TypeMeta   `json:",inline"`
-    metav1.ObjectMeta `json:"metadata,omitempty"`
-
-    Spec   ValidationRunSpec   `json:"spec,omitempty"`
-    Status ValidationRunStatus `json:"status,omitempty"`
-}
-```
-
-#### `ValidationRunSpec` — composable optional blocks (no discriminator)
-
-```go
-// +kubebuilder:validation:XValidation:rule="has(self.load) || has(self.sequence) || has(self.chaos) || (has(self.rules) && size(self.rules) > 0)",message="at least one of spec.load, spec.sequence, spec.chaos must be set, or spec.rules must be non-empty (rules-only Runs are valid as passive chain monitoring)"
-// +kubebuilder:validation:XValidation:rule="!has(self.sequence)",message="spec.sequence is reserved for v2; the ValidationSequenceReconciler is not registered in this controller version"
-// +kubebuilder:validation:XValidation:rule="!has(self.chaos)",message="spec.chaos is reserved for v2; the ValidationChaosReconciler is not registered in this controller version"
-type ValidationRunSpec struct {
-    // Chain is the ephemeral Sei chain this run executes against. Required.
-    Chain ChainSpec `json:"chain"`
-
-    // Rules is the set of typed validation rules evaluated alongside any actor
-    // (load/sequence/chaos) over the run window. Pass = ⋀(rule.Passed) AND
-    // (workload exit 0 if load is set). A rules-only Run (rules but no actor)
-    // becomes passive monitoring of an existing chain.
-    // +optional
-    // +listType=map
-    // +listMapKey=name
-    // +kubebuilder:validation:MaxItems=32
-    Rules []ValidationRule `json:"rules,omitempty"`
-
-    // Results configures where artifacts are uploaded. Workloads upload via
-    // Pod Identity; the controller stamps `status.report.s3Url`.
-    // +optional
-    Results *ResultsSpec `json:"results,omitempty"`
-
-    // Timeouts bound the run's wall-clock at named lifecycle points.
-    // +optional
-    Timeouts *RunTimeouts `json:"timeouts,omitempty"`
-
-    // ----- Composable optional actors. Multiple may coexist
-    //       (e.g., load + sequence; chaos + load).
-
-    // Load is the v1 actor — a containerized load generator owned by the
-    // ValidationLoadGenerationReconciler. Reconciler-gated; field is admissable in v1.
-    // +optional
-    Load *LoadSpec `json:"load,omitempty"`
-
-    // Sequence is reserved for v2 — an ordered list of state-change steps
-    // (governance proposal, validator-set churn, IBC bring-up). Admission
-    // rejects spec.sequence in v1 until ValidationSequenceReconciler ships.
-    // +optional
-    Sequence *SequenceSpec `json:"sequence,omitempty"`
-
-    // Chaos is reserved for v2 — a chaos-mesh-driven fault injection actor.
-    // Admission rejects spec.chaos in v1 until ValidationChaosReconciler ships AND
-    // tenant namespace carries label sei.io/chaos-allowed=true AND controller
-    // is built with --enable-chaos-plan.
-    // +optional
-    Chaos *ChaosSpec `json:"chaos,omitempty"`
-}
-```
-
-##### Composable-blocks rationale
-
-The earlier draft used a single discriminator `spec.type ∈ {LoadTest, SequenceTest, IntegrationTest}` with a body wrapper per kind. Three problems surfaced:
-
-1. **Sequence + load + chaos compose naturally.** Real test shapes mix actors: "submit a governance proposal (sequence), then drive load (load), then check rules (rules)." A discriminator forces separate Runs and an external orchestrator; composable blocks let one Run carry all three actors against one chain.
-2. **`IntegrationTest` collapses into `load + rules`.** "Stand up a chain, run one container that validates" is identical to a load actor whose Job just performs assertions instead of generating load. Splitting it into a separate kind was duplication.
-3. **The discriminator wrapper produced a one-way door.** If `spec.type` is enum-locked, adding "load + sequence in one Run" later requires either a fourth discriminator value or a schema migration. Composable blocks expand additively.
-
-CEL admission keeps the safety net: at least one actor block OR rules-only must be set. v2 reconcilers register additively. v1's two reserved blocks (`sequence`, `chaos`) are CEL-rejected at admission so users see a deterministic error rather than silently-ignored fields.
-
-#### `ChainSpec` — list of named SeiNodeDeployment configs
-
-The chain substrate is one or more `SeiNodeDeployment` children (one per entry in `chain.deployments[]`). Each entry names a deployment, declares its role (`validator` or `fullNode`), and embeds the full `SeiNodeDeploymentSpec` shape — operators write the same `template.spec.image / entrypoint / overrides / dataVolume` they already know from the existing nightly template.
-
-```go
-// +kubebuilder:validation:XValidation:rule="self.chainId.matches('^[a-z0-9][a-z0-9-]*[a-z0-9]$')",message="chainId must be lowercase alphanumeric with hyphens"
-// +kubebuilder:validation:XValidation:rule="self.deployments.exists_one(d, d.role == 'validator')",message="exactly one deployment with role=validator is required"
-// +kubebuilder:validation:XValidation:rule="self.deployments.exists(d, d.role == 'fullNode')",message="at least one deployment with role=fullNode is required"
-// +kubebuilder:validation:XValidation:rule="self.deployments.all(d, d.role != 'fullNode' || !has(d.spec.genesis))",message="role=fullNode deployments must not declare genesis (full nodes inherit genesis from the validator ceremony)"
-type ChainSpec struct {
-    // ChainID for the ephemeral chain. The controller injects this into
-    // every materialized SND's genesis and into every SeiNode's chainId.
-    // +kubebuilder:validation:MinLength=1
-    // +kubebuilder:validation:MaxLength=64
-    ChainID string `json:"chainId"`
-
-    // Deployments is the list of named SeiNodeDeployment configs that
-    // constitute this chain. Exactly one must have role=validator (the
-    // genesis-ceremony fleet); at least one must have role=fullNode (the
-    // RPC surface workloads connect to). Names are unique within the list.
-    //
-    // The controller materializes one SeiNodeDeployment per entry, named
-    // {chainId}-{deployments[i].name}, with OwnerReferences to the
-    // ValidationRun. Cascade-delete works.
-    //
-    // +listType=map
-    // +listMapKey=name
-    // +kubebuilder:validation:MinItems=2
-    // +kubebuilder:validation:MaxItems=8
-    Deployments []ChainDeployment `json:"deployments"`
-}
-
-// +kubebuilder:validation:XValidation:rule="self.role != 'validator' || !has(self.spec.template.spec.fullNode)",message="role=validator deployment must not set template.spec.fullNode"
-// +kubebuilder:validation:XValidation:rule="self.role != 'fullNode' || !has(self.spec.template.spec.validator)",message="role=fullNode deployment must not set template.spec.validator"
-type ChainDeployment struct {
-    // Name is unique within deployments[]. DNS-label. Used as the suffix
-    // on the materialized SND name and as the value of the
-    // sei.io/deployment-name label on owned objects.
-    // +kubebuilder:validation:MinLength=1
-    // +kubebuilder:validation:MaxLength=63
-    // +kubebuilder:validation:Pattern=`^[a-z0-9]([-a-z0-9]*[a-z0-9])?$`
-    Name string `json:"name"`
-
-    // Role discriminates the deployment's purpose for genesis-ceremony
-    // wiring and default peer-selector injection.
-    // +kubebuilder:validation:Enum=validator;fullNode
-    Role ChainDeploymentRole `json:"role"`
-
-    // Spec is the embedded SeiNodeDeploymentSpec. Controller injects:
-    //  - genesis.chainId = chainId (validator role only; if unset)
-    //  - template.spec.chainId = chainId (if unset)
-    //  - template.spec.validator: {} (validator role; if unset)
-    //  - template.spec.fullNode: {} (fullNode role; if unset)
-    //  - default peers: validators peer by sei.io/chain-id label;
-    //    fullNodes peer by sei.io/nodedeployment label pointing at
-    //    the validator-role SND. Injected when user omits peers.
-    //  - labels: sei.io/managed-by=validationrun, sei.io/run-id=<runName>,
-    //    sei.io/deployment-name=<name>, sei.io/chain-id=<chainId>
-    //
-    // User-set fields take precedence on every field except the role
-    // discriminator and chainId — those two are controller-owned.
-    Spec sei_v1alpha1.SeiNodeDeploymentSpec `json:"spec"`
-}
-
-// +kubebuilder:validation:Enum=validator;fullNode
-type ChainDeploymentRole string
-
-const (
-    ChainDeploymentRoleValidator ChainDeploymentRole = "validator"
-    ChainDeploymentRoleFullNode  ChainDeploymentRole = "fullNode"
-)
-```
-
-##### Why exactly one validator + ≥1 fullNode
-
-Three things drove the gate decision:
-
-1. **The workload always wants RPC parity with users.** Validators run with mempool admission rules and consensus duty cycles that surface latency artifacts unrelated to the SUT under load. Full nodes are the realistic RPC surface. The workload always connects to fullNode-role endpoints; there is no validator-fleet-endpoint mode in v1.
-2. **One genesis ceremony per chain.** Allowing N validator deployments multiplies the genesis ceremony into a coordination problem the SND controller doesn't model. v1 codifies "exactly one validator deployment" so the existing genesis-ceremony plan applies cleanly.
-3. **List-of-deployments cleanly absorbs heterogeneity.** Some chains want a second fullNode fleet (e.g., archive-mode fullNodes for historical RPC, regular fullNodes for the workload). The list shape lets that compose without a new top-level field; both fleets are role=fullNode but differ in their embedded `Spec`.
-
-This generalizes — but does not remove — the prior "two-SNDs-always-materialized" rule.
-
-##### What the prior `endpointPolicy` knob was
-
-The previous draft considered an `endpointPolicy: validators|fullNodes` field for which fleet the workload connects to. It is **dropped permanently**. With the deployments-list shape, the answer is always "the deployment(s) with `role: fullNode`." `resolve-endpoints` (Orchestration plan task) selects all fullNode-role SNDs' headless Services; if multiple fullNode-role deployments exist, the workload sees the union of their pod DNS names. Controllers don't pick "which fullNode fleet" — that's a future expansion handled by adding a `chain.deployments[i].endpointSelector` field if and when needed.
-
-##### Embedded-SND trade-off (load-bearing)
-
-The embedded `SeiNodeDeploymentSpec` couples ValidationRun's CRD schema to SeiNodeDeployment's CRD schema. SND breaking changes break ValidationRun. Mitigation: same controller binary owns both; they version together. If/when validation gets its own controller binary or separate version cadence, switch to a `ChainTemplate` projection that copies fields explicitly. Document this as a re-evaluation trigger (≥1 SND breaking change forced ValidationRun's hand).
-
-#### `LoadSpec` — the v1 actor body
-
-`spec.load.workload` is the Phase 1 contract envelope, adopted **verbatim**.
-
-```go
-type LoadSpec struct {
-    // Workload is the containerized load generator. The
-    // ValidationLoadGenerationReconciler materializes this as a batch/v1.Job named
-    // {runName} owned by the ValidationRun.
-    Workload WorkloadSpec `json:"workload"`
-
-    // Duration is the run's load window. Surfaced to the workload as
-    // DURATION_SECONDS. The controller enforces wallclock via
-    // spec.timeouts.runDuration (defaults from this).
-    // +kubebuilder:validation:MinDuration=1s
-    Duration metav1.Duration `json:"duration"`
-
-    // Replicas is the load-gen pod parallelism. Controller injects
-    // SHARD_INDEX and SHARD_COUNT env per K6's execution-segment idiom
-    // (CLI-arg / env-injected, not magic discovery). Default 1.
-    // +optional
-    // +kubebuilder:default=1
-    // +kubebuilder:validation:Minimum=1
-    // +kubebuilder:validation:Maximum=64
-    Replicas int32 `json:"replicas,omitempty"`
-}
-
-// WorkloadSpec is the Phase 1 contract (sei-protocol/platform#235), one-to-one.
-//
-// Reserved-env-var rejection: any user-supplied env entry whose `name` collides
-// with a controller-injected variable is rejected at admission via CEL.
-//
-// +kubebuilder:validation:XValidation:rule="!self.env.exists(e, e.name in ['CHAIN_RPC_URL','CHAIN_WS_URL','CHAIN_ID','RUN_ID','RESULT_DIR','DURATION_SECONDS','NAMESPACE','SHARD_INDEX','SHARD_COUNT'])",message="env names CHAIN_RPC_URL, CHAIN_WS_URL, CHAIN_ID, RUN_ID, RESULT_DIR, DURATION_SECONDS, NAMESPACE, SHARD_INDEX, SHARD_COUNT are reserved by the validation controller and must not be set in spec.load.workload.env"
-type WorkloadSpec struct {
-    // Name is the workload identity used in S3 paths and metric labels
-    // (e.g. "evm_transfer", "tokens"). Lowercase, hyphenless preferred.
-    // +kubebuilder:validation:MinLength=1
-    // +kubebuilder:validation:MaxLength=63
-    // +kubebuilder:validation:Pattern=`^[a-z0-9][a-z0-9_]*$`
-    Name string `json:"name"`
-
-    // Image is the workload container image (digest-pinned recommended).
-    // +kubebuilder:validation:MinLength=1
-    Image string `json:"image"`
-
-    // Command and Args override the image entrypoint.
-    // +optional
-    Command []string `json:"command,omitempty"`
-    // +optional
-    Args []string `json:"args,omitempty"`
-
-    // Env are user-supplied env vars merged with controller-injected ones.
-    // Controller-injected vars (CHAIN_RPC_URL, CHAIN_WS_URL, CHAIN_ID, RUN_ID,
-    // RESULT_DIR, DURATION_SECONDS, NAMESPACE, SHARD_INDEX, SHARD_COUNT) are
-    // reserved names — the CRD-level XValidation rule above hard-rejects any
-    // user attempt to set them.
-    // +optional
-    Env []corev1.EnvVar `json:"env,omitempty"`
-
-    // EnvFrom is forwarded verbatim. Tenant SAs control what's mountable.
-    // +optional
-    EnvFrom []corev1.EnvFromSource `json:"envFrom,omitempty"`
-
-    // Resources is forwarded verbatim onto the Job's Pod template.
-    // +optional
-    Resources corev1.ResourceRequirements `json:"resources,omitempty"`
-
-    // ServiceAccountName names the Pod Identity-bound SA. Defaults to
-    // "{namespace}-runner". Tenant pre-provisions; controller validates
-    // existence at the apply-job task.
-    // +optional
-    ServiceAccountName string `json:"serviceAccountName,omitempty"`
-
-    // Config is rendered to a ConfigMap and mounted at /etc/validation/config/.
-    // Controller-side fixed-substitution applied: ${chainId}, ${rpcEndpoints},
-    // ${runId}, ${namespace} are replaced inline before the ConfigMap is
-    // applied. Other ${...} sequences pass through verbatim.
-    // +optional
-    Config *WorkloadConfigSpec `json:"config,omitempty"`
-
-    // PodTemplate exposes a narrow set of pod-level fields (nodeSelector,
-    // tolerations, securityContext, volumes/volumeMounts merged with the
-    // controller's). Locked to a small allowlist; not a full PodSpec.
-    // +optional
-    PodTemplate *WorkloadPodTemplate `json:"podTemplate,omitempty"`
-}
-
-// WorkloadConfigSpec is rendered to a ConfigMap with one key per file.
-// File contents are arbitrary text (JSON, YAML, INI, plain).
-// +kubebuilder:validation:XValidation:rule="size(self.files) >= 1",message="config.files must have at least one entry"
-type WorkloadConfigSpec struct {
-    Files     map[string]string `json:"files"`
-    // +optional
-    MountPath string            `json:"mountPath,omitempty"`
-}
-
-type WorkloadPodTemplate struct {
-    // +optional
-    NodeSelector    map[string]string         `json:"nodeSelector,omitempty"`
-    // +optional
-    Tolerations     []corev1.Toleration       `json:"tolerations,omitempty"`
-    // +optional
-    SecurityContext *corev1.PodSecurityContext `json:"securityContext,omitempty"`
-    // +optional
-    Volumes         []corev1.Volume           `json:"volumes,omitempty"`
-    // +optional
-    VolumeMounts    []corev1.VolumeMount      `json:"volumeMounts,omitempty"`
-    // +optional
-    Affinity        *corev1.Affinity          `json:"affinity,omitempty"`
-}
-```
-
-##### Reserved environment variables (controller-injected)
-
-Adopted verbatim from Phase 1 (`/tmp/phase1-issue-body.md`):
-
-| Var | Value |
-|---|---|
-| `CHAIN_RPC_URL` | First resolved RPC endpoint (always fullNode-role deployment(s)). Tendermint port. |
-| `CHAIN_WS_URL`  | Same host, EVM WebSocket port (8546). |
-| `CHAIN_ID` | `spec.chain.chainId` |
-| `RUN_ID` | `metadata.name` of the ValidationRun (no UID — name is unique within ns) |
-| `RESULT_DIR` | `/var/run/validation/results` (mounted emptyDir) |
-| `DURATION_SECONDS` | `spec.load.duration.Seconds()` |
-| `NAMESPACE` | `metadata.namespace` |
-| `SHARD_INDEX` | Injected per pod (0..replicas-1) via Indexed-Job + downward-API on `JOB_COMPLETION_INDEX` |
-| `SHARD_COUNT` | `spec.load.replicas` |
-
-These names are CRD-rejected when supplied in `spec.load.workload.env` — see the `XValidation` rule on `WorkloadSpec` above.
-
-##### Exit-code semantics
-
-Adopted verbatim from Phase 1:
-
-- `0` → workload Succeeded path (rules still gate the verdict)
-- `1` → workload Failed path (`verdict=Failed`, `Reason=WorkloadAssertionFailed`)
-- `2` → infra-level failure (`verdict=Error`, `Reason=WorkloadInfraFailure`). Run terminates in `Error`, *not* `Failed`. Heartbeat alert ignores `Error` so blast radius is bounded.
-
-#### `SequenceSpec` and `ChaosSpec` — reserved v2 placeholders
-
-```go
-// SequenceSpec is reserved for v2. Field name and outer struct shape are
-// load-bearing one-way doors; the body is intentionally empty until
-// ValidationSequenceReconciler ships. Admission rejects spec.sequence in v1.
-//
-// Sketch of v2 shape: ordered list of typed steps {applyPatch, submitTx,
-// awaitHeight, registerValidator, …}. Each step submits its work via a
-// short-lived Job targeting the chain RPC service — pods/exec is rejected
-// (see one-way doors). Reconciler is gated by predicate:
-// spec.sequence != nil AND Conditions[TestRunning]=True.
-type SequenceSpec struct {
-    // Reserved. Adding fields here is additive; renaming SequenceSpec is breaking.
-}
-
-// ChaosSpec is reserved for v2. Body intentionally empty. Admission rejects
-// spec.chaos in v1; in v2 admission additionally requires the run's namespace
-// to carry label sei.io/chaos-allowed=true AND controller built with
-// --enable-chaos-plan flag. Reconciler is gated by predicate:
-// spec.chaos != nil AND Conditions[TestRunning]=True.
-type ChaosSpec struct {
-    // Reserved. v2 sketch: list of chaos-mesh experiments (NetworkChaos,
-    // PodChaos, IOChaos) scoped to the chain's SND children, applied during
-    // the run window with cleanup at finalize.
-}
-```
-
-#### `ValidationRule` — alert + query schema
-
-```go
-// +kubebuilder:validation:XValidation:rule="(has(self.alert)?1:0) + (has(self.query)?1:0) == 1",message="exactly one of alert or query must be set"
-// +kubebuilder:validation:XValidation:rule="!has(self.alert) || self.type == 'alert'",message="alert body requires type=alert"
-// +kubebuilder:validation:XValidation:rule="!has(self.query) || self.type == 'query'",message="query body requires type=query"
-type ValidationRule struct {
-    // Name is unique within spec.rules. DNS-label.
-    // +kubebuilder:validation:MinLength=1
-    // +kubebuilder:validation:MaxLength=63
-    // +kubebuilder:validation:Pattern=`^[a-z0-9]([-a-z0-9]*[a-z0-9])?$`
-    Name string `json:"name"`
-
-    // Type discriminates the rule body. v1: alert, query.
-    // +kubebuilder:validation:Enum=alert;query
-    Type ValidationRuleType `json:"type"`
-
-    // +optional
-    Alert *AlertRule `json:"alert,omitempty"`
-    // +optional
-    Query *QueryRule `json:"query,omitempty"`
-
-    // RunProperties tunes evaluation behavior. Defaults documented per-type.
-    // +optional
-    RunProperties *RuleRunProperties `json:"runProperties,omitempty"`
-}
-
-// +kubebuilder:validation:Enum=alert;query
-type ValidationRuleType string
-
-const (
-    ValidationRuleTypeAlert ValidationRuleType = "alert"
-    ValidationRuleTypeQuery ValidationRuleType = "query"
-)
-
-type AlertRule struct {
-    RuleRef AlertRuleRef     `json:"ruleRef"`
-    // +optional
-    // +kubebuilder:default="30s"
-    Timeout *metav1.Duration `json:"timeout,omitempty"`
-}
-
-// +kubebuilder:validation:XValidation:rule="self.alertname.size() > 0",message="alertname is required"
-type AlertRuleRef struct {
-    // +kubebuilder:validation:MinLength=1
-    Name      string `json:"name"`
-    // Same-namespace as the Run, OR a namespace labeled
-    // sei.io/validation-shared-rules=true (e.g., monitoring).
-    // +kubebuilder:validation:MinLength=1
-    Namespace string `json:"namespace"`
-    // +kubebuilder:validation:MinLength=1
-    Alertname string `json:"alertname"`
-}
-
-type QueryRule struct {
-    // +kubebuilder:validation:MinLength=1
-    PromQL    string           `json:"promql"`
-    // +kubebuilder:validation:Enum=">=";"<=";">";"<";"==";"!="
-    Op        QueryComparator  `json:"op"`
-    // OTel Round 1: numeric→string is breaking; string→numeric is additive.
-    // +kubebuilder:validation:Pattern=`^-?[0-9]+(\.[0-9]+)?([eE][+-]?[0-9]+)?$`
-    Threshold string           `json:"threshold"`
-    // +optional
-    // +kubebuilder:default="30s"
-    Timeout   *metav1.Duration `json:"timeout,omitempty"`
-}
-
-type RuleRunProperties struct {
-    // Interval is the polling cadence. Default 30s. Min 5s, max 5m.
-    // +optional
-    // +kubebuilder:default="30s"
-    // +kubebuilder:validation:Minimum=5
-    // +kubebuilder:validation:Maximum=300
-    Interval *metav1.Duration `json:"interval,omitempty"`
-
-    // StopOnFailure: a Failed verdict short-circuits the run. The
-    // ValidationOrchestrationReconciler sets Conditions[TestCancelled]=True,
-    // which actor reconcilers (ValidationLoadGenerationReconciler) observe and
-    // halt cooperatively. The orchestration plan's monitor-task-completion
-    // task reads this condition and exits Failed.
-    // +optional
-    // +kubebuilder:default=false
-    StopOnFailure bool `json:"stopOnFailure,omitempty"`
-
-    // Retry is the number of consecutive Errors tolerated before
-    // a rule matures into a permanent Error verdict. Defaults to 3.
-    // Does NOT retry on Failed (verdicts of Failed are monotonic).
-    // +optional
-    // +kubebuilder:default=3
-    // +kubebuilder:validation:Minimum=0
-    // +kubebuilder:validation:Maximum=10
-    Retry int32 `json:"retry,omitempty"`
-}
-```
-
-There is no `mode` field on `ValidationRule` in v1: there is exactly one evaluation cadence (continuous polling, with a final sweep at run terminal). `edge` and `window-end` are additive future modes.
-
-#### `ResultsSpec` and `RunTimeouts`
-
-```go
-type ResultsSpec struct {
-    // Reserved for future bucket overrides. v1 has no user-tunable fields.
-    // +optional
-    S3 *S3ResultsSpec `json:"s3,omitempty"`
-}
-
-type S3ResultsSpec struct {
-    // Reserved. v1 ignores user-supplied bucket/prefix.
-}
-
-type RunTimeouts struct {
-    // ChainReady caps wait-chain-ready (Orchestration plan task #2). Default 20m.
-    // +optional
-    // +kubebuilder:default="20m"
-    ChainReady *metav1.Duration `json:"chainReady,omitempty"`
-
-    // RunDuration caps the workload Job's wall-clock from apply to terminal.
-    // When omitted and spec.load is set, defaults to spec.load.duration + 5m.
-    // For rules-only Runs (no spec.load), defaults to 1h.
-    // +optional
-    RunDuration *metav1.Duration `json:"runDuration,omitempty"`
-}
-```
-
-#### `ValidationRunStatus` — partitioned across plan slots and conditions
-
-```go
-type ValidationRunStatus struct {
-    // ObservedGeneration tracks the spec generation processed by the latest
-    // reconcile. Set by ValidationOrchestrationReconciler — it is the single
-    // generation-tracking authority across both controllers.
-    // +optional
-    ObservedGeneration int64 `json:"observedGeneration,omitempty"`
-
-    // Phase is the high-level lifecycle state. Pascal-case.
-    // OWNER: ValidationOrchestrationReconciler.
-    // +kubebuilder:validation:Enum=Pending;Running;Succeeded;Failed;Error;Cancelled
-    // +optional
-    Phase ValidationRunPhase `json:"phase,omitempty"`
-
-    // Verdict is a denormalized human-friendly summary that mirrors a
-    // condition. Computed at terminal phase entry.
-    // OWNER: ValidationOrchestrationReconciler.
-    // +optional
-    // +kubebuilder:validation:Enum=Passed;Failed;Awaited;Error
-    Verdict ValidationVerdict `json:"verdict,omitempty"`
-
-    // StartTime / CompletionTime / Duration. OWNER: ValidationOrchestrationReconciler.
-    // +optional
-    StartTime      *metav1.Time `json:"startTime,omitempty"`
-    // +optional
-    CompletionTime *metav1.Time `json:"completionTime,omitempty"`
-    // +optional
-    Duration       string       `json:"duration,omitempty"`
-
-    // Plans carries one TaskPlan slot per registered controller. Each
-    // controller writes ONLY its own slot via its dedicated SSA field
-    // manager. There is no shared "active plan" — Orchestration and
-    // LoadGeneration progress in parallel against their own slots.
-    // +optional
-    Plans ValidationRunPlans `json:"plans,omitempty"`
-
-    // Chain reports the materialized SND children and resolved endpoints.
-    // OWNER: ValidationOrchestrationReconciler.
-    // +optional
-    Chain *ChainStatus `json:"chain,omitempty"`
-
-    // WorkloadExitCode is the exit code captured from the workload Job's
-    // pod once the Job reaches terminal state. 0/1/2 per Phase 1 contract.
-    // OWNER: ValidationLoadGenerationReconciler.
-    // Unset until Conditions[LoadComplete]=True.
-    // +optional
-    WorkloadExitCode *int32 `json:"workloadExitCode,omitempty"`
-
-    // Rules carries per-rule verdicts and supporting evidence. Updated
-    // continuously by the orchestration plan's monitor-task-completion task.
-    // OWNER: ValidationOrchestrationReconciler.
-    // +listType=map
-    // +listMapKey=name
-    // +optional
-    Rules []RuleStatus `json:"rules,omitempty"`
-
-    // Report carries the resolved S3 URL. OWNER: ValidationOrchestrationReconciler.
-    // +optional
-    Report *ReportStatus `json:"report,omitempty"`
-
-    // FailedPlan names the plan slot that caused the run to terminate in
-    // Failed/Error. v1 only ever stamps "loadGeneration" or empty (when the
-    // orchestration plan itself failed in chain bring-up). Reserved now so
-    // v2 reconcilers (sequence, chaos) can stamp "sequence"/"chaos" without
-    // schema churn. OWNER: ValidationOrchestrationReconciler.
-    // +optional
-    FailedPlan string `json:"failedPlan,omitempty"`
-
-    // Conditions includes Tekton-style Succeeded plus the run's coordination
-    // signals. See "Conditions and the single-writer table".
-    // +listType=map
-    // +listMapKey=type
-    // +optional
-    Conditions []metav1.Condition `json:"conditions,omitempty"`
-}
-
-type ValidationRunPlans struct {
-    // Orchestration is the chain-bring-up + monitoring + finalize plan.
-    // OWNER: ValidationOrchestrationReconciler (field manager: validationrun-orchestration).
-    // +optional
-    Orchestration *sei_v1alpha1.TaskPlan `json:"orchestration,omitempty"`
-
-    // LoadGeneration is the workload-Job lifecycle plan.
-    // OWNER: ValidationLoadGenerationReconciler (field manager: validationrun-loadgeneration).
-    // Nil when spec.load is unset (rules-only Run) or when the
-    // ValidationLoadGenerationReconciler is disabled by deployment opt-in.
-    // +optional
-    LoadGeneration *sei_v1alpha1.TaskPlan `json:"loadGeneration,omitempty"`
-
-    // Reserved for v2 actor controllers. Adding fields here is additive.
-    // +optional
-    // Sequence *sei_v1alpha1.TaskPlan `json:"sequence,omitempty"`
-    // +optional
-    // Chaos    *sei_v1alpha1.TaskPlan `json:"chaos,omitempty"`
-}
-
-// +kubebuilder:validation:Enum=Pending;Running;Succeeded;Failed;Error;Cancelled
-type ValidationRunPhase string
-
-const (
-    PhasePending   ValidationRunPhase = "Pending"
-    PhaseRunning   ValidationRunPhase = "Running"
-    PhaseSucceeded ValidationRunPhase = "Succeeded"
-    PhaseFailed    ValidationRunPhase = "Failed"
-    PhaseError     ValidationRunPhase = "Error"
-    PhaseCancelled ValidationRunPhase = "Cancelled"
-)
-
-// +kubebuilder:validation:Enum=Passed;Failed;Awaited;Error
-type ValidationVerdict string
-
-type ChainStatus struct {
-    // Deployments mirrors spec.chain.deployments[].name with the resolved
-    // SND child name and its current phase.
-    // +listType=map
-    // +listMapKey=name
-    Deployments  []ChainDeploymentStatus `json:"deployments"`
-    // RPCEndpoints is the union of fullNode-role deployments' resolved
-    // headless-Service pod DNS names + ports. Stamped at resolve-endpoints.
-    // +optional
-    RPCEndpoints []string `json:"rpcEndpoints,omitempty"`
-}
-
-type ChainDeploymentStatus struct {
-    Name    string                  `json:"name"`
-    Role    ChainDeploymentRole     `json:"role"`
-    SNDName string                  `json:"sndName"`
-    Phase   string                  `json:"phase,omitempty"` // mirror of SND.status.phase
-}
-
-// +kubebuilder:validation:XValidation:rule="(has(self.alert)?1:0) + (has(self.query)?1:0) <= 1",message="rule status carries at most one of alert or query"
-type RuleStatus struct {
-    Name             string             `json:"name"`
-    Type             ValidationRuleType `json:"type"`
-    Verdict          ValidationVerdict  `json:"verdict"`
-    // Reason categorizes Error verdicts (PrometheusRuleNotFound, NoSamples,
-    // AmbiguousResult, NaN, PrometheusUnavailable, RBACDenied, Timeout) and
-    // Failed verdicts (ThresholdViolated, AlertFired).
-    // +optional
-    Reason           string             `json:"reason,omitempty"`
-    // +optional
-    LastEvaluatedAt  *metav1.Time       `json:"lastEvaluatedAt,omitempty"`
-    // +optional
-    NextEvaluationAt *metav1.Time       `json:"nextEvaluationAt,omitempty"`
-    // +optional
-    Query            *QueryRuleStatus   `json:"query,omitempty"`
-    // +optional
-    Alert            *AlertRuleStatus   `json:"alert,omitempty"`
-}
-
-type QueryRuleStatus struct {
-    ActualValue string `json:"actualValue"` // string, lossless
-    Threshold   string `json:"threshold"`
-    Op          string `json:"op"`
-}
-
-type AlertRuleStatus struct {
-    FiredCount  int32        `json:"firedCount"`
-    LastFiredAt *metav1.Time `json:"lastFiredAt,omitempty"`
-}
-
-type ReportStatus struct {
-    // S3URL = s3://harbor-validation-results/{namespace}/{job}/{runId}/
-    // OWNER: ValidationOrchestrationReconciler. Stamped at finalize.
-    // +optional
-    S3URL string `json:"s3Url,omitempty"`
-}
-```
-
-##### Conditions and the single-writer table
-
-Each condition has exactly one writing controller. SSA field-manager isolation enforces this — `validationrun-orchestration` and `validationrun-loadgeneration` never touch each other's owned condition keys. The orchestrator's status patch lists only the conditions it owns; the loadgenerator's patch lists only `LoadComplete`. controller-runtime + Kubernetes server-side apply preserves both sets correctly because they are partitioned by `type`.
-
-| Condition / status field | Owner controller | Field manager | Purpose |
-|---|---|---|---|
-| `Conditions[TestRunning]` | ValidationOrchestrationReconciler | `validationrun-orchestration` | Set True at the `mark-test-running` task; gates ValidationLoadGenerationReconciler's predicate |
-| `Conditions[TestComplete]` | ValidationOrchestrationReconciler | `validationrun-orchestration` | Set True at finalize; monotonic |
-| `Conditions[Succeeded]` | ValidationOrchestrationReconciler | `validationrun-orchestration` | Tekton-style; set at finalize |
-| `Conditions[TestCancelled]` | ValidationOrchestrationReconciler | `validationrun-orchestration` | Set True on stop-on-failure rule trip; observed by actor reconcilers |
-| `Conditions[LoadComplete]` | ValidationLoadGenerationReconciler | `validationrun-loadgeneration` | Set True when wait-job-terminal captures exit code |
-| `status.chain.*` | ValidationOrchestrationReconciler | `validationrun-orchestration` | |
-| `status.report.*` | ValidationOrchestrationReconciler | `validationrun-orchestration` | |
-| `status.rules[]` | ValidationOrchestrationReconciler | `validationrun-orchestration` | |
-| `status.phase` / `.verdict` / `.startTime` / `.completionTime` / `.duration` | ValidationOrchestrationReconciler | `validationrun-orchestration` | |
-| `status.failedPlan` | ValidationOrchestrationReconciler | `validationrun-orchestration` | |
-| `status.workloadExitCode` | ValidationLoadGenerationReconciler | `validationrun-loadgeneration` | Stamped at wait-job-terminal |
-| `status.plans.orchestration` | ValidationOrchestrationReconciler | `validationrun-orchestration` | |
-| `status.plans.loadGeneration` | ValidationLoadGenerationReconciler | `validationrun-loadgeneration` | |
-
-`Succeeded` matches Tekton TaskRun/PipelineRun semantics — `kubectl wait --for=condition=Succeeded validationrun/X` is the contract.
-
-`TestRunning` is the gate the ValidationLoadGenerationReconciler watches: the controller's predicate fires only when this condition transitions to True. It is set by the orchestration plan's `mark-test-running` task after chain readiness and endpoint resolution succeed. Before that point the ValidationLoadGenerationReconciler short-circuits (no plan, no work).
-
-`TestCancelled` is the cross-controller stop signal: when monitor-task-completion observes a stop-on-failure rule trip, it sets this condition; ValidationLoadGenerationReconciler observes it on its next reconcile and halts (stops submitting tasks; in-flight work proceeds; chain teardown via cascade-delete is the rollback).
-
-| Condition `Succeeded` | status | reason | meaning |
-|---|---|---|---|
-| `Unknown` | `Pending` / `Running` | Not yet terminal |
-| `True` | `RunSucceeded` | Phase=Succeeded |
-| `False` | `WorkloadFailed` | Workload exit 1 |
-| `False` | `RuleFailed:{ruleName}` | One or more rules verdict=Failed |
-| `False` | `RuleError` | Run terminated with Phase=Error from rule eval |
-| `False` | `WorkloadInfraFailure` | Workload exit 2 |
-| `False` | `ChainNotReady` | wait-chain-ready timed out |
-| `False` | `Cancelled` | Phase=Cancelled |
-
-| Condition `TestComplete` | status | reason | meaning |
-|---|---|---|---|
-| `Unknown` | `Pending` / `Running` | Run not yet terminal |
-| `True` | `RunFinalized` | Orchestration plan reached finalize task; phase + verdict + S3 URL set |
-
-| Condition `TestRunning` | status | reason | meaning |
-|---|---|---|---|
-| `Unknown` | `Pending` | Chain not yet ready |
-| `True` | `ChainReady` | Endpoints resolved; actor reconcilers may proceed |
-| `False` | `Cancelled` | Set False when the run terminates (closes the gate behind us) |
-
-| Condition `LoadComplete` | status | reason | meaning |
-|---|---|---|---|
-| `Unknown` | `Pending` / `Running` | Workload not yet terminal |
-| `True` | `JobTerminated` | Job reached Complete or Failed; `status.workloadExitCode` populated |
-| `True` | `JobTimedOut` | `spec.timeouts.runDuration` exceeded; controller cancelled the Job; exit code synthesized as 2 |
-
-| Condition `TestCancelled` | status | reason | meaning |
-|---|---|---|---|
-| `Unknown` | `Pending` / `Running` | Not signalled |
-| `True` | `RuleStopOnFailure:{ruleName}` | Stop-on-failure tripped; actor reconcilers halt |
-
-### Future kinds (one paragraph each, NOT detailed)
-
-#### `ValidationSuite` — designed but NOT implemented in v1
-
-Designed at field-level so `ValidationRun`'s shape doesn't lock out the suite parent. CRD ships with v1; no controller implements it. Attempting to create one returns an admission-time error (`ValidationSuite reconciler not registered in this controller version`).
-
-```go
-// +kubebuilder:object:root=true
-// +kubebuilder:subresource:status
-// +kubebuilder:resource:shortName=vsuite
-type ValidationSuite struct {
-    metav1.TypeMeta   `json:",inline"`
-    metav1.ObjectMeta `json:"metadata,omitempty"`
-    Spec   ValidationSuiteSpec   `json:"spec,omitempty"`
-    Status ValidationSuiteStatus `json:"status,omitempty"`
-}
-
-type ValidationSuiteSpec struct {
-    // +listType=map
-    // +listMapKey=name
-    // +kubebuilder:validation:MinItems=1
-    Runs []SuiteRunSpec `json:"runs"`
-    // +optional
-    // +kubebuilder:default=Sequential
-    // +kubebuilder:validation:Enum=Sequential;Parallel
-    Concurrency SuiteConcurrency `json:"concurrency,omitempty"`
-    // +optional
-    // +kubebuilder:default=true
-    StopOnFailure bool `json:"stopOnFailure,omitempty"`
-}
-
-type SuiteRunSpec struct {
-    // +kubebuilder:validation:Pattern=`^[a-z0-9]([-a-z0-9]*[a-z0-9])?$`
-    Name string            `json:"name"`
-    Spec ValidationRunSpec `json:"spec"`
-}
-```
-
-No DAG. Users wanting a DAG wrap suites in Argo (Litmus's "we outsource multi-experiment to Argo" lesson — [Litmus probes](https://docs.litmuschaos.io/docs/concepts/probes)).
-
-#### `ValidationSchedule` — designed but NOT implemented in v1
-
-Modeled on Argo `CronWorkflow` ([Argo CronWorkflow](https://argo-workflows.readthedocs.io/en/latest/cron-workflows/)) plus Chaos Mesh's annotation-pause refinement.
-
-```go
-// +kubebuilder:object:root=true
-// +kubebuilder:subresource:status
-// +kubebuilder:resource:shortName=vsched
-type ValidationSchedule struct {
-    metav1.TypeMeta   `json:",inline"`
-    metav1.ObjectMeta `json:"metadata,omitempty"`
-    Spec   ValidationScheduleSpec   `json:"spec,omitempty"`
-    Status ValidationScheduleStatus `json:"status,omitempty"`
-}
-
-type ValidationScheduleSpec struct {
-    // +kubebuilder:validation:MinItems=1
-    Schedules []string `json:"schedules"`
-    // +optional
-    // +kubebuilder:default=UTC
-    Timezone string `json:"timezone,omitempty"`
-    RunTemplate ValidationRunSpec `json:"runTemplate"`
-    // +optional
-    // +kubebuilder:default=Forbid
-    // +kubebuilder:validation:Enum=Allow;Replace;Forbid
-    ConcurrencyPolicy SchedConcurrency `json:"concurrencyPolicy,omitempty"`
-    // +optional
-    SuccessfulRunsHistoryLimit *int32 `json:"successfulRunsHistoryLimit,omitempty"`
-    // +optional
-    FailedRunsHistoryLimit *int32 `json:"failedRunsHistoryLimit,omitempty"`
-    // +optional
-    // +kubebuilder:default=60
-    StartingDeadlineSeconds *int64 `json:"startingDeadlineSeconds,omitempty"`
-    // Pause via annotation (validation.sei.io/paused: "true").
-}
-```
-
-## Lifecycle and plan tasks
-
-The two sub-controllers each integrate with `internal/planner/` (see `/Users/brandon/sei-k8s-controller/internal/planner/doc.go`) but produce **independent plans** that progress in parallel. There is no barrier task; the ValidationLoadGenerationReconciler is gated **at event-delivery time** by a controller-runtime predicate.
-
-### ValidationOrchestrationReconciler — required when validation is enabled
-
-**Purpose:** chain bring-up, endpoint resolution, gate-flip, monitoring, finalize. Owns every status field except `status.workloadExitCode`, `status.plans.loadGeneration`, and `Conditions[LoadComplete]`.
-
-**Builder location:** `internal/planner/validationrun_orchestration.go`. Resolver in the controller package follows the same `ResolvePlan → persist → ExecutePlan` shape as `NodeResolver` and `ForGroup`.
-
-**Plan tasks (ordered, sequential — 6 tasks):**
-
-```
-1. ensure-chain               (controller-side; SSA all SND children)
-2. wait-chain-ready           (controller-side, async; SND.status.phase=Ready)
-3. resolve-endpoints          (controller-side; LIST fullNode-role headless Services)
-4. mark-test-running          (controller-side; sets Conditions[TestRunning]=True;
-                               this is the gate-flip that lets actor reconcilers
-                               start their plans)
-5. monitor-task-completion    (controller-side, async; reads Conditions[LoadComplete],
-                               status.rules[], status.workloadExitCode; polls rules
-                               per RunProperties.Interval; computes stop-on-failure)
-6. finalize                   (controller-side; computes phase + verdict + report.s3Url;
-                               sets Conditions[TestComplete]=True, Conditions[Succeeded];
-                               flips Conditions[TestRunning]=False)
-
-Plan TargetPhase: ignored (see "TargetPhase decoupling" below)
-Plan FailedPhase: ignored
-```
-
-The 6-task collapse from the prior 7-task design pulls the workload-side concerns (`render-config` and `apply-job`) out of the orchestration plan and into ValidationLoadGenerationReconciler's plan. The orchestration plan no longer touches the Job at all; it observes condition state and aggregates verdicts.
-
-### ValidationLoadGenerationReconciler — opt-in (default on within validation in v1)
-
-**Purpose:** materialize the workload Job and ConfigMap, watch to terminal, capture the exit code, set `Conditions[LoadComplete]=True`. That is the entire surface area.
-
-**Builder location:** `internal/planner/validationrun_loadgen.go`.
-
-**Plan tasks (ordered, sequential — 3 tasks):**
-
-```
-1. render-config            (controller-side; SSA ConfigMap with substitutions
-                             ${chainId}, ${rpcEndpoints}, ${runId}, ${namespace};
-                             skipped when spec.load.workload.config is unset)
-2. apply-job                (controller-side; SSA batch/v1.Job with Indexed
-                             completion mode, reserved-env injection, OwnerRef→Run)
-3. wait-job-terminal        (controller-side, async; watches Job to terminal,
-                             captures pod exit code, stamps status.workloadExitCode,
-                             sets Conditions[LoadComplete]=True)
-```
-
-**Predicate-gated reconciliation.** The reconciler's `SetupWithManager` registers with a `WithEventFilter` that fires only when both:
-
-- `spec.load != nil`, AND
-- `Conditions[TestRunning]` exists with `Status=True`
-
-```go
-// internal/controller/validationrun/loadgen_predicate.go
-var loadActorPredicate = predicate.Funcs{
-    CreateFunc: func(e event.CreateEvent) bool {
-        return shouldReconcileLoadGen(e.Object.(*validationv1alpha1.ValidationRun))
-    },
-    UpdateFunc: func(e event.UpdateEvent) bool {
-        return shouldReconcileLoadGen(e.ObjectNew.(*validationv1alpha1.ValidationRun))
-    },
-    DeleteFunc: func(e event.DeleteEvent) bool { return true },
-    GenericFunc: func(_ event.GenericEvent) bool { return false },
-}
-
-func shouldReconcileLoadGen(vr *validationv1alpha1.ValidationRun) bool {
-    if vr.Spec.Load == nil {
-        return false
-    }
-    cond := meta.FindStatusCondition(vr.Status.Conditions, "TestRunning")
-    return cond != nil && cond.Status == metav1.ConditionTrue
-}
-```
-
-Before `Conditions[TestRunning]=True`, the reconciler never sees an event for the CR (controller-runtime drops the event before invoking Reconcile). This is the controller-runtime-native "wait for upstream readiness" pattern — no barrier task, no polling, no synchronization primitive.
-
-`ValidationLoadGenerationReconciler.ResolvePlan` returns `(nil, nil)` if `spec.load == nil` even after the predicate fires (defense-in-depth) so re-running with the controller enabled but `spec.load=nil` Run is a no-op rather than an error.
-
-### Plan ownership and the executor
-
-Each controller passes its own `Named()` to the `controller-runtime` builder (so leader-election locks are independent and metric labels are partitioned):
-
-```go
-// ValidationOrchestrationReconciler
-ctrl.NewControllerManagedBy(mgr).
-    Named("validationrun-orchestration").
-    For(&validationv1alpha1.ValidationRun{}).
-    Owns(&seiv1alpha1.SeiNodeDeployment{}).
-    Complete(r)
-
-// ValidationLoadGenerationReconciler
-ctrl.NewControllerManagedBy(mgr).
-    Named("validationrun-loadgeneration").
-    For(&validationv1alpha1.ValidationRun{}, builder.WithPredicates(loadActorPredicate)).
-    Owns(&batchv1.Job{}).
-    Owns(&corev1.ConfigMap{}).
-    Complete(r)
-```
-
-The `planner.Executor[*ValidationRun]` is parameterized by the plan slot: each controller passes its own `slotAccessor` to the executor (a function `func(*ValidationRun) **TaskPlan` returning a pointer-to-pointer for the right slot, e.g., `&vr.Status.Plans.Orchestration`). The executor's existing single-patch model writes only the slot it was given. SSA's field-manager isolation handles the rest.
-
-### Plan-driven invariants inherited from the existing controllers
-
-- Each plan persisted before any task executes (atomic creation).
-- Single-patch model per controller per reconcile: tasks mutate owned resources; executor mutates plan/phase in-memory; reconciler flushes once via SSA with its dedicated field manager.
-- Planner owns conditions; executor never sets conditions.
-- Terminal plan observed by next reconcile, which (for ValidationOrchestrationReconciler only) sets the terminal phase. ValidationLoadGenerationReconciler does NOT set phase on terminal — it only sets `Conditions[LoadComplete]=True`; the orchestrator's monitor-task-completion task observes that condition and continues.
-
-### TargetPhase decoupling for v1
-
-`TaskPlan.TargetPhase` is typed `seiv1alpha1.SeiNodePhase` — its enum values (`Pending|Initializing|Running|Failed|Terminating`) do not align with `ValidationRunPhase` (`Pending|Running|Succeeded|Failed|Error|Cancelled`). For v1:
-
-- Both ValidationRun controllers leave `TaskPlan.TargetPhase` and `FailedPhase` empty in their built plans.
-- The executor's `setTargetPhase` short-circuit (when TargetPhase is empty, skip the phase write) handles this cleanly — already exercised by the existing NodeUpdate plans.
-- Phase transitions are owned exclusively by the orchestration plan's `finalize` task, which computes `Phase + Verdict + status.failedPlan` from `Conditions` + `status.workloadExitCode` + `status.rules[]` and writes them in the orchestrator's status patch.
-- ValidationLoadGenerationReconciler does NOT write `status.phase` ever. Its terminal-plan observation only sets `Conditions[LoadComplete]=True` and clears its own plan slot.
-
-This is documented as Open Dependency #2 — when v2 lands a third actor, we may need a `TaskPlan` type with controller-typed `TargetPhase`/`FailedPhase`. v1 sidesteps the issue.
-
-### Per-task contract
-
-Each task lives at `internal/task/validation/{name}.go`, follows the existing controller-side task pattern (`Submit / Poll / Apply` interface — see `internal/task/observe_image.go`), and is registered in `internal/task/task.go`'s deserialize map.
-
-#### Orchestration plan tasks
-
-##### 1. `ensure-chain`
-
-| Property | Value |
-|---|---|
-| Type constant | `validation.TaskTypeEnsureChain` |
-| Sync/Async | Sync |
-| Idempotent op | Server-side apply one SND per `spec.chain.deployments[]` entry, named `{chainId}-{name}`, with `OwnerReferences` to the Run. Field manager `validationrun-orchestration`. |
-| Inputs | `Run.spec.chain` |
-| Success | All SNDs exist with the expected spec hash |
-| Failure | API error (terminal after 3 retries → `Reason=ChainApplyFailed`); SND validation reject (terminal → `Reason=ChainSpecInvalid`) |
-
-Injected fields the task fills before SSA:
-
-- `genesis.chainId = chain.chainId` on the validator-role SND if unset.
-- `template.spec.chainId = chain.chainId` on every template if unset.
-- `template.spec.validator: {}` on validator-role SND templates; `template.spec.fullNode: {}` on fullNode-role templates.
-- Default `peers: [{label: {selector: {sei.io/chain-id: chainId}}}]` for validator-role and `[{label: {selector: {sei.io/nodedeployment: <validator-snd-name>}}}]` for fullNode-role — when the user omitted `peers`.
-- Labels on every materialized object: `sei.io/managed-by=validationrun`, `sei.io/run-id=<runName>`, `sei.io/chain-id=<chainId>`, `sei.io/deployment-name=<chainDeployment.name>`.
-
-##### 2. `wait-chain-ready`
-
-| Property | Value |
-|---|---|
-| Type constant | `validation.TaskTypeWaitChainReady` |
-| Sync/Async | Async (RequeueAfter pattern) |
-| Idempotent op | GET each materialized SND; check `status.phase == Ready` AND `status.conditions[NodesReady].status == True`. |
-| Timeout | `spec.timeouts.chainReady` (default 20m) |
-| Failure | Timeout → `Reason=ChainNotReady`. Any SND `phase=Failed` → `Reason=ChainFailed` |
-
-**SND readiness already gates on chain catch-up.** `internal/noderesource/noderesource.go:366-377` wires `readinessProbe.httpGet` at `seid:RPC_PORT/lag_status` (sei-tendermint's threshold-configured endpoint that returns non-200 while lagging behind chain tip). kubelet sees probe failure → Pod stays NotReady → Service excludes it → SND `phase=Ready` waits via the existing `ConditionNodesReady` aggregation. The `wait-chain-ready` task's poll on SND `phase=Ready` therefore does include the catching_up precondition transitively. (Earlier drafts of this LLD claimed this was a gap and proposed extending the seictl sidecar; that claim was based on incorrect investigation. See `sei-protocol/sei-k8s-controller#144` for the closed-as-already-implemented record.)
-
-##### 3. `resolve-endpoints`
-
-| Property | Value |
-|---|---|
-| Type constant | `validation.TaskTypeResolveEndpoints` |
-| Sync/Async | Sync |
-| Idempotent op | LIST headless Services with selector `sei.io/managed-by=validationrun, sei.io/run-id=<runName>, sei.io/role=fullNode`; build `{podDNSName}:{port}` list across all fullNode-role deployments. Stamps `Run.status.chain.rpcEndpoints`. |
-| Selection | Always the union of all fullNode-role deployments. |
-| Failure | Empty result after 5 retries → `Reason=EndpointsNotResolvable` |
-
-##### 4. `mark-test-running`
-
-| Property | Value |
-|---|---|
-| Type constant | `validation.TaskTypeMarkTestRunning` |
-| Sync/Async | Sync (one-shot) |
-| Idempotent op | `meta.SetStatusCondition` with `Type=TestRunning, Status=True, Reason=ChainReady`. |
-| Side effect | Triggers ValidationLoadGenerationReconciler's predicate-gated event delivery on the next informer dispatch — controller-runtime's update event (the conditions array changed) is delivered through the predicate; if `spec.load != nil`, ValidationLoadGenerationReconciler reconciles and starts its plan. |
-| Failure | None (in-memory condition write; persisted via the normal status patch) |
-
-##### 5. `monitor-task-completion`
-
-This is the central observation task. It is the renamed-and-reduced successor to the prior `monitor-run` task. It does **not** watch the workload Job — that is ValidationLoadGenerationReconciler's job. It reads condition + status state and decides whether the run is done.
-
-| Property | Value |
-|---|---|
-| Type constant | `validation.TaskTypeMonitorTaskCompletion` |
-| Sync/Async | Async (RequeueAfter; refines requeue cadence by min rule interval) |
-| Inputs | `Run.spec.rules`, `Run.status.rules`, `Run.status.conditions[LoadComplete, ...]`, `Run.status.workloadExitCode` |
-| Timeout | `spec.timeouts.runDuration` |
-
-**Per-iteration behavior:**
-
-```
-1. Condition gathering:
-     loadComplete = Conditions[LoadComplete].Status == True (or spec.load == nil → treat as True at runStart+0)
-     // For rules-only Runs: monitor-task-completion's loadComplete defaults to
-     // True at task start; the run's "completion" is purely rule-driven.
-
-2. Rule-side advance (for each rule with verdict != Failed):
-     if rule.LastEvaluatedAt + Interval <= now (or unset and now >= runStart + Interval):
-       issue Prometheus instant query at time=now
-       classify into Passed | Failed | Error (per per-rule semantics)
-       update status.rules[i] with verdict, reason, query/alert evidence,
-         LastEvaluatedAt = now,
-         NextEvaluationAt = now + Interval
-
-3. Stop-on-failure:
-     if any rule with RunProperties.StopOnFailure=true tripped to Failed
-       in this iteration:
-       set Conditions[TestCancelled] = True, Reason=RuleStopOnFailure:{ruleName}
-       exit task as Failed → planner advances to finalize, which sets
-         Phase=Failed and FailedPlan="" (orchestration's own ruling, no actor blame)
-
-4. Exit conditions:
-     if loadComplete AND no rule has verdict=Failed:
-       do a final sweep — for any rule whose NextEvaluationAt <= now+Interval
-       (i.e., still pending) issue one last evaluation now
-       exit task Complete → planner advances to finalize
-
-     if loadComplete AND some rule has verdict=Failed:
-       exit task Failed → planner advances to finalize
-
-5. Else (still running):
-     RequeueAfter min(
-       earliest rule's NextEvaluationAt - now,
-       loadCompletePollInterval (10s — bound on observing a condition flip
-                                    even though informer events drive most updates)
-     )
-```
-
-**Idempotency.** The task carries no in-memory state across reconciles. Every iteration recomputes from `.status` (rules) and `.status.conditions` (LoadComplete). All advance is monotonic.
-
-**Implementation invariants for this task** (load-bearing — see Implementation invariants subsection below):
-
-- Transient errors (Prometheus 5xx, network timeout, RBAC denial) return `RequeueAfter`, NEVER `task.TerminalError`. The existing executor's `RetryCount/MaxRetries` was designed for short-running sidecar tasks that re-execute on retry; a long-running poller would mature `MaxRetries` into a permanent failure incorrectly.
-- Per-rule `Retry` (in `RuleRunProperties`) governs the rule's *own* error budget — that is the right per-rule retry surface; the task-level retry is not used here.
-
-##### Per-rule evaluation semantics
-
-For an `alert` rule (instant query against the synthetic `ALERTS` series):
-
-```promql
-max_over_time(ALERTS{alertname="X",alertstate="firing"}[<runDuration-so-far>s]) > 0
-```
-
-- Result > 0 → `verdict=Failed`, `Reason=AlertFired`, fill `AlertRuleStatus.FiredCount` + `LastFiredAt`
-- Result == 0 → `verdict=Passed`
-- Prometheus 5xx / network error → `verdict=Error`, `Reason=PrometheusUnavailable` (counted against `Retry` budget; resets to Awaited on next iteration if budget remains)
-
-For a `query` rule (instant query of user PromQL):
-
-- Result must be **scalar** or **vector of length 1**. Vector length 0 → `verdict=Error`, `Reason=NoSamples`. Length > 1 → `Error/AmbiguousResult`.
-- `NaN` → `Error/NaN`. Never silent pass/fail.
-- Comparator floats: document recommendation `>=`/`<=`; `==`/`!=` is exact-bits.
-- Threshold coerced from string to float64 (`strconv.ParseFloat`); CEL admission already gated the regex.
-- Comparator passes → `verdict=Passed`. Comparator fails → `verdict=Failed`, `Reason=ThresholdViolated`.
-- Store `actualValue: string` and `threshold: string` losslessly in `.status.rules[].query`.
-
-##### Failure model (per rule)
-
-- `Failed` = "signal said SUT misbehaved" (alert fired in window, query violated threshold). Monotonic.
-- `Error` = "couldn't ask" (Prometheus 5xx, network timeout, RBAC denial, unresolved ref, NaN, NoSamples, AmbiguousResult). Decays — successful evaluation overwrites Error with Passed/Failed; matures into permanent Error after `Retry` consecutive Errors.
-
-##### PrometheusRule resolution (folded into monitor-task-completion's first iteration)
-
-For each rule with `type=alert`, the first evaluation iteration GETs the referenced `monitoring.coreos.com/v1.PrometheusRule`; verifies `alertname` exists in the rule's spec; validates cross-namespace policy. Per-rule failure of resolution writes `verdict=Error` with `Reason=PrometheusRuleNotFound` or `Reason=CrossNamespaceForbidden`; does NOT block other rules. If **all** rules fail resolution on first iteration, the task exits Error with `Reason=AllRulesUnresolved`.
-
-##### 6. `finalize`
-
-Computes the final phase + verdict + condition from accumulated state, stamps the S3 URL, and closes the `TestRunning` gate. Pure in-memory aggregation plus one status patch:
-
-```
-loadComplete = Conditions[LoadComplete].Status == True (or spec.load == nil)
-exit         = *status.workloadExitCode (set by ValidationLoadGenerationReconciler; nil if load unset)
-ruleVerdicts = status.rules[]
-cancelled    = Conditions[TestCancelled].Status == True
-
-case cancelled:                                  → Phase=Failed, Reason=RuleStopOnFailure:{ruleName}, FailedPlan="" (orchestration's ruling)
-case spec.load != nil && exit == 2:              → Phase=Error, Reason=WorkloadInfraFailure, FailedPlan="loadGeneration"
-case spec.load != nil && exit == 1 && any(rule.Failed):
-                                                 → Phase=Failed, Reason=WorkloadFailed (primary), FailedPlan="loadGeneration", append RuleFailed conditions
-case spec.load != nil && exit == 1:              → Phase=Failed, Reason=WorkloadFailed, FailedPlan="loadGeneration"
-case any(rule.Failed):                           → Phase=Failed, Reason=RuleFailed:{firstFailedRule}, FailedPlan=""
-case all(rule.Passed) && (spec.load==nil || exit==0):
-                                                 → Phase=Succeeded, Reason=RunSucceeded
-case any(rule.Error) && all others Passed:       → Phase=Error,  Reason=RuleError, FailedPlan=""
-default:                                         → Phase=Error,  Reason=Unknown
-```
-
-Sets `status.completionTime`, `status.duration`, the `Succeeded` condition, sets `Conditions[TestComplete]=True, Reason=RunFinalized`, flips `Conditions[TestRunning]=False, Reason=Cancelled` (closes the gate so any in-flight ValidationLoadGenerationReconciler reconcile observes the cancellation), computes `status.report.s3Url = s3://harbor-validation-results/{ns}/{job}/{runId}/`, and emits the `validation_run_terminal_total{verdict}` metric (heartbeat-alert input).
-
-The S3 URL stamp is trivially derivable; the controller never reads from S3.
-
-#### LoadGeneration plan tasks
-
-##### 1. `render-config`
-
-| Property | Value |
-|---|---|
-| Type constant | `validation.TaskTypeRenderConfig` |
-| Sync/Async | Sync |
-| Idempotent op | Apply ConfigMap with files from `load.workload.config.files`, fixed-substitutions applied (`${chainId}`, `${rpcEndpoints}`, `${runId}`, `${namespace}` — sourced from `Run.status.chain.rpcEndpoints`). Field manager `validationrun-loadgeneration`. OwnerRef → Run. Name `{runName}-config`. |
-| Skipped when | `load.workload.config` unset |
-| Failure | API error → terminal `Reason=ConfigApplyFailed`, plan FailedPlan="loadGeneration" stamped by orchestrator's finalize |
-
-The substitution sources `${rpcEndpoints}` from `Run.status.chain.rpcEndpoints`, which ValidationOrchestrationReconciler stamped during `resolve-endpoints` before flipping `Conditions[TestRunning]`. The predicate guarantees this status field is populated before ValidationLoadGenerationReconciler reconciles.
-
-##### 2. `apply-job`
-
-| Property | Value |
-|---|---|
-| Type constant | `validation.TaskTypeApplyJob` |
-| Sync/Async | Sync |
-| Idempotent op | SSA `batch/v1.Job` named `{runName}` with `spec.parallelism=replicas`, `spec.completions=replicas`, `spec.completionMode=Indexed`, `backoffLimit=0`. Pod template carries reserved env vars (with `JOB_COMPLETION_INDEX` exposed as `SHARD_INDEX` via env-from-fieldRef), mounted ConfigMap, RESULT_DIR emptyDir, `serviceAccountName=spec.load.workload.serviceAccountName \|\| {ns}-runner`. Field manager `validationrun-loadgeneration`. OwnerRef → Run. |
-| Failure | Forbidden (no SA) → terminal `Reason=ServiceAccountMissing`. Other API errors retried. |
-
-The Job carries no Flux labels (`kustomize.toolkit.fluxcd.io/*`, `app.kubernetes.io/managed-by=flux`). Hard invariant — see One-Way Doors.
-
-##### 3. `wait-job-terminal`
-
-| Property | Value |
-|---|---|
-| Type constant | `validation.TaskTypeWaitJobTerminal` |
-| Sync/Async | Async (RequeueAfter; informer events on Owns(Job) refine cadence) |
-| Idempotent op | GET workload Job; if condition `Complete=True OR Failed=True`, GET the workload pod (label-selected), parse `pod.status.containerStatuses[].state.terminated.exitCode`, stamp `status.workloadExitCode` and set `Conditions[LoadComplete]=True, Reason=JobTerminated`. If `Conditions[TestCancelled]=True` was set by orchestrator, halt cooperatively (delete the Job; the cascade-delete on Run terminal will clean up regardless). |
-| Timeout | `spec.timeouts.runDuration` (default `load.duration + 5m`). On timeout, delete Job, synthesize exit code 2, set `Conditions[LoadComplete]=True, Reason=JobTimedOut`. |
-| Failure | API error → RequeueAfter (transient); plan-level terminal failure if Job spec is invalid (would have been caught at apply-job). |
-
-**Cancellation observation.** The task's idempotent loop checks `Conditions[TestCancelled]` first; when set True (by ValidationOrchestrationReconciler's monitor-task-completion stop-on-failure path), the task deletes the Job, sets `Conditions[LoadComplete]=True, Reason=Cancelled`, and exits Complete. This is the cooperative-halt contract: actor reconcilers do NOT undo work already done; the chain teardown via cascade-delete handles cleanup.
-
-### Cancellation contract
-
-`metadata.deletionTimestamp != nil` triggers the standard finalizer flow. ValidationOrchestrationReconciler clears its finalizer last; the cascade-delete OwnerReferences purge SNDs/Job/ConfigMap.
-
-`Conditions[TestCancelled]=True` is the **stop-on-failure** signal, set exclusively by ValidationOrchestrationReconciler's monitor-task-completion task when a stop-on-failure rule trips. Per-actor reconcilers (ValidationLoadGenerationReconciler) observe this condition on their next reconcile and halt cooperatively:
-
-- They stop submitting new tasks (the predicate continues to fire; the reconciler's plan resolver sees `TestCancelled=True` and short-circuits the build of new plans).
-- In-flight tasks proceed naturally to completion, but their `wait-job-terminal`-shaped tasks observe the cancellation and delete their owned Job to halt the actor's work.
-- They do NOT undo work already done. Chain teardown via cascade-delete is the rollback when the Run itself is deleted.
-- "Halt" is best-effort cooperative; the Run reaches `Phase=Failed`, NOT `Phase=Cancelled`. (`Cancelled` is reserved for `metadata.deletionTimestamp != nil`.)
-
-This contract scales to v2: when ValidationSequenceReconciler observes `TestCancelled=True`, it stops applying new sequence steps; when ValidationChaosReconciler observes it, it stops scheduling new fault injections.
-
-### Implementation invariants
-
-These are runtime rules every controller and task implementation must follow. They are not visible in the CRD schema but are load-bearing for correctness.
-
-1. **Transient errors return `RequeueAfter`, never `task.TerminalError`.** `monitor-task-completion` (Prometheus poller) and `wait-job-terminal` (Job watcher) are long-running; the executor's `MaxRetries` budget would mature transient errors into permanent failures. Implementation invariant: only schema-violation, RBAC-denial, and missing-resource errors are terminal in these tasks; everything else requeues.
-2. **Plan-creation idempotency uses optimistic concurrency.** Status patches use resourceVersion-checked update, not blind SSA on the status subresource. Existing controllers (`SeiNodeReconciler`, etc.) already do this; both ValidationRun controllers reinforce the pattern. Two reconciles in the same generation that race on plan creation see a conflict; one retries.
-3. **Each controller's status patch carries only its owned fields.** Field-manager isolation is preserved by listing only the fields the controller owns in its SSA patch — never echoing back the other controller's conditions or plan slot.
-4. **No controller writes another controller's plan slot.** The executor is parameterized by slot; passing the wrong slot is a compile-time error in Go (different field type per slot).
-5. **`spec.load == nil` short-circuits ValidationLoadGenerationReconciler.** The predicate gate is the primary defense; the resolver's `(nil, nil)` return is the secondary defense.
-6. **`monitor-task-completion` treats absent `spec.load` as `Conditions[LoadComplete]=True` from t=0.** Rules-only Runs are well-formed; the orchestrator does not wait for a never-arriving load-complete signal.
-
-## Validation rule semantics
-
-(Concise restatement; see `monitor-task-completion` task above for operative detail and OTel Round 1 for rationale.)
-
-- v1 evaluates rules continuously: each rule polls at `RunProperties.Interval` (default 30s, min 5s, max 5m) starting at `runStart + Interval`, with a final sweep at run terminal time before finalize aggregates.
-- v1 types: `alert`, `query`. Future `container` reserved.
-- Both types use the same Prometheus HTTP client (`internal/monitoring/prom.go`). One connection pool, one `otelhttp.NewTransport`, one set of metrics.
-- Prometheus URL via env var (`VALIDATION_PROMETHEUS_URL`, default `http://prometheus-k8s.monitoring.svc:9090`) with `--prometheus-url` flag override.
-- AND-of-rules pass semantics. OR/weighted aggregation explicitly out of scope.
-- Per-rule `verdict ∈ {Passed, Failed, Awaited, Error}` (4-state). `Awaited` is the pre-first-evaluation state.
-- `Failed` is monotonic per rule. `Error` is recoverable (decays to Passed/Failed; matures into permanent Error after `Retry` consecutive Errors).
-- `StopOnFailure: true` on a rule that trips → ValidationOrchestrationReconciler sets `Conditions[TestCancelled]=True`; ValidationLoadGenerationReconciler halts; Run terminates Failed.
-
-## Controller registration and opt-in deployment
-
-The validation machinery — `ValidationOrchestrationReconciler`, `ValidationLoadGenerationReconciler`, and any future actor controllers — is **opt-in at deployment time**. The default `sei-k8s-controller` deployment runs `SeiNodeReconciler` and `SeiNodeDeploymentReconciler` only; cluster operators running production validators with Kubernetes do not need the validation slice.
-
-Validation is opted in by chain developers and release engineers who own ephemeral test environments. When opted in, `ValidationOrchestrationReconciler` is the required orchestrator; opting into specific actor controllers (`ValidationLoadGenerationReconciler` in v1, future `ValidationSequenceReconciler` and `ValidationChaosReconciler` in v2) is independent and additive within the validation slice.
-
-**The specific deployment-time opt-in mechanism is left to the implementation** — values flags, build tags, separate Deployment manifests, env-var-gated controller registration, or any combination. What this LLD locks in is the architectural property: each validation controller's `SetupWithManager` call is conditional on a deployment-time signal, v2 actor controllers register identically, and nothing in the existing `SeiNodeReconciler` / `SeiNodeDeploymentReconciler` code path depends on the validation controllers being registered.
-
-Sketch (mechanism-agnostic):
-
-```go
-// cmd/main.go
-if cfg.Validation.Enabled {
-    if err := (&validationcontroller.ValidationOrchestrationReconciler{...}).SetupWithManager(mgr); err != nil {
-        setupLog.Error(err, "Failed to create controller", "controller", "ValidationRun-Orchestration")
-        os.Exit(1)
-    }
-
-    if cfg.Validation.LoadGenerationEnabled { // default-on within validation
-        if err := (&validationcontroller.ValidationLoadGenerationReconciler{...}).SetupWithManager(mgr); err != nil {
-            setupLog.Error(err, "Failed to create controller", "controller", "ValidationRun-LoadGeneration")
-            os.Exit(1)
-        }
-    }
-
-    // v2 actor controllers register identically with their own opt-in signals.
-    // Until v2 ships, attempting to enable Sequence or Chaos errors at startup.
-}
-```
-
-The "v2 expansion is purely additive" claim becomes a deployment-config-level invariant: adding `ValidationSequenceReconciler` is a new conditional `SetupWithManager` block plus an opt-in signal, with admission un-rejecting `spec.sequence` and a new `status.plans.sequence` slot. Existing v1 controllers (`ValidationOrchestrationReconciler`, `ValidationLoadGenerationReconciler`) do not change.
-
-## Observability
-
-Per OTel Round 1, with cardinality discipline. Metrics are emitted by both reconcilers; the controller name in metric labels distinguishes them.
-
-### Controller metrics (emitted from the reconciler binary)
-
-| Instrument | Type | Labels | Cardinality | Owner |
-|---|---|---|---|---|
-| `validationrun_phase_transitions_total` | Counter | `from`, `to`, `reason` | ~200 series | ValidationOrchestrationReconciler |
-| `validationrun_active` | UpDownCounter | `phase` | 5 series | ValidationOrchestrationReconciler |
-| `validationrun_duration_seconds` | Histogram (`30,60,300,600,1800,3600,7200`) | `terminal_phase`, `actor` | bounded | ValidationOrchestrationReconciler |
-| `validation_rule_evaluation_duration_seconds` | Histogram (`0.05,0.1,0.5,1,5,10,30`) | `type`, `verdict` | 12 series | ValidationOrchestrationReconciler |
-| `validation_rule_evaluations_total` | Counter | `type`, `verdict`, `reason` | bounded | ValidationOrchestrationReconciler |
-| `validation_prometheus_query_errors_total` | Counter | `endpoint`, `error_type` | ≤8 series | ValidationOrchestrationReconciler |
-| `validationrun_loadgen_jobs_terminal_total` | Counter | `exit_code` (0/1/2/timeout) | 4 series | ValidationLoadGenerationReconciler |
-| `validationrun_loadgen_active_jobs` | UpDownCounter | (none) | 1 series | ValidationLoadGenerationReconciler |
-| `validation_run_terminal_total` | Counter | `namespace`, `name`, `verdict` | **per-run** — see note below | ValidationOrchestrationReconciler |
-
-`actor` label values: `load`, `sequence`, `chaos`, `rules-only` (and combinations joined by `+`, e.g., `load+rules`). Bounded at ~8 distinct values.
-
-**Note on `validation_run_terminal_total` cardinality.** `name` is unbounded and would normally be a hard reject. Exception: this is the heartbeat-alert input (per platform-engineer), which queries `increase(... [24h])`. The cardinality is bounded by the alerting path's retention. If/when Prometheus pressure shows, drop `name` and switch heartbeat to a `validation_run_terminal_total{namespace, verdict}` aggregate. Document the upgrade path; ship the loud version first.
-
-**Note on `mode` label.** Round 1's OTel sketch carried a `mode` label on rule-evaluation metrics. v1 has no `mode` field on the wire (continuous is the only behavior), so the label is dropped. Re-add with care for cardinality if a future mode lands.
-
-**Labels rejected (cardinality bombs):** `chain_id`, `image_sha`, `run_id`. These go into trace attributes and structured log fields, not metrics.
-
-### Traces
-
-One span per reconcile per controller (`controller.kind=ValidationRun, controller.name=validationrun-orchestration|loadgeneration` + standard controller-runtime attributes). Child span per planner task. `monitor-task-completion`'s per-iteration loop emits one span per Prometheus query. Use `otelhttp.NewTransport` on the Prometheus client (single-flight client lives in `internal/monitoring/`). Inject `traceID`/`spanID` into structured logs.
-
-### Structured-report aggregation
-
-**S3-only in v1.** No Pushgateway, no remote_write of report metrics, no `.status.report.raw`. Reasons (per OTel Round 1):
-
-- Pushgateway is the wrong tool for batch-job metrics (Prometheus docs explicitly warn).
-- The report shape isn't yet stable across `seiload` and `qa-testing`; locking before two real consumers is a trap.
-- Cardinality of `{run_id, image_sha, profile}` is the bomb the controller avoids.
-- `.status.report.raw` (a CEL-capped echo of the termination message) was cut at the Round 2 gate. Consumers fetch from `status.report.s3Url`; the URL is programmatically derivable from the bucket layout.
-
-**Un-defer when:** ≥2 distinct report consumers want the same numeric AND a stable report-schema field has shipped. Add a separate `ReportExporter` controller that subscribes to terminal Runs and emits curated low-cardinality metrics. *Not* the Run controllers' job.
-
-## RBAC and tenancy
-
-### Controller-side ClusterRole (generated from kubebuilder markers)
-
-The two controllers share one ClusterRole — they live in one binary and one ServiceAccount. Markers go on each reconciler's package, but verbs are unioned at generation time.
-
-```go
-// Both controllers
-// +kubebuilder:rbac:groups=validation.sei.io,resources=validationruns,verbs=get;list;watch;create;update;patch;delete
-// +kubebuilder:rbac:groups=validation.sei.io,resources=validationruns/status,verbs=get;update;patch
-// +kubebuilder:rbac:groups=validation.sei.io,resources=validationruns/finalizers,verbs=update
-// +kubebuilder:rbac:groups=validation.sei.io,resources=validationsuites;validationschedules,verbs=get;list;watch
-// +kubebuilder:rbac:groups=validation.sei.io,resources=validationsuites/status;validationschedules/status,verbs=get
-// ValidationOrchestrationReconciler
-// +kubebuilder:rbac:groups=sei.io,resources=seinodedeployments,verbs=get;list;watch;create;update;patch;delete
-// +kubebuilder:rbac:groups=sei.io,resources=seinodedeployments/status,verbs=get
-// +kubebuilder:rbac:groups="",resources=services,verbs=get;list;watch
-// +kubebuilder:rbac:groups="",resources=namespaces,verbs=get;list;watch
-// +kubebuilder:rbac:groups=monitoring.coreos.com,resources=prometheusrules,verbs=get;list;watch
-// ValidationLoadGenerationReconciler
-// +kubebuilder:rbac:groups=batch,resources=jobs,verbs=get;list;watch;create;update;patch;delete
-// +kubebuilder:rbac:groups="",resources=configmaps,verbs=get;list;watch;create;update;patch;delete
-// +kubebuilder:rbac:groups="",resources=pods,verbs=get;list;watch
-// +kubebuilder:rbac:groups="",resources=pods/log,verbs=get
-// +kubebuilder:rbac:groups="",resources=serviceaccounts,verbs=get;list;watch
-// Both
-// +kubebuilder:rbac:groups="",resources=events,verbs=create;patch
-```
-
-ValidationLoadGenerationReconciler deletes Jobs (cooperative cancel for stop-on-failure and run-duration timeout); `delete` verb on `batch/jobs` is load-bearing. SNDs are deleted only via cascade on Run delete — controller-initiated SND delete is reserved for finalizer cleanup.
-
-**Not granted:** `secrets` write, `rbac.*`, `eks.*` (pod-identity AWS calls), **`pods/exec`** (one-way-door — see Resolved decisions). Per platform-engineer: controller never mints SAs or PIAs.
-
-### Tenant-side ClusterRole bundle (shipped in controller's manifests; bound per-namespace by tenants)
-
-- `validation-tenant-author`: full CRUD on `validation.sei.io/*` + `seinodedeployments` + view on rule-eval status.
-- `validation-tenant-viewer`: get/list/watch on validation kinds only.
-
-### CEL admission rules (CRD-level)
-
-- Same-namespace enforcement on every reference except `alert.ruleRef.namespace` (which is allowed cross-namespace into label-allowlisted namespaces only — checked at admission via webhook because CEL can't see namespace labels).
-- `chain.deployments[role=fullNode].spec.genesis` forbidden (full nodes inherit; CEL on the list-map).
-- `chain.deployments` exactly one `role=validator` and ≥1 `role=fullNode` (CEL on the list-map).
-- `chain.deployments[].spec.template.spec.{validator|fullNode}` if set must align with role; if unset, controller injects `{}`.
-- Reserved env-var names rejected on `spec.load.workload.env` via the `WorkloadSpec` `XValidation` rule.
-- Rule names unique within `spec.rules` (list-map convention).
-- At least one of `spec.load`, `spec.sequence`, `spec.chaos` must be set OR `spec.rules` non-empty.
-- `spec.sequence` and `spec.chaos` admission-rejected in v1.
-- Spec mutation: substantive fields immutable on a Run after creation (CEL: `self == oldSelf` on `spec.chain`, `spec.load`, `spec.rules`, `spec.sequence`, `spec.chaos`). Only metadata, finalizers, and status mutate.
-
-### Workload SA convention
-
-Tenant pre-provisions SA named `{namespace}-runner` (override via `spec.load.workload.serviceAccountName`). ValidationLoadGenerationReconciler validates existence at apply-job; missing SA terminates with `Reason=ServiceAccountMissing`. Controller never creates SAs.
-
-## IRSA / Pod Identity / S3
-
-Per platform-engineer Round 1:
-
-- Pod Identity Association lifecycle: Terraform, **per-namespace**, never per-Run. Adding a tenant = ~30-line Terraform block in `terraform/aws/.../harbor/validation.tf`.
-- One IAM Role per namespace → one S3 prefix.
-- **Bucket layout LOCKED:**
-  ```
-  s3://harbor-validation-results/{namespace}/{job}/{runId}/
-    report.log
-    report.html        # qa-testing mochawesome
-    report.junit.xml
-    termination-message.json
-    rules/{ruleName}.json
-  ```
-  `{job}` is `spec.load.workload.name` (or `"rules-only"` for rules-only Runs that skip the workload). New components only by *appending* — no new path levels.
-- Workloads upload via Pod Identity (already wired for nightly). Controller never writes to S3, never reads from S3, only computes the URL.
-
-## Migration: nightly GHA → ValidationRun
-
-Per platform-engineer Round 1:
-
-- `clusters/harbor/nightly/` Flux config is **structurally unchanged**. Same namespace, SAs, RBAC, PodMonitors. Pod labels (set by controller from `spec.load.workload.podTemplate` plus chain-id + run-id + deployment-name) keep selector compatibility.
-- `.github/workflows/k8s_nightly.yml` shrinks ~50%: each `kubectl wait`/`kubectl exec`/`envsubst`/`kubectl logs`/`aws s3 cp` step becomes a controller plan task (split across the orchestration plan and the loadgeneration plan). The workflow becomes:
-  1. `cat <<EOF | envsubst | kubectl apply -f -` (one ValidationRun manifest)
-  2. `kubectl wait --for=condition=Succeeded validationrun/${RUN_ID} --timeout=2h`
-  3. Read `.status.report.s3Url` and `.status.verdict` for the GHA summary; pull the report artifacts from S3 directly.
-- The race against `sei-chain` CI moves into the controller's `ensure-chain` task.
-- `templates/seinodedeployment.yaml` and `templates/seiload-job.yaml` retire (subsumed into `ValidationRun.spec`). Both fleets (validators and fullNodes) lift into `chain.deployments[]` directly — the prior YAML's two SND templates become two list entries.
-
-Migration is opt-in per tenant: nightly can flip to ValidationRun while qa-testing still runs Phase 1's GHA-orchestrated bash, and vice versa. The Phase 1 contract guarantees the workload manifests remain identical.
-
-The plan-task labels in this migration narrative align with the new task split:
-
-| Old workflow step | New plan task | Plan slot |
-|---|---|---|
-| `kubectl apply -f seinodedeployment.yaml` (×2) | `ensure-chain` | orchestration |
-| `kubectl wait --for=condition=Ready snd/...` | `wait-chain-ready` | orchestration |
-| `kubectl get services -l sei.io/nodedeployment=...` | `resolve-endpoints` | orchestration |
-| (gate flip — implicit in old workflow) | `mark-test-running` | orchestration |
-| `envsubst | kubectl apply -f profile.cm.yaml` | `render-config` | loadgeneration |
-| `kubectl apply -f seiload-job.yaml` | `apply-job` | loadgeneration |
-| `kubectl wait --for=condition=complete job/...` + `kubectl logs` tail | `wait-job-terminal` | loadgeneration |
-| (rule polling — new) | `monitor-task-completion` | orchestration |
-| `aws s3 cp` summary upload (workload-side via Pod Identity) | URL stamped at `finalize` | orchestration |
-
-## Six open questions from #139 — answers
-
-| # | Question | Decision | Source |
-|---|---|---|---|
-| 1 | Embed workload spec, or reference a separate `ValidationWorkload` CR? | **Embed.** Reserve `spec.runRef` field name only. | PM Round 1 Q1 |
-| 2 | Probe `mode` enum: 3-mode, 5-mode, or defer? | **Defer.** v1 has no `mode` field; only behavior is continuous polling. | PM Round 1 Q2 + user input |
-| 3 | `ValidationSuite`'s own scheduler vs `ValidationSchedule`-only? | **Moot in v1** — both deferred. Schedule is the lever; Suite is a child orchestrator. | PM Round 1 Q3 |
-| 4 | S3 upload: top-level field, sidecar, or controller-side? | **Workload-side via Pod Identity.** Controller stamps URL only. | PM Round 1 Q4 + Phase 1 contract |
-| 5 | Kueue integration day 1? | **No.** Defer until ≥3 concurrent suites contend. | PM Round 1 Q5 |
-| 6 | Numeric vs stringified result aggregation? | **Stringified.** Adding numerics later is additive; removing them is breaking. | PM Round 1 Q6 + OTel one-way doors |
-
-The Round 3 architectural pass also resolved one unstated question: **how to orchestrate "chain bring-up + workload + rules" without a barrier task**. Answer: predicate-gated event delivery on a second sub-controller; `Conditions[TestRunning]` is the gate. (See Resolved decision #19.)
-
-## Five one-way-door warnings from OSS survey — how design avoids each
-
-1. **Don't bake workload I/O abstractions into the Run (Tekton's `PipelineResource`).** ValidationRun has no typed I/O abstraction. The workload contract is env vars + exit codes + termination message + S3 — text-shaped, not typed Kubernetes objects. Avoided.
-2. **Don't lowercase your phase enum (Testkube's `passed`/`failed`).** All phase and verdict enums are Pascal-case. Tekton/Argo precedent. Avoided.
-3. **Don't pick a kind name twice (K6's `K6→TestRun`, Litmus's `Experiment→Fault`).** ValidationRun / ValidationSuite / ValidationSchedule chosen with explicit OSS-precedent alignment. Composable-blocks shape (`load`, `sequence`, `chaos`) absorbs new test shapes inside one kind name. Avoided.
-4. **Don't proliferate per-workload-type CRDs (Testkube's `TestWorkflow` consolidation).** One CRD with composable optional blocks. Per-runner-type kinds explicitly rejected. Avoided.
-5. **Don't orchestrate complex DAGs in your suite CR (Litmus outsourced to Argo).** `ValidationSuite` (when implemented) is flat — sequential or parallel, with `stopOnFailure`. Users needing DAG wrap suites in Argo. Documented as non-goal. Avoided.
-
-## Resolved one-way-door decisions
-
-These are decisions baked into the LLD that the council orchestrator surfaced for explicit user approval. Round 2 closed 2026-04-28; Round 3 architectural-refinement pass added rows 19–24 and modified rows 3, 4, 18 on 2026-04-29.
-
-| # | Decision | Rationale | Reverse cost |
-|---|---|---|---|
-| 1 | API group **`validation.sei.io`** (separate from `sei.io`) | Validation versions independently of node-platform CRDs | High — group rename = full migration |
-| 2 | **Composable optional actor blocks** (`spec.load`, `spec.sequence`, `spec.chaos`) **instead of `spec.type` discriminator** | Real test shapes mix actors; discriminator was a one-way door | High — schema migration; v1 ships composable from start |
-| 3 | **`chain.deployments[]` list of named SeiNodeDeployment configs** with `role: validator|fullNode` | Generalizes the "validator + fullNodes" topology; absorbs heterogeneous fullNode fleets without new top-level fields | High — list shape rename = stored-object migration |
-| 4 | **Exactly one `role: validator` deployment, ≥1 `role: fullNode` deployment.** Workload always connects to fullNode-role endpoints. **No `endpointPolicy` knob — permanently dropped.** | Two-fleet topology codifies what both real consumers run; one validator + N fullNodes generalizes; removing endpointPolicy eliminates a one-way enum | Medium — adding a validator-only or selectable mode later requires new shape |
-| 5 | **`Failed` vs `Error` distinction** at phase level (heartbeat ignores Error) | Two terminal failure modes carry different operational signals | Low (additive) |
-| 6 | **Bucket prefix `s3://harbor-validation-results/{namespace}/{job}/{runId}/`** locked verbatim | Phase 1 already wires this | High — bucket migration |
-| 7 | **No Flux labels on owned children** (controller invariant) | Flux would prune controller-managed objects | High — Flux drift incident |
-| 8 | **Tenant pre-provisions SAs**; controller never IAM | Stay off the IAM control plane | High — auth model creep |
-| 9 | **`alert.ruleRef` cross-namespace** allowed only into label-allowlisted namespaces | Multi-tenancy by namespace; controller-mediated escape hatch | Low (additive) |
-| 10 | **Reserved env-var names rejected at admission via CEL XValidation** on `WorkloadSpec.env` | Hard reject is deterministic; avoids the silent-override surprise | Low (additive — names list extends additively) |
-| 11 | **`query.threshold` typed as `string`** with regex CEL | OTel Round 1 schema-typing; switch numeric→string is breaking | High |
-| 12 | **No `mode` field on `ValidationRule` in v1** | Avoids locking in an enum value before semantics solidify | Low (additive) |
-| 13 | **`runProperties.interval` (default 30s, min 5s, max 5m)** is load-bearing for continuous polling | Polling cadence per rule | Low (additive — bounds may relax) |
-| 14 | **`runProperties.stopOnFailure` implemented in v1** (default `false`); orchestrator sets `Conditions[TestCancelled]=True` and actor reconcilers halt cooperatively | Real fast-fail path; no barrier task needed | Low (additive — semantics expansion) |
-| 15 | **Five conditions: `TestRunning`, `TestComplete`, `Succeeded`, `TestCancelled`, `LoadComplete`** with single-writer field-manager isolation | Gate-flip + finalize + Tekton-style pass/fail + cross-controller cancel signal + actor-side complete signal | Low (additive — extra conditions stay additive) |
-| 16 | **`status.report.raw` cut entirely.** S3 URL is the only artifact pointer in `.status`. | Bounded `.status` size; URL is programmatically derivable | Low (additive — re-add capped raw if a consumer demands without S3 access) |
-| 17 | **Spec immutability** (CEL `self == oldSelf` on `spec.chain`, `spec.load`, `spec.rules`, `spec.sequence`, `spec.chaos`) | Re-runs are new ValidationRun resources | Medium — relaxation requires versioning |
-| 18 | **Two plans per Run, partitioned across `.status.plans.<controllerName>` slots.** Orchestration plan: `ensure-chain → wait-chain-ready → resolve-endpoints → mark-test-running → monitor-task-completion → finalize` (6 tasks). LoadGeneration plan: `render-config → apply-job → wait-job-terminal` (3 tasks). | Each controller owns its own plan; no cross-controller plan mutation; v2 actor controllers append plan slots additively | Medium — collapsing back to one plan would require status migration AND re-introducing a barrier task |
-| 19 | **Two cooperating sub-controllers in one binary** (`ValidationOrchestrationReconciler` required when validation is enabled, `ValidationLoadGenerationReconciler` opt-in default-on within the validation slice). v2 actor controllers register identically. | Each controller is independently testable, opt-out-able, and v2-extensible. Replaces the prior single-controller-with-7-tasks design. | High — unwinding to one controller would re-introduce the cross-actor-orchestration problem composable blocks are designed to absorb |
-| 20 | **Predicate-gated event delivery** on `ValidationLoadGenerationReconciler` (`spec.load != nil AND Conditions[TestRunning]=True`). No barrier task; controller-runtime native gating. | Idiomatic; survives controller restart; no synchronization primitive required | Low (additive — predicate refines additively) |
-| 21 | **Field-manager isolation per controller** (`validationrun-orchestration`, `validationrun-loadgeneration`). Single-writer table enforces no two controllers ever write the same condition or status field. | SSA preserves both partitions correctly; concurrent writes are conflict-free by construction | Low (additive — adding fields keeps isolation) |
-| 22 | **`TaskPlan.TargetPhase` ignored by both ValidationRun controllers** (typed as `SeiNodePhase`, doesn't fit ValidationRunPhase). Phase transitions owned exclusively by orchestration plan's `finalize` task. | Avoids forcing a `TaskPlan` schema migration in v1; phase logic centralizes in one place | Medium — if v2 needs typed TargetPhase across actor plans, file the SND/SeiNode-side common-types update |
-| 23 | **Deployment-time opt-in pattern** for validation controllers; specific mechanism (values flags, build tags, env vars, separate Deployment manifests) deferred to implementation. v1 binary errors at startup if `sequence` or `chaos` is signaled enabled. SeiNode / SeiNodeDeployment controllers are the always-on default; the validation slice is layered on top. | Makes "v2 expansion is additive" a deployment-config-level invariant; node operators running production validators don't carry validation machinery | Low (additive — new actor opt-in signals append) |
-| 24 | **`pods/exec` REJECTED for v2 sequence steps.** When ValidationSequenceReconciler ships, sequence steps that submit chain transactions do so via short-lived `Job`s targeting RPC, NOT via `kubectl exec` into validator pods. Controller's RBAC will not include `pods/exec` ever. | `pods/exec` is a privileged escape hatch with audit-trail and security implications. Job-based sequence steps reuse the same Pod Identity / RBAC story already established. | Very High — adding `pods/exec` later requires a separate security review and an explicit RBAC promotion |
-| 25 | **Chaos plan namespace-label gate (v2).** When ValidationChaosReconciler ships, chaos-mesh CRDs in tenant namespaces require namespace label `sei.io/chaos-allowed=true` (admission-webhook-enforced) AND controller built with `--enable-chaos-plan` (compile-time opt-in for v1 deployments). | Chaos is a privileged operation; namespace-label is the multi-tenancy gate; compile-time flag prevents accidental enablement | Low (additive — relaxing the label gate is an additive admission-rule change) |
-| 26 | **`status.failedPlan` field reserved in v1, populated by orchestration's `finalize` task.** Stamps `"loadGeneration"` (or `""` for orchestration's own ruling) on terminal Failed/Error in v1. v2 reconcilers stamp `"sequence"` / `"chaos"` without schema churn. | 2am on-call sees at a glance which actor caused the run to fail; v2-additive | Low (additive — new actor names append) |
-
-## Open dependencies
-
-These are companion sub-issues the LLD discovers. **None block v1 ValidationRun.**
-
-1. ~~**SND-readiness includes catching_up.**~~ **Resolved — already implemented.** The `seid:RPC_PORT/lag_status` readiness probe wired at `internal/noderesource/noderesource.go:366-377` (probing sei-tendermint's threshold-configured endpoint) gates Pod readiness on chain catch-up; SND `phase=Ready` waits via `ConditionNodesReady` aggregation. `sei-protocol/sei-k8s-controller#144` closed-as-already-implemented. Original draft of this LLD specified a multi-day seictl sidecar `/health` extension based on incorrect investigation; the existing seid-side mechanism is functionally equivalent and architecturally cleaner (no sidecar↔seid hop).
-2. **`TaskPlan.TargetPhase` typing.** v1 sidesteps by leaving the field empty in both ValidationRun controllers' plans and centralizing phase transitions in `finalize`. Re-evaluate when v2 lands a third actor plan that wants phase-shape parity with SND/SeiNode plans.
-3. ~~**SND inline `genesis` rejection on full-node-role SND.**~~ **Resolved by `sei-protocol/sei-k8s-controller#148`.** CEL XValidation rule on `SeiNodeDeploymentSpec` rejects `genesis` when `template.spec.validator` is unset — covers fullNode/archive/replayer roles, broader than the original ValidationRun-side framing because the SND has a 4-role discriminator. Closes #145.
-4. ~~**PodMonitor for `sei-k8s-controller` itself.**~~ **Resolved — already implemented.** The upstream `sei-k8s-controller/config/monitoring/service-monitor.yaml` (a **ServiceMonitor**, not PodMonitor — functionally equivalent) is pulled in transitively via the platform's `clusters/harbor/sei-k8s-controller/kustomization.yaml → config/default → config/monitoring/`. Verified via `kubectl kustomize` rendering both the metrics Service and the ServiceMonitor. `sei-protocol/platform#243` closed-as-already-implemented. Original draft was based on incorrect investigation — same pattern as Open Dependency #1.
-5. **Heartbeat PrometheusRule.** One global `clusters/harbor/monitoring/alerts/validation-heartbeat.yaml` querying `validation_run_terminal_total`. Per platform-engineer Round 1 — generic across tenants. Tracked at `sei-protocol/platform#244`; **deferred until the ValidationRun controller ships and emits the input metric.**
-6. ~~**`validation` shared-rules monitoring namespace label.**~~ **Resolved by `sei-protocol/platform#269`.** `monitoring/` namespace now carries `sei.io/validation-shared-rules=true`. Closes platform#245.
-7. **Shard env-var injection mechanism.** `SHARD_INDEX`/`SHARD_COUNT` per pod via Indexed-Job (`spec.completionMode=Indexed`) with `JOB_COMPLETION_INDEX` mapped to `SHARD_INDEX` via env-from-fieldRef. Already a Phase 1 contract reservation.
-8. **Controller registration opt-in plumbing.** Deployment-time mechanism (values flags, env vars, build tags, separate Deployment manifests, or any combination) for gating each validation controller's `SetupWithManager` call. Specific mechanism TBD — not shipping a Helm chart in v1; lands alongside the v1 controller manifest delivery.
-
-## Future work
-
-Each future item names the trigger that un-defers it.
-
-| Item | Un-defer trigger |
-|---|---|
-| **`ValidationSequenceReconciler` sub-controller** | First Cosmos-SDK upgrade test that needs ordered state changes. Lands as additive sub-controller; deployment-time opt-in signal; admission un-rejects `spec.sequence`; new `status.plans.sequence` slot. Sequence steps submit via short-lived Jobs (not `pods/exec`). |
-| **`ValidationChaosReconciler` sub-controller** | First chaos-mesh-driven failure-injection test. Lands as additive sub-controller; deployment-time opt-in signal + compile-time `--enable-chaos-plan` flag + tenant namespace label `sei.io/chaos-allowed=true`. New `status.plans.chaos` slot. |
-| `ValidationSuite` reconciler | First multi-Run flow that one human launching kubectl can't drive (≥3 chained Runs in CI). |
-| `ValidationSchedule` reconciler | Tenant wants to retire their GHA cron. Track first request. |
-| `container` rule type | Real heuristic ships outside Prometheus (e.g., balance reconciliation across two RPCs). |
-| `window-end` rule mode (single instant query at run end) | Use case where continuous polling cost is undesirable AND the result is only meaningful at run end. Re-add `mode` field with `continuous` default. |
-| `edge` rule mode (start-vs-end snapshot) | Spec-vs-end snapshot use case. |
-| **`status.failedPlan` populated in ValidationOrchestrationReconciler when v2 actor plans land** | Already reserved in v1 schema; v1 stamps `"loadGeneration"` only. v2 reconcilers stamp `"sequence"` / `"chaos"` for at-a-glance fault attribution. |
-| `ReportExporter` controller (Pushgateway-shaped) | ≥2 consumers want the same numeric AND a stable report-schema field ships. |
-| Kueue admission control | ≥3 simultaneous suites contending; Prometheus shows queueing. |
-| `spec.runRef` (separate `ValidationDefinition` CR) | ≥10 distinct Runs sharing a workload definition; ConfigMap reuse insufficient. |
-| Per-PR `ValidationTrigger` event-driven CR | CI cost budget approved; Testkube-style triggers explicitly demanded. |
-| Numeric `actualValue`/`threshold` typed fields on rule status | Regression-detection consumer ships and benefits from type safety. |
-| Multi-workload composition in one Run | Cross-workload barrier/sync requirements emerge. |
-| Live mid-run RPC fleet refresh | HPA on RPC fleet during a run becomes desired. |
-| Validator-only chain topology (no fullNode-role deployments) | Consensus-poking workload demands it AND ≥2 consumers justify the operational cost of optional-fullNodes mode. |
-
-**v2 expansion is purely additive — by design.** Adding a sequence or chaos sub-controller does not refactor ValidationOrchestrationReconciler or ValidationLoadGenerationReconciler. The integration touch-points are: (a) new deployment-time opt-in signal, (b) admission un-rejects the corresponding `spec.<actor>` block, (c) new `status.plans.<name>` slot, (d) new condition `<Actor>Complete` written by the new controller via its own field manager, (e) the new controller's predicate watches `Conditions[TestRunning]=True` exactly like ValidationLoadGenerationReconciler does. No existing controller code changes.
-
-### Argo Workflows re-evaluation triggers
-
-The /coral subcouncil unanimously concluded to ship the custom CRD over Argo Workflows for v1 — the per-Run cost of a custom controller is small, and the surface area Argo would absorb (chain-readiness orchestration, structured rule verdicts, S3 URL stamping, stop-on-failure semantics) is the surface area the validation domain wants to own first-class. The Argo conversation should re-open under any of the following triggers:
-
-- **Argo Workflows lands on Harbor for unrelated reasons** (data pipelines, CI consolidation). Marginal cost of a `WorkflowTemplate` + `CronWorkflow` drops to ~zero; `ValidationSchedule` becomes a free wrapping layer.
-- **A fourth distinct test-shape with real DAG needs lands** (chaos-mesh-driven sequences, multi-cluster fanouts). DAG-shaped orchestration is what Argo is genuinely better at.
-- **`ValidationSuite` needs DAG semantics, not flat sequential/parallel orchestration.**
-- **300-pod polling cost becomes a non-issue (i.e., we drop continuous mode and revert to window-end only).** The custom controller's main value-add over Argo is the per-rule polling loop with stop-on-failure; if the polling story collapses to a single instant query, Argo's `Workflow` shape becomes broadly competitive.
-
-If any of these trigger, the conversation re-opens with a fresh /coral pass — not a unilateral pivot.
-
-## References
-
-### Round 0 + Round 1 + Round 2 + Round 3 artifacts (this council cycle)
-
-- `sei-protocol/sei-k8s-controller#139` — entry brief, OSS survey, six open questions, five one-way-door warnings (`/tmp/validationrun-issue-body.md`)
-- `sei-protocol/platform#235` — Phase 1 workload contract issue (`/tmp/phase1-issue-body.md`)
-- Round 1 PM scope cuts (`/tmp/round1/pm-scope-cuts.md`)
-- Round 1 OTel rule semantics + observability (`/tmp/round1/otel-rule-semantics.md`)
-- Round 1 platform-engineer Harbor integration (`/tmp/round1/platform-integration.md`)
-- Round 2 user gate decisions (closed 2026-04-28) — encoded in the "Resolved one-way-door decisions" table rows 1–18
-- Round 3 architectural-refinement pass (closed 2026-04-29) — encoded in rows 2, 3, 4, 18, 19–26 above
-
-### In-repo (sei-k8s-controller)
-
-- `/Users/brandon/sei-k8s-controller/CLAUDE.md` — controller patterns, RBAC marker convention, plan-driven reconciler
-- `/Users/brandon/sei-k8s-controller/internal/planner/doc.go` — plan lifecycle, condition ownership, single-patch model
-- `/Users/brandon/sei-k8s-controller/internal/planner/{planner.go,group.go,full.go,validator.go,replay.go,executor.go}` — existing builder idiom; `validationrun_orchestration.go` and `validationrun_loadgen.go` will follow the same pattern
-- `/Users/brandon/sei-k8s-controller/internal/task/{await_nodes_running.go,deployment.go,observe_image.go,task.go}` — controller-side task pattern, deserialize-map registration
-- `/Users/brandon/sei-k8s-controller/api/v1alpha1/{seinode_types.go,seinodedeployment_types.go,common_types.go,validator_types.go,full_node_types.go,replayer_types.go}` — existing CRD type idioms (CEL XValidation, kubebuilder markers, listType=map listMapKey)
-- `/Users/brandon/sei-k8s-controller/cmd/main.go` — controller registration pattern (the new `ValidationOrchestrationReconciler` and `ValidationLoadGenerationReconciler` register here, the latter behind `cfg.LoadGenerationEnabled`)
-- `/Users/brandon/sei-k8s-controller/docs/design/composable-genesis.md` — LLD style template
-
-### In-repo (platform)
-
-- `/Users/brandon/platform/clusters/harbor/nightly/templates/seinodedeployment.yaml` — the working two-SND topology this design generalizes into `chain.deployments[]`
-- `/Users/brandon/platform/.github/workflows/k8s_nightly.yml` — bash glue this CRD subsumes
-
-### OSS survey — direct precedents (only patterns cited above)
-
-- Tekton TaskRun docs — `Succeeded` condition shape — https://github.com/tektoncd/pipeline/blob/main/docs/taskruns.md
-- Tekton TEP-0074 — PipelineResource deprecation (one-way-door warning) — https://github.com/tektoncd/community/blob/main/teps/0074-deprecate-pipelineresources.md
-- Argo CronWorkflow — `ValidationSchedule` shape — https://argo-workflows.readthedocs.io/en/latest/cron-workflows/
-- Litmus probes overview — `ValidationRule` analog (interval, runProperties) — https://docs.litmuschaos.io/docs/concepts/probes
-- K6 Operator TestRun — execution-segment sharding via env injection — https://github.com/grafana/k6-operator/blob/main/api/v1alpha1/testrun_types.go
-- Chaos Mesh Schedule types — annotation-based pause pattern — https://github.com/chaos-mesh/chaos-mesh/blob/master/api/v1alpha1/schedule_types.go
-- Testkube TestExecution types — what NOT to do (lowercase enum, per-runner CRDs) — https://github.com/kubeshop/testkube-operator/blob/main/api/tests/v3/test_types.go
-- Sonobuoy plugins doc — done-file convention (referenced for future `container` rule type) — https://github.com/vmware-tanzu/sonobuoy/blob/main/site/content/docs/main/plugins.md
-- controller-runtime `WithEventFilter` predicate pattern — predicate-gated reconciliation idiom (used for ValidationLoadGenerationReconciler's `Conditions[TestRunning]` gate) — https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/predicate
diff --git a/docs/known-issues-node-alarms.md b/docs/known-issues-node-alarms.md
deleted file mode 100644
index 2f622d62..00000000
--- a/docs/known-issues-node-alarms.md
+++ /dev/null
@@ -1,33 +0,0 @@
-# Known Issues: Node Startup Alarms
-
-Recurring alerts observed during SeiNode and SeiNodeDeployment deployments. These are expected during the bootstrap lifecycle but represent areas for improvement.
-
-## 1. SeiNodeFailed: Shadow Replayer on Dev
-
-**Alert:** `SeiNodeFailed` for `pacific-1-shadow-replayer`
-**Environment:** dev
-**Severity:** critical (alert), expected during iteration
-
-**What happens:** The shadow replayer fails during bootstrap, typically at `discover-peers` or `configure-state-sync`. Each failed deployment requires deleting and recreating the SeiNode. _(Historical: `discover-peers` was a sidecar bootstrap task when this incident occurred; peering is now controller-owned via the config-apply `persistent_peers` override and is no longer a distinct task.)_
-
-**Root causes encountered:**
-1. **Pruned peers (resolved):** State-syncer EC2 nodes pruned blocks below the snapshot height (200440000). `configure-state-sync` queries peers for a block hash at the trust height and gets empty responses. Fix: use a snapshot at a height within peers' retention window.
-2. **Proposer priority divergence (intermittent):** One snapshotter node (`3.75.235.199`) returns different validator set proposer priorities than the other 5. When CometBFT picks divergent nodes as primary vs witness, state sync fails with "proposer priority hashes do not match."
-3. **Seid flat JSON format (fixed in PR #63):** `rpc.Client.Get()` assumed JSON-RPC envelopes but seid returns flat JSON. Peer discovery silently failed, reporting "no reachable peers."
-
-**Current mitigation:**
-- Use snapshot at height 200940000 (within state-syncer retention)
-- Use `Component: state-syncer` peers (consistent proposer priorities at recent heights)
-
-**Future improvements:**
-- Use same host for both primary and witness when `useLocalSnapshot: true` (state sync only needs trust hash verification, not independent witnesses)
-- Add peer fallback: try multiple peers for block hash queries instead of failing on first error
-- Automate snapshot pipeline to keep recent snapshots available
-
----
-
-## General Notes
-
-- **Failed nodes are terminal:** Once a SeiNode enters `PhaseFailed`, no further reconciliation occurs. The operator must `kubectl delete seinode <name>` and let the group controller recreate it (or recreate manually for standalone nodes).
-- **PVCs persist across node recreation:** Deleting a SeiNode does not delete its PVC (by default). A new node with the same name reuses the existing PVC, which may contain stale sidecar SQLite state. The sidecar's `rehydrateStaleTasks` handles stale `running` tasks, and the cloud-API `Submit` model handles stale `failed` tasks.
-- **Controller image matters:** The `/bin/sh` fix (PR #53), predicate fix (PR #58), and plan ID changes (PR #50) all require the controller image to be updated. Check `config/manager/manager.yaml` matches the latest ECR build.
diff --git a/docs/production-deployment-analysis.md b/docs/production-deployment-analysis.md
deleted file mode 100644
index 7271fc2f..00000000
--- a/docs/production-deployment-analysis.md
+++ /dev/null
@@ -1,166 +0,0 @@
-# Production Deployment Analysis: SeiNode CRD vs sei-infra Snapshotter
-
-## Current State: sei-infra Snapshotter (EC2/Terraform)
-
-The existing snapshotter is entirely EC2-based with no Kubernetes involvement:
-
-- **3 instances** (`m7i.8xlarge`) in `eu-central-1` running `seid v6.3.0`
-- **32 TiB EBS** storage (RAID0 across 5 disks)
-- **Weekly AMI snapshots** (Mondays 16:00 UTC via cron + `aws ec2 create-image`)
-- **S3 bucket**: `pacific-1-snapshots/state-sync/` for Tendermint snapshots
-- **DynamoDB** metadata table: `pacific-1_snapshot_metadata`
-- **ALB** with Route53 DNS (`rpc.sei-archive.pacific-1.seinetwork.io`)
-- **WAF** rate limiting (300 req/5min per IP)
-- **IAM role**: `pacific-1-snapshot-iam-role` with S3 + DynamoDB permissions
-
-### Bootstrap Flow
-
-1. `ec2_init.sh` downloads sei-infra from S3, runs `generate_configs.py`
-2. `mount_ebs_volumes.sh` sets up RAID0 or JBOD
-3. `post_init.sh` installs seid binary, configures, starts systemd service
-4. `init_configure.sh` runs `seid init`, fetches peers, sets pruning to `nothing`, enables SeiDB + OCC
-5. Crontab schedules `snapshot.sh` for weekly AMI creation
-
-### Snapshot Generation
-
-| Setting | Value |
-|---------|-------|
-| Cron | `0 16 * * 1` (Mondays 16:00 UTC) |
-| Mechanism | `snapshot.sh` → `aws ec2 create-image` (no-reboot) |
-| Retention | AMIs older than 30 days removed |
-| Metadata | DynamoDB with `last_update`, `height`, `imageName` |
-
-### State-Syncer (Snapshot Source)
-
-- 3 instances in `eu-central-1`
-- Snapshot interval: 4000 blocks
-- Script: `create_snapshot.sh` (halt-height loop, `seid tendermint snapshot`, keep 5 recent)
-- Peers: `primaryEndpoint: https://sei-rpc.polkachu.com`
-
----
-
-## SeiNode CRD: Production Snapshotter
-
-### Example Manifest (Archive Mode)
-
-```yaml
-apiVersion: sei.io/v1alpha1
-kind: SeiNode
-metadata:
-  name: pacific-1-snapshotter
-  namespace: sei-nodes
-spec:
-  chainId: pacific-1
-  image: "ghcr.io/sei-protocol/sei:v6.3.0"
-  sidecar:
-    image: ghcr.io/sei-protocol/seictl@sha256:8bfef078409c160f03c62fcd969702b3edc9d957369fb56dca9e34e09ac6c99a
-    resources:
-      requests:
-        cpu: "500m"
-        memory: "256Mi"
-  entrypoint:
-    command: ["seid"]
-    args: ["start", "--home", "/sei"]
-  storage:
-    retainOnDelete: true
-  archive:
-    peers:
-      - ec2Tags:
-          region: eu-central-1
-          tags:
-            ChainIdentifier: pacific-1
-            Component: state-syncer
-    snapshotGeneration:
-      keepRecent: 5
-      destination:
-        s3:
-          bucket: pacific-1-snapshots
-          prefix: state-sync/
-          region: eu-central-1
-```
-
-### Generated Kubernetes Resources
-
-| Resource | Details |
-|----------|---------|
-| StatefulSet | 1 replica, `seid` + `seictl` sidecar |
-| Service | Headless (`ClusterIP: None`), `PublishNotReadyAddresses: true` |
-| PVC | `data-{nodeName}`, StorageClass `gp3-10k-750`, 2000Gi for archive |
-| Init containers | `seid-init` (chain init), `sei-sidecar` (restartable) |
-
-### PlatformConfig (Controller Environment Variables)
-
-| Env Var | Default | Purpose |
-|---------|---------|---------|
-| `SEI_NODEPOOL_NAME` | `sei-node` | Karpenter NodePool for scheduling |
-| `SEI_TOLERATION_KEY` | `sei.io/workload` | Taint key to tolerate |
-| `SEI_TOLERATION_VALUE` | `sei-node` | Taint value |
-| `SEI_SERVICE_ACCOUNT` | `seid-node` | ServiceAccount for node pods |
-| `SEI_STORAGE_CLASS_PERF` | `gp3-10k-750` | StorageClass for full/validator/archive |
-| `SEI_STORAGE_CLASS_DEFAULT` | `gp3` | StorageClass for other modes |
-| `SEI_STORAGE_SIZE_DEFAULT` | `1000Gi` | PVC size for full/validator |
-| `SEI_STORAGE_SIZE_ARCHIVE` | `2000Gi` | PVC size for archive |
-| `SEI_RESOURCE_CPU_ARCHIVE` | `8` | CPU request for archive |
-| `SEI_RESOURCE_MEM_ARCHIVE` | `48Gi` | Memory request for archive |
-| `SEI_RESOURCE_CPU_DEFAULT` | `4` | CPU request for full/validator |
-| `SEI_RESOURCE_MEM_DEFAULT` | `32Gi` | Memory request for full/validator |
-| `SEI_SNAPSHOT_REGION` | `eu-central-1` | Default S3 region for snapshots |
-
-### Resource Sizing by Mode
-
-| Mode | StorageClass | Size | CPU | Memory |
-|------|-------------|------|-----|--------|
-| archive | gp3-10k-750 | 2000Gi | 8 | 48Gi |
-| full, validator | gp3-10k-750 | 1000Gi | 4 | 32Gi |
-| replayer | gp3 | 1000Gi | 4 | 32Gi |
-
-### Hardcoded Values
-
-| Setting | Value |
-|---------|-------|
-| Data dir | `/sei` |
-| Default sidecar image | `ghcr.io/sei-protocol/seictl@sha256:...` |
-| Default sidecar port | 7777 |
-| Snapshot upload cron | `0 0 * * *` (daily midnight) |
-| Snapshot interval | 2000 blocks (in config-apply) |
-
----
-
-## Gap Analysis: sei-infra vs SeiNode CRD
-
-| Aspect | sei-infra (EC2) | SeiNode CRD | Gap? |
-|--------|-----------------|-------------|------|
-| Compute | m7i.8xlarge (32 vCPU, 128GB) | 8 CPU, 48Gi | Tunable via PlatformConfig env vars |
-| Storage | 32 TiB EBS RAID0 | 2 TiB PVC (gp3-10k-750) | Need larger PVC for full archive |
-| S3 bucket | `pacific-1-snapshots` | Configurable per-node | No gap |
-| ALB/DNS | ALB + Route53 + WAF | Not managed by controller | External concern (Ingress/Gateway API) |
-| DynamoDB metadata | Snapshot metadata tracking | Not implemented | Gap -- could add to seictl |
-| AMI snapshots | EC2 AMI creation | N/A (Tendermint snapshots to S3) | Different approach, arguably better |
-| IAM | Instance profile | ServiceAccount + IRSA | Via PlatformConfig `SEI_SERVICE_ACCOUNT` |
-| Monitoring | Prometheus EC2 SD + alerts | Need ServiceMonitor/PodMonitor | External concern |
-| Snapshot verification | Dedicated verifier EC2 | Not implemented | Gap |
-| Multi-instance | 3 EC2 instances | 1 replica StatefulSet | Could create multiple SeiNodes |
-
-## What's Needed for Production
-
-### Already handled by the controller
-- Node lifecycle (bootstrap, init, running)
-- Snapshot generation and S3 upload
-- Peer discovery via EC2 tags
-- Genesis configuration (embedded or S3)
-- Config generation (pruning, state-sync intervals, etc.)
-- PVC provisioning and cleanup
-- Tolerations, affinity, service account assignment
-
-### Needs external setup
-- **Controller in prod Flux kustomization** (currently dev only)
-- **Prod PlatformConfig tuning** (storage sizes, resource limits for prod workloads)
-- **IRSA ServiceAccount** with S3 permissions in the node namespace
-- **Networking** (ALB/Ingress, DNS, WAF -- separate from controller)
-- **Monitoring** (ServiceMonitor for Prometheus, alerts ported from sei-infra)
-- **Storage sizing** -- `SEI_STORAGE_SIZE_ARCHIVE` may need to be much larger for full archive
-
-### Future enhancements
-- Snapshot verification (automated post-upload check)
-- DynamoDB metadata tracking for snapshot catalog
-- Multi-replica support (if needed for HA/load distribution)