Skip to content

HYPERFLEET-827 - docs: Adapter Resource Lifecycle Management Design#158

Draft
rh-amarin wants to merge 1 commit into
openshift-hyperfleet:mainfrom
rh-amarin:HYPERLFEET-827
Draft

HYPERFLEET-827 - docs: Adapter Resource Lifecycle Management Design#158
rh-amarin wants to merge 1 commit into
openshift-hyperfleet:mainfrom
rh-amarin:HYPERLFEET-827

Conversation

@rh-amarin

@rh-amarin rh-amarin commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

Design document for the adapter resource lifecycle management epic (HYPERFLEET-827), covering eight areas:

  • Lifecycle gates — resource-level when (debounce), lifecycle.apply.when (create-only), lifecycle.patch (conditional JSON
    merge patches), lifecycle.orphan.when (declarative resource disownment)
  • Automatic label stamping — framework injects hyperfleet.io/adapter and hyperfleet.io/resource-id on all managed resources at apply time
  • Resilience model — documents why inner retry is not implemented and how Available=False closes the Sentinel drift-check gap
  • Stuck detection — unified Reconciled=False-based metrics planned in HYPERFLEET-1205
  • Periodic execution — Sentinel drift-check as implicit scheduler; CEL timestamp arithmetic for finer-grained control
  • API-level deletion policy — restrict/cascade from HYPERFLEET-896; orphan case identified as requiring a follow-up proposal
  • Resource retention — historyLimit, ttl, label-based multi-version discovery, replace vs. retain mode for recreate
  • Ownership model — label-based sweep controller for API force-delete scenarios; K8s ownerReferences for native parent child cascade within an adapter

@openshift-ci openshift-ci Bot requested review from Mischulee and ldornele June 15, 2026 08:40
@openshift-ci

openshift-ci Bot commented Jun 15, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ciaranroche for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Summary by CodeRabbit

  • Documentation
    • Expanded adapter lifecycle management proposals with draft metadata (status, owner, last-updated) and clarified the end-to-end lifecycle design.
    • Added new guidance on lifecycle gates (apply/patch/orphan), periodic execution, and resilience/stuck detection during reconciliation.
    • Documented automatic hyperfleet.io/* label stamping, API-layer deletion policy semantics, adapter ownership model, and adapter resource retention/version pruning behavior.

Walkthrough

Eight linked design documents specify HyperFleet adapter lifecycle management: resource-scoped when defers apply/patch/recreate but exempts delete/orphan; lifecycle.apply.when gates creation; lifecycle.patch applies ordered, gated RFC 7396 JSON merge patches with mutual exclusivity vs standard apply; lifecycle.orphan.when strips hyperfleet.io/* labels, sets Finalized=True, preserving generation, with precedence over delete. Automatic fill-gaps-only label/annotation stamping injects standard keys at apply time via stampLabels() in resource_executor.go before ApplyResource(). Resilience delegates to Sentinel reconciliation (~10s convergence, ~30m drift-check); no per-resource retry; Available=False/Reconciled=False signal non-convergence. Stuck detection uses Reconciled=False metrics with is_delete discriminator, computed API-side at scrape time. API-level deletion policies (restrict, cascade) layer with label-sweep discovery (calls /resources/{id} API) and K8s ownerReferences cascade. Weak resource references allow deletion and atomically soft-delete referencing resource with Reconciled=False mark. Periodic execution gates operations via CEL on condition lastTransitionTime. Resource retention distinguishes replace (delete-before-create) from retain (accumulate versions) with multi-result discovery via hyperfleet.io/resource-id, pruning by historyLimit/ttl, and debouncing via when gate to limit rapid version creation. Orphan disownment takes precedence over deletion when multiple versions exist.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change: a design document for adapter resource lifecycle management covering eight areas (gates, stamping, resilience, stuck detection, periodic execution, deletion policy, retention, ownership).
Description check ✅ Passed The description is directly relevant to the changeset, providing a structured summary of the eight design areas, linking related epic/issue IDs, and matching the eight documentation files added in the PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Sec-02: Secrets In Log Output ✅ Passed PR contains only markdown documentation files (.md). No source code files (.go, .ts, .py, etc.) with log statements present. Check is not applicable.
No Hardcoded Secrets ✅ Passed All 9 markdown files added by this PR contain zero hardcoded secrets, API keys, tokens, passwords, or private keys. Matches for "secret" reference K8s Secret resources or documentation examples (ce...
No Weak Cryptography ✅ Passed PR is documentation-only (9 markdown files, 989 lines added, 0 code changes). No cryptographic code present: zero imports of crypto/md5/des/rc4, no SHA1 for security, no weak primitives, no impleme...
No Injection Vectors ✅ Passed PR contains only markdown documentation files; no executable code, Go source files, or injection-vulnerable patterns (CWE-89/78/79/502) detected.
No Privileged Containers ✅ Passed PR contains only Markdown documentation (9 design docs, +849 lines). No Kubernetes manifests, Helm templates, Dockerfiles, or container configs present to evaluate.
No Pii Or Sensitive Data In Logs ✅ Passed PR contains only design documentation (9 .md files, 1049 lines total); no logging statements found. Check not applicable to documentation-only changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`:
- Around line 775-784: The "Responsibilities & Trade-offs" section does not
follow the required component document format. Separate the current content into
two sections: keep responsibilities as a distinct section, then expand the
Trade-offs section with the required structured subsections: What We Gain
(benefits with measurable impact), What We Lose / What Gets Harder (costs and
capability losses), Technical Debt Incurred (debt items with description,
impact, and remediation plan), and Acceptable Because (justification for
MVP/timeline/scope). The current trade-off about no activeGenerationPolicy
should be incorporated into the expanded Trade-offs structure rather than left
as a brief paragraph.
- Around line 1-784: The design document lacks required Mermaid diagrams for
complex state machine and flow descriptions. Add three Mermaid diagrams to the
document: (1) a flowchart showing lifecycle gate evaluation order for the six
gates (resource-level `when`, `lifecycle.apply.when`, `lifecycle.patch`,
`lifecycle.orphan.when`, `lifecycle.recreate.when`, `lifecycle.delete.when`)
including precedence rules and conditional paths in section §1; (2) a flowchart
in section §7 illustrating how multi-version discovery selects the current
version by `resource-id` label and generation matching or timestamp fallback;
(3) a sequence diagram in section §8A showing the sweep controller flow:
discover resources with `hyperfleet.io/adapter` label, extract `resource-id`,
verify via API GET, and delete if 404 is returned. Each diagram should include
labels for conditional branches and include references to the relevant
configuration fields or API endpoints mentioned in the prose.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5ba0d011-7561-452c-8dc1-38bc610e8363

📥 Commits

Reviewing files that changed from the base of the PR and between 3bde2f1 and afee454.

📒 Files selected for processing (1)
  • hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift-hyperfleet/architecture (manual)
  • openshift-hyperfleet/hyperfleet-api (manual)
  • openshift-hyperfleet/hyperfleet-sentinel (manual)
  • openshift-hyperfleet/hyperfleet-adapter (manual)
  • openshift-hyperfleet/hyperfleet-broker (manual)

Comment on lines +1 to +784
---
Status: Draft
Owner: HyperFleet Team
Last Updated: 2026-06-15
---

# Adapter Resource Lifecycle Management Design

**Jira**: [HYPERFLEET-827](https://issues.redhat.com/browse/HYPERFLEET-827) · [HYPERFLEET-1065](https://issues.redhat.com/browse/HYPERFLEET-1065)

## What & Whys

**What**: Add per-resource lifecycle configuration to the adapter framework so adapter authors can control what happens after initial resource creation: how resources are updated, how failures are handled, and how resources are cleaned up.

**Why**: The adapter framework creates real infrastructure (Kubernetes objects, Maestro ManifestWork resources) but treats each event as a fresh operation with no memory of prior state:
- No configurable update strategy — adapter authors cannot express "recreate on generation change" vs "apply in place" vs "never write once created"
- No declarative way to stop managing a resource — adapter authors cannot express "release this resource from HyperFleet management without deleting it"
- No retry for resource operations — a transient K8s API server error fails the entire event and depends on broker redelivery
- No persistent state — execution context is in-memory per event; no state survives across event executions
- No automatic label stamping — standard `hyperfleet.io/*` labels are convention, not enforcement, so sweep-based cleanup cannot be built reliably

**Related Documentation:**
- [Adapter Framework Design](./adapter-frame-design.md) — Core executor architecture
- [Adapter Recreation Flow Design](./adapter-recreation-flow-design.md) — Recreation flow specifics (HYPERFLEET-837)
- [Adapter Status Contract](./adapter-status-contract.md) — Status reporting patterns

### Scope

This design covers six proposals:

1. **Lifecycle gates** (`§1`) — four new per-resource CEL-gated operations following the existing `lifecycle.recreate.when` / `lifecycle.delete.when` pattern:
- resource-level `when` (`§1.1`) — evaluated before any lifecycle gate; if false, all modification actions (apply, patch, recreate) are skipped for that resource in the current event; primary use case is debouncing using CEL timestamp arithmetic
- `lifecycle.apply.when` (`§1.2`) — gate whether the apply step runs (create-only use case)
- `lifecycle.patch` (`§1.3`) — array of conditional JSON merge patches on an already-existing resource
- `lifecycle.orphan.when` (`§1.4`) — conditional resource disownment: strip `hyperfleet.io/*` management labels and report `Finalized=True` without deleting

2. **Automatic label and annotation stamping** (`§2`) — the framework injects standard `hyperfleet.io/*` labels on all adapter-created resources at apply time, making them consistently discoverable for tooling and sweep-based cleanup

3. **Resilience model** (`§3`) — documents why per-resource retry is not implemented in the adapter and how the Sentinel reconciliation cycle (convergence at ~10s, drift-check at 30 minutes) covers both failure cases, with `Available=False` closing the drift-check corner case

4. **Stuck detection** (`§4`) — unified reconciliation metrics at the API layer via [HYPERFLEET-1205](https://redhat.atlassian.net/browse/HYPERFLEET-1205), covering both deletion and create/update flows via `Reconciled=False` condition tracking

5. **Periodic execution** (`§5`) — explains how the Sentinel drift-check cycle provides implicit periodic execution at per-resource-type granularity, and how lifecycle CEL expressions and API condition timestamps enable finer-grained control within that cadence

6. **API-level deletion policy for owned resources** (`§6`) — documents the `OnParentDelete` descriptor-driven policy (`restrict` / `cascade`) from the Generic Resource Registry (HYPERFLEET-896), its interaction with adapter delete events, and the unimplemented `orphan` case where children are left in place when a parent is deleted

7. **Resource retention** (`§7`) — per-resource `retention:` configuration that governs how many historical versions of a resource accumulate and for how long; enables recreate-with-history semantics (versus the default delete-before-create replace mode), multi-result label-based discovery, and debounce protection against rapid version creation

8. **Ownership model** (`§8`) — two complementary mechanisms: label-based ownership with a sweep controller that detects and removes resources orphaned by API force-deletes (`§8A`), and K8s-native `ownerReferences` between resources managed by the same adapter for cascade deletion via Kubernetes GC (`§8B`)

### Out of Scope

- **Sweep controller implementation** — the design is described in this document; implementation is a separate workstream pending the HYPERFLEET-896 generic `/resources` endpoint being deployed
- **Retention history UI** — surfacing version history via the HyperFleet API or dashboard is a separate concern; the adapter framework records versions on the cluster but does not model them as first-class API resources
- **Active generation policy** (Cancel / Coexist / Queue) — broker-level ordering is the current approach; Cancel and Queue require distributed coordination
- **Full persistent state machine** — ConfigMap/CRD backend with transition guards and history

---

## Current State

The adapter executes a 4-phase sequential pipeline per CloudEvent:

1. **Parameter Extraction** — extracts params from event data and env vars
2. **Preconditions** — evaluates CEL or structured conditions; supports API calls with retry/backoff
3. **Resources** — applies, discovers, and deletes resources sequentially
4. **Post-Actions** — CEL-gated HTTP calls or log entries; always runs for error reporting

### What Already Exists

| Capability | Location | Notes |
|---|---|---|
| `recreate_on_change: bool` | `internal/configloader/types.go` | Delete+create when generation changes; superseded by `lifecycle.recreate.when` (HYPERFLEET-837) |
| `lifecycle.delete.when` CEL expression | `resource_executor.go` | Gated deletion ordering per resource |
| `lifecycle.delete.propagationPolicy` | `transportclient/types.go` | Background / Foreground / Orphan (K8s GC sense only) |
| Generation-aware idempotent apply | `internal/manifest/generation.go` | Create / Skip (same gen) / Update / Recreate |
| Retry/backoff for precondition API calls | `internal/hyperfleetapi/client.go` | Exponential, linear, constant + ±10% jitter; 3 attempts default |
| `hyperfleet.io/*` label constants | `pkg/constants/constants.go` | String constants for label/annotation keys (`hyperfleet.io/generation`, `managed-by`, `cluster-id`, `created-by`). Used as a shared reference — the framework does **not** inject these automatically; each adapter's YAML manifest template must add them explicitly |
| `hyperfleet.io/generation` enforcement | `internal/manifest/generation.go` | The one exception: this annotation is mandatory and validated at apply time. The adapter refuses to apply any manifest that is missing it or has a non-numeric value |
| Label/annotation stamping | — | **Not automatic.** No framework mechanism injects `hyperfleet.io/*` labels into manifests. Adapter authors must add them to their YAML templates by hand. This design adds automatic stamping — see [Automatic Label and Annotation Stamping](#automatic-label-and-annotation-stamping) below |

---

## Design

### 1. Lifecycle Gates for Apply Behaviour

The existing lifecycle gate pattern (`lifecycle.recreate.when`, `lifecycle.delete.when`) drives resource operations via per-resource CEL expressions evaluated against discovered resource state. The new gates follow the same pattern rather than introducing a separate `updateStrategy` enum.

#### 1.1 Resource-Level When Gate

A resource-level `when` expression is evaluated before any lifecycle gate. If it evaluates to false, the executor skips all modification actions for that resource in the current event — no apply, no patch, no recreate. The resource is still discovered and available as `resources.myAlias` in sibling expressions. Delete and orphan are exempt: they are evaluated regardless of the `when` result, so a freshness check never accidentally blocks a deletion.

```yaml
resources:
- alias: "provisioningJob"
when:
expression: "..."
lifecycle:
...
```

This gate follows the same CEL evaluation context as all other lifecycle expressions: extracted params, discovered resources, and adapter metadata are all available.

##### Debouncing with timestamp arithmetic

The primary use case is rate-limiting resource modifications when Sentinel sends a burst of convergence events — for example, preventing a new provisioning job from being created on every ~10s event when the previous one was just launched.

CEL's standard library provides `timestamp()` and `duration()` types with full arithmetic support. Combined with the existing `now()` function, debounce conditions require no new framework functions:

```yaml
resources:
- alias: provisioningJob
when:
expression: >
!resources.?provisioningJob.hasValue()
|| timestamp(resources.provisioningJob.metadata.creationTimestamp)
< timestamp(now()) - duration("10m")
lifecycle:
recreate:
when:
expression: >
resources.provisioningJob.status.conditions.exists(
c, c.type == "Complete" && c.status == "True")
```

The `when` expression passes if no current version exists, or if the current version is older than 10 minutes. During a convergence burst, all events within the 10-minute window skip the resource entirely — `lifecycle.recreate.when` is never reached. Once the interval elapses, the next event evaluates `lifecycle.recreate.when` normally.

`metadata.creationTimestamp` arrives from Kubernetes as an RFC3339 string and is parsed correctly by `timestamp()`. `timestamp(now()) - duration("10m")` returns a timestamp 10 minutes in the past; comparison with `<` is native to CEL timestamp types. Duration strings follow Go conventions: `h`, `m`, `s`, `ms`.

The resource-level `when` is also useful outside retention scenarios — any condition that should gate the entire resource rather than a specific lifecycle action belongs here.

#### 1.2 lifecycle.apply.when (CreateOnly)

`lifecycle.apply.when` gates whether the apply step runs at all for a given resource. If the expression evaluates to false, the apply step is skipped and the executor continues to the next resource.

The primary use case is create-only behaviour: only write the resource if it does not already exist.

```yaml
resources:
- name: "bootstrapNamespace"
transport:
client: "kubernetes"
manifest:
apiVersion: v1
kind: Namespace
metadata:
name: "{{ .clusterId }}"
annotations:
hyperfleet.io/generation: "{{ .generation }}"
discovery:
by_name: "{{ .clusterId }}"
lifecycle:
apply:
when:
expression: "!resources.?bootstrapNamespace.hasValue()"
```

The expression has access to the same CEL context as `lifecycle.recreate.when`: extracted params, discovered resources, and adapter metadata. Discovery must be configured so that the current resource state is available in `resources.*` for the expression to evaluate against.

**Why not preconditions**: preconditions are event-scoped — a failing precondition skips the entire event, not just the resource it was intended to guard. `lifecycle.apply.when` is per-resource and does not affect sibling resources.

#### 1.3 lifecycle.patch (Patch)

Patch is an operation on an already-existing resource, not a creation strategy. `lifecycle.patch` is an array — each entry has its own `when` gate and `document`. The executor evaluates every entry independently and applies all whose `when` evaluates to true, in order.

```yaml
resources:
- name: "clusterNamespace"
transport:
client: "kubernetes"
discovery:
by_name: "{{ .clusterId }}"
lifecycle:
patch:
- when:
expression: >
resources.?clusterNamespace.hasValue()
&& dig(resources.clusterNamespace.metadata.annotations, "example.io/status") != clusterStatus
document:
metadata:
annotations:
example.io/status: "{{ .clusterStatus }}"

- when:
expression: >
resources.?clusterNamespace.hasValue()
&& dig(resources.clusterNamespace.metadata.labels, "example.io/tier") != tier
document:
metadata:
labels:
example.io/tier: "{{ .tier }}"
```

The first entry fires only when the `example.io/status` annotation on the discovered resource differs from the current `clusterStatus` param — the expression reads the live value off the discovered resource and compares it to the desired value. The second entry fires independently when the label drifts. Both can apply in the same event; neither fires if the discovered state already matches.

Each `document` is the patch body, rendered through the normal Go template pipeline and sent as a JSON merge patch (RFC 7396): fields present are set, fields absent are left untouched on the server, fields explicitly set to `null` are removed.

`lifecycle.patch` and the normal apply path are mutually exclusive per resource: if any patch entry fires, the standard apply step is skipped for that resource. This is consistent with how `lifecycle.recreate.when` and `lifecycle.delete.when` take precedence over the default path.

**Transport**: Maestro's `PatchManifestWork` already uses JSON merge patch internally. For K8s, the transport client issues a `Patch` call with `types.MergePatchType`. The `PatchResourceLabels` primitive added for `lifecycle.orphan.when` provides the same transport method.

**Note on recreation**: `recreate_on_change: true` is superseded by `lifecycle.recreate.when` from [Adapter Recreation Flow Design](./adapter-recreation-flow-design.md) (HYPERFLEET-837). Recreation is a lifecycle gate, not a strategy.

#### 1.4 Resource Disownment (lifecycle.orphan.when)

An adapter can already release a resource without deleting it: skip the delete step in `lifecycle.delete.when` and report `Finalized=True` in the post-action status conditions. The platform sees `Finalized=True` and stops sending events for that resource.

`lifecycle.orphan.when` makes this a first-class lifecycle gate. The expression is typically gated on the event being a deletion and the resource still carrying its management labels — confirming the resource is still managed and this event is the handoff point.

```yaml
resources:
- name: "provisioningJob"
transport:
client: "kubernetes"
discovery:
by_name: "{{ .clusterId }}-job"
lifecycle:
orphan:
when:
expression: >
is_deleting
&& resources.?provisioningJob.hasValue()
&& "hyperfleet.io/adapter" in resources.provisioningJob.metadata.labels
```

When `lifecycle.orphan.when` evaluates to true, the executor:

1. Skips the delete step entirely
2. Issues a label-strip patch via the transport client, removing `hyperfleet.io/adapter` and `hyperfleet.io/resource-id`
3. Sets `Finalized=True` in the execution result; the post-action phase reports it automatically
4. Preserves `hyperfleet.io/generation` (it belongs to the resource, not the management relationship)

The label presence check in the expression (`"hyperfleet.io/adapter" in resources.provisioningJob.metadata.labels`) ensures the gate is idempotent: if the event is redelivered after labels have already been stripped, the expression evaluates to false and neither delete nor orphan fires.

**Precedence**: `lifecycle.orphan.when` is evaluated before `lifecycle.delete.when`. If orphan fires, delete is skipped.

**Implementation**: Add `orphan.when` CEL field to `LifecycleConfig`. Add a `PatchResourceLabels(ctx, gvk, namespace, name, remove []string, target)` method to the `TransportClient` interface for the label-strip operation.

**Maestro compatibility**: For Maestro ManifestWork resources, the label-strip is applied to the ManifestWork object itself, not to the nested manifests. The sweep controller must account for this when reconciling Maestro-managed resources.

### 2. Automatic Label and Annotation Stamping

**Current state**: The framework defines a set of standard `hyperfleet.io/*` label and annotation keys as string constants in `pkg/constants/constants.go` (`hyperfleet.io/generation`, `managed-by`, `cluster-id`, `created-by`). These are a shared reference for adapter authors — the framework does not inject them. Each adapter's manifest template must add them explicitly by hand. The only exception is `hyperfleet.io/generation`: this annotation is mandatory and validated at apply time by `internal/manifest/generation.go`, which refuses to apply any manifest missing it or carrying a non-numeric value.

This means label coverage across adapter-created resources is inconsistent: adapters that omit `hyperfleet.io/managed-by` or `hyperfleet.io/cluster-id` from their templates are invisible to any tooling that relies on those labels for discovery, ownership tracking, or sweep-based cleanup.

**Proposal**: The framework injects the standard set of `hyperfleet.io/*` labels and annotations at apply time, before the manifest reaches the transport client. Adapter authors no longer need to add them manually. The merge strategy is fill-gaps-only: labels already present in the adapter's manifest template take precedence, so intentional overrides are preserved and no existing adapter is broken.

**Standard labels stamped on every resource:**

| Label / Annotation | Value Source | Notes |
|---|---|---|
| `hyperfleet.io/adapter` | Adapter config: `adapter.name` | Identifies the adapter instance managing this resource |
| `hyperfleet.io/resource-id` | TBD — pending HYPERFLEET-896 alignment | Stable identifier across recreations |
| `hyperfleet.io/generation` | Event param: `generation` | Already enforced; automatic stamping is a no-op if already present |

**Implementation**: A `stampLabels(manifest, frameworkLabels)` function called in `resource_executor.go` before `ApplyResource()`. Works identically for K8s and Maestro.

> **Blocked on HYPERFLEET-896**: The `hyperfleet.io/resource-id` value is blocked on alignment with the CRD-Driven Generic Resource API team. Stamping can be shipped without this label and extended once the identifier model is agreed.

### 3. Resilience Model

The adapter does not implement per-resource retry for apply and delete operations. The resilience model for HyperFleet is the Sentinel reconciliation cycle: Sentinel continuously compares observed resource state against the desired spec and generates a new event approximately every 10 seconds for as long as the state has not converged. Once the state converges, Sentinel switches to a periodic drift-check cadence (default: every 30 minutes).

**Convergence events**: a transient apply failure leaves the resource state unchanged and unconverged. Sentinel observes continued divergence and sends the next event at the ~10s cadence. No inner retry is needed.

**Drift-check events**: Sentinel also sends periodic events (default every 30 minutes) to detect and correct drift on resources that were previously converged. This introduces a corner case: if the apply fails during a drift-check event, the resource state is unchanged and still appears converged to Sentinel, which would otherwise not retrigger for another 30 minutes. This gap is closed by the adapter reporting `Available=False` on the API resource whenever an apply fails. `Available=False` signals to Sentinel that the resource is not reconciled, causing it to switch from the drift cadence back to the ~10s convergence cadence. The adapter must always surface apply failures via status conditions — this is what keeps the gap closed regardless of which Sentinel cadence triggered the event.

Adding an inner retry loop inside a single event execution creates a two-loop problem:

- **Concurrent execution risk**: a retry with a long `activeDeadline` holds the event execution open while Sentinel generates new events because the resource has not converged (or has reported `Available=False`). Multiple executions attempt the same apply against an already-stressed API server.
- **Redundancy**: the failure classes an inner retry would handle are already covered by the Sentinel cycle, provided the adapter surfaces failures correctly.
- **Backoff amplification**: exponential backoff inside the adapter does not coordinate with Sentinel's cadence. Under sustained API server pressure, both loops fire independently, increasing load rather than reducing it.

The adapter's responsibility is to drive state convergence and always report failure via `Available=False` and status conditions. Retry is a property of the reconciliation loop, not of a single event execution.

The existing retry mechanism in `internal/hyperfleetapi/client.go` (used for precondition API calls) is retained as-is. Precondition calls are synchronous blocking checks where a short backoff before concluding failure is semantically correct. Resource apply and delete do not share this property.

### 6. API-Level Deletion Policy for Owned Resources

This proposal is defined at the HyperFleet API layer, not in the adapter framework. It is documented here because it directly determines what events adapters receive when a parent resource is deleted, and therefore what cleanup the adapter is expected to perform.

#### What HYPERFLEET-896 defines

The [Generic Resource Registry design](../../../docs/generic-resource-registry-design.md) introduces a descriptor-driven `OnParentDelete` policy declared on each **child** entity type. When a parent resource is deleted, the API applies each child type's policy before proceeding:

| Policy | Behaviour |
|---|---|
| `restrict` (default) | Parent deletion is rejected with `409 Conflict` if any active children of this type exist. The caller must delete children first. |
| `cascade` | All children of this type are soft-deleted recursively (DFS, innermost first) before the parent is deleted. The entire tree is soft-deleted atomically. |

With `cascade`, each child soft-deletion triggers the normal adapter reconciliation flow — adapters receive delete events and use `lifecycle.delete.when` to clean up the corresponding K8s or Maestro resources. No special adapter behaviour is required beyond what already exists.

#### The Orphan case

The ticket specifies a third policy — **`orphan`** — where parent deletion proceeds but children are left in place, neither deleted nor blocked. The children remain in the API as active resources without a parent; no delete events are sent to adapters.

This policy is **not currently in the HYPERFLEET-896 design** and would need to be added. Its implications:

- Children remain as API resources with no owner reference
- Adapters do not receive delete events; the K8s/Maestro resources they manage also remain in place
- The sweep controller (`§8A`) cannot clean them up either — those K8s resources still have a live API counterpart (the orphaned child API resource returns 200, not 404)
- Cleanup would require either manual deletion of each child, or a separate sweep mechanism that queries for parentless resources and acts on them

**Relationship to `lifecycle.orphan.when` (`§1.4`)**: these are distinct mechanisms operating at different layers. `lifecycle.orphan.when` is an adapter-level gate for planned disownment during a normal event lifecycle — the API resource still exists. The API-level `orphan` deletion policy operates at the API layer when a parent is deleted, independent of any adapter event.

#### Action required

A proposal to add `OnParentDeleteOrphan` to the Generic Resource Registry descriptor model is needed before this policy can be implemented. The adapter framework requires no changes for the `restrict` and `cascade` cases.

### 4. Stuck Detection

**Current state**: The HyperFleet API already exposes deletion-specific stuck detection via `hyperfleet_api_resource_pending_deletion_stuck` — a gauge computed on each Prometheus scrape by querying the database for resources with `deleted_time` set beyond a configurable threshold (default 30 minutes). This covers the deletion flow only and has no equivalent for create and update reconciliations.

**Planned** ([HYPERFLEET-1205](https://redhat.atlassian.net/browse/HYPERFLEET-1205)): A unified reconciliation metrics system that extends the same pattern to all reconciliation flows using the `Reconciled=False` condition already stored on API resources.

New metrics:

| Metric | Type | Description |
|---|---|---|
| `hyperfleet_api_reconciliation_requests_total` | Counter | Incremented when `Reconciled` transitions to `False` |
| `hyperfleet_api_resource_pending_reconciliation` | Gauge | Resources currently in `Reconciled=False` state |
| `hyperfleet_api_resource_pending_reconciliation_stuck` | Gauge | Resources where `Reconciled=False` has persisted beyond the stuck threshold |
| `hyperfleet_api_resource_pending_reconciliation_stuck_duration_seconds` | Gauge | Maximum duration any resource has been stuck |

All metrics carry an `is_delete` label (`true` when `deleted_time IS NOT NULL`) to differentiate deletion flows from create/update flows. The existing deletion metrics become redundant and will be removed once HYPERFLEET-1205 ships.

**Why no adapter-side annotation is needed**: the API stores the `Reconciled=False` condition with a transition timestamp in its database. Duration and stuck detection are computed from that timestamp via DB queries at scrape time — the API is the authoritative source. The adapter's responsibility is to correctly report `Available=False` on apply failure, which drives `Reconciled=False` on the API resource. Sentinel picks up the signal and retriggers; if `Reconciled=False` persists beyond the threshold, the API metrics surface it. No annotation on the managed K8s resource is required.

### 5. Periodic Execution

The adapter has no built-in scheduler. Periodic execution is provided by the Sentinel reconciliation cycle, which operates at two cadences:

- **Convergence cadence (~10s)**: triggered while `Reconciled=False` — the resource has not yet reached the desired state. Every event execution is an opportunity to drive the resource toward convergence.
- **Drift-check cadence (default 30 minutes)**: triggered when the resource is considered reconciled. Sentinel re-evaluates the resource to detect and correct configuration drift.

Both cadences are configured per resource type, not per individual resource. That is the granularity available: all resources of a given type share the same drift-check interval. There is no mechanism to set a different cadence for individual resources within the same type.

#### Finer-grained control with lifecycle CEL expressions

Every `when` expression — `lifecycle.apply.when`, `lifecycle.patch`, `lifecycle.orphan.when`, `lifecycle.recreate.when`, `lifecycle.delete.when` — is evaluated on every event, including drift-check events. This makes lifecycle gates the natural mechanism for expressing finer-grained periodic behaviour within the drift cadence.

An adapter author can gate an operation on drift-check events using conditions that express time-awareness or state-awareness without any framework changes:

```yaml
lifecycle:
patch:
- when:
expression: >
resources.?clusterNamespace.hasValue()
&& dig(resources.clusterNamespace.metadata.annotations, "example.io/rotated-at") < now()
document:
metadata:
annotations:
example.io/rotated-at: "{{ .now }}"
```

The `now()` CEL function returns the current time as an RFC3339 string, enabling expressions that compare timestamps read from the discovered resource against the current time.

#### Using last-reported time from adapter conditions

An adapter can retrieve the timestamp of its own last successful reconciliation — or that of another adapter — by reading the status conditions on the HyperFleet API resource. Conditions carry a `lastTransitionTime` field that records when `Reconciled` last changed. This can be fetched via a precondition API call and injected as a param, making it available in CEL expressions across the entire event execution.

This allows an adapter to express "only perform this operation if it has not been done in the last N hours," even though Sentinel sends drift events every 30 minutes:

```yaml
preconditions:
- name: "fetchClusterStatus"
type: hyperfleet_api
resource: cluster
id: "{{ .clusterId }}"
output_param: clusterResource

resources:
- name: "certSecret"
lifecycle:
patch:
- when:
expression: >
resources.?certSecret.hasValue()
&& dig(clusterResource, "status.conditions")
.filter(c, c.type == "CertRotated")
.all(c, c.lastTransitionTime < timestamp(now()).toUnixTime() - 86400)
document:
metadata:
annotations:
example.io/cert-rotated-at: "{{ .now }}"
```

The same pattern applies to conditions reported by other adapters managing the same cluster resource — an adapter can make its drift-check behaviour conditional on what a peer adapter last reported, without any direct coupling between the adapters.

### 7. Resource Retention

Some adapter resources should accumulate as history rather than be replaced in-place — provisioning jobs, pipeline runs, audit records. The `retention:` block at the resource level enables this: it configures how many historical versions are kept and for how long.

**Retention is not a lifecycle gate.** It is a resource-level policy that runs as a post-apply cleanup pass. It is independent of `lifecycle.recreate.when` — the gate decides *whether* to create a new version; retention decides *how many old versions to keep* after the new one exists.

#### Replace mode vs. retain mode

The presence or absence of `retention:` determines the recreation strategy when `lifecycle.recreate.when` fires:

- **Replace mode** (no `retention:` block): the old resource is deleted first, then the new manifest is applied with the same name. One resource exists at a time.
- **Retain mode** (`retention:` block present): the new manifest is applied without deleting the old one first. The adapter author controls the naming to ensure no collision — the name should vary per creation (e.g., include a timestamp or generation suffix). Old versions accumulate and are pruned by retention rules.

This distinction matters because the two modes have incompatible semantics: replace requires the old name to be free before creating; retain requires the old resource to remain alive until pruned.

#### Multi-result discovery

In retain mode, discovery must find all existing versions of the logical resource, not a single resource by name. The standard mechanism is a label selector on `hyperfleet.io/resource-id`, which is stamped automatically on every resource (see `§2`):

```yaml
discovery:
by_selectors:
label_selector:
hyperfleet.io/resource-id: "{{ .resourceId }}"
```

This returns a list of all versions sharing the same API resource identity. The executor resolves the **current version** — the entry whose `hyperfleet.io/generation` annotation matches the current event's generation param, or the newest by `metadata.creationTimestamp` as fallback — and exposes only that single resource as `resources.myJob` in CEL expressions. The full version list is an internal executor concern.

**Authors write lifecycle expressions exactly as they do today.** `resources.myJob` is always a single optional resource, never a list. The multi-version behavior (pruning historical versions, applying delete or orphan gates across all versions) is performed by the executor transparently, without the author needing to iterate over versions in CEL.

#### Retention rules

```yaml
retention:
historyLimit: 3 # keep at most N historical versions; oldest pruned first
ttl: 24h # delete versions older than this duration
```

Both rules apply independently and are evaluated against historical versions only. The current version is never pruned. If a historical version exceeds `ttl` OR the count of historical versions exceeds `historyLimit`, it is deleted. The pruning pass runs after the apply step on every event where `retention:` is configured.

#### Debouncing

In retain mode, rapid convergence events can cause rapid version creation — each event creating a new version before the previous one has had a chance to run. Debouncing is expressed via the resource-level `when` gate (`§1.4`), which skips all modification actions when the current version is too fresh:

```yaml
resources:
- alias: provisioningJob
when:
expression: >
!resources.?provisioningJob.hasValue()
|| timestamp(resources.provisioningJob.metadata.creationTimestamp)
< timestamp(now()) - duration("10m")
retention:
historyLimit: 5
ttl: 48h
lifecycle:
recreate:
when:
expression: "..."
```

The resolution is bounded by the Sentinel event cadence. With Sentinel sending convergence events every ~10 seconds and a 10-minute debounce:

```
t=0:00 — event arrives, no current version exists → new job created (v1)
t=0:10 — event arrives, v1 is 10s old < 10m → resource skipped, v1 retained
t=0:20 — event arrives, v1 is 20s old < 10m → resource skipped, v1 retained
...
t=10:00 — event arrives, v1 is 10m old ≥ 10m → recreate.when evaluated → new job created (v2)
```

Creation is limited to at most once per interval regardless of how many events Sentinel sends.

#### Delete and orphan semantics with multiple versions

When a delete event arrives and multi-result discovery is active, all discovered versions of the resource are affected — not only the current version. Leaving historical versions in place after a delete would orphan them with no cleanup path, since the API resource they reference no longer exists.

- **`lifecycle.delete.when`**: evaluated once as a gate; if true, the executor deletes all discovered versions.
- **`lifecycle.orphan.when`**: evaluated once as a gate; if true, ownership labels are stripped from all discovered versions before reporting `Finalized=True`. This ensures no version remains visible to the sweep controller.

The precedence rule is unchanged: `lifecycle.orphan.when` is evaluated before `lifecycle.delete.when`. If orphan fires, the delete step is skipped for all versions.

#### Example — provisioning job with history

```yaml
resources:
- alias: provisioningJob
when:
expression: >
!resources.?provisioningJob.hasValue()
|| timestamp(resources.provisioningJob.metadata.creationTimestamp)
< timestamp(now()) - duration("10m")
transport:
client: "kubernetes"
manifest:
apiVersion: batch/v1
kind: Job
metadata:
name: "{{ .clusterId }}-job-{{ .now | replace \":\" \"-\" }}"
namespace: "{{ .namespace }}"
annotations:
hyperfleet.io/generation: "{{ .generation }}"
discovery:
namespace: "{{ .namespace }}"
by_selectors:
label_selector:
hyperfleet.io/resource-id: "{{ .resourceId }}"
retention:
historyLimit: 5
ttl: 48h
lifecycle:
recreate:
when:
expression: >
resources.provisioningJob.status.conditions.exists(
c, c.type == "Complete" && c.status == "True")
|| resources.provisioningJob.status.conditions.exists(
c, c.type == "Failed" && c.status == "True")
```

The resource-level `when` prevents a burst of convergence events from creating multiple jobs: if the current version is younger than 10 minutes the entire resource is skipped. Once the interval elapses, `lifecycle.recreate.when` is evaluated — it fires when the job has completed or failed, triggering a new run. Retention keeps at most 5 historical runs and discards any older than 48 hours.

### 8. Ownership Model

"Ownership" in the adapter context means that the framework can determine which managed resources belong to a given HyperFleet API resource, and can clean them up when that API resource is deleted or disowned. Two complementary mechanisms serve this purpose, each suited to a different kind of parent-child relationship.

#### Option A — Label-based ownership (sweep controller)

Every managed resource carries `hyperfleet.io/adapter` and `hyperfleet.io/resource-id` labels stamped automatically by the framework (see `§2`). The sweep controller is a separate background job that uses these labels to detect and clean up resources that were orphaned when their corresponding HyperFleet API resource was force-deleted.

**When to use**: any resource whose logical owner lives in the HyperFleet API — Clusters, NodePools, or any future API resource type. The sweep controller is the correct mechanism because the owner has no K8s representation to reference.

##### Why it cannot use the API as its source of truth

After a force-delete, the API has no record of the resource. The sweep controller cannot ask "what resources should exist" — it must instead ask "does this resource that exists on the cluster still have a live API counterpart?"

##### Flow

1. **Discover**: list all K8s resources (and Maestro ManifestWork objects) carrying the `hyperfleet.io/adapter` label, across all namespaces and resource kinds
2. **Verify**: for each discovered resource, extract `hyperfleet.io/resource-id` and call `GET /api/hyperfleet/v1/resources/{id}` on the HyperFleet API
3. **Act**: if the API returns 404 — the API resource was force-deleted — delete the K8s resource; if the API returns the resource, skip it

The generic `/resources` endpoint from [HYPERFLEET-896](https://issues.redhat.com/browse/HYPERFLEET-896) is the key enabler: the sweep controller works across all resource kinds without kind-specific logic.

##### Prerequisites

| Prerequisite | Provided by |
|---|---|
| `hyperfleet.io/adapter` label on all managed resources | Automatic label stamping (`§2`) |
| `hyperfleet.io/resource-id` carrying the API UUID | Automatic label stamping (`§2`) |
| Labels stripped on intentional disownment | `lifecycle.orphan.when` (`§1.4`) — orphaned resources are invisible to the sweep controller |
| Generic `/resources` endpoint | HYPERFLEET-896 — required for the verification call |

##### Operational considerations

- **Interval**: configurable; a few minutes is sufficient — force-deleted resources are not immediately harmful, just wasteful
- **Leader election**: required for multi-replica deployments to avoid concurrent deletions
- **Maestro**: ManifestWork objects carry the labels on the ManifestWork itself; the sweep controller deletes the ManifestWork, not the nested manifests
- **Dry-run mode**: should support a read-only mode that reports what would be deleted without acting

**Limitations**: cleanup is eventual (bounded by the sweep interval), and requires the sweep controller to be deployed and operational.

#### Option B — K8s ownerReferences

Kubernetes garbage collection cascades deletions through `ownerReferences`: if object B has an `ownerReference` pointing to object A, deleting A causes Kubernetes to delete B automatically. This happens synchronously at the K8s API level — no external component required.

The adapter can set an `ownerReference` on a managed resource pointing to another resource managed by the same adapter. This is expressed as a per-resource `owner:` config field naming another resource alias in the same adapter config:

```yaml
resources:
- alias: clusterNamespace
transport:
client: "kubernetes"
manifest:
apiVersion: v1
kind: Namespace
metadata:
name: "{{ .clusterId }}"
annotations:
hyperfleet.io/generation: "{{ .generation }}"

- alias: provisioningJob
owner:
resource: clusterNamespace # ownerReference points to the Namespace above
transport:
client: "kubernetes"
manifest:
apiVersion: batch/v1
kind: Job
metadata:
name: "{{ .clusterId }}-job"
namespace: "{{ .clusterId }}"
annotations:
hyperfleet.io/generation: "{{ .generation }}"
```

When the executor applies `provisioningJob`, it sets an `ownerReference` on the Job pointing to the already-applied Namespace. If the adapter later deletes the Namespace (via `lifecycle.delete.when`), Kubernetes automatically deletes the Job — the adapter does not need to issue a separate delete call for it.

**When to use**: resources that have a natural K8s parent-child relationship within the same adapter — for example, Jobs or Secrets scoped to a Namespace that the adapter also manages. The parent and child must live in the same namespace (K8s cross-namespace ownerReferences are not supported for namespaced resources).

**Limitations**:
- Only applies to K8s direct transport. Maestro ManifestWork objects do not participate in K8s garbage collection — `ownerReferences` on a ManifestWork are not propagated to the nested manifests that Maestro applies on the spoke cluster.
- The referenced `owner.resource` alias must appear earlier in the resources list (it must already be applied so its UID is known).
- Does not replace label-based ownership for the HyperFleet API force-delete case — `ownerReferences` only cascade within the cluster; they cannot respond to a deletion in an external API.

#### Relationship between the two mechanisms

They are complementary, not alternatives:

| Scenario | Mechanism |
|---|---|
| HyperFleet API resource force-deleted, K8s resources orphaned | Label-based sweep controller (`§8A`) |
| Adapter deletes a parent K8s resource; child K8s resources should follow | K8s `ownerReferences` (`§8B`) |
| Adapter explicitly disowns a resource without deleting it | `lifecycle.orphan.when` strips labels; sweep controller ignores it (`§1.4`) |

An adapter author can use both simultaneously: label-based ownership ensures the sweep controller can find the resource if the API record disappears, while `ownerReferences` ensure that a parent deletion within the cluster cascades to children without requiring the adapter to enumerate and delete each child explicitly.

### Configuration Reference

Full resource configuration with all new lifecycle and retention fields:

```yaml
resources:
- alias: "example"
# NEW: resource-level gate evaluated before any lifecycle action.
# If false, apply/patch/recreate are skipped. Delete and orphan are unaffected.
when:
expression: >
!resources.?example.hasValue()
|| timestamp(resources.example.metadata.creationTimestamp)
< timestamp(now()) - duration("10m")

transport:
client: "kubernetes" # or "maestro"
maestro:
target_cluster: "{{ .clusterId }}"

manifest:
apiVersion: v1
kind: Namespace
metadata:
name: "{{ .clusterId }}"
labels:
# Framework stamps hyperfleet.io/adapter and hyperfleet.io/resource-id automatically.
# Explicit values in the template take precedence (fill-gaps-only merge).
annotations:
hyperfleet.io/generation: "{{ .generation }}"

discovery:
namespace: "{{ .namespace }}"
by_selectors:
label_selector:
hyperfleet.io/resource-id: "{{ .resourceId }}"

# NEW: ownerReference to another resource alias in this adapter config (K8s only).
# Kubernetes GC deletes this resource when the referenced owner is deleted.
owner:
resource: anotherAlias

# NEW: retention — enables retain mode for recreate and governs version cleanup.
# Omit this block for replace mode (delete-before-create, single resource).
retention:
historyLimit: 3 # keep at most N historical versions; oldest pruned first
ttl: 24h # delete versions older than this duration

lifecycle:
# NEW: gate whether the apply step runs (e.g. create-only)
apply:
when:
expression: "!resources.?example.hasValue()"

# NEW: array of JSON merge patches; each entry evaluated independently
patch:
- when:
expression: >
resources.?example.hasValue()
&& dig(resources.example.metadata.annotations, "example.io/status") != status
document:
metadata:
annotations:
example.io/status: "{{ .status }}"

# NEW: strip management labels + report Finalized=True without deleting
orphan:
when:
expression: >
is_deleting
&& resources.?example.hasValue()
&& "hyperfleet.io/adapter" in resources.example.metadata.labels

# Existing gates
recreate:
when:
expression: "resources.?example.hasValue() && ..."

delete:
when:
expression: "is_deleting"
propagationPolicy: "Background"
```

---

## Alternatives Considered

### hyperfleet.io/last-updated Annotation

During the design of the Periodic Execution section, a `hyperfleet.io/last-updated` annotation was considered as a framework-stamped RFC3339 timestamp recording when the adapter last successfully applied a resource. This would make time-aware CEL expressions in drift-check cycles straightforward without requiring a precondition API call:

```yaml
lifecycle:
patch:
- when:
expression: >
resources.?certSecret.hasValue()
&& dig(resources.certSecret.metadata.annotations, "hyperfleet.io/last-updated") < now()
document:
metadata:
annotations:
example.io/rotated-at: "{{ .now }}"
```

Other uses where this pattern is useful:

- **Rate-limiting expensive operations** — only trigger a heavyweight recreation or API call if the last successful apply was more than N hours ago, even though Sentinel sends drift events every 30 minutes
- **Audit trail on the resource** — operators can `kubectl describe` a resource and see when the adapter last touched it without querying the HyperFleet API
- **Cross-adapter coordination** — a downstream adapter can read `hyperfleet.io/last-updated` stamped by an upstream adapter on a shared resource to condition its own actions on recency

**Why this was not adopted at framework level**: no specific use case in the current design requires it. The HyperFleet API already stores `lastTransitionTime` on the `Reconciled` condition, which is the authoritative timestamp for reconciliation events. Stamping an additional annotation on every managed resource adds a write on every apply cycle for a capability that may never be used by most adapters.

No ecosystem precedent exists for a standalone last-applied timestamp annotation on managed resources — ArgoCD, Flux, and Helm all avoid it, relying on `metadata.managedFields[].time` or status conditions on controller objects instead. Crossplane is the closest precedent, using RFC3339 timestamps on managed resources for creation-lifecycle tracking only.

**For adapter authors who need this**: add the annotation manually to the manifest template. The framework will not overwrite it (fill-gaps-only merge strategy). A custom key is equally valid — there is no requirement to use the `hyperfleet.io/` prefix for adapter-specific bookkeeping:

```yaml
manifest:
metadata:
annotations:
hyperfleet.io/generation: "{{ .generation }}"
my-adapter.io/last-updated: "{{ .now }}" # custom key, stamped by adapter author
```

The `now()` template function is available in Go templates used for manifest rendering, providing the current RFC3339 timestamp at event execution time.

---

### Per-Resource Retry with Backoff

An earlier version of this design proposed a per-resource retry configuration for apply and delete operations, modeled on the existing retry mechanism for precondition API calls:

```yaml
resources:
- name: "managedCluster"
retry:
maxRetries: 3
retryBackoff: exponential # exponential | linear | constant
activeDeadline: 60s
transport:
client: "kubernetes"
...
```

The proposed behavior: on a transient transport error (connection refused, timeout, 429, 503), the executor would wait for an exponential/linear/constant backoff interval (±10% jitter) and retry up to `maxRetries` times before surfacing the error as a resource failure. The `activeDeadline` would be enforced via `context.WithDeadline`. Implementation would extract the existing retry loop from `internal/hyperfleetapi/client.go` into a new `pkg/retry` package.

**Why this was not adopted**: The Sentinel reconciliation cycle covers both failure scenarios. For convergence events, a failed apply leaves the resource unconverged and Sentinel retriggers at ~10s. For periodic drift-check events (default every 30 minutes), a failed apply could create a gap — but the adapter closes it by reporting `Available=False` on failure, which signals Sentinel to treat the resource as not reconciled and switch back to the ~10s cadence. Provided the adapter always reports failure correctly via status conditions, no inner retry is needed. An inner retry loop also creates concurrent execution risk: if the resource has not converged (or has reported `Available=False`), Sentinel generates new events while the retry holds the current execution open, causing multiple executions to contend against an already-stressed API server.

The existing retry in `internal/hyperfleetapi/client.go` is not affected — precondition calls are synchronous blocking checks where a short backoff before concluding failure is semantically correct, and they are not subject to Sentinel retriggering.

**If this is revisited**: Any inner retry must complete well within the Sentinel cycle (< 5s total backoff). `maxRetries` should be at most 1–2, with no `activeDeadline` longer than a few seconds. It should be framed as "avoid unnecessary Sentinel cycles for sub-second blips," not as a resilience primitive.

---

## Risks

**HYPERFLEET-896 alignment** — `hyperfleet.io/resource-id` semantics must be confirmed before label stamping is shipped. Stamping can be prototyped without this label and extended once the identifier model is agreed with the CRD-Driven Generic Resource API team.

**Maestro compatibility** — Label-strip for `lifecycle.orphan.when` and sweep controller queries behave differently for Maestro ManifestWork vs K8s direct resources. Every new lifecycle gate must be validated against both transports.

**Sweep controller** — requires the HYPERFLEET-896 generic `/resources` endpoint to be deployed before the verification call can be implemented. Leader election is required for multi-replica deployments. The sweep controller design is documented in this document; implementation is a separate workstream.

---

## Responsibilities & Trade-offs

**What the framework provides**: Declarative lifecycle configuration via YAML. All new fields are optional with backward-compatible defaults. Label stamping is automatic and merge-safe. Resilience is owned by the Sentinel reconciliation cycle — the adapter drives state convergence and surfaces failures clearly; Sentinel decides when to retrigger based on observed divergence.

**What config authors own**:
- Choose the correct lifecycle gate for each resource (`lifecycle.apply.when` for create-only, `lifecycle.patch` for partial writes, `lifecycle.recreate.when` for immutable-field resources, `lifecycle.orphan.when` for handoff without deletion)
- Validate orphan behavior in staging — once `lifecycle.orphan.when` fires and labels are stripped, the resource is no longer tracked by HyperFleet; a redelivered event will not re-manage it
- Ensure `lifecycle.orphan.when` and `lifecycle.delete.when` expressions are not both active for the same resource — orphan takes precedence, but a delete expression that also evaluates to true on the same event indicates a config error

**Trade-off — no activeGenerationPolicy**: Concurrent generation events continue to be handled by broker-level ordering (current behavior). Cancel and Queue require distributed coordination that is disproportionately complex relative to the immediate need.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

REQUIRED: Add Mermaid diagrams for lifecycle gate flow and multi-version resolution logic.

Component documents MUST include Mermaid diagrams. This document describes complex state machines (lifecycle gate precedence in §1, multi-version retention and pruning in §7, sweep controller flow in §8A) entirely in prose. Add diagrams showing:

  1. Lifecycle gate evaluation order — how resource-level when, lifecycle.apply.when, lifecycle.patch, lifecycle.orphan.when, lifecycle.recreate.when, and lifecycle.delete.when precedence is ordered during an event.
  2. Multi-version discovery and current-version selection (§7, lines 408–421) — how the executor finds all versions by resource-id label and selects the current version (by generation match or newest).
  3. Sweep controller verification flow (§8A, lines 529–535) — list resources with hyperfleet.io/adapter label → extract resource-id → verify via API → delete if 404.
🧰 Tools
🪛 LanguageTool

[grammar] ~268-~268: Use a hyphen to join words.
Context: ...cadence. The adapter must always surface apply failures via status conditions — t...

(QB_NEW_EN_HYPHEN)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`
around lines 1 - 784, The design document lacks required Mermaid diagrams for
complex state machine and flow descriptions. Add three Mermaid diagrams to the
document: (1) a flowchart showing lifecycle gate evaluation order for the six
gates (resource-level `when`, `lifecycle.apply.when`, `lifecycle.patch`,
`lifecycle.orphan.when`, `lifecycle.recreate.when`, `lifecycle.delete.when`)
including precedence rules and conditional paths in section §1; (2) a flowchart
in section §7 illustrating how multi-version discovery selects the current
version by `resource-id` label and generation matching or timestamp fallback;
(3) a sequence diagram in section §8A showing the sweep controller flow:
discover resources with `hyperfleet.io/adapter` label, extract `resource-id`,
verify via API GET, and delete if 404 is returned. Each diagram should include
labels for conditional branches and include references to the relevant
configuration fields or API endpoints mentioned in the prose.

Source: Coding guidelines

Comment on lines +775 to +784
## Responsibilities & Trade-offs

**What the framework provides**: Declarative lifecycle configuration via YAML. All new fields are optional with backward-compatible defaults. Label stamping is automatic and merge-safe. Resilience is owned by the Sentinel reconciliation cycle — the adapter drives state convergence and surfaces failures clearly; Sentinel decides when to retrigger based on observed divergence.

**What config authors own**:
- Choose the correct lifecycle gate for each resource (`lifecycle.apply.when` for create-only, `lifecycle.patch` for partial writes, `lifecycle.recreate.when` for immutable-field resources, `lifecycle.orphan.when` for handoff without deletion)
- Validate orphan behavior in staging — once `lifecycle.orphan.when` fires and labels are stripped, the resource is no longer tracked by HyperFleet; a redelivered event will not re-manage it
- Ensure `lifecycle.orphan.when` and `lifecycle.delete.when` expressions are not both active for the same resource — orphan takes precedence, but a delete expression that also evaluates to true on the same event indicates a config error

**Trade-off — no activeGenerationPolicy**: Concurrent generation events continue to be handled by broker-level ordering (current behavior). Cancel and Queue require distributed coordination that is disproportionately complex relative to the immediate need.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚖️ Poor tradeoff

Restructure Trade-offs section to match required component document format.

The "Responsibilities & Trade-offs" section is a brief paragraph. Component documents REQUIRE a structured Trade-offs section with these exact subsections (per coding guidelines):

  • What We Gain — list benefits with measurable impact (e.g., "Eliminates single point of failure," "Reduces from 6 to 5 components")
  • What We Lose / What Gets Harder — cost and capability loss
  • Technical Debt Incurred — list debt items: description, impact, remediation plan
  • Acceptable Because — why the trade-offs justify the design for MVP/timeline/scope

Current text (lines 775–784) is a summary of responsibilities; move it to a separate section and expand Trade-offs with the required structure.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`
around lines 775 - 784, The "Responsibilities & Trade-offs" section does not
follow the required component document format. Separate the current content into
two sections: keep responsibilities as a distinct section, then expand the
Trade-offs section with the required structured subsections: What We Gain
(benefits with measurable impact), What We Lose / What Gets Harder (costs and
capability losses), Technical Debt Incurred (debt items with description,
impact, and remediation plan), and Acceptable Because (justification for
MVP/timeline/scope). The current trade-off about no activeGenerationPolicy
should be incorporated into the expanded Trade-offs structure rather than left
as a brief paragraph.

Source: Coding guidelines

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md (1)

1-784: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

REQUIRED: Add three Mermaid diagrams for complex state machine and flow descriptions.

Component documents MUST include diagrams per coding guidelines. This document describes three complex flows entirely in prose:

  1. Lifecycle gate evaluation order (§1, lines 79–196) — precedence among resource-level when, lifecycle.apply.when, lifecycle.patch, lifecycle.orphan.when, lifecycle.recreate.when, lifecycle.delete.when with conditional skip paths.
  2. Multi-version discovery and current-version resolution (§7, lines 386–506) — how the executor discovers all versions by resource-id label, selects the current version by generation or timestamp, and applies retention pruning.
  3. Sweep controller verification flow (§8A, lines 512–547) — list resources with hyperfleet.io/adapter label → extract resource-id → GET /api/hyperfleet/v1/resources/{id} → delete if 404.

Add flowchart or sequence diagrams with clear labels for conditional branches and references to configuration fields/API endpoints.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`
around lines 1 - 784, Add three Mermaid flowchart diagrams to the design
document per the coding guidelines: (1) a lifecycle gate evaluation order
diagram in §1 showing the precedence and skip paths among resource-level when,
lifecycle.apply.when, lifecycle.patch, lifecycle.orphan.when,
lifecycle.recreate.when, and lifecycle.delete.when with conditional branches;
(2) a multi-version discovery and current-version resolution diagram in §7
showing how the executor discovers all versions by resource-id label, selects
the current version by generation or timestamp, applies retention pruning, and
handles delete/orphan semantics across versions; (3) a sweep controller
verification flow diagram in §8A showing the sequence of listing resources with
hyperfleet.io/adapter label, extracting resource-id, calling GET
/api/hyperfleet/v1/resources/{id}, and acting on 404 vs 200 responses. Each
diagram should clearly label configuration fields, API endpoints, and
conditional decision points.

Source: Coding guidelines

🧹 Nitpick comments (1)
hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md (1)

260-270: 💤 Low value

Grammar: use hyphen to join compound adjectives before noun.

Line ~268: "...surface apply failures via status conditions — the adapter..." Use hyphen to clarify if this is intended as a compound adjective (e.g., "status-condition report") or reword for clarity.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`
around lines 260 - 270, The phrase "surface apply failures via status
conditions" uses an em-dash but needs clearer grammar for compound adjectives.
In the sentence starting with "The adapter must always surface apply failures
via status conditions — this is what keeps the gap closed...", either hyphenate
the compound adjective as "status-condition" if that is the intended meaning
(e.g., "surface apply failures via status-condition reports"), or reword the
sentence to make the grammatical structure clearer without ambiguity about
whether "status conditions" functions as a compound adjective modifying a noun.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`:
- Around line 689-754: The document is missing a mandatory Trade-offs section
separate from Alternatives Considered. Create a new Trade-offs section with four
required subsections: What We Gain (measurable benefits of chosen design), What
We Lose / What Gets Harder (costs and capabilities given up), Technical Debt
Incurred (debt items with description, impact, and remediation plan), and
Acceptable Because (justification for MVP/timeline/scope). Extract and expand
the current trade-off discussions embedded within the Alternatives Considered
section (particularly the explanations for why the hyperfleet.io/last-updated
annotation and per-resource retry mechanisms were rejected) into this structured
Trade-offs format, moving this content out of the Alternatives section and
consolidating it into the new section with fuller detail on impact and
remediation.
- Around line 512-546: The sweep controller design (Option A — Label-based
ownership) lacks security controls to prevent tampering with the
`hyperfleet.io/resource-id` labels that identify resources for cleanup. Add
explicit security controls to the design by specifying one of the three
approaches mentioned in the review comment: cryptographic protection (HMAC or
signature validation), RBAC restrictions on label mutation, or an admission
webhook validating `hyperfleet.io/*` labels. Document the chosen approach in the
Prerequisites section (updating or adding entries as needed) and consider adding
it to the Operational considerations section if relevant to deployment or
configuration. This prevents unauthorized users with patch permissions from
forging labels to trigger unintended resource deletions.
- Around line 79-196: The design document for lifecycle management does not
document the CEL expression validation and sandboxing constraints, which could
lead readers to incorrectly assume expressions are unsandboxed or capable of
executing arbitrary code. Add a new section after the lifecycle.patch section
(around line 196) titled something like "CEL Expression Validation and
Sandboxing" that explicitly states: (1) expressions are validated at config load
time via Parse() in TaskConfigValidator, (2) the CEL evaluator environment
(cel_evaluator.go) is restricted to only ext.Strings() and three custom
functions (toJson, dig, now), (3) network-capable functions like HTTP, DNS, gRPC
and filesystem access are not available, and (4) expressions cannot execute
arbitrary code or access cluster credentials outside the explicitly defined CEL
context. This section should clarify the threat model and security boundaries so
readers understand the safety constraints of the expression environment.

---

Outside diff comments:
In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`:
- Around line 1-784: Add three Mermaid flowchart diagrams to the design document
per the coding guidelines: (1) a lifecycle gate evaluation order diagram in §1
showing the precedence and skip paths among resource-level when,
lifecycle.apply.when, lifecycle.patch, lifecycle.orphan.when,
lifecycle.recreate.when, and lifecycle.delete.when with conditional branches;
(2) a multi-version discovery and current-version resolution diagram in §7
showing how the executor discovers all versions by resource-id label, selects
the current version by generation or timestamp, applies retention pruning, and
handles delete/orphan semantics across versions; (3) a sweep controller
verification flow diagram in §8A showing the sequence of listing resources with
hyperfleet.io/adapter label, extracting resource-id, calling GET
/api/hyperfleet/v1/resources/{id}, and acting on 404 vs 200 responses. Each
diagram should clearly label configuration fields, API endpoints, and
conditional decision points.

---

Nitpick comments:
In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`:
- Around line 260-270: The phrase "surface apply failures via status conditions"
uses an em-dash but needs clearer grammar for compound adjectives. In the
sentence starting with "The adapter must always surface apply failures via
status conditions — this is what keeps the gap closed...", either hyphenate the
compound adjective as "status-condition" if that is the intended meaning (e.g.,
"surface apply failures via status-condition reports"), or reword the sentence
to make the grammatical structure clearer without ambiguity about whether
"status conditions" functions as a compound adjective modifying a noun.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: aac35f06-a47d-4157-a561-86eedffd428d

📥 Commits

Reviewing files that changed from the base of the PR and between afee454 and 29f5365.

📒 Files selected for processing (1)
  • hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift-hyperfleet/architecture (manual)
  • openshift-hyperfleet/hyperfleet-api (manual)
  • openshift-hyperfleet/hyperfleet-sentinel (manual)
  • openshift-hyperfleet/hyperfleet-adapter (manual)
  • openshift-hyperfleet/hyperfleet-broker (manual)

Comment on lines +79 to +196
### 1. Lifecycle Gates for Apply Behaviour

The existing lifecycle gate pattern (`lifecycle.recreate.when`, `lifecycle.delete.when`) drives resource operations via per-resource CEL expressions evaluated against discovered resource state. The new gates follow the same pattern rather than introducing a separate `updateStrategy` enum.

#### 1.1 Resource-Level When Gate

A resource-level `when` expression is evaluated before any lifecycle gate. If it evaluates to false, the executor skips all modification actions for that resource in the current event — no apply, no patch, no recreate. The resource is still discovered and available as `resources.myAlias` in sibling expressions. Delete and orphan are exempt: they are evaluated regardless of the `when` result, so a freshness check never accidentally blocks a deletion.

```yaml
resources:
- alias: "provisioningJob"
when:
expression: "..."
lifecycle:
...
```

This gate follows the same CEL evaluation context as all other lifecycle expressions: extracted params, discovered resources, and adapter metadata are all available.

##### Debouncing with timestamp arithmetic

The primary use case is rate-limiting resource modifications when Sentinel sends a burst of convergence events — for example, preventing a new provisioning job from being created on every ~10s event when the previous one was just launched.

CEL's standard library provides `timestamp()` and `duration()` types with full arithmetic support. Combined with the existing `now()` function, debounce conditions require no new framework functions:

```yaml
resources:
- alias: provisioningJob
when:
expression: >
!resources.?provisioningJob.hasValue()
|| timestamp(resources.provisioningJob.metadata.creationTimestamp)
< timestamp(now()) - duration("10m")
lifecycle:
recreate:
when:
expression: >
resources.provisioningJob.status.conditions.exists(
c, c.type == "Complete" && c.status == "True")
```

The `when` expression passes if no current version exists, or if the current version is older than 10 minutes. During a convergence burst, all events within the 10-minute window skip the resource entirely — `lifecycle.recreate.when` is never reached. Once the interval elapses, the next event evaluates `lifecycle.recreate.when` normally.

`metadata.creationTimestamp` arrives from Kubernetes as an RFC3339 string and is parsed correctly by `timestamp()`. `timestamp(now()) - duration("10m")` returns a timestamp 10 minutes in the past; comparison with `<` is native to CEL timestamp types. Duration strings follow Go conventions: `h`, `m`, `s`, `ms`.

The resource-level `when` is also useful outside retention scenarios — any condition that should gate the entire resource rather than a specific lifecycle action belongs here.

#### 1.2 lifecycle.apply.when (CreateOnly)

`lifecycle.apply.when` gates whether the apply step runs at all for a given resource. If the expression evaluates to false, the apply step is skipped and the executor continues to the next resource.

The primary use case is create-only behaviour: only write the resource if it does not already exist.

```yaml
resources:
- name: "bootstrapNamespace"
transport:
client: "kubernetes"
manifest:
apiVersion: v1
kind: Namespace
metadata:
name: "{{ .clusterId }}"
annotations:
hyperfleet.io/generation: "{{ .generation }}"
discovery:
by_name: "{{ .clusterId }}"
lifecycle:
apply:
when:
expression: "!resources.?bootstrapNamespace.hasValue()"
```

The expression has access to the same CEL context as `lifecycle.recreate.when`: extracted params, discovered resources, and adapter metadata. Discovery must be configured so that the current resource state is available in `resources.*` for the expression to evaluate against.

**Why not preconditions**: preconditions are event-scoped — a failing precondition skips the entire event, not just the resource it was intended to guard. `lifecycle.apply.when` is per-resource and does not affect sibling resources.

#### 1.3 lifecycle.patch (Patch)

Patch is an operation on an already-existing resource, not a creation strategy. `lifecycle.patch` is an array — each entry has its own `when` gate and `document`. The executor evaluates every entry independently and applies all whose `when` evaluates to true, in order.

```yaml
resources:
- name: "clusterNamespace"
transport:
client: "kubernetes"
discovery:
by_name: "{{ .clusterId }}"
lifecycle:
patch:
- when:
expression: >
resources.?clusterNamespace.hasValue()
&& dig(resources.clusterNamespace.metadata.annotations, "example.io/status") != clusterStatus
document:
metadata:
annotations:
example.io/status: "{{ .clusterStatus }}"

- when:
expression: >
resources.?clusterNamespace.hasValue()
&& dig(resources.clusterNamespace.metadata.labels, "example.io/tier") != tier
document:
metadata:
labels:
example.io/tier: "{{ .tier }}"
```

The first entry fires only when the `example.io/status` annotation on the discovered resource differs from the current `clusterStatus` param — the expression reads the live value off the discovered resource and compares it to the desired value. The second entry fires independently when the label drifts. Both can apply in the same event; neither fires if the discovered state already matches.

Each `document` is the patch body, rendered through the normal Go template pipeline and sent as a JSON merge patch (RFC 7396): fields present are set, fields absent are left untouched on the server, fields explicitly set to `null` are removed.

`lifecycle.patch` and the normal apply path are mutually exclusive per resource: if any patch entry fires, the standard apply step is skipped for that resource. This is consistent with how `lifecycle.recreate.when` and `lifecycle.delete.when` take precedence over the default path.

**Transport**: Maestro's `PatchManifestWork` already uses JSON merge patch internally. For K8s, the transport client issues a `Patch` call with `types.MergePatchType`. The `PatchResourceLabels` primitive added for `lifecycle.orphan.when` provides the same transport method.

**Note on recreation**: `recreate_on_change: true` is superseded by `lifecycle.recreate.when` from [Adapter Recreation Flow Design](./adapter-recreation-flow-design.md) (HYPERFLEET-837). Recreation is a lifecycle gate, not a strategy.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify CEL expression validation in adapter framework

# Search for validation or ast-inspection of CEL expressions at config load
rg -n 'ValidateExpression|CheckExpression|ValidateCEL|ast|Compile' \
  --type=go -g '!*test*' \
  -C 3 | head -50

# Check if there's any allowlist of permitted CEL functions
rg -n 'allowlist|whitelist|builtin|func.*register' \
  --type=go -g '!*test*' -C 2 | head -30

# Verify Criteria Evaluator initialization
rg -n 'NewEvaluator|cel-go|CEL engine' --type=go -g '!*test*' -C 3

Repository: openshift-hyperfleet/architecture

Length of output: 524


🏁 Script executed:

#!/bin/bash
# Find CEL validation in adapter config loader

# Search for CEL expression validation/compilation logic
rg -n 'ValidateExpression|CheckExpression|ValidateCEL|ast\.|Compile' \
  --type=go -g '!*test*' -C 3

# Look for cel-go usage and environment setup
rg -n 'cel\.NewEnv|cel\.Program|cel\.Check' \
  --type=go -g '!*test*' -C 3

# Search for validation in config loader
fd -t f -n 'loader|config' --type=go | head -10

Repository: openshift-hyperfleet/hyperfleet-adapter

Length of output: 3798


🏁 Script executed:

#!/bin/bash
# Find CEL engine setup and validation in Sentinel

# Search for CEL environment initialization
rg -n 'cel\.NewEnv|NewProgram|CELEnvironment|eval' \
  --type=go -g '!*test*' -C 3

# Look for expression validation patterns
rg -n 'validat|Validat|compile|Compile' \
  --type=go | grep -i 'cel\|expr' -C 2

Repository: openshift-hyperfleet/hyperfleet-sentinel

Length of output: 22760


🏁 Script executed:

#!/bin/bash
# Search for lifecycle expression compilation and validation

# Find where lifecycle expressions are processed
rg -n 'lifecycle|Lifecycle' --type=go -g '!*test*' -C 2 | grep -i 'when\|expr\|compile\|valid' -B 2 -A 2

# Look for config loader entry points
fd -t f 'loader\|config' --type=go -g '!*test*' 

# Check configloader/validator.go for CEL validation
cat internal/configloader/validator.go | head -100

Repository: openshift-hyperfleet/hyperfleet-adapter

Length of output: 15931


🏁 Script executed:

#!/bin/bash
# Check for lifecycle expression validation in types
rg -n 'type.*Lifecycle|type.*When' --type=go -C 3

# Look at CEL compilation in validator
rg -n 'env\.Compile|ast\.|CheckExpression' --type=go -C 3 | head -60

Repository: openshift-hyperfleet/hyperfleet-adapter

Length of output: 2865


🏁 Script executed:

#!/bin/bash
# Get the actual validateCELExpression function
rg -n 'func.*validateCELExpression|validateCELExpressions' -A 20 internal/configloader/validator.go

# Check the cel.NewEnv setup in validator
rg -n 'buildCELOptions|cel.NewEnv' -B 5 -A 10 internal/configloader/validator.go

Repository: openshift-hyperfleet/hyperfleet-adapter

Length of output: 2495


🏁 Script executed:

#!/bin/bash
# Check the criteria evaluator for CEL function restrictions
cat internal/criteria/cel_evaluator.go | head -80

# Look for buildCELOptions function
rg -n 'buildCELOptions' -A 30 internal/criteria/cel_evaluator.go

Repository: openshift-hyperfleet/hyperfleet-adapter

Length of output: 4179


🏁 Script executed:

#!/bin/bash
# Check what cel-go extensions are being used
rg -n 'ext\.|cel\.Function' internal/criteria/cel_evaluator.go -A 3

# Look for any documentation on CEL function restrictions
rg -n 'http|dns|grpc|unsafe|builtin' -i internal/criteria/

# Check if there's any security documentation about CEL
find . -type f -name '*.md' | xargs grep -l 'CEL\|cel' 2>/dev/null | head -5

Repository: openshift-hyperfleet/hyperfleet-adapter

Length of output: 922


🏁 Script executed (no clone):

# Check cel-go library documentation for available functions
curl -s https://pkg.go.dev/github.com/google/cel-go@latest | grep -i 'http\|grpc\|dns' 2>/dev/null | head -20 || echo "Could not fetch online docs"

Length of output: 1290


Document the CEL expression validation boundary and sandboxing constraints.

CEL expressions in lifecycle.when fields are validated at config load time via Parse() in the TaskConfigValidator (lines 578–622 of validator.go). The evaluator environment (internal/criteria/cel_evaluator.go) is explicitly restricted: only ext.Strings() and three custom functions (toJson, dig, now) are registered. No HTTP, DNS, gRPC, or other network-capable functions are available in the cel-go environment.

However, the design document does not document these safety constraints. Add a section explicitly stating:

  • Expressions are validated at config load time (syntax via Parse)
  • The CEL environment permits only safe built-in functions and explicitly registered helpers
  • Functions like HTTP, DNS, gRPC, and filesystem access are not available
  • Expressions cannot execute arbitrary code or access cluster credentials outside the defined context

This prevents readers from assuming expressions are unsandboxed and clarifies the threat model.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`
around lines 79 - 196, The design document for lifecycle management does not
document the CEL expression validation and sandboxing constraints, which could
lead readers to incorrectly assume expressions are unsandboxed or capable of
executing arbitrary code. Add a new section after the lifecycle.patch section
(around line 196) titled something like "CEL Expression Validation and
Sandboxing" that explicitly states: (1) expressions are validated at config load
time via Parse() in TaskConfigValidator, (2) the CEL evaluator environment
(cel_evaluator.go) is restricted to only ext.Strings() and three custom
functions (toJson, dig, now), (3) network-capable functions like HTTP, DNS, gRPC
and filesystem access are not available, and (4) expressions cannot execute
arbitrary code or access cluster credentials outside the explicitly defined CEL
context. This section should clarify the threat model and security boundaries so
readers understand the safety constraints of the expression environment.

Source: Linters/SAST tools

Comment on lines +689 to +754
## Alternatives Considered

### hyperfleet.io/last-updated Annotation

During the design of the Periodic Execution section, a `hyperfleet.io/last-updated` annotation was considered as a framework-stamped RFC3339 timestamp recording when the adapter last successfully applied a resource. This would make time-aware CEL expressions in drift-check cycles straightforward without requiring a precondition API call:

```yaml
lifecycle:
patch:
- when:
expression: >
resources.?certSecret.hasValue()
&& dig(resources.certSecret.metadata.annotations, "hyperfleet.io/last-updated") < now()
document:
metadata:
annotations:
example.io/rotated-at: "{{ .now }}"
```

Other uses where this pattern is useful:

- **Rate-limiting expensive operations** — only trigger a heavyweight recreation or API call if the last successful apply was more than N hours ago, even though Sentinel sends drift events every 30 minutes
- **Audit trail on the resource** — operators can `kubectl describe` a resource and see when the adapter last touched it without querying the HyperFleet API
- **Cross-adapter coordination** — a downstream adapter can read `hyperfleet.io/last-updated` stamped by an upstream adapter on a shared resource to condition its own actions on recency

**Why this was not adopted at framework level**: no specific use case in the current design requires it. The HyperFleet API already stores `lastTransitionTime` on the `Reconciled` condition, which is the authoritative timestamp for reconciliation events. Stamping an additional annotation on every managed resource adds a write on every apply cycle for a capability that may never be used by most adapters.

No ecosystem precedent exists for a standalone last-applied timestamp annotation on managed resources — ArgoCD, Flux, and Helm all avoid it, relying on `metadata.managedFields[].time` or status conditions on controller objects instead. Crossplane is the closest precedent, using RFC3339 timestamps on managed resources for creation-lifecycle tracking only.

**For adapter authors who need this**: add the annotation manually to the manifest template. The framework will not overwrite it (fill-gaps-only merge strategy). A custom key is equally valid — there is no requirement to use the `hyperfleet.io/` prefix for adapter-specific bookkeeping:

```yaml
manifest:
metadata:
annotations:
hyperfleet.io/generation: "{{ .generation }}"
my-adapter.io/last-updated: "{{ .now }}" # custom key, stamped by adapter author
```

The `now()` template function is available in Go templates used for manifest rendering, providing the current RFC3339 timestamp at event execution time.

---

### Per-Resource Retry with Backoff

An earlier version of this design proposed a per-resource retry configuration for apply and delete operations, modeled on the existing retry mechanism for precondition API calls:

```yaml
resources:
- name: "managedCluster"
retry:
maxRetries: 3
retryBackoff: exponential # exponential | linear | constant
activeDeadline: 60s
transport:
client: "kubernetes"
...
```

The proposed behavior: on a transient transport error (connection refused, timeout, 429, 503), the executor would wait for an exponential/linear/constant backoff interval (±10% jitter) and retry up to `maxRetries` times before surfacing the error as a resource failure. The `activeDeadline` would be enforced via `context.WithDeadline`. Implementation would extract the existing retry loop from `internal/hyperfleetapi/client.go` into a new `pkg/retry` package.

**Why this was not adopted**: The Sentinel reconciliation cycle covers both failure scenarios. For convergence events, a failed apply leaves the resource unconverged and Sentinel retriggers at ~10s. For periodic drift-check events (default every 30 minutes), a failed apply could create a gap — but the adapter closes it by reporting `Available=False` on failure, which signals Sentinel to treat the resource as not reconciled and switch back to the ~10s cadence. Provided the adapter always reports failure correctly via status conditions, no inner retry is needed. An inner retry loop also creates concurrent execution risk: if the resource has not converged (or has reported `Available=False`), Sentinel generates new events while the retry holds the current execution open, causing multiple executions to contend against an already-stressed API server.

The existing retry in `internal/hyperfleetapi/client.go` is not affected — precondition calls are synchronous blocking checks where a short backoff before concluding failure is semantically correct, and they are not subject to Sentinel retriggering.

**If this is revisited**: Any inner retry must complete well within the Sentinel cycle (< 5s total backoff). `maxRetries` should be at most 1–2, with no `activeDeadline` longer than a few seconds. It should be framed as "avoid unnecessary Sentinel cycles for sub-second blips," not as a resilience primitive.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

REQUIRED: Separate and expand Trade-offs section with mandatory subsections.

Component documents REQUIRE a structured Trade-offs section distinct from Alternatives Considered. Current document has "Alternatives Considered" (lines 689–754) which explains why certain approaches were rejected, but the Trade-offs section is missing with these required subsections:

  • What We Gain — benefits of the chosen design with measurable impact
  • What We Lose / What Gets Harder — costs and capabilities given up
  • Technical Debt Incurred — debt items: description, impact, remediation plan
  • Acceptable Because — justification for MVP/timeline/scope

Current design trade-offs (e.g., "no per-resource retry," "no framework timestamp annotation") are discussed within Alternatives sections but not consolidated into the structured Trade-offs format. Move them to a separate section and expand with the required structure.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`
around lines 689 - 754, The document is missing a mandatory Trade-offs section
separate from Alternatives Considered. Create a new Trade-offs section with four
required subsections: What We Gain (measurable benefits of chosen design), What
We Lose / What Gets Harder (costs and capabilities given up), Technical Debt
Incurred (debt items with description, impact, and remediation plan), and
Acceptable Because (justification for MVP/timeline/scope). Extract and expand
the current trade-off discussions embedded within the Alternatives Considered
section (particularly the explanations for why the hyperfleet.io/last-updated
annotation and per-resource retry mechanisms were rejected) into this structured
Trade-offs format, moving this content out of the Alternatives section and
consolidating it into the new section with fuller detail on impact and
remediation.

Source: Coding guidelines

@rh-amarin rh-amarin force-pushed the HYPERLFEET-827 branch 2 times, most recently from eb436af to aec94d4 Compare June 15, 2026 09:50

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md (4)

512-546: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Lock down the hyperfleet.io/* label keys before using them for deletes.

The sweep controller turns hyperfleet.io/resource-id and hyperfleet.io/adapter into deletion authority. Without RBAC, admission, or signature protection, a user with patch rights can forge ownership and trigger cleanup of unrelated resources — a CWE-639 / CWE-548 path. As per static analysis hints, add an explicit anti-forgery control before relying on these labels.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`
around lines 512 - 546, The sweep controller design in the "Option A —
Label-based ownership (sweep controller)" section relies on the
hyperfleet.io/adapter and hyperfleet.io/resource-id labels to make deletion
decisions, but the design lacks security controls to prevent label forgery. Add
explicit anti-forgery protections to the design by documenting RBAC restrictions
that prevent users from patching or creating resources with these labels, or
alternatively describe an admission controller that validates and stamps these
labels only through authorized pathways, or a signature verification mechanism.
Include this security control in the Prerequisites section and add an
operational consideration explaining how label protection prevents CWE-639 and
CWE-548 vulnerabilities where users could forge ownership labels to trigger
unintended resource deletions.

Source: Linters/SAST tools


79-196: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Document the CEL validation and sandbox boundary.

The lifecycle gates rely on CEL, but the doc still does not say expressions are validated at config-load time or that the evaluator is restricted to the registered functions. That leaves the threat model ambiguous and hides the fact that HTTP/DNS/filesystem access is out of scope. As per static analysis hints, document the CEL sandboxing boundary explicitly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`
around lines 79 - 196, Add explicit documentation about CEL expression
validation and sandboxing boundaries to the design document. Create a new
subsection (such as "CEL Validation and Sandboxing") that clarifies: (1) CEL
expressions in lifecycle gates are validated at config-load time to catch syntax
errors early, (2) the evaluator is restricted to a registered set of CEL
functions and does not allow arbitrary function calls, (3) dangerous operations
such as HTTP requests, DNS lookups, and filesystem access are explicitly out of
scope and unavailable, and (4) the threat model assumes only the functions
documented in the adapter framework are accessible. This documentation should
appear early in the lifecycle gates section so readers understand the security
boundary before encountering individual gate examples like the debounce scenario
with timestamp arithmetic.

Source: Linters/SAST tools


689-875: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Split out the mandatory Trade-offs section.

Alternatives Considered is still carrying the design trade-off rationale. Move that material into a dedicated Trade-offs section with What We Gain, What We Lose / What Gets Harder, Technical Debt Incurred, and Acceptable Because; alternatives should stay separate. As per coding guidelines, Trade-offs is mandatory and cannot be skipped.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`
around lines 689 - 875, The current Alternatives Considered section is mixing
design trade-offs rationale with alternative proposals, which violates the
coding guidelines requiring a mandatory Trade-offs section. Extract the design
rationale content from the Alternatives Considered section and reorganize it
into a new mandatory Trade-offs section positioned before Alternatives
Considered. Structure the Trade-offs section with four subsections: What We
Gain, What We Lose / What Gets Harder, Technical Debt Incurred, and Acceptable
Because. This trade-off rationale currently appears throughout subsections like
hyperfleet.io/last-updated Annotation, Per-Resource Retry with Backoff, and the
ownership model discussions. Keep Alternatives Considered focused solely on the
alternative proposals themselves without the justification of why the chosen
design is better.

Source: Coding guidelines


79-196: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add the required Mermaid diagrams.

These sections still describe lifecycle-gate precedence, retention/current-version selection, and sweep-controller flow only in prose. As per coding guidelines, component documents MUST include Mermaid diagrams for complex flows.

Also applies to: 386-504, 512-546

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`
around lines 79 - 196, Add Mermaid diagrams to the
adapter-lifecycle-management-design.md document to visualize the complex flows
currently described in prose only. At lines 79-196 (anchor), add a diagram
showing lifecycle-gate precedence that illustrates the decision flow for
resource-level when gates, lifecycle.apply.when, lifecycle.patch evaluation
order, and how lifecycle.delete.when and lifecycle.orphan.when remain exempt
from resource-level when gates. At lines 386-504 (sibling), add a diagram
illustrating retention and current-version selection logic, showing how versions
are compared and which is kept versus removed. At lines 512-546 (sibling), add a
diagram depicting the sweep-controller flow including the sequence of events,
resource discovery, and how lifecycle gates are evaluated at each stage. Each
diagram should clearly show decision points, flow paths, and the interaction
between different lifecycle gates to help readers understand the execution
model.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`:
- Around line 426-430: In the Debouncing section, the cross-reference `§1.4` is
incorrect and points to the orphan/disownment section instead of the
resource-level `when` debounce gate being referenced. Locate the correct section
number in the document that describes the resource-level `when` gate for
debouncing, and replace the `§1.4` reference with the correct section number to
ensure readers are directed to the appropriate lifecycle rule.

---

Duplicate comments:
In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`:
- Around line 512-546: The sweep controller design in the "Option A —
Label-based ownership (sweep controller)" section relies on the
hyperfleet.io/adapter and hyperfleet.io/resource-id labels to make deletion
decisions, but the design lacks security controls to prevent label forgery. Add
explicit anti-forgery protections to the design by documenting RBAC restrictions
that prevent users from patching or creating resources with these labels, or
alternatively describe an admission controller that validates and stamps these
labels only through authorized pathways, or a signature verification mechanism.
Include this security control in the Prerequisites section and add an
operational consideration explaining how label protection prevents CWE-639 and
CWE-548 vulnerabilities where users could forge ownership labels to trigger
unintended resource deletions.
- Around line 79-196: Add explicit documentation about CEL expression validation
and sandboxing boundaries to the design document. Create a new subsection (such
as "CEL Validation and Sandboxing") that clarifies: (1) CEL expressions in
lifecycle gates are validated at config-load time to catch syntax errors early,
(2) the evaluator is restricted to a registered set of CEL functions and does
not allow arbitrary function calls, (3) dangerous operations such as HTTP
requests, DNS lookups, and filesystem access are explicitly out of scope and
unavailable, and (4) the threat model assumes only the functions documented in
the adapter framework are accessible. This documentation should appear early in
the lifecycle gates section so readers understand the security boundary before
encountering individual gate examples like the debounce scenario with timestamp
arithmetic.
- Around line 689-875: The current Alternatives Considered section is mixing
design trade-offs rationale with alternative proposals, which violates the
coding guidelines requiring a mandatory Trade-offs section. Extract the design
rationale content from the Alternatives Considered section and reorganize it
into a new mandatory Trade-offs section positioned before Alternatives
Considered. Structure the Trade-offs section with four subsections: What We
Gain, What We Lose / What Gets Harder, Technical Debt Incurred, and Acceptable
Because. This trade-off rationale currently appears throughout subsections like
hyperfleet.io/last-updated Annotation, Per-Resource Retry with Backoff, and the
ownership model discussions. Keep Alternatives Considered focused solely on the
alternative proposals themselves without the justification of why the chosen
design is better.
- Around line 79-196: Add Mermaid diagrams to the
adapter-lifecycle-management-design.md document to visualize the complex flows
currently described in prose only. At lines 79-196 (anchor), add a diagram
showing lifecycle-gate precedence that illustrates the decision flow for
resource-level when gates, lifecycle.apply.when, lifecycle.patch evaluation
order, and how lifecycle.delete.when and lifecycle.orphan.when remain exempt
from resource-level when gates. At lines 386-504 (sibling), add a diagram
illustrating retention and current-version selection logic, showing how versions
are compared and which is kept versus removed. At lines 512-546 (sibling), add a
diagram depicting the sweep-controller flow including the sequence of events,
resource discovery, and how lifecycle gates are evaluated at each stage. Each
diagram should clearly show decision points, flow paths, and the interaction
between different lifecycle gates to help readers understand the execution
model.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 294b9435-452f-48a1-80da-13da8e4ccd89

📥 Commits

Reviewing files that changed from the base of the PR and between 29f5365 and eb436af.

📒 Files selected for processing (1)
  • hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift-hyperfleet/architecture (manual)
  • openshift-hyperfleet/hyperfleet-api (manual)
  • openshift-hyperfleet/hyperfleet-sentinel (manual)
  • openshift-hyperfleet/hyperfleet-adapter (manual)
  • openshift-hyperfleet/hyperfleet-broker (manual)

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md (1)

1-891: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

REQUIRED: Separate and expand Trade-offs section with mandatory subsections.

Component documents require a standalone Trade-offs section distinct from Alternatives Considered, with these four required subsections (per coding guidelines):

  • What We Gain — benefits with measurable impact (e.g., "unified ownership via two mechanisms eliminates single point of failure in API force-delete case")
  • What We Lose / What Gets Harder — costs and capabilities foregone (e.g., "sweep controller eventual consistency adds cleanup latency bounded by interval")
  • Technical Debt Incurred — debt items with description, impact, and remediation plan (e.g., "HYPERFLEET-896 blocks hyperfleet.io/resource-id stamping until identifier model is agreed")
  • Acceptable Because — justification for MVP/timeline/scope

Current document has extensive Alternatives Considered (lines 704–891, ~188 lines) but no Trade-offs section. Trade-off discussions embedded in Alternatives (e.g., lines 765–769 rejecting per-resource retry based on Sentinel cycle coverage) must be extracted and consolidated into the structured Trade-offs format.

Create a new section before "Alternatives Considered" with the required subsections; move and expand trade-off content accordingly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`
around lines 1 - 891, The design document lacks a standalone Trade-offs section
as required by coding guidelines. Insert a new "Trade-offs" section immediately
before the existing "Alternatives Considered" section (after the "Configuration
Reference" subsection ends and before the "Alternatives Considered" header). The
Trade-offs section must include four required subsections: "What We Gain"
(measurable benefits), "What We Lose / What Gets Harder" (costs and foregone
capabilities), "Technical Debt Incurred" (debt items with description, impact,
and remediation), and "Acceptable Because" (MVP/timeline/scope justification).
Extract and consolidate trade-off discussions currently embedded throughout the
Alternatives Considered section — particularly the per-resource retry discussion
(rejecting retry in favor of relying on Sentinel cycle), sweep controller
eventual consistency concerns, dual ownership mechanism complexity — and reframe
them using the Trade-offs structured format with specific examples and
measurable impacts (e.g., "sweep controller latency bounded by configurable
interval" under What We Lose, "eliminates single point of failure in API
force-delete case" under What We Gain). Include technical debt items referencing
blocking tickets like HYPERFLEET-896 for resource-id stamping and
HYPERFLEET-1205 for reconciliation metrics. Ensure the section provides clear
justification for why each trade-off is acceptable for this design's scope and
timeline.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`:
- Around line 426-430: The section reference at line 426 incorrectly cites §1.4,
which covers resource disownment with lifecycle.orphan.when, but the current
section is describing resource-level when debounce gates that apply to lifecycle
expressions. Update the cross-reference from §1.4 to §1.1 Resource-Level When
Gate to point readers to the correct lifecycle rule that describes the when gate
behavior being referenced in this section.
- Around line 79-700: The document describes three complex state machines and
control flows entirely in prose, but component documents must include Mermaid
diagrams per coding guidelines. Add three Mermaid flowchart diagrams to address
this requirement: first, a diagram showing the lifecycle gate evaluation order
and precedence rules for resource-level `when`, `lifecycle.apply.when`,
`lifecycle.patch`, `lifecycle.orphan.when`, `lifecycle.recreate.when`, and
`lifecycle.delete.when` with conditional paths based on gate results; second, a
diagram illustrating multi-version discovery and current-version resolution
logic showing how selector-based discovery returns all versions by
`hyperfleet.io/resource-id` label, then resolves the current version by
generation match or timestamp fallback; and third, a diagram for the sweep
controller verification flow showing the steps of listing K8s resources with the
`hyperfleet.io/adapter` label, extracting `hyperfleet.io/resource-id`, calling
the API GET endpoint to verify the resource exists, and deleting the K8s
resource on 404. Each diagram should include labels for conditional branches and
reference the configuration fields and CEL expressions mentioned in the
surrounding prose.
- Around line 527-561: The Option A — Label-based ownership section lacks
security controls to prevent tampering with the hyperfleet.io/resource-id label,
which could allow users with patch permissions to forge labels and trick the
sweep controller into deleting unrelated resources. Choose one of the three
specified security approaches (cryptographic protection via HMAC/signature
validation, RBAC restrictions to prevent modification of hyperfleet.io/* labels,
or validating admission webhook to control label changes), document the chosen
approach in a new subsection within the Prerequisites table in §8A (Option A),
and update the operational considerations if the chosen approach affects
deployment requirements.
- Line 276: Replace the em-dash (—) with a hyphen (-) on line 276 where the text
reads "apply failures via status conditions — this is what keeps the gap
closed". Use the hyphen to properly join the phrase according to standard
grammar convention for connecting related clauses.
- Around line 79-196: The design document lacks explicit documentation about CEL
expression validation, available functions, and security boundaries for
lifecycle gates (lifecycle.when, lifecycle.apply.when, lifecycle.patch.when,
lifecycle.orphan.when, lifecycle.recreate.when, lifecycle.delete.when). Add a
new section titled "CEL Expression Validation and Sandboxing" after section 1.3
or within section 1's introduction that documents: expressions are validated at
config load time via Parse() in TaskConfigValidator, the CEL evaluator
environment is restricted to ext.Strings() plus custom functions (toJson, dig,
now) with no HTTP, DNS, gRPC, or filesystem functions available, expressions
cannot execute arbitrary code or access cluster credentials outside the defined
context, and a note addressing how secrets in expressions are handled to prevent
readers from assuming unsandboxed evaluation.

---

Outside diff comments:
In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`:
- Around line 1-891: The design document lacks a standalone Trade-offs section
as required by coding guidelines. Insert a new "Trade-offs" section immediately
before the existing "Alternatives Considered" section (after the "Configuration
Reference" subsection ends and before the "Alternatives Considered" header). The
Trade-offs section must include four required subsections: "What We Gain"
(measurable benefits), "What We Lose / What Gets Harder" (costs and foregone
capabilities), "Technical Debt Incurred" (debt items with description, impact,
and remediation), and "Acceptable Because" (MVP/timeline/scope justification).
Extract and consolidate trade-off discussions currently embedded throughout the
Alternatives Considered section — particularly the per-resource retry discussion
(rejecting retry in favor of relying on Sentinel cycle), sweep controller
eventual consistency concerns, dual ownership mechanism complexity — and reframe
them using the Trade-offs structured format with specific examples and
measurable impacts (e.g., "sweep controller latency bounded by configurable
interval" under What We Lose, "eliminates single point of failure in API
force-delete case" under What We Gain). Include technical debt items referencing
blocking tickets like HYPERFLEET-896 for resource-id stamping and
HYPERFLEET-1205 for reconciliation metrics. Ensure the section provides clear
justification for why each trade-off is acceptable for this design's scope and
timeline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 67a22278-f792-418b-88ee-a64e9b62e9f1

📥 Commits

Reviewing files that changed from the base of the PR and between eb436af and aec94d4.

📒 Files selected for processing (1)
  • hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift-hyperfleet/architecture (manual)
  • openshift-hyperfleet/hyperfleet-api (manual)
  • openshift-hyperfleet/hyperfleet-sentinel (manual)
  • openshift-hyperfleet/hyperfleet-adapter (manual)
  • openshift-hyperfleet/hyperfleet-broker (manual)

Comment on lines +79 to +196
### 1. Lifecycle Gates for Apply Behaviour

The existing lifecycle gate pattern (`lifecycle.recreate.when`, `lifecycle.delete.when`) drives resource operations via per-resource CEL expressions evaluated against discovered resource state. The new gates follow the same pattern rather than introducing a separate `updateStrategy` enum.

#### 1.1 Resource-Level When Gate

A resource-level `when` expression is evaluated before any lifecycle gate. If it evaluates to false, the executor skips all modification actions for that resource in the current event — no apply, no patch, no recreate. Delete and orphan are exempt: they are evaluated regardless of the `when` result, so a freshness check never accidentally blocks a deletion.

```yaml
resources:
- alias: "provisioningJob"
when:
expression: "..."
lifecycle:
...
```

This gate follows the same CEL evaluation context as all other lifecycle expressions: extracted params, discovered resources, and adapter metadata are all available.

##### Discovery always runs

Discovery is **not** skipped when `when` evaluates to false. The executor still discovers the current state of the resource and makes it available as `resources.provisioningJob` (an optional value, absent if nothing was found). This means:

- Sibling resources can read `resources.provisioningJob` in their own `when` expressions, `lifecycle.*` expressions, and manifest templates, regardless of whether this resource's `when` gate passed.
- The resource-level `when` expression itself can read `resources.provisioningJob` — which is the typical pattern for debouncing: inspect the current resource's `creationTimestamp` and skip writes if it is too fresh.

The sequencing is: **discover → evaluate `when` → if true, apply/patch/recreate; if false, skip writes → evaluate delete/orphan regardless**.

##### Finer-grained control than preconditions

A precondition failure is event-scoped: if a precondition evaluates to false or errors, the executor aborts the entire resources phase — all remaining resources in the event are skipped, not just the one the precondition was intended to guard. This makes preconditions the right tool for event-wide guards ("abort if the cluster is not yet registered") but the wrong tool for resource-specific guards.

The resource-level `when` gate is resource-scoped: if it evaluates to false, only that resource's modification actions are skipped. The executor continues to the next resource in the list without interruption. Other resources are discovered, evaluated, and applied normally. This matters when resources in the same event are logically independent — a debounce on a provisioning job should not prevent an annotation patch on a separate namespace resource from running in the same event.

##### Debouncing with timestamp arithmetic

The primary use case is rate-limiting resource modifications when Sentinel sends a burst of convergence events — for example, preventing a new provisioning job from being created on every ~10s event when the previous one was just launched.

CEL's standard library provides `timestamp()` and `duration()` types with full arithmetic support. Combined with the existing `now()` function, debounce conditions require no new framework functions:

```yaml
resources:
- alias: provisioningJob
when:
expression: >
!resources.?provisioningJob.hasValue()
|| timestamp(resources.provisioningJob.metadata.creationTimestamp)
< timestamp(now()) - duration("10m")
lifecycle:
recreate:
when:
expression: >
resources.provisioningJob.status.conditions.exists(
c, c.type == "Complete" && c.status == "True")
```

The `when` expression passes if no current version exists, or if the current version is older than 10 minutes. During a convergence burst, all events within the 10-minute window skip the resource entirely — `lifecycle.recreate.when` is never reached. Once the interval elapses, the next event evaluates `lifecycle.recreate.when` normally.

`metadata.creationTimestamp` arrives from Kubernetes as an RFC3339 string and is parsed correctly by `timestamp()`. `timestamp(now()) - duration("10m")` returns a timestamp 10 minutes in the past; comparison with `<` is native to CEL timestamp types. Duration strings follow Go conventions: `h`, `m`, `s`, `ms`.

The resource-level `when` is also useful outside retention scenarios — any condition that should gate the entire resource rather than a specific lifecycle action belongs here.

#### 1.2 lifecycle.apply.when (CreateOnly)

`lifecycle.apply.when` gates whether the apply step runs at all for a given resource. If the expression evaluates to false, the apply step is skipped and the executor continues to the next resource.

The primary use case is create-only behaviour: only write the resource if it does not already exist.

```yaml
resources:
- name: "bootstrapNamespace"
transport:
client: "kubernetes"
manifest:
apiVersion: v1
kind: Namespace
metadata:
name: "{{ .clusterId }}"
annotations:
hyperfleet.io/generation: "{{ .generation }}"
discovery:
by_name: "{{ .clusterId }}"
lifecycle:
apply:
when:
expression: "!resources.?bootstrapNamespace.hasValue()"
```

The expression has access to the same CEL context as `lifecycle.recreate.when`: extracted params, discovered resources, and adapter metadata. Discovery must be configured so that the current resource state is available in `resources.*` for the expression to evaluate against.

**Why not preconditions**: preconditions are event-scoped — a failing precondition skips the entire event, not just the resource it was intended to guard. `lifecycle.apply.when` is per-resource and does not affect sibling resources.

#### 1.3 lifecycle.patch (Patch)

Patch is an operation on an already-existing resource, not a creation strategy. `lifecycle.patch` is an array — each entry has its own `when` gate and `document`. The executor evaluates every entry independently and applies all whose `when` evaluates to true, in order.

```yaml
resources:
- name: "clusterNamespace"
transport:
client: "kubernetes"
discovery:
by_name: "{{ .clusterId }}"
lifecycle:
patch:
- when:
expression: >
resources.?clusterNamespace.hasValue()
&& dig(resources.clusterNamespace.metadata.annotations, "example.io/status") != clusterStatus
document:
metadata:
annotations:
example.io/status: "{{ .clusterStatus }}"

- when:
expression: >
resources.?clusterNamespace.hasValue()
&& dig(resources.clusterNamespace.metadata.labels, "example.io/tier") != tier

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Document CEL expression validation boundaries and sandboxing constraints.

Lifecycle gates (lifecycle.when, lifecycle.apply.when, lifecycle.patch, lifecycle.orphan.when, lifecycle.recreate.when, lifecycle.delete.when) and periodic execution (§5) all rely on CEL expressions validated at config load time. The design does not document what functions are available, what is restricted, or the threat model.

Verification against linked repository [hyperfleet-adapter/internal/criteria/cel_evaluator.go] confirms: CEL environment restricted to ext.Strings() plus three custom functions (toJson, dig, now). No HTTP, DNS, gRPC, or filesystem functions available. Expressions cannot execute arbitrary code or access cluster credentials outside the defined context.

Add a section (after §1.3 or in §1 introduction) titled "CEL Expression Validation and Sandboxing" explicitly stating:

  1. Expressions are validated at config load time via Parse() in TaskConfigValidator
  2. CEL evaluator environment restricted to safe built-in functions and explicit helpers only
  3. Network-capable functions (HTTP, DNS, gRPC) and filesystem access not available
  4. Expressions cannot execute arbitrary code or escape the defined context
  5. Secrets in expressions are logged (clarify or fix if logging SIL policy applies)

This prevents readers from assuming unsandboxed evaluation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`
around lines 79 - 196, The design document lacks explicit documentation about
CEL expression validation, available functions, and security boundaries for
lifecycle gates (lifecycle.when, lifecycle.apply.when, lifecycle.patch.when,
lifecycle.orphan.when, lifecycle.recreate.when, lifecycle.delete.when). Add a
new section titled "CEL Expression Validation and Sandboxing" after section 1.3
or within section 1's introduction that documents: expressions are validated at
config load time via Parse() in TaskConfigValidator, the CEL evaluator
environment is restricted to ext.Strings() plus custom functions (toJson, dig,
now) with no HTTP, DNS, gRPC, or filesystem functions available, expressions
cannot execute arbitrary code or access cluster credentials outside the defined
context, and a note addressing how secrets in expressions are handled to prevent
readers from assuming unsandboxed evaluation.

Source: Learnings

Comment on lines +79 to +700
### 1. Lifecycle Gates for Apply Behaviour

The existing lifecycle gate pattern (`lifecycle.recreate.when`, `lifecycle.delete.when`) drives resource operations via per-resource CEL expressions evaluated against discovered resource state. The new gates follow the same pattern rather than introducing a separate `updateStrategy` enum.

#### 1.1 Resource-Level When Gate

A resource-level `when` expression is evaluated before any lifecycle gate. If it evaluates to false, the executor skips all modification actions for that resource in the current event — no apply, no patch, no recreate. Delete and orphan are exempt: they are evaluated regardless of the `when` result, so a freshness check never accidentally blocks a deletion.

```yaml
resources:
- alias: "provisioningJob"
when:
expression: "..."
lifecycle:
...
```

This gate follows the same CEL evaluation context as all other lifecycle expressions: extracted params, discovered resources, and adapter metadata are all available.

##### Discovery always runs

Discovery is **not** skipped when `when` evaluates to false. The executor still discovers the current state of the resource and makes it available as `resources.provisioningJob` (an optional value, absent if nothing was found). This means:

- Sibling resources can read `resources.provisioningJob` in their own `when` expressions, `lifecycle.*` expressions, and manifest templates, regardless of whether this resource's `when` gate passed.
- The resource-level `when` expression itself can read `resources.provisioningJob` — which is the typical pattern for debouncing: inspect the current resource's `creationTimestamp` and skip writes if it is too fresh.

The sequencing is: **discover → evaluate `when` → if true, apply/patch/recreate; if false, skip writes → evaluate delete/orphan regardless**.

##### Finer-grained control than preconditions

A precondition failure is event-scoped: if a precondition evaluates to false or errors, the executor aborts the entire resources phase — all remaining resources in the event are skipped, not just the one the precondition was intended to guard. This makes preconditions the right tool for event-wide guards ("abort if the cluster is not yet registered") but the wrong tool for resource-specific guards.

The resource-level `when` gate is resource-scoped: if it evaluates to false, only that resource's modification actions are skipped. The executor continues to the next resource in the list without interruption. Other resources are discovered, evaluated, and applied normally. This matters when resources in the same event are logically independent — a debounce on a provisioning job should not prevent an annotation patch on a separate namespace resource from running in the same event.

##### Debouncing with timestamp arithmetic

The primary use case is rate-limiting resource modifications when Sentinel sends a burst of convergence events — for example, preventing a new provisioning job from being created on every ~10s event when the previous one was just launched.

CEL's standard library provides `timestamp()` and `duration()` types with full arithmetic support. Combined with the existing `now()` function, debounce conditions require no new framework functions:

```yaml
resources:
- alias: provisioningJob
when:
expression: >
!resources.?provisioningJob.hasValue()
|| timestamp(resources.provisioningJob.metadata.creationTimestamp)
< timestamp(now()) - duration("10m")
lifecycle:
recreate:
when:
expression: >
resources.provisioningJob.status.conditions.exists(
c, c.type == "Complete" && c.status == "True")
```

The `when` expression passes if no current version exists, or if the current version is older than 10 minutes. During a convergence burst, all events within the 10-minute window skip the resource entirely — `lifecycle.recreate.when` is never reached. Once the interval elapses, the next event evaluates `lifecycle.recreate.when` normally.

`metadata.creationTimestamp` arrives from Kubernetes as an RFC3339 string and is parsed correctly by `timestamp()`. `timestamp(now()) - duration("10m")` returns a timestamp 10 minutes in the past; comparison with `<` is native to CEL timestamp types. Duration strings follow Go conventions: `h`, `m`, `s`, `ms`.

The resource-level `when` is also useful outside retention scenarios — any condition that should gate the entire resource rather than a specific lifecycle action belongs here.

#### 1.2 lifecycle.apply.when (CreateOnly)

`lifecycle.apply.when` gates whether the apply step runs at all for a given resource. If the expression evaluates to false, the apply step is skipped and the executor continues to the next resource.

The primary use case is create-only behaviour: only write the resource if it does not already exist.

```yaml
resources:
- name: "bootstrapNamespace"
transport:
client: "kubernetes"
manifest:
apiVersion: v1
kind: Namespace
metadata:
name: "{{ .clusterId }}"
annotations:
hyperfleet.io/generation: "{{ .generation }}"
discovery:
by_name: "{{ .clusterId }}"
lifecycle:
apply:
when:
expression: "!resources.?bootstrapNamespace.hasValue()"
```

The expression has access to the same CEL context as `lifecycle.recreate.when`: extracted params, discovered resources, and adapter metadata. Discovery must be configured so that the current resource state is available in `resources.*` for the expression to evaluate against.

**Why not preconditions**: preconditions are event-scoped — a failing precondition skips the entire event, not just the resource it was intended to guard. `lifecycle.apply.when` is per-resource and does not affect sibling resources.

#### 1.3 lifecycle.patch (Patch)

Patch is an operation on an already-existing resource, not a creation strategy. `lifecycle.patch` is an array — each entry has its own `when` gate and `document`. The executor evaluates every entry independently and applies all whose `when` evaluates to true, in order.

```yaml
resources:
- name: "clusterNamespace"
transport:
client: "kubernetes"
discovery:
by_name: "{{ .clusterId }}"
lifecycle:
patch:
- when:
expression: >
resources.?clusterNamespace.hasValue()
&& dig(resources.clusterNamespace.metadata.annotations, "example.io/status") != clusterStatus
document:
metadata:
annotations:
example.io/status: "{{ .clusterStatus }}"

- when:
expression: >
resources.?clusterNamespace.hasValue()
&& dig(resources.clusterNamespace.metadata.labels, "example.io/tier") != tier
document:
metadata:
labels:
example.io/tier: "{{ .tier }}"
```

The first entry fires only when the `example.io/status` annotation on the discovered resource differs from the current `clusterStatus` param — the expression reads the live value off the discovered resource and compares it to the desired value. The second entry fires independently when the label drifts. Both can apply in the same event; neither fires if the discovered state already matches.

Each `document` is the patch body, rendered through the normal Go template pipeline and sent as a JSON merge patch (RFC 7396): fields present are set, fields absent are left untouched on the server, fields explicitly set to `null` are removed.

`lifecycle.patch` and the normal apply path are mutually exclusive per resource: if any patch entry fires, the standard apply step is skipped for that resource. This is consistent with how `lifecycle.recreate.when` and `lifecycle.delete.when` take precedence over the default path.

**Transport**: Maestro's `PatchManifestWork` already uses JSON merge patch internally. For K8s, the transport client issues a `Patch` call with `types.MergePatchType`. The `PatchResourceLabels` primitive added for `lifecycle.orphan.when` provides the same transport method.

**Note on recreation**: `recreate_on_change: true` is superseded by `lifecycle.recreate.when` from [Adapter Recreation Flow Design](./adapter-recreation-flow-design.md) (HYPERFLEET-837). Recreation is a lifecycle gate, not a strategy.

#### 1.4 Resource Disownment (lifecycle.orphan.when)

An adapter can already release a resource without deleting it: skip the delete step in `lifecycle.delete.when` and report `Finalized=True` in the post-action status conditions. The platform sees `Finalized=True` and stops sending events for that resource.

`lifecycle.orphan.when` makes this a first-class lifecycle gate. The expression is typically gated on the event being a deletion and the resource still carrying its management labels — confirming the resource is still managed and this event is the handoff point.

```yaml
resources:
- name: "provisioningJob"
transport:
client: "kubernetes"
discovery:
by_name: "{{ .clusterId }}-job"
lifecycle:
orphan:
when:
expression: >
is_deleting
&& resources.?provisioningJob.hasValue()
&& "hyperfleet.io/adapter" in resources.provisioningJob.metadata.labels
```

When `lifecycle.orphan.when` evaluates to true, the executor:

1. Skips the delete step entirely
2. Issues a label-strip patch via the transport client, removing `hyperfleet.io/adapter` and `hyperfleet.io/resource-id`
3. Sets `Finalized=True` in the execution result; the post-action phase reports it automatically
4. Preserves `hyperfleet.io/generation` (it belongs to the resource, not the management relationship)

The label presence check in the expression (`"hyperfleet.io/adapter" in resources.provisioningJob.metadata.labels`) ensures the gate is idempotent: if the event is redelivered after labels have already been stripped, the expression evaluates to false and neither delete nor orphan fires.

**Precedence**: `lifecycle.orphan.when` is evaluated before `lifecycle.delete.when`. If orphan fires, delete is skipped.

**Implementation**: Add `orphan.when` CEL field to `LifecycleConfig`. Add a `PatchResourceLabels(ctx, gvk, namespace, name, remove []string, target)` method to the `TransportClient` interface for the label-strip operation.

**Maestro compatibility**: For Maestro ManifestWork resources, the label-strip is applied to the ManifestWork object itself, not to the nested manifests. The sweep controller must account for this when reconciling Maestro-managed resources.

### 2. Automatic Label and Annotation Stamping

**Current state**: The framework defines a set of standard `hyperfleet.io/*` label and annotation keys as string constants in `pkg/constants/constants.go` (`hyperfleet.io/generation`, `managed-by`, `cluster-id`, `created-by`). These are a shared reference for adapter authors — the framework does not inject them. Each adapter's manifest template must add them explicitly by hand. The only exception is `hyperfleet.io/generation`: this annotation is mandatory and validated at apply time by `internal/manifest/generation.go`, which refuses to apply any manifest missing it or carrying a non-numeric value.

This means label coverage across adapter-created resources is inconsistent: adapters that omit `hyperfleet.io/managed-by` or `hyperfleet.io/cluster-id` from their templates are invisible to any tooling that relies on those labels for discovery, ownership tracking, or sweep-based cleanup.

**Proposal**: The framework injects the standard set of `hyperfleet.io/*` labels and annotations at apply time, before the manifest reaches the transport client. Adapter authors no longer need to add them manually. The merge strategy is fill-gaps-only: labels already present in the adapter's manifest template take precedence, so intentional overrides are preserved and no existing adapter is broken.

**Standard labels stamped on every resource:**

| Label / Annotation | Value Source | Notes |
|---|---|---|
| `hyperfleet.io/adapter` | Adapter config: `adapter.name` | Identifies the adapter instance managing this resource |
| `hyperfleet.io/resource-id` | TBD — pending HYPERFLEET-896 alignment | Stable identifier across recreations |
| `hyperfleet.io/generation` | Event param: `generation` | Already enforced; automatic stamping is a no-op if already present |

**Implementation**: A `stampLabels(manifest, frameworkLabels)` function called in `resource_executor.go` before `ApplyResource()`. Works identically for K8s and Maestro.

> **Blocked on HYPERFLEET-896**: The `hyperfleet.io/resource-id` value is blocked on alignment with the CRD-Driven Generic Resource API team. Stamping can be shipped without this label and extended once the identifier model is agreed.

### 3. Resilience Model

The adapter does not implement per-resource retry for apply and delete operations. The resilience model for HyperFleet is the Sentinel reconciliation cycle: Sentinel continuously compares observed resource state against the desired spec and generates a new event approximately every 10 seconds for as long as the state has not converged. Once the state converges, Sentinel switches to a periodic drift-check cadence (default: every 30 minutes).

**Convergence events**: a transient apply failure leaves the resource state unchanged and unconverged. Sentinel observes continued divergence and sends the next event at the ~10s cadence. No inner retry is needed.

**Drift-check events**: Sentinel also sends periodic events (default every 30 minutes) to detect and correct drift on resources that were previously converged. This introduces a corner case: if the apply fails during a drift-check event, the resource state is unchanged and still appears converged to Sentinel, which would otherwise not retrigger for another 30 minutes. This gap is closed by the adapter reporting `Available=False` on the API resource whenever an apply fails. `Available=False` signals to Sentinel that the resource is not reconciled, causing it to switch from the drift cadence back to the ~10s convergence cadence. The adapter must always surface apply failures via status conditions — this is what keeps the gap closed regardless of which Sentinel cadence triggered the event.

Adding an inner retry loop inside a single event execution creates a two-loop problem:

- **Concurrent execution risk**: a retry with a long `activeDeadline` holds the event execution open while Sentinel generates new events because the resource has not converged (or has reported `Available=False`). Multiple executions attempt the same apply against an already-stressed API server.
- **Redundancy**: the failure classes an inner retry would handle are already covered by the Sentinel cycle, provided the adapter surfaces failures correctly.
- **Backoff amplification**: exponential backoff inside the adapter does not coordinate with Sentinel's cadence. Under sustained API server pressure, both loops fire independently, increasing load rather than reducing it.

The adapter's responsibility is to drive state convergence and always report failure via `Available=False` and status conditions. Retry is a property of the reconciliation loop, not of a single event execution.

The existing retry mechanism in `internal/hyperfleetapi/client.go` (used for precondition API calls) is retained as-is. Precondition calls are synchronous blocking checks where a short backoff before concluding failure is semantically correct. Resource apply and delete do not share this property.

### 6. API-Level Deletion Policy for Owned Resources

This proposal is defined at the HyperFleet API layer, not in the adapter framework. It is documented here because it directly determines what events adapters receive when a parent resource is deleted, and therefore what cleanup the adapter is expected to perform.

#### What HYPERFLEET-896 defines

The [Generic Resource Registry design](../../../docs/generic-resource-registry-design.md) introduces a descriptor-driven `OnParentDelete` policy declared on each **child** entity type. When a parent resource is deleted, the API applies each child type's policy before proceeding:

| Policy | Behaviour |
|---|---|
| `restrict` (default) | Parent deletion is rejected with `409 Conflict` if any active children of this type exist. The caller must delete children first. |
| `cascade` | All children of this type are soft-deleted recursively (DFS, innermost first) before the parent is deleted. The entire tree is soft-deleted atomically. |

With `cascade`, each child soft-deletion triggers the normal adapter reconciliation flow — adapters receive delete events and use `lifecycle.delete.when` to clean up the corresponding K8s or Maestro resources. No special adapter behaviour is required beyond what already exists.

#### The Orphan case

The ticket specifies a third policy — **`orphan`** — where parent deletion proceeds but children are left in place, neither deleted nor blocked. The children remain in the API as active resources without a parent; no delete events are sent to adapters.

This policy is **not currently in the HYPERFLEET-896 design** and would need to be added. Its implications:

- Children remain as API resources with no owner reference
- Adapters do not receive delete events; the K8s/Maestro resources they manage also remain in place
- The sweep controller (`§8A`) cannot clean them up either — those K8s resources still have a live API counterpart (the orphaned child API resource returns 200, not 404)
- Cleanup would require either manual deletion of each child, or a separate sweep mechanism that queries for parentless resources and acts on them

**Relationship to `lifecycle.orphan.when` (`§1.4`)**: these are distinct mechanisms operating at different layers. `lifecycle.orphan.when` is an adapter-level gate for planned disownment during a normal event lifecycle — the API resource still exists. The API-level `orphan` deletion policy operates at the API layer when a parent is deleted, independent of any adapter event.

#### Action required

A proposal to add `OnParentDeleteOrphan` to the Generic Resource Registry descriptor model is needed before this policy can be implemented. The adapter framework requires no changes for the `restrict` and `cascade` cases.

### 4. Stuck Detection

**Current state**: The HyperFleet API already exposes deletion-specific stuck detection via `hyperfleet_api_resource_pending_deletion_stuck` — a gauge computed on each Prometheus scrape by querying the database for resources with `deleted_time` set beyond a configurable threshold (default 30 minutes). This covers the deletion flow only and has no equivalent for create and update reconciliations.

**Planned** ([HYPERFLEET-1205](https://redhat.atlassian.net/browse/HYPERFLEET-1205)): A unified reconciliation metrics system that extends the same pattern to all reconciliation flows using the `Reconciled=False` condition already stored on API resources.

New metrics:

| Metric | Type | Description |
|---|---|---|
| `hyperfleet_api_reconciliation_requests_total` | Counter | Incremented when `Reconciled` transitions to `False` |
| `hyperfleet_api_resource_pending_reconciliation` | Gauge | Resources currently in `Reconciled=False` state |
| `hyperfleet_api_resource_pending_reconciliation_stuck` | Gauge | Resources where `Reconciled=False` has persisted beyond the stuck threshold |
| `hyperfleet_api_resource_pending_reconciliation_stuck_duration_seconds` | Gauge | Maximum duration any resource has been stuck |

All metrics carry an `is_delete` label (`true` when `deleted_time IS NOT NULL`) to differentiate deletion flows from create/update flows. The existing deletion metrics become redundant and will be removed once HYPERFLEET-1205 ships.

**Why no adapter-side annotation is needed**: the API stores the `Reconciled=False` condition with a transition timestamp in its database. Duration and stuck detection are computed from that timestamp via DB queries at scrape time — the API is the authoritative source. The adapter's responsibility is to correctly report `Available=False` on apply failure, which drives `Reconciled=False` on the API resource. Sentinel picks up the signal and retriggers; if `Reconciled=False` persists beyond the threshold, the API metrics surface it. No annotation on the managed K8s resource is required.

### 5. Periodic Execution

The adapter has no built-in scheduler. Periodic execution is provided by the Sentinel reconciliation cycle, which operates at two cadences:

- **Convergence cadence (~10s)**: triggered while `Reconciled=False` — the resource has not yet reached the desired state. Every event execution is an opportunity to drive the resource toward convergence.
- **Drift-check cadence (default 30 minutes)**: triggered when the resource is considered reconciled. Sentinel re-evaluates the resource to detect and correct configuration drift.

Both cadences are configured per resource type, not per individual resource. That is the granularity available: all resources of a given type share the same drift-check interval. There is no mechanism to set a different cadence for individual resources within the same type.

#### Finer-grained control with lifecycle CEL expressions

Every `when` expression — `lifecycle.apply.when`, `lifecycle.patch`, `lifecycle.orphan.when`, `lifecycle.recreate.when`, `lifecycle.delete.when` — is evaluated on every event, including drift-check events. This makes lifecycle gates the natural mechanism for expressing finer-grained periodic behaviour within the drift cadence.

An adapter author can gate an operation on drift-check events using conditions that express time-awareness or state-awareness without any framework changes:

```yaml
lifecycle:
patch:
- when:
expression: >
resources.?clusterNamespace.hasValue()
&& dig(resources.clusterNamespace.metadata.annotations, "example.io/rotated-at") < now()
document:
metadata:
annotations:
example.io/rotated-at: "{{ .now }}"
```

The `now()` CEL function returns the current time as an RFC3339 string, enabling expressions that compare timestamps read from the discovered resource against the current time.

#### Using last-reported time from adapter conditions

An adapter can retrieve the timestamp of its own last successful reconciliation — or that of another adapter — by reading the status conditions on the HyperFleet API resource. Conditions carry a `lastTransitionTime` field that records when `Reconciled` last changed. This can be fetched via a precondition API call and injected as a param, making it available in CEL expressions across the entire event execution.

This allows an adapter to express "only perform this operation if it has not been done in the last N hours," even though Sentinel sends drift events every 30 minutes:

```yaml
preconditions:
- name: "fetchClusterStatus"
type: hyperfleet_api
resource: cluster
id: "{{ .clusterId }}"
output_param: clusterResource

resources:
- name: "certSecret"
lifecycle:
patch:
- when:
expression: >
resources.?certSecret.hasValue()
&& dig(clusterResource, "status.conditions")
.filter(c, c.type == "CertRotated")
.all(c, c.lastTransitionTime < timestamp(now()).toUnixTime() - 86400)
document:
metadata:
annotations:
example.io/cert-rotated-at: "{{ .now }}"
```

The same pattern applies to conditions reported by other adapters managing the same cluster resource — an adapter can make its drift-check behaviour conditional on what a peer adapter last reported, without any direct coupling between the adapters.

### 7. Resource Retention

Some adapter resources should accumulate as history rather than be replaced in-place — provisioning jobs, pipeline runs, audit records. The `retention:` block at the resource level enables this: it configures how many historical versions are kept and for how long.

**Retention is not a lifecycle gate.** It is a resource-level policy that runs as a post-apply cleanup pass. It is independent of `lifecycle.recreate.when` — the gate decides *whether* to create a new version; retention decides *how many old versions to keep* after the new one exists.

#### Replace mode vs. retain mode

The presence or absence of `retention:` determines the recreation strategy when `lifecycle.recreate.when` fires:

- **Replace mode** (no `retention:` block): the old resource is deleted first, then the new manifest is applied with the same name. One resource exists at a time.
- **Retain mode** (`retention:` block present): the new manifest is applied without deleting the old one first. The adapter author controls the naming to ensure no collision — the name should vary per creation (e.g., include a timestamp or generation suffix). Old versions accumulate and are pruned by retention rules.

This distinction matters because the two modes have incompatible semantics: replace requires the old name to be free before creating; retain requires the old resource to remain alive until pruned.

#### Multi-result discovery

In retain mode, discovery must find all existing versions of the logical resource, not a single resource by name. The standard mechanism is a label selector on `hyperfleet.io/resource-id`, which is stamped automatically on every resource (see `§2`):

```yaml
discovery:
by_selectors:
label_selector:
hyperfleet.io/resource-id: "{{ .resourceId }}"
```

This returns a list of all versions sharing the same API resource identity. The executor resolves the **current version** — the entry whose `hyperfleet.io/generation` annotation matches the current event's generation param, or the newest by `metadata.creationTimestamp` as fallback — and exposes only that single resource as `resources.myJob` in CEL expressions. The full version list is an internal executor concern.

**Authors write lifecycle expressions exactly as they do today.** `resources.myJob` is always a single optional resource, never a list. The multi-version behavior (pruning historical versions, applying delete or orphan gates across all versions) is performed by the executor transparently, without the author needing to iterate over versions in CEL.

#### Retention rules

```yaml
retention:
historyLimit: 3 # keep at most N historical versions; oldest pruned first
ttl: 24h # delete versions older than this duration
```

Both rules apply independently and are evaluated against historical versions only. The current version is never pruned. If a historical version exceeds `ttl` OR the count of historical versions exceeds `historyLimit`, it is deleted. The pruning pass runs after the apply step on every event where `retention:` is configured.

#### Debouncing

In retain mode, rapid convergence events can cause rapid version creation — each event creating a new version before the previous one has had a chance to run. Debouncing is expressed via the resource-level `when` gate (`§1.4`), which skips all modification actions when the current version is too fresh:

```yaml
resources:
- alias: provisioningJob
when:
expression: >
!resources.?provisioningJob.hasValue()
|| timestamp(resources.provisioningJob.metadata.creationTimestamp)
< timestamp(now()) - duration("10m")
retention:
historyLimit: 5
ttl: 48h
lifecycle:
recreate:
when:
expression: "..."
```

The resolution is bounded by the Sentinel event cadence. With Sentinel sending convergence events every ~10 seconds and a 10-minute debounce:

```
t=0:00 — event arrives, no current version exists → new job created (v1)
t=0:10 — event arrives, v1 is 10s old < 10m → resource skipped, v1 retained
t=0:20 — event arrives, v1 is 20s old < 10m → resource skipped, v1 retained
...
t=10:00 — event arrives, v1 is 10m old ≥ 10m → recreate.when evaluated → new job created (v2)
```

Creation is limited to at most once per interval regardless of how many events Sentinel sends.

#### Delete and orphan semantics with multiple versions

When a delete event arrives and multi-result discovery is active, all discovered versions of the resource are affected — not only the current version. Leaving historical versions in place after a delete would orphan them with no cleanup path, since the API resource they reference no longer exists.

- **`lifecycle.delete.when`**: evaluated once as a gate; if true, the executor deletes all discovered versions.
- **`lifecycle.orphan.when`**: evaluated once as a gate; if true, ownership labels are stripped from all discovered versions before reporting `Finalized=True`. This ensures no version remains visible to the sweep controller.

The precedence rule is unchanged: `lifecycle.orphan.when` is evaluated before `lifecycle.delete.when`. If orphan fires, the delete step is skipped for all versions.

#### Example — provisioning job with history

```yaml
resources:
- alias: provisioningJob
when:
expression: >
!resources.?provisioningJob.hasValue()
|| timestamp(resources.provisioningJob.metadata.creationTimestamp)
< timestamp(now()) - duration("10m")
transport:
client: "kubernetes"
manifest:
apiVersion: batch/v1
kind: Job
metadata:
name: "{{ .clusterId }}-job-{{ .now | replace \":\" \"-\" }}"
namespace: "{{ .namespace }}"
annotations:
hyperfleet.io/generation: "{{ .generation }}"
discovery:
namespace: "{{ .namespace }}"
by_selectors:
label_selector:
hyperfleet.io/resource-id: "{{ .resourceId }}"
retention:
historyLimit: 5
ttl: 48h
lifecycle:
recreate:
when:
expression: >
resources.provisioningJob.status.conditions.exists(
c, c.type == "Complete" && c.status == "True")
|| resources.provisioningJob.status.conditions.exists(
c, c.type == "Failed" && c.status == "True")
```

The resource-level `when` prevents a burst of convergence events from creating multiple jobs: if the current version is younger than 10 minutes the entire resource is skipped. Once the interval elapses, `lifecycle.recreate.when` is evaluated — it fires when the job has completed or failed, triggering a new run. Retention keeps at most 5 historical runs and discards any older than 48 hours.

### 8. Ownership Model

"Ownership" in the adapter context means that the framework can determine which managed resources belong to a given HyperFleet API resource, and can clean them up when that API resource is deleted or disowned. Two complementary mechanisms serve this purpose, each suited to a different kind of parent-child relationship.

#### Option A — Label-based ownership (sweep controller)

Every managed resource carries `hyperfleet.io/adapter` and `hyperfleet.io/resource-id` labels stamped automatically by the framework (see `§2`). The sweep controller is a separate background job that uses these labels to detect and clean up resources that were orphaned when their corresponding HyperFleet API resource was force-deleted.

**When to use**: any resource whose logical owner lives in the HyperFleet API — Clusters, NodePools, or any future API resource type. The sweep controller is the correct mechanism because the owner has no K8s representation to reference.

##### Why it cannot use the API as its source of truth

After a force-delete, the API has no record of the resource. The sweep controller cannot ask "what resources should exist" — it must instead ask "does this resource that exists on the cluster still have a live API counterpart?"

##### Flow

1. **Discover**: list all K8s resources (and Maestro ManifestWork objects) carrying the `hyperfleet.io/adapter` label, across all namespaces and resource kinds
2. **Verify**: for each discovered resource, extract `hyperfleet.io/resource-id` and call `GET /api/hyperfleet/v1/resources/{id}` on the HyperFleet API
3. **Act**: if the API returns 404 — the API resource was force-deleted — delete the K8s resource; if the API returns the resource, skip it

The generic `/resources` endpoint from [HYPERFLEET-896](https://issues.redhat.com/browse/HYPERFLEET-896) is the key enabler: the sweep controller works across all resource kinds without kind-specific logic.

##### Prerequisites

| Prerequisite | Provided by |
|---|---|
| `hyperfleet.io/adapter` label on all managed resources | Automatic label stamping (`§2`) |
| `hyperfleet.io/resource-id` carrying the API UUID | Automatic label stamping (`§2`) |
| Labels stripped on intentional disownment | `lifecycle.orphan.when` (`§1.4`) — orphaned resources are invisible to the sweep controller |
| Generic `/resources` endpoint | HYPERFLEET-896 — required for the verification call |

##### Operational considerations

- **Interval**: configurable; a few minutes is sufficient — force-deleted resources are not immediately harmful, just wasteful
- **Leader election**: required for multi-replica deployments to avoid concurrent deletions
- **Maestro**: ManifestWork objects carry the labels on the ManifestWork itself; the sweep controller deletes the ManifestWork, not the nested manifests
- **Dry-run mode**: should support a read-only mode that reports what would be deleted without acting

**Limitations**: cleanup is eventual (bounded by the sweep interval), and requires the sweep controller to be deployed and operational.

#### Option B — K8s ownerReferences

Kubernetes garbage collection cascades deletions through `ownerReferences`: if object B has an `ownerReference` pointing to object A, deleting A causes Kubernetes to delete B automatically. This happens synchronously at the K8s API level — no external component required.

The adapter can set an `ownerReference` on a managed resource pointing to another resource managed by the same adapter. This is expressed as a per-resource `owner:` config field naming another resource alias in the same adapter config:

```yaml
resources:
- alias: clusterNamespace
transport:
client: "kubernetes"
manifest:
apiVersion: v1
kind: Namespace
metadata:
name: "{{ .clusterId }}"
annotations:
hyperfleet.io/generation: "{{ .generation }}"

- alias: provisioningJob
owner:
resource: clusterNamespace # ownerReference points to the Namespace above
transport:
client: "kubernetes"
manifest:
apiVersion: batch/v1
kind: Job
metadata:
name: "{{ .clusterId }}-job"
namespace: "{{ .clusterId }}"
annotations:
hyperfleet.io/generation: "{{ .generation }}"
```

When the executor applies `provisioningJob`, it sets an `ownerReference` on the Job pointing to the already-applied Namespace. If the adapter later deletes the Namespace (via `lifecycle.delete.when`), Kubernetes automatically deletes the Job — the adapter does not need to issue a separate delete call for it.

**When to use**: resources that have a natural K8s parent-child relationship within the same adapter — for example, Jobs or Secrets scoped to a Namespace that the adapter also manages. The parent and child must live in the same namespace (K8s cross-namespace ownerReferences are not supported for namespaced resources).

**Limitations**:
- Only applies to K8s direct transport. Maestro ManifestWork objects do not participate in K8s garbage collection — `ownerReferences` on a ManifestWork are not propagated to the nested manifests that Maestro applies on the spoke cluster.
- The referenced `owner.resource` alias must appear earlier in the resources list (it must already be applied so its UID is known).
- Does not replace label-based ownership for the HyperFleet API force-delete case — `ownerReferences` only cascade within the cluster; they cannot respond to a deletion in an external API.

#### Relationship between the two mechanisms

They are complementary, not alternatives:

| Scenario | Mechanism |
|---|---|
| HyperFleet API resource force-deleted, K8s resources orphaned | Label-based sweep controller (`§8A`) |
| Adapter deletes a parent K8s resource; child K8s resources should follow | K8s `ownerReferences` (`§8B`) |
| Adapter explicitly disowns a resource without deleting it | `lifecycle.orphan.when` strips labels; sweep controller ignores it (`§1.4`) |

An adapter author can use both simultaneously: label-based ownership ensures the sweep controller can find the resource if the API record disappears, while `ownerReferences` ensure that a parent deletion within the cluster cascades to children without requiring the adapter to enumerate and delete each child explicitly.

### Configuration Reference

Full resource configuration with all new lifecycle and retention fields:

```yaml
resources:
- alias: "example"
# NEW: resource-level gate evaluated before any lifecycle action.
# If false, apply/patch/recreate are skipped. Delete and orphan are unaffected.
when:
expression: >
!resources.?example.hasValue()
|| timestamp(resources.example.metadata.creationTimestamp)
< timestamp(now()) - duration("10m")

transport:
client: "kubernetes" # or "maestro"
maestro:
target_cluster: "{{ .clusterId }}"

manifest:
apiVersion: v1
kind: Namespace
metadata:
name: "{{ .clusterId }}"
labels:
# Framework stamps hyperfleet.io/adapter and hyperfleet.io/resource-id automatically.
# Explicit values in the template take precedence (fill-gaps-only merge).
annotations:
hyperfleet.io/generation: "{{ .generation }}"

discovery:
namespace: "{{ .namespace }}"
by_selectors:
label_selector:
hyperfleet.io/resource-id: "{{ .resourceId }}"

# NEW: ownerReference to another resource alias in this adapter config (K8s only).
# Kubernetes GC deletes this resource when the referenced owner is deleted.
owner:
resource: anotherAlias

# NEW: retention — enables retain mode for recreate and governs version cleanup.
# Omit this block for replace mode (delete-before-create, single resource).
retention:
historyLimit: 3 # keep at most N historical versions; oldest pruned first
ttl: 24h # delete versions older than this duration

lifecycle:
# NEW: gate whether the apply step runs (e.g. create-only)
apply:
when:
expression: "!resources.?example.hasValue()"

# NEW: array of JSON merge patches; each entry evaluated independently
patch:
- when:
expression: >
resources.?example.hasValue()
&& dig(resources.example.metadata.annotations, "example.io/status") != status
document:
metadata:
annotations:
example.io/status: "{{ .status }}"

# NEW: strip management labels + report Finalized=True without deleting
orphan:
when:
expression: >
is_deleting
&& resources.?example.hasValue()
&& "hyperfleet.io/adapter" in resources.example.metadata.labels

# Existing gates
recreate:
when:
expression: "resources.?example.hasValue() && ..."

delete:
when:
expression: "is_deleting"
propagationPolicy: "Background"
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

REQUIRED: Add Mermaid diagrams for lifecycle gate flow, multi-version resolution, and sweep controller.

Component documents MUST include Mermaid diagrams per coding guidelines. This document describes three complex state machines entirely in prose:

  1. Lifecycle gate evaluation order (§1.1–§1.4): resource-level when, lifecycle.apply.when, lifecycle.patch, lifecycle.orphan.when, lifecycle.recreate.when, lifecycle.delete.when with precedence rules and conditional paths.
  2. Multi-version discovery and current-version resolution (§7, lines 416–429): selector-based discovery returning all versions by hyperfleet.io/resource-id, resolving current version by generation match or timestamp fallback.
  3. Sweep controller verification flow (§8A, lines 537–541): list K8s resources with label → extract resource-id → verify via API GET → delete on 404.

Add three Mermaid diagrams with labels for conditional branches and references to configuration fields mentioned in the prose.

🧰 Tools
🪛 LanguageTool

[grammar] ~276-~276: Use a hyphen to join words.
Context: ...cadence. The adapter must always surface apply failures via status conditions — t...

(QB_NEW_EN_HYPHEN)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`
around lines 79 - 700, The document describes three complex state machines and
control flows entirely in prose, but component documents must include Mermaid
diagrams per coding guidelines. Add three Mermaid flowchart diagrams to address
this requirement: first, a diagram showing the lifecycle gate evaluation order
and precedence rules for resource-level `when`, `lifecycle.apply.when`,
`lifecycle.patch`, `lifecycle.orphan.when`, `lifecycle.recreate.when`, and
`lifecycle.delete.when` with conditional paths based on gate results; second, a
diagram illustrating multi-version discovery and current-version resolution
logic showing how selector-based discovery returns all versions by
`hyperfleet.io/resource-id` label, then resolves the current version by
generation match or timestamp fallback; and third, a diagram for the sweep
controller verification flow showing the steps of listing K8s resources with the
`hyperfleet.io/adapter` label, extracting `hyperfleet.io/resource-id`, calling
the API GET endpoint to verify the resource exists, and deleting the K8s
resource on 404. Each diagram should include labels for conditional branches and
reference the configuration fields and CEL expressions mentioned in the
surrounding prose.

Source: Coding guidelines


**Convergence events**: a transient apply failure leaves the resource state unchanged and unconverged. Sentinel observes continued divergence and sends the next event at the ~10s cadence. No inner retry is needed.

**Drift-check events**: Sentinel also sends periodic events (default every 30 minutes) to detect and correct drift on resources that were previously converged. This introduces a corner case: if the apply fails during a drift-check event, the resource state is unchanged and still appears converged to Sentinel, which would otherwise not retrigger for another 30 minutes. This gap is closed by the adapter reporting `Available=False` on the API resource whenever an apply fails. `Available=False` signals to Sentinel that the resource is not reconciled, causing it to switch from the drift cadence back to the ~10s convergence cadence. The adapter must always surface apply failures via status conditions — this is what keeps the gap closed regardless of which Sentinel cadence triggered the event.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Grammar: Use a hyphen to connect words.

Line 276: "...apply failures via status conditions — t..." should use a hyphen (-) not an em-dash to join the phrase, per grammar convention. Rewrite as "...apply failures via status conditions–the resource...".

🧰 Tools
🪛 LanguageTool

[grammar] ~276-~276: Use a hyphen to join words.
Context: ...cadence. The adapter must always surface apply failures via status conditions — t...

(QB_NEW_EN_HYPHEN)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`
at line 276, Replace the em-dash (—) with a hyphen (-) on line 276 where the
text reads "apply failures via status conditions — this is what keeps the gap
closed". Use the hyphen to properly join the phrase according to standard
grammar convention for connecting related clauses.

Comment thread hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md Outdated
Comment on lines +527 to +561
#### Option A — Label-based ownership (sweep controller)

Every managed resource carries `hyperfleet.io/adapter` and `hyperfleet.io/resource-id` labels stamped automatically by the framework (see `§2`). The sweep controller is a separate background job that uses these labels to detect and clean up resources that were orphaned when their corresponding HyperFleet API resource was force-deleted.

**When to use**: any resource whose logical owner lives in the HyperFleet API — Clusters, NodePools, or any future API resource type. The sweep controller is the correct mechanism because the owner has no K8s representation to reference.

##### Why it cannot use the API as its source of truth

After a force-delete, the API has no record of the resource. The sweep controller cannot ask "what resources should exist" — it must instead ask "does this resource that exists on the cluster still have a live API counterpart?"

##### Flow

1. **Discover**: list all K8s resources (and Maestro ManifestWork objects) carrying the `hyperfleet.io/adapter` label, across all namespaces and resource kinds
2. **Verify**: for each discovered resource, extract `hyperfleet.io/resource-id` and call `GET /api/hyperfleet/v1/resources/{id}` on the HyperFleet API
3. **Act**: if the API returns 404 — the API resource was force-deleted — delete the K8s resource; if the API returns the resource, skip it

The generic `/resources` endpoint from [HYPERFLEET-896](https://issues.redhat.com/browse/HYPERFLEET-896) is the key enabler: the sweep controller works across all resource kinds without kind-specific logic.

##### Prerequisites

| Prerequisite | Provided by |
|---|---|
| `hyperfleet.io/adapter` label on all managed resources | Automatic label stamping (`§2`) |
| `hyperfleet.io/resource-id` carrying the API UUID | Automatic label stamping (`§2`) |
| Labels stripped on intentional disownment | `lifecycle.orphan.when` (`§1.4`) — orphaned resources are invisible to the sweep controller |
| Generic `/resources` endpoint | HYPERFLEET-896 — required for the verification call |

##### Operational considerations

- **Interval**: configurable; a few minutes is sufficient — force-deleted resources are not immediately harmful, just wasteful
- **Leader election**: required for multi-replica deployments to avoid concurrent deletions
- **Maestro**: ManifestWork objects carry the labels on the ManifestWork itself; the sweep controller deletes the ManifestWork, not the nested manifests
- **Dry-run mode**: should support a read-only mode that reports what would be deleted without acting

**Limitations**: cleanup is eventual (bounded by the sweep interval), and requires the sweep controller to be deployed and operational.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Label-based ownership verification lacks security controls for label forgery (CWE-639/CWE-548).

§8A sweep controller relies on hyperfleet.io/resource-id labels to identify resources for cleanup after API force-deletes. The design does not specify protection against tampering. A user with patch permissions could forge or modify the label on an unrelated resource, causing the sweep controller to call GET /api/hyperfleet/v1/resources/{id}, receive 404, and delete the target resource.

Specify one of these before implementation:

  • Cryptographic protection: Labels use HMAC or signature validation (e.g., signed UUID derived from resource identity) to prevent forgery
  • RBAC restrictions: Role-based access control prevents modification of hyperfleet.io/* label keys on resources (e.g., ClusterRole denying patch on hyperfleet.io/* labels)
  • Admission webhook: Validating webhook on the management cluster prevents creation/modification of hyperfleet.io/* labels except by authorized service accounts

Document the chosen approach in a new Prerequisites subsection under §8A and update operational considerations if relevant to deployment.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`
around lines 527 - 561, The Option A — Label-based ownership section lacks
security controls to prevent tampering with the hyperfleet.io/resource-id label,
which could allow users with patch permissions to forge labels and trick the
sweep controller into deleting unrelated resources. Choose one of the three
specified security approaches (cryptographic protection via HMAC/signature
validation, RBAC restrictions to prevent modification of hyperfleet.io/* labels,
or validating admission webhook to control label changes), document the chosen
approach in a new subsection within the Prerequisites table in §8A (Option A),
and update the operational considerations if the chosen approach affects
deployment requirements.

Source: Linters/SAST tools

@ciaranRoche ciaranRoche left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is proving pretty hard to review, I would break this up if possible please 🙏

One thing I am not able to see clearly, is an evaluation ordering lifecycle. It would be nice to see some sort of decision tree


### Scope

This design covers six proposals:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Six proposals but lists 8 👀

- `lifecycle.patch` (`§1.3`) — array of conditional JSON merge patches on an already-existing resource
- `lifecycle.orphan.when` (`§1.4`) — conditional resource disownment: strip `hyperfleet.io/*` management labels and report `Finalized=True` without deleting

2. **Automatic label and annotation stamping** (`§2`) — the framework injects standard `hyperfleet.io/*` labels on all adapter-created resources at apply time, making them consistently discoverable for tooling and sweep-based cleanup

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mliptak0 this might impact the work you are looking at for the sweep based cleanup

Comment on lines +57 to +60
1. **Parameter Extraction** — extracts params from event data and env vars
2. **Preconditions** — evaluates CEL or structured conditions; supports API calls with retry/backoff
3. **Resources** — applies, discovers, and deletes resources sequentially
4. **Post-Actions** — CEL-gated HTTP calls or log entries; always runs for error reporting

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should align on these terminologies, i see in some places it is called post-processing, it might be better to have a single source of truth for the sequential pipeline and link to that source of truth in this document

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md (2)

1-91: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

CRITICAL: Missing required Trade-offs section with mandatory subsections.

Component documents REQUIRE a structured Trade-offs section distinct from Alternatives Considered. Current document links to eight proposals but does not consolidate trade-offs into the required format with these subsections:

  • What We Gain — measurable benefits (e.g., "enables sweep-based cleanup," "closes drift-check corner case")
  • What We Lose / What Gets Harder — costs and capability losses (e.g., "no per-resource retry logic," "stamp-time window for label forgery if RBAC not configured")
  • Technical Debt Incurred — debt items with description, impact, remediation plan (e.g., "no framework timestamp annotation" per adapter-periodic-execution.md alternatives)
  • Acceptable Because — why trade-offs justify the design for MVP/timeline/scope

Move existing trade-off discussions from individual proposal documents (e.g., "no per-resource retry" rationale, "no hyperfleet.io/last-updated annotation" rationale) into this consolidated, structured section at the main design document level. This is a blocking requirement for component document acceptance.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`
around lines 1 - 91, The adapter-lifecycle-management-design.md document is
missing a mandatory Trade-offs section that must be distinct from Alternatives
Considered. Add a new Trade-offs section (preferably before the Design section
that lists the eight proposals) with four required subsections: "What We Gain"
(listing measurable benefits like sweep-based cleanup and drift-check corner
case closure), "What We Lose / What Gets Harder" (documenting capability losses
and increased complexity such as no per-resource retry logic and potential label
forgery windows), "Technical Debt Incurred" (listing debt items with
description, impact, and remediation plan, such as the missing framework
timestamp annotation discussed in adapter-periodic-execution.md), and
"Acceptable Because" (explaining why these trade-offs justify the design for MVP
timeline and scope). Consolidate relevant trade-off discussions that currently
exist in the individual proposal documents (adapter-lifecycle-gates.md,
adapter-label-stamping.md, adapter-resilience-model.md,
adapter-periodic-execution.md, and others) into this consolidated structured
format at the main design document level.

Source: Coding guidelines


1-91: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

CRITICAL: Missing required Mermaid diagrams for complex state machines.

Component documents MUST include Mermaid diagrams per coding guidelines. This index document (and lifecycle-gates.md) describes three complex state machines entirely in prose and lacks visual representation:

  1. Lifecycle gate evaluation order — precedence of resource-level when, lifecycle.apply.when, lifecycle.patch, lifecycle.orphan.when, lifecycle.recreate.when, lifecycle.delete.when with conditional paths
  2. Multi-version discovery and current-version resolution — how selector-based discovery by hyperfleet.io/resource-id returns all versions, then resolves current version by generation match or timestamp fallback
  3. Sweep controller verification flow — list K8s resources with hyperfleet.io/adapter label → extract resource-id → API GET verification → delete on 404

Add three Mermaid flowchart diagrams with labeled branches and references to config fields from the prose. This is a blocking requirement for component document acceptance.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`
around lines 1 - 91, Add three Mermaid flowchart diagrams to the design document
to visually represent the complex state machines currently described only in
prose. First, create a lifecycle gate evaluation order diagram showing the
precedence and conditional paths for resource-level when, lifecycle.apply.when,
lifecycle.patch, lifecycle.orphan.when, lifecycle.recreate.when, and
lifecycle.delete.when. Second, add a multi-version discovery diagram
illustrating how selector-based discovery by hyperfleet.io/resource-id returns
all versions and resolves the current version by generation match or timestamp
fallback. Third, add a sweep controller verification flow diagram showing the
sequence of listing K8s resources with hyperfleet.io/adapter label, extracting
resource-id, performing API GET verification, and deleting on 404. Ensure all
three diagrams include labeled branches and reference the relevant config fields
from the prose documentation.

Source: Coding guidelines

hyperfleet/components/adapter/framework/adapter-resource-retention.md (1)

1-141: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Component document is missing four required structural sections: "What & Why", "How (with Mermaid diagrams)", "Trade-offs", and "Alternatives Considered".

Per HyperFleet component document guidelines, every component design document must include these four sections. Your document currently lacks all of them. The guidelines explicitly state "DO NOT SKIP ANY" and flag this as a common mistake.

Required additions:

  1. What & Why (after line 6, before current replace/retain discussion): Explicitly state the component's purpose (retain multiple versions of adapted resources as history rather than replace in-place) and the problem it solves (audit trails, failure replay, provisioning pipeline history).

  2. How (with Mermaid diagram): Describe the mechanical flow — resource-level when gate evaluates on every event, multi-result discovery via hyperfleet.io/resource-id selector, current-version resolution logic, retention pruning (historyLimit + ttl), and precedence of orphan.when over delete.when across all discovered versions. Include a Mermaid flowchart visualizing: creation event → debounce gate (resource when) → apply current → discover all versions → prune historical → evaluate delete/orphan gates once across all versions.

  3. Trade-offs (new section): Honest assessment of what is gained and lost. For example:

    • Gain: Audit trail, replay capability, generation history visibility across multiple resource versions.
    • Lose/Harder: Cleanup complexity (retain mode forces manual naming to avoid collisions; replace mode is simpler), discovery overhead (selector-based queries on every event to find all versions), retention enforcement complexity (pruning logic must respect both historyLimit and ttl).
    • Risk: Unbounded resource accumulation if pruning is misconfigured; orphaned resources if adapters fail to update lifecycle.orphan.when appropriately.
    • Acceptable because: Solves real operational need for audit/replay without forking adapter behavior into separate codepaths.
  4. Alternatives Considered (new section): Compare replace-only (simpler, no pruning overhead, no collision handling; loses audit trail and replay), TTL-only (simpler, no historyLimit count logic; less predictable cleanup), and a dedicated history API service (decouples retention from lifecycle; more complex, adds service dependency).

[major_issue]

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hyperfleet/components/adapter/framework/adapter-resource-retention.md` around
lines 1 - 141, The component design document adapter-resource-retention.md is
missing four required structural sections per HyperFleet guidelines. Add a "What
& Why" section after the introductory paragraph (before the "Replace mode vs.
retain mode" section) that explicitly states the component's purpose (retaining
multiple versions of adapted resources as history) and the problem it solves
(audit trails, failure replay, provisioning pipeline history). Add a "How"
section after retention rules that describes the mechanical flow including the
resource-level when gate evaluation, multi-result discovery via
hyperfleet.io/resource-id selector, current-version resolution logic, retention
pruning with historyLimit and ttl, and precedence of orphan.when over
delete.when, including a Mermaid flowchart diagram showing the sequence from
creation event through debounce gate, apply, discover all versions, prune
historical, and evaluate delete/orphan gates. Add a "Trade-offs" section that
honestly assesses gains (audit trail, replay capability, generation history) and
losses (cleanup complexity, discovery overhead, retention enforcement
complexity) along with risks and justification. Add an "Alternatives Considered"
section comparing replace-only mode (simpler but loses audit trail), TTL-only
approach (simpler but less predictable), and a dedicated history API service
(decouples retention but adds complexity). Place these sections in logical order
within the document before the existing example.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@hyperfleet/components/adapter/framework/adapter-lifecycle-gates.md`:
- Around line 176-233: The design document in sections 1.4 and 8A relies on
HyperFleet labels as the source of truth for resource management, but does not
specify protection against label forgery attacks that could escalate patch
permissions to delete permissions. Choose one of the three security approaches
(cryptographic protection via HMAC/signed UUID, RBAC restrictions on
hyperfleet.io/* label patches, or admission webhook validation) and document the
selected security model in adapter-ownership-model.md section 8A with clear
operational deployment prerequisites such as RBAC policies or webhook
configurations. Ensure the chosen approach is referenced in both the lifecycle
gates documentation and the ownership model documentation for consistency.

In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`:
- Around line 29-49: Two documentation gaps need to be addressed: (1) In
adapter-lifecycle-gates.md (lines 16-233), add a "CEL Expression Validation and
Sandboxing" section to §1 introduction documenting that the CEL environment is
restricted to ext.Strings(), ext.Lists(), and three custom functions only
(toJson, dig, now), with no HTTP, DNS, gRPC, or filesystem access, expressions
are validated at config load time via TaskConfigValidator.Parse(), and
expressions cannot escape the defined context. (2) In adapter-ownership-model.md
§8A, add a "Prerequisites" subsection that chooses and documents a security
model to prevent label forgery on hyperfleet.io/adapter and
hyperfleet.io/resource-id labels—select one approach (cryptographic
HMAC/signature validation, RBAC via ClusterRole denying patch on hyperfleet.io/*
keys, or validating admission webhook) and explain the deployment constraints
and how it prevents the sweep controller from deleting unrelated resources via
forged labels.

---

Outside diff comments:
In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`:
- Around line 1-91: The adapter-lifecycle-management-design.md document is
missing a mandatory Trade-offs section that must be distinct from Alternatives
Considered. Add a new Trade-offs section (preferably before the Design section
that lists the eight proposals) with four required subsections: "What We Gain"
(listing measurable benefits like sweep-based cleanup and drift-check corner
case closure), "What We Lose / What Gets Harder" (documenting capability losses
and increased complexity such as no per-resource retry logic and potential label
forgery windows), "Technical Debt Incurred" (listing debt items with
description, impact, and remediation plan, such as the missing framework
timestamp annotation discussed in adapter-periodic-execution.md), and
"Acceptable Because" (explaining why these trade-offs justify the design for MVP
timeline and scope). Consolidate relevant trade-off discussions that currently
exist in the individual proposal documents (adapter-lifecycle-gates.md,
adapter-label-stamping.md, adapter-resilience-model.md,
adapter-periodic-execution.md, and others) into this consolidated structured
format at the main design document level.
- Around line 1-91: Add three Mermaid flowchart diagrams to the design document
to visually represent the complex state machines currently described only in
prose. First, create a lifecycle gate evaluation order diagram showing the
precedence and conditional paths for resource-level when, lifecycle.apply.when,
lifecycle.patch, lifecycle.orphan.when, lifecycle.recreate.when, and
lifecycle.delete.when. Second, add a multi-version discovery diagram
illustrating how selector-based discovery by hyperfleet.io/resource-id returns
all versions and resolves the current version by generation match or timestamp
fallback. Third, add a sweep controller verification flow diagram showing the
sequence of listing K8s resources with hyperfleet.io/adapter label, extracting
resource-id, performing API GET verification, and deleting on 404. Ensure all
three diagrams include labeled branches and reference the relevant config fields
from the prose documentation.

In `@hyperfleet/components/adapter/framework/adapter-resource-retention.md`:
- Around line 1-141: The component design document adapter-resource-retention.md
is missing four required structural sections per HyperFleet guidelines. Add a
"What & Why" section after the introductory paragraph (before the "Replace mode
vs. retain mode" section) that explicitly states the component's purpose
(retaining multiple versions of adapted resources as history) and the problem it
solves (audit trails, failure replay, provisioning pipeline history). Add a
"How" section after retention rules that describes the mechanical flow including
the resource-level when gate evaluation, multi-result discovery via
hyperfleet.io/resource-id selector, current-version resolution logic, retention
pruning with historyLimit and ttl, and precedence of orphan.when over
delete.when, including a Mermaid flowchart diagram showing the sequence from
creation event through debounce gate, apply, discover all versions, prune
historical, and evaluate delete/orphan gates. Add a "Trade-offs" section that
honestly assesses gains (audit trail, replay capability, generation history) and
losses (cleanup complexity, discovery overhead, retention enforcement
complexity) along with risks and justification. Add an "Alternatives Considered"
section comparing replace-only mode (simpler but loses audit trail), TTL-only
approach (simpler but less predictable), and a dedicated history API service
(decouples retention but adds complexity). Place these sections in logical order
within the document before the existing example.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 7906e78b-9fd1-435d-8f14-f36c7e9aade3

📥 Commits

Reviewing files that changed from the base of the PR and between fee71b7 and ed657d6.

📒 Files selected for processing (9)
  • hyperfleet/components/adapter/framework/adapter-deletion-policy.md
  • hyperfleet/components/adapter/framework/adapter-label-stamping.md
  • hyperfleet/components/adapter/framework/adapter-lifecycle-gates.md
  • hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md
  • hyperfleet/components/adapter/framework/adapter-ownership-model.md
  • hyperfleet/components/adapter/framework/adapter-periodic-execution.md
  • hyperfleet/components/adapter/framework/adapter-resilience-model.md
  • hyperfleet/components/adapter/framework/adapter-resource-retention.md
  • hyperfleet/components/adapter/framework/adapter-stuck-detection.md
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift-hyperfleet/architecture (manual)
  • openshift-hyperfleet/hyperfleet-api (manual)
  • openshift-hyperfleet/hyperfleet-sentinel (manual)
  • openshift-hyperfleet/hyperfleet-adapter (manual)
  • openshift-hyperfleet/hyperfleet-broker (manual)
✅ Files skipped from review due to trivial changes (4)
  • hyperfleet/components/adapter/framework/adapter-periodic-execution.md
  • hyperfleet/components/adapter/framework/adapter-ownership-model.md
  • hyperfleet/components/adapter/framework/adapter-resilience-model.md
  • hyperfleet/components/adapter/framework/adapter-deletion-policy.md

Comment on lines +176 to +233
## 1.4 Detach resources (lifecycle.orphan.when)

An adapter can already release a resource without deleting it: skip the delete step in `lifecycle.delete.when` and report `Finalized=True` in the post-action status conditions. The platform sees `Finalized=True` and stops sending events for that resource.

But the existing resource is left with HyperFleet specific labels, which can cause problems with cleaning jobs identifying resources as orphans to be deleted.

`lifecycle.orphan.when` makes this a first-class lifecycle operation. The expression is typically gated on the event being a deletion and the resource still carrying its management labels — confirming the resource is still managed and this event is the handoff point.

```yaml
resources:
- name: "provisioningJob"
transport:
client: "kubernetes"
discovery:
by_name: "{{ .clusterId }}-job"
lifecycle:
orphan:
when:
expression: >
is_deleting
&& resources.?provisioningJob.hasValue()
&& "hyperfleet.io/adapter" in resources.provisioningJob.metadata.labels
```

When `lifecycle.orphan.when` evaluates to true, the executor:

1. Skips the delete step entirely
2. Issues a label-strip patch via the transport client, removing `hyperfleet.io/adapter` and `hyperfleet.io/resource-id`

The label presence check in the expression (`"hyperfleet.io/adapter" in resources.provisioningJob.metadata.labels`) ensures the gate is idempotent: if the event is redelivered after labels have already been stripped, the expression evaluates to false and neither delete nor orphan fires.

**Precedence**: `lifecycle.orphan.when` is evaluated before `lifecycle.delete.when`. If orphan fires, delete is skipped.

**Implementation**: Add `orphan.when` CEL field to `LifecycleConfig`. Add a `PatchResourceLabels(ctx, gvk, namespace, name, remove []string, target)` method to the `TransportClient` interface for the label-strip operation.

**Maestro compatibility**: For Maestro ManifestWork resources, the label-strip is applied to the ManifestWork object itself, not to the nested manifests. The sweep controller must account for this when reconciling Maestro-managed resources.

### 1.4.1 Alternatives

#### API-Level Disown Operation

**What**: Instead of adapter-side label stripping, the adapter calls a `POST /api/hyperfleet/v1/resources/{id}/disown` endpoint. The API marks the resource as unmanaged, stops sending events, and the adapter's role ends there. Label cleanup is the API's responsibility.

**Why Rejected**: Adds an API round-trip at the critical path of a delete event, and introduces a new API endpoint with its own lifecycle and versioning. The adapter already has a transport client capable of issuing a label-strip patch; routing through the HyperFleet API adds coupling without material benefit for the documented use case. This alternative becomes attractive if the platform needs a centralized audit trail of disownment events — at that point the API layer is the right place to record the handoff.

#### Alternative Names to orphan

The term "orphan" carries two conflicting meanings in this domain — Kubernetes uses it for the GC `propagationPolicy: Orphan` (leave children in place when parent is deleted), while this design uses it to mean "release a resource from HyperFleet management." This ambiguity is a readability risk for adapter authors familiar with K8s.

| Name | Rationale | Concern |
|---|---|---|
| `lifecycle.release.when` | "Release from management" reads naturally in context | `release` is also used for software releases; ambiguous in a CI/CD platform context |
| `lifecycle.disown.when` | Precise: strip ownership, leave in place | Less common English; may read as aggressive/negative |
| `lifecycle.handoff.when` | Conveys transfer of responsibility to another owner | Implies a receiver exists; disownment is unconditional |
| `lifecycle.finalize.when` | Mirrors the `Finalized=True` condition it produces | `Finalized` in K8s means object is being deleted (finalizer cleared) — opposite meaning |

**Recommendation to revisit**: `lifecycle.release.when` is the least ambiguous in context and avoids the K8s GC collision. If the project glossary settles on "release" for this concept, rename accordingly. Until then, `orphan.when` is retained as the design name with an explicit note in the glossary entry distinguishing it from `propagationPolicy: Orphan`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find ownership model and security documentation
find . -name "*ownership*" -o -name "*security*" | grep -E "\.(md|yaml)$"
cat hyperfleet/components/adapter/framework/adapter-ownership-model.md 2>/dev/null | head -200

Repository: openshift-hyperfleet/architecture

Length of output: 11270


🏁 Script executed:

#!/bin/bash
# Search for sweep controller and label validation
rg "sweep|hyperfleet\.io/(adapter|resource-id)" -A 3 -B 3 --max-count 20
rg "GET.*resources" -A 3 -B 3 --max-count 10

Repository: openshift-hyperfleet/hyperfleet-sentinel

Length of output: 66


🏁 Script executed:

#!/bin/bash
# Search for label usage and protection mechanisms
rg "hyperfleet\.io/(adapter|resource-id)" -A 5 -B 2 --max-count 20
rg "metadata\.labels|PatchResourceLabels" -A 3 -B 2 --max-count 15

Repository: openshift-hyperfleet/hyperfleet-adapter

Length of output: 13789


🏁 Script executed:

#!/bin/bash
# Check security standards document
cat hyperfleet/standards/code-review/security.md

Repository: openshift-hyperfleet/architecture

Length of output: 5897


🏁 Script executed on selected repositories:

#!/bin/bash
# Search for RBAC, admission webhook, or label validation
rg "ClusterRole|admissionWebhook|ValidatingWebhook|label.*valid|patch.*hyperfleet" -A 3 -B 2 --max-count 15

Repositories: openshift-hyperfleet/architecture, openshift-hyperfleet/hyperfleet-adapter, openshift-hyperfleet/hyperfleet-sentinel

Length of output: 14940


🏁 Script executed:

#!/bin/bash
# Search for admission webhook or label protection in API
rg "hyperfleet\.io/(adapter|resource-id)|ValidatingWebhook|admission" -A 3 -B 2 --max-count 15

Repository: openshift-hyperfleet/hyperfleet-api

Length of output: 61


🏁 Script executed:

#!/bin/bash
# Search for sweep controller implementation
rg "sweep|resource-id|hyperfleet\.io" -A 5 -B 2 --max-count 20
find . -name "*sweep*" -type f

Repository: openshift-hyperfleet/hyperfleet-sentinel

Length of output: 66


🏁 Script executed:

#!/bin/bash
# Check if sweep controller or label validation exists in API
rg "sweep|orphan|resource-id" -A 3 -B 2 --max-count 15

Repository: openshift-hyperfleet/hyperfleet-api

Length of output: 61


🏁 Script executed:

#!/bin/bash
# Check if this design doc is new in the PR or being modified
git log --oneline --all -- hyperfleet/components/adapter/framework/adapter-lifecycle-gates.md hyperfleet/components/adapter/framework/adapter-ownership-model.md 2>/dev/null | head -5 || echo "Files may be new to this PR"

Repository: openshift-hyperfleet/architecture

Length of output: 153


🏁 Script executed:

#!/bin/bash
# Verify the exact content of adapter-lifecycle-gates.md section 1.4
sed -n '176,233p' hyperfleet/components/adapter/framework/adapter-lifecycle-gates.md

Repository: openshift-hyperfleet/architecture

Length of output: 4516


CWE-639 / CWE-548: Label forgery enables privilege escalation — design must specify security model before sweep controller implementation.

§1.4 and §8A document the sweep controller using hyperfleet.io/adapter and hyperfleet.io/resource-id labels as the source of truth. A user with patch permissions on K8s resources can forge the resource-id label to an arbitrary UUID. The sweep controller extracts this label and calls GET /api/hyperfleet/v1/resources/{forged-id}, receives 404, and deletes the target K8s resource — escalating patch to delete.

The design document does not specify any protection against label tampering. Address one of the three approaches before implementing the sweep controller:

  1. Cryptographic protection: Labels use HMAC or signed UUID derived from resource identity, making forgery cryptographically infeasible.
  2. RBAC restrictions: Deny patch on hyperfleet.io/* label keys except by authorized service accounts (e.g., adapter framework, sweep controller).
  3. Admission webhook: ValidatingWebhook on the management cluster prevents creation/modification of hyperfleet.io/* labels except by authorized actors.

Document the chosen security model in adapter-ownership-model.md §8A (or cross-reference) and specify operational deployment prerequisites (e.g., RBAC policy, webhook configuration).

🧰 Tools
🪛 LanguageTool

[grammar] ~180-~180: Use a hyphen to join words.
Context: ...xisting resource is left with HyperFleet specific labels, which can cause problem...

(QB_NEW_EN_HYPHEN)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hyperfleet/components/adapter/framework/adapter-lifecycle-gates.md` around
lines 176 - 233, The design document in sections 1.4 and 8A relies on HyperFleet
labels as the source of truth for resource management, but does not specify
protection against label forgery attacks that could escalate patch permissions
to delete permissions. Choose one of the three security approaches
(cryptographic protection via HMAC/signed UUID, RBAC restrictions on
hyperfleet.io/* label patches, or admission webhook validation) and document the
selected security model in adapter-ownership-model.md section 8A with clear
operational deployment prerequisites such as RBAC policies or webhook
configurations. Ensure the chosen approach is referenced in both the lifecycle
gates documentation and the ownership model documentation for consistency.

Source: Linters/SAST tools

Comment on lines +29 to +49
This design covers eight proposals:

1. **Lifecycle gates** (`§1`) — [adapter-lifecycle-gates.md](./adapter-lifecycle-gates.md): four new per-resource CEL-gated operations following the existing `lifecycle.recreate.when` / `lifecycle.delete.when` pattern:
- resource-level `when` (`§1.1`) — evaluated before any lifecycle gate; if false, all modification actions (apply, patch, recreate) are skipped for that resource in the current event; primary use case is debouncing using CEL timestamp arithmetic
- `lifecycle.apply.when` (`§1.2`) — gate whether the apply step runs (create-only use case)
- `lifecycle.patch` (`§1.3`) — array of conditional JSON merge patches on an already-existing resource
- `lifecycle.detach.when` (`§1.4`) — conditional resource disownment: strip `hyperfleet.io/*` management labels and report `Finalized=True` without deleting

2. **Automatic label and annotation stamping** (`§2`) — [adapter-label-stamping.md](./adapter-label-stamping.md): the framework injects standard `hyperfleet.io/*` labels on all adapter-created resources at apply time, making them consistently discoverable for tooling and sweep-based cleanup

3. **Resilience model** (`§3`) — [adapter-resilience-model.md](./adapter-resilience-model.md): documents why per-resource retry is not implemented in the adapter and how the Sentinel reconciliation cycle (convergence at ~10s, drift-check at 30 minutes) covers both failure cases, with `Available=False` closing the drift-check corner case

4. **Stuck detection** (`§4`) — [adapter-stuck-detection.md](./adapter-stuck-detection.md): unified reconciliation metrics at the API layer via [HYPERFLEET-1205](https://redhat.atlassian.net/browse/HYPERFLEET-1205), covering both deletion and create/update flows via `Reconciled=False` condition tracking

5. **Periodic execution** (`§5`) — [adapter-periodic-execution.md](./adapter-periodic-execution.md): explains how the Sentinel drift-check cycle provides implicit periodic execution at per-resource-type granularity, and how lifecycle CEL expressions and API condition timestamps enable finer-grained control within that cadence

6. **API-level deletion policy for owned resources** (`§6`) — [adapter-deletion-policy.md](./adapter-deletion-policy.md): documents the `OnParentDelete` descriptor-driven policy (`restrict` / `cascade`) from the Generic Resource Registry (HYPERFLEET-896), its interaction with adapter delete events, and how resource references (non-ownership weak links) cover use cases where resources need to survive their logical counterpart's deletion

7. **Resource retention** (`§7`) — [adapter-resource-retention.md](./adapter-resource-retention.md): per-resource `retention:` configuration that governs how many historical versions of a resource accumulate and for how long; enables recreate-with-history semantics (versus the default delete-before-create replace mode), multi-result label-based discovery, and debounce protection against rapid version creation

8. **Ownership model** (`§8`) — [adapter-ownership-model.md](./adapter-ownership-model.md): two complementary mechanisms: label-based ownership with a sweep controller that detects and removes resources orphaned by API force-deletes (`§8A`), and K8s-native `ownerReferences` between resources managed by the same adapter for cascade deletion via Kubernetes GC (`§8B`)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for TaskConfigValidator and CEL validation
echo "=== TaskConfigValidator and CEL validation ==="
rg "TaskConfigValidator|CEL.*validation|Parse\(" --type go -A 3 | head -100

echo -e "\n=== Sweep controller implementation ==="
rg "sweep|orphan.*label" --type go -i | head -50

echo -e "\n=== Label validation/security ==="
rg "hyperfleet\.io/adapter|hyperfleet\.io/resource-id" --type go -B 2 -A 2 | head -80

Repository: openshift-hyperfleet/hyperfleet-adapter

Length of output: 6547


🏁 Script executed:

#!/bin/bash
# Search for RBAC or webhook controls on hyperfleet.io labels
echo "=== RBAC or webhook controls on hyperfleet.io labels ==="
rg "hyperfleet\.io|ValidatingWebhook|ClusterRole.*patch" --type go -i | head -100

echo -e "\n=== Label security patterns ==="
rg "label.*validation|forbidden.*label|HMAC|signature" --type go -i | head -50

Repository: openshift-hyperfleet/hyperfleet-api

Length of output: 1231


🏁 Script executed on selected repositories:

#!/bin/bash
# Cross-repo search for CEL environment/sandboxing configuration
echo "=== CEL sandboxing setup ==="
rg "ext\\.Strings|CEL.*env|EnvOption.*Strings" --type go | head -100

echo -e "\n=== Custom CEL functions ==="
rg "toJson|dig.*function|now.*CEL" --type go | head -80

Repositories: openshift-hyperfleet/hyperfleet-adapter, openshift-hyperfleet/hyperfleet-api, openshift-hyperfleet/hyperfleet-sentinel, openshift-hyperfleet/hyperfleet-broker

Length of output: 3695


🏁 Script executed:

#!/bin/bash
# Search for sweep controller more specifically
echo "=== Sweep controller files ==="
fd -type f -name "*sweep*" | head -20

echo -e "\n=== Label-stripping or orphan logic ==="
rg "Finalized|orphan|PatchResourceLabels|remove.*label" --type go -A 2 | head -100

echo -e "\n=== CEL environment setup details ==="
cat -n internal/criteria/cel_evaluator.go | head -80

Repository: openshift-hyperfleet/hyperfleet-adapter

Length of output: 4071


🏁 Script executed:

#!/bin/bash
# Check label validation function details
echo "=== Label validation in API ==="
rg "validateLabelKey" --type go -B 2 -A 10 | head -100

echo -e "\n=== Webhook or RBAC controls ==="
fd -type f -name "*webhook*" -o -name "*rbac*" | head -20

echo -e "\n=== Resource ownership/security controls ==="
rg "hyperfleet\.io/adapter.*validation|forbidden|permission.*label" --type go -i | head -50

Repository: openshift-hyperfleet/hyperfleet-api

Length of output: 3563


🏁 Script executed:

#!/bin/bash
# Check if security controls are documented in the design doc itself
echo "=== Design doc content - check for security sections ==="
cat -n hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md | grep -i "security\|threat\|validation\|RBAC\|webhook" | head -20

echo -e "\n=== Check adapter-ownership-model.md for security ==="
cat -n hyperfleet/components/adapter/framework/adapter-ownership-model.md 2>/dev/null | grep -i "security\|threat\|validation\|label.*protect" | head -20

Repository: openshift-hyperfleet/architecture

Length of output: 189


🏁 Script executed:

#!/bin/bash
# Find the customCELFunctions implementation
echo "=== Custom CEL functions implementation ==="
rg "customCELFunctions" --context 50 | head -150

echo -e "\n=== All functions available in CEL environment ==="
rg "cel\.Function|ext\." | grep -v test | head -80

Repository: openshift-hyperfleet/hyperfleet-adapter

Length of output: 15085


🏁 Script executed:

#!/bin/bash
# Look for the dig function implementation
echo "=== dig function ==="
rg "func.*dig|\"dig\"" -A 5 | head -100

# Check if there's any sweep or cleanup logic elsewhere
echo -e "\n=== Files mentioning resource cleanup, orphan, or management ==="
find . -name "*.go" ! -path "*/test/*" ! -path "*/vendor/*" | xargs grep -l "orphan\|cleanup\|sweep\|management.*label" | head -20

Repository: openshift-hyperfleet/hyperfleet-adapter

Length of output: 1151


🏁 Script executed:

#!/bin/bash
# Check the full adapter-ownership-model.md to see if it documents security
echo "=== adapter-ownership-model.md full scan ==="
cat hyperfleet/components/adapter/framework/adapter-ownership-model.md 2>/dev/null | head -300

echo -e "\n=== Check adapter-lifecycle-gates.md for security sections ==="
rg "CEL.*validation|sandboxing|threat|security" hyperfleet/components/adapter/framework/ -i | head -50

Repository: openshift-hyperfleet/architecture

Length of output: 11935


CEL sandboxing and label forgery protection require design documentation before implementation.

1. CEL sandboxing documentation is missing — The design leverages CEL expressions in lifecycle.when, lifecycle.apply.when, lifecycle.patch.when, lifecycle.orphan.when, and lifecycle.recreate.when. The implementation restricts the environment to ext.Strings(), ext.Lists(), and three custom functions (toJson, dig, now) — no HTTP, DNS, gRPC, or filesystem functions. This is correctly sandboxed. The design document must document this:

  • adapter-lifecycle-gates.md §1 introduction: Add "CEL Expression Validation and Sandboxing" section stating:
    • Environment restricted to ext.Strings() plus three custom functions only
    • No HTTP, DNS, gRPC, or filesystem functions available
    • Expressions validated at config load time via TaskConfigValidator.Parse()
    • Expressions cannot escape the defined context

2. Label forgery protection is unaddressed — The sweep controller uses hyperfleet.io/adapter and hyperfleet.io/resource-id labels to identify orphaned resources. Users with patch permissions can forge these labels, causing the sweep controller to delete unrelated resources (CWE-639 authorization escalation, CWE-548 label tampering).

Choose a security model and document in adapter-ownership-model.md §8A:

  1. Cryptographic: HMAC/signature validation on label writes
  2. RBAC: ClusterRole denying patch on hyperfleet.io/* label keys (recommended for K8s-native platforms)
  3. Webhook: Validating admission webhook controlling label mutations

Add "Prerequisites" subsection documenting the chosen approach and any deployment constraints.

📍 Affects 2 files
  • hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md#L29-L49 (this comment)
  • hyperfleet/components/adapter/framework/adapter-lifecycle-gates.md#L16-L233
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@hyperfleet/components/adapter/framework/adapter-lifecycle-management-design.md`
around lines 29 - 49, Two documentation gaps need to be addressed: (1) In
adapter-lifecycle-gates.md (lines 16-233), add a "CEL Expression Validation and
Sandboxing" section to §1 introduction documenting that the CEL environment is
restricted to ext.Strings(), ext.Lists(), and three custom functions only
(toJson, dig, now), with no HTTP, DNS, gRPC, or filesystem access, expressions
are validated at config load time via TaskConfigValidator.Parse(), and
expressions cannot escape the defined context. (2) In adapter-ownership-model.md
§8A, add a "Prerequisites" subsection that chooses and documents a security
model to prevent label forgery on hyperfleet.io/adapter and
hyperfleet.io/resource-id labels—select one approach (cryptographic
HMAC/signature validation, RBAC via ClusterRole denying patch on hyperfleet.io/*
keys, or validating admission webhook) and explain the deployment constraints
and how it prevents the sweep controller from deleting unrelated resources via
forged labels.

Sources: Coding guidelines, Linters/SAST tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants