HYPERFLEET-1112 - docs: add GCP developer cluster lifecycle policy by rafabene · Pull Request #160 · openshift-hyperfleet/architecture

rafabene · 2026-06-15T16:21:52Z

Summary

Add gcp-developer-cluster-lifecycle.md defining lifecycle rules for developer GKE clusters in hcm-hyperfleet: naming conventions, required labels (owner, environment, ttl), 5-day TTL with renewal via terraform apply, nightly shutdown, missing owner enforcement, event cluster rules, and orphaned resource cleanup procedures
Add cross-reference note to prow-cicd-cluster.md clarifying the Prow cluster is excluded from lifecycle enforcement

Context

The HYPERFLEET-1112 audit of the hcm-hyperfleet GCP project found ~$1,142/mo in wasted resources from clusters without lifecycle controls. This document establishes the policy. Automated enforcement is tracked in HYPERFLEET-1229.

Test plan

Verify Markdown renders correctly
Verify cross-reference link in prow-cicd-cluster.md resolves to the new doc
Verify HYPERFLEET-1229 callout note renders correctly in the "Automated Daily Enforcement" section
Verify all fenced code blocks have language identifiers (MD040)

coderabbitai · 2026-06-15T16:22:09Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

A new documentation file hyperfleet/docs/gcp-developer-cluster-lifecycle.md defines the GCP Developer Cluster Lifecycle Policy for developer GKE clusters in hcm-hyperfleet. The policy specifies scope exclusions for CI/CD and Prow-managed ephemeral clusters, mandatory naming conventions and labels (ttl required for environment: dev), Terraform provisioning requirements, default 5-day TTL with renewal via re-applying Terraform, scale-to-zero shutdown before deletion, event/hackathon cluster handling, standalone VM restrictions outside GKE, intended Cloud Scheduler automation for nightly shutdown and enforcement (explicitly marked as not yet implemented with tracking ticket; enforcement remains manual until rollout), orphaned resource cleanup procedures with monthly audit checklist and Prow E2E guidance, and audit history dated 2026-06-15. The existing hyperfleet/docs/prow-cicd-cluster.md receives a clarifying note that the Prow CI/CD cluster is excluded from this policy.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 11

✅ Passed checks (11 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the primary change: adding documentation for the GCP developer cluster lifecycle policy.
Description check	✅ Passed	The description is directly related to the changeset, detailing what was added and providing relevant context about the underlying audit and linked tracking tickets.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Sec-02: Secrets In Log Output	✅ Passed	PR modifies only documentation files (.md). SEC-02 check targets Go log statements (slog, log, logr, zap, fmt.Print*); no Go code present in this PR.
No Hardcoded Secrets	✅ Passed	Documentation-only change with no hardcoded secrets, API keys, tokens, passwords, base64 strings, or embedded credentials detected.
No Weak Cryptography	✅ Passed	PR contains only Markdown documentation files with no cryptographic code, weak primitives (MD5, DES, RC4, SHA1 for security), ECB mode, custom crypto implementations, or non-constant-time secret co...
No Injection Vectors	✅ Passed	Both files are documentation (Markdown) containing only policy descriptions and example commands with no SQL, exec.Command, template.HTML, or yaml.Unmarshal patterns.
No Privileged Containers	✅ Passed	PR contains only documentation files (.md); no Kubernetes manifests, Helm templates, or Dockerfiles present. Check not applicable.
No Pii Or Sensitive Data In Logs	✅ Passed	PR contains only Markdown documentation files with no logging code. No logging statements present to expose PII or sensitive data.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

✨ Simplify code

Create PR with simplified code

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@hyperfleet/docs/gcp-developer-cluster-lifecycle.md`:
- Around line 132-141: Add a reference to ticket HYPERFLEET-1229 at the
beginning of the "Future Automation (Proposed)" section per coding guideline
HYG-02. Expand the section to document the enforcement specifications missing
from the current table: clarify how the Cloud Function validates the owner label
(e.g., against Kerberos/LDAP or just existence check), specify the service
account identity and required IAM permissions for the enforcement job, document
audit logging and dry-run requirements before cluster shutdown or deletion, and
detail safeguards to prevent label spoofing or privilege escalation by the Cloud
Function. These details should be added as separate subsections or a detailed
specification paragraph following the component table.
- Around line 128-131: Expand the "Missing Owner Label Enforcement" section
(lines 128-131) to specify the validation criteria for a valid owner label. Add
documentation that defines: (1) the exact format or criteria an owner label must
meet to be considered valid (for example, whether it must match a Kerberos
principal format or be cross-checked against a directory service like LDAP), (2)
how the enforcement job authenticates its own identity to prevent unauthorized
bypass or tampering, and (3) any exemption mechanisms, manual override
procedures, or escalation paths available to cluster operators. This
clarification is necessary to establish a secure enforcement boundary before
HYPERFLEET-1229 implementation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 627ad06a-352f-4ec3-a004-279e23df336f

📥 Commits

Reviewing files that changed from the base of the PR and between 0bb0fc2 and 1882eda.

📒 Files selected for processing (2)

hyperfleet/docs/gcp-developer-cluster-lifecycle.md
hyperfleet/docs/prow-cicd-cluster.md

🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

openshift-hyperfleet/architecture (manual)
openshift-hyperfleet/hyperfleet-api (manual)
openshift-hyperfleet/hyperfleet-sentinel (manual)
openshift-hyperfleet/hyperfleet-adapter (manual)
openshift-hyperfleet/hyperfleet-broker (manual)

openshift-ci · 2026-06-15T16:37:10Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign crizzo71 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ma-hill · 2026-06-15T18:14:16Z

+### Monthly Audit
+
+A monthly audit should be performed to identify and clean up orphaned resources:
+


https://console.cloud.google.com/iam-admin/asset-inventory/dashboard?project=hcm-hyperfleet

Don't know if we should be looking at this asset-inventory list can be helpful to see what resources are left over.. I'm sure there are different filtering techniques that could be helpfule

Good call! Added a reference to the GCP Asset Inventory dashboard in the Monthly Audit section — it gives a consolidated view of all project resources which is great for spotting leftovers.

ma-hill · 2026-06-15T18:20:30Z

+A monthly audit should be performed to identify and clean up orphaned resources:
+
+| Resource Type | How to Check | Cleanup |
+|---|---|---|


https://github.com/cloud-custodian/cloud-custodian
Could be useful to cleanup resources. I'm sure there are also other solutions

Nice find! Added a mention of Cloud Custodian as a viable option for automated policy-as-code resource cleanup, alongside the daily enforcement automation we already have planned.

kuudori · 2026-06-15T20:18:50Z

+
+### Nightly Shutdown
+
+A Cloud Scheduler job runs daily at 19:00 UTC and scales down all developer cluster node pools to zero nodes. Developers must manually scale back up in the morning. This avoids paying for idle compute overnight and on weekends.


Interesting. In the NA/SA, it'll be around 11 AM-2 PM. Should we expect a scale-down during work hours?

What if the scheduled job runs every hour and shut downs nodepools running for more than 12h since creation?

This makes the ttl tag not required, unless we want a mechanism to bypass the shutdown, so a TTL will keep alive until reached

Good catch! Changed the shutdown time from 19:00 UTC to 01:00 UTC, which is after work hours across all team timezones (5 PM PDT / 10 PM BRT / 3 AM CEST). Also added a note in the doc explaining the timezone rationale.

Interesting idea! I think they serve different purposes though:

Nightly shutdown is about saving compute costs during off-hours. A fixed daily schedule is simpler to implement (single Cloud Scheduler cron) and predictable for developers — they know exactly when it happens and can plan around it.

TTL is about cluster lifecycle — when to delete the cluster entirely, not just scale it down. Even with a 12h-based shutdown, we'd still need TTL to prevent clusters from accumulating indefinitely.

The 12h-since-creation approach adds complexity (tracking node pool scale-up timestamps, hourly execution cost) without eliminating TTL. The fixed nightly shutdown + TTL combo keeps both concerns separate and simple.

That said, if the team prefers the hourly approach, we can revisit. What do you think?

Thinking more about it, you're right — creation-time based is simpler and eliminates the timezone issue entirely. Updated the doc:

Replaced the fixed 01:00 UTC nightly shutdown with an hourly job that scales down node pools whose nodes have been running for more than 12 hours (based on node creationTimestamp)

Every developer gets the same 12h uptime window regardless of timezone

Updated the Future Automation table accordingly

kuudori · 2026-06-15T20:30:35Z

+| Component | Purpose |
+|---|---|
+| Cloud Scheduler | Cron trigger at 19:00 UTC daily |
+| Cloud Function | Iterates clusters, checks labels and TTL, resizes node pools to 0 or deletes |


In case of deletion, should we add a Slack notification about the imminent deletion as a proposal?
Because in current logic, only terraform apply updates ttl, so it might be unexpected to see your resources are deleted, even if you scaled back your cluster.

IMO connection to Slack adds too much complexity here.
We already work with spot instances, so the dev nodepools can disapear at any given moment.

I would say that if you need to persist some resources... better write a script to create them

Indeed. Initially I considered Slack notification, but I removed before asking for a review. Since we already work with spot instances, devs are used to ephemeral nodepools. Adding Slack integration would be unnecessary complexity for this use case.

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread hyperfleet/docs/gcp-developer-cluster-lifecycle.md

Comment thread hyperfleet/docs/gcp-developer-cluster-lifecycle.md

rafabene force-pushed the HYPERFLEET-1112-gcp-developer-cluster-lifecycle branch from 1882eda to f17f85a Compare June 15, 2026 16:37

openshift-ci Bot requested review from crizzo71 and jsell-rh June 15, 2026 16:46

ma-hill reviewed Jun 15, 2026

View reviewed changes

rafabene force-pushed the HYPERFLEET-1112-gcp-developer-cluster-lifecycle branch from f17f85a to bb453af Compare June 15, 2026 20:08

kuudori reviewed Jun 15, 2026

View reviewed changes

rafabene force-pushed the HYPERFLEET-1112-gcp-developer-cluster-lifecycle branch 4 times, most recently from ccd05a5 to a3123f8 Compare June 16, 2026 14:26

HYPERFLEET-1112 - docs: add GCP developer cluster lifecycle policy

8a26107

rafabene force-pushed the HYPERFLEET-1112-gcp-developer-cluster-lifecycle branch from a3123f8 to 8a26107 Compare June 16, 2026 14:34

		### Monthly Audit

		A monthly audit should be performed to identify and clean up orphaned resources:


		### Nightly Shutdown

		A Cloud Scheduler job runs daily at 19:00 UTC and scales down all developer cluster node pools to zero nodes. Developers must manually scale back up in the morning. This avoids paying for idle compute overnight and on weekends.

Conversation

rafabene commented Jun 15, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Test plan

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

openshift-ci Bot commented Jun 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rh-amarin Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rafabene Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rafabene commented Jun 15, 2026 •

edited by atlassian Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

rh-amarin Jun 16, 2026 •

edited

Loading

rafabene Jun 16, 2026 •

edited

Loading