Skip to content

HYPERFLEET-1112 - docs: add GCP developer cluster lifecycle policy#160

Open
rafabene wants to merge 1 commit into
openshift-hyperfleet:mainfrom
rafabene:HYPERFLEET-1112-gcp-developer-cluster-lifecycle
Open

HYPERFLEET-1112 - docs: add GCP developer cluster lifecycle policy#160
rafabene wants to merge 1 commit into
openshift-hyperfleet:mainfrom
rafabene:HYPERFLEET-1112-gcp-developer-cluster-lifecycle

Conversation

@rafabene

@rafabene rafabene commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add gcp-developer-cluster-lifecycle.md defining lifecycle rules for developer GKE clusters in hcm-hyperfleet: naming conventions, required labels (owner, environment, ttl), 5-day TTL with renewal via terraform apply, nightly shutdown, missing owner enforcement, event cluster rules, and orphaned resource cleanup procedures
  • Add cross-reference note to prow-cicd-cluster.md clarifying the Prow cluster is excluded from lifecycle enforcement

Context

The HYPERFLEET-1112 audit of the hcm-hyperfleet GCP project found ~$1,142/mo in wasted resources from clusters without lifecycle controls. This document establishes the policy. Automated enforcement is tracked in HYPERFLEET-1229.

Test plan

  • Verify Markdown renders correctly
  • Verify cross-reference link in prow-cicd-cluster.md resolves to the new doc
  • Verify HYPERFLEET-1229 callout note renders correctly in the "Automated Daily Enforcement" section
  • Verify all fenced code blocks have language identifiers (MD040)

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

A new documentation file hyperfleet/docs/gcp-developer-cluster-lifecycle.md defines the GCP Developer Cluster Lifecycle Policy for developer GKE clusters in hcm-hyperfleet. The policy specifies scope exclusions for CI/CD and Prow-managed ephemeral clusters, mandatory naming conventions and labels (ttl required for environment: dev), Terraform provisioning requirements, default 5-day TTL with renewal via re-applying Terraform, scale-to-zero shutdown before deletion, event/hackathon cluster handling, standalone VM restrictions outside GKE, intended Cloud Scheduler automation for nightly shutdown and enforcement (explicitly marked as not yet implemented with tracking ticket; enforcement remains manual until rollout), orphaned resource cleanup procedures with monthly audit checklist and Prow E2E guidance, and audit history dated 2026-06-15. The existing hyperfleet/docs/prow-cicd-cluster.md receives a clarifying note that the Prow CI/CD cluster is excluded from this policy.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the primary change: adding documentation for the GCP developer cluster lifecycle policy.
Description check ✅ Passed The description is directly related to the changeset, detailing what was added and providing relevant context about the underlying audit and linked tracking tickets.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Sec-02: Secrets In Log Output ✅ Passed PR modifies only documentation files (.md). SEC-02 check targets Go log statements (slog, log, logr, zap, fmt.Print*); no Go code present in this PR.
No Hardcoded Secrets ✅ Passed Documentation-only change with no hardcoded secrets, API keys, tokens, passwords, base64 strings, or embedded credentials detected.
No Weak Cryptography ✅ Passed PR contains only Markdown documentation files with no cryptographic code, weak primitives (MD5, DES, RC4, SHA1 for security), ECB mode, custom crypto implementations, or non-constant-time secret co...
No Injection Vectors ✅ Passed Both files are documentation (Markdown) containing only policy descriptions and example commands with no SQL, exec.Command, template.HTML, or yaml.Unmarshal patterns.
No Privileged Containers ✅ Passed PR contains only documentation files (.md); no Kubernetes manifests, Helm templates, or Dockerfiles present. Check not applicable.
No Pii Or Sensitive Data In Logs ✅ Passed PR contains only Markdown documentation files with no logging code. No logging statements present to expose PII or sensitive data.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@hyperfleet/docs/gcp-developer-cluster-lifecycle.md`:
- Around line 132-141: Add a reference to ticket HYPERFLEET-1229 at the
beginning of the "Future Automation (Proposed)" section per coding guideline
HYG-02. Expand the section to document the enforcement specifications missing
from the current table: clarify how the Cloud Function validates the owner label
(e.g., against Kerberos/LDAP or just existence check), specify the service
account identity and required IAM permissions for the enforcement job, document
audit logging and dry-run requirements before cluster shutdown or deletion, and
detail safeguards to prevent label spoofing or privilege escalation by the Cloud
Function. These details should be added as separate subsections or a detailed
specification paragraph following the component table.
- Around line 128-131: Expand the "Missing Owner Label Enforcement" section
(lines 128-131) to specify the validation criteria for a valid owner label. Add
documentation that defines: (1) the exact format or criteria an owner label must
meet to be considered valid (for example, whether it must match a Kerberos
principal format or be cross-checked against a directory service like LDAP), (2)
how the enforcement job authenticates its own identity to prevent unauthorized
bypass or tampering, and (3) any exemption mechanisms, manual override
procedures, or escalation paths available to cluster operators. This
clarification is necessary to establish a secure enforcement boundary before
HYPERFLEET-1229 implementation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 627ad06a-352f-4ec3-a004-279e23df336f

📥 Commits

Reviewing files that changed from the base of the PR and between 0bb0fc2 and 1882eda.

📒 Files selected for processing (2)
  • hyperfleet/docs/gcp-developer-cluster-lifecycle.md
  • hyperfleet/docs/prow-cicd-cluster.md
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift-hyperfleet/architecture (manual)
  • openshift-hyperfleet/hyperfleet-api (manual)
  • openshift-hyperfleet/hyperfleet-sentinel (manual)
  • openshift-hyperfleet/hyperfleet-adapter (manual)
  • openshift-hyperfleet/hyperfleet-broker (manual)

Comment thread hyperfleet/docs/gcp-developer-cluster-lifecycle.md
Comment thread hyperfleet/docs/gcp-developer-cluster-lifecycle.md
@rafabene rafabene force-pushed the HYPERFLEET-1112-gcp-developer-cluster-lifecycle branch from 1882eda to f17f85a Compare June 15, 2026 16:37
@openshift-ci

openshift-ci Bot commented Jun 15, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign crizzo71 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot requested review from crizzo71 and jsell-rh June 15, 2026 16:46
### Monthly Audit

A monthly audit should be performed to identify and clean up orphaned resources:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://console.cloud.google.com/iam-admin/asset-inventory/dashboard?project=hcm-hyperfleet

Don't know if we should be looking at this asset-inventory list can be helpful to see what resources are left over.. I'm sure there are different filtering techniques that could be helpfule

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! Added a reference to the GCP Asset Inventory dashboard in the Monthly Audit section — it gives a consolidated view of all project resources which is great for spotting leftovers.

A monthly audit should be performed to identify and clean up orphaned resources:

| Resource Type | How to Check | Cleanup |
|---|---|---|

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/cloud-custodian/cloud-custodian
Could be useful to cleanup resources. I'm sure there are also other solutions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! Added a mention of Cloud Custodian as a viable option for automated policy-as-code resource cleanup, alongside the daily enforcement automation we already have planned.

@rafabene rafabene force-pushed the HYPERFLEET-1112-gcp-developer-cluster-lifecycle branch from f17f85a to bb453af Compare June 15, 2026 20:08

### Nightly Shutdown

A Cloud Scheduler job runs daily at 19:00 UTC and scales down all developer cluster node pools to zero nodes. Developers must manually scale back up in the morning. This avoids paying for idle compute overnight and on weekends.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. In the NA/SA, it'll be around 11 AM-2 PM. Should we expect a scale-down during work hours?

@rh-amarin rh-amarin Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the scheduled job runs every hour and shut downs nodepools running for more than 12h since creation?

This makes the ttl tag not required, unless we want a mechanism to bypass the shutdown, so a TTL will keep alive until reached

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Changed the shutdown time from 19:00 UTC to 01:00 UTC, which is after work hours across all team timezones (5 PM PDT / 10 PM BRT / 3 AM CEST). Also added a note in the doc explaining the timezone rationale.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea! I think they serve different purposes though:

  • Nightly shutdown is about saving compute costs during off-hours. A fixed daily schedule is simpler to implement (single Cloud Scheduler cron) and predictable for developers — they know exactly when it happens and can plan around it.
  • TTL is about cluster lifecycle — when to delete the cluster entirely, not just scale it down. Even with a 12h-based shutdown, we'd still need TTL to prevent clusters from accumulating indefinitely.

The 12h-since-creation approach adds complexity (tracking node pool scale-up timestamps, hourly execution cost) without eliminating TTL. The fixed nightly shutdown + TTL combo keeps both concerns separate and simple.

That said, if the team prefers the hourly approach, we can revisit. What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about it, you're right — creation-time based is simpler and eliminates the timezone issue entirely. Updated the doc:

  • Replaced the fixed 01:00 UTC nightly shutdown with an hourly job that scales down node pools whose nodes have been running for more than 12 hours (based on node creationTimestamp)
  • Every developer gets the same 12h uptime window regardless of timezone
  • Updated the Future Automation table accordingly

| Component | Purpose |
|---|---|
| Cloud Scheduler | Cron trigger at 19:00 UTC daily |
| Cloud Function | Iterates clusters, checks labels and TTL, resizes node pools to 0 or deletes |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of deletion, should we add a Slack notification about the imminent deletion as a proposal?
Because in current logic, only terraform apply updates ttl, so it might be unexpected to see your resources are deleted, even if you scaled back your cluster.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO connection to Slack adds too much complexity here.
We already work with spot instances, so the dev nodepools can disapear at any given moment.

I would say that if you need to persist some resources... better write a script to create them

@rafabene rafabene Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Initially I considered Slack notification, but I removed before asking for a review. Since we already work with spot instances, devs are used to ephemeral nodepools. Adding Slack integration would be unnecessary complexity for this use case.

@rafabene rafabene force-pushed the HYPERFLEET-1112-gcp-developer-cluster-lifecycle branch 4 times, most recently from ccd05a5 to a3123f8 Compare June 16, 2026 14:26
@rafabene rafabene force-pushed the HYPERFLEET-1112-gcp-developer-cluster-lifecycle branch from a3123f8 to 8a26107 Compare June 16, 2026 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants