HYPERFLEET-1112 - docs: add GCP developer cluster lifecycle policy#160
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughA new documentation file Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 11✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@hyperfleet/docs/gcp-developer-cluster-lifecycle.md`:
- Around line 132-141: Add a reference to ticket HYPERFLEET-1229 at the
beginning of the "Future Automation (Proposed)" section per coding guideline
HYG-02. Expand the section to document the enforcement specifications missing
from the current table: clarify how the Cloud Function validates the owner label
(e.g., against Kerberos/LDAP or just existence check), specify the service
account identity and required IAM permissions for the enforcement job, document
audit logging and dry-run requirements before cluster shutdown or deletion, and
detail safeguards to prevent label spoofing or privilege escalation by the Cloud
Function. These details should be added as separate subsections or a detailed
specification paragraph following the component table.
- Around line 128-131: Expand the "Missing Owner Label Enforcement" section
(lines 128-131) to specify the validation criteria for a valid owner label. Add
documentation that defines: (1) the exact format or criteria an owner label must
meet to be considered valid (for example, whether it must match a Kerberos
principal format or be cross-checked against a directory service like LDAP), (2)
how the enforcement job authenticates its own identity to prevent unauthorized
bypass or tampering, and (3) any exemption mechanisms, manual override
procedures, or escalation paths available to cluster operators. This
clarification is necessary to establish a secure enforcement boundary before
HYPERFLEET-1229 implementation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 627ad06a-352f-4ec3-a004-279e23df336f
📒 Files selected for processing (2)
hyperfleet/docs/gcp-developer-cluster-lifecycle.mdhyperfleet/docs/prow-cicd-cluster.md
🔗 Linked repositories identified
CodeRabbit considers these linked repositories for cross-repo context during reviews:
openshift-hyperfleet/architecture(manual)openshift-hyperfleet/hyperfleet-api(manual)openshift-hyperfleet/hyperfleet-sentinel(manual)openshift-hyperfleet/hyperfleet-adapter(manual)openshift-hyperfleet/hyperfleet-broker(manual)
1882eda to
f17f85a
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| ### Monthly Audit | ||
|
|
||
| A monthly audit should be performed to identify and clean up orphaned resources: | ||
|
|
There was a problem hiding this comment.
https://console.cloud.google.com/iam-admin/asset-inventory/dashboard?project=hcm-hyperfleet
Don't know if we should be looking at this asset-inventory list can be helpful to see what resources are left over.. I'm sure there are different filtering techniques that could be helpfule
There was a problem hiding this comment.
Good call! Added a reference to the GCP Asset Inventory dashboard in the Monthly Audit section — it gives a consolidated view of all project resources which is great for spotting leftovers.
| A monthly audit should be performed to identify and clean up orphaned resources: | ||
|
|
||
| | Resource Type | How to Check | Cleanup | | ||
| |---|---|---| |
There was a problem hiding this comment.
https://github.com/cloud-custodian/cloud-custodian
Could be useful to cleanup resources. I'm sure there are also other solutions
There was a problem hiding this comment.
Nice find! Added a mention of Cloud Custodian as a viable option for automated policy-as-code resource cleanup, alongside the daily enforcement automation we already have planned.
f17f85a to
bb453af
Compare
|
|
||
| ### Nightly Shutdown | ||
|
|
||
| A Cloud Scheduler job runs daily at 19:00 UTC and scales down all developer cluster node pools to zero nodes. Developers must manually scale back up in the morning. This avoids paying for idle compute overnight and on weekends. |
There was a problem hiding this comment.
Interesting. In the NA/SA, it'll be around 11 AM-2 PM. Should we expect a scale-down during work hours?
There was a problem hiding this comment.
What if the scheduled job runs every hour and shut downs nodepools running for more than 12h since creation?
This makes the ttl tag not required, unless we want a mechanism to bypass the shutdown, so a TTL will keep alive until reached
There was a problem hiding this comment.
Good catch! Changed the shutdown time from 19:00 UTC to 01:00 UTC, which is after work hours across all team timezones (5 PM PDT / 10 PM BRT / 3 AM CEST). Also added a note in the doc explaining the timezone rationale.
There was a problem hiding this comment.
Interesting idea! I think they serve different purposes though:
- Nightly shutdown is about saving compute costs during off-hours. A fixed daily schedule is simpler to implement (single Cloud Scheduler cron) and predictable for developers — they know exactly when it happens and can plan around it.
- TTL is about cluster lifecycle — when to delete the cluster entirely, not just scale it down. Even with a 12h-based shutdown, we'd still need TTL to prevent clusters from accumulating indefinitely.
The 12h-since-creation approach adds complexity (tracking node pool scale-up timestamps, hourly execution cost) without eliminating TTL. The fixed nightly shutdown + TTL combo keeps both concerns separate and simple.
That said, if the team prefers the hourly approach, we can revisit. What do you think?
There was a problem hiding this comment.
Thinking more about it, you're right — creation-time based is simpler and eliminates the timezone issue entirely. Updated the doc:
- Replaced the fixed 01:00 UTC nightly shutdown with an hourly job that scales down node pools whose nodes have been running for more than 12 hours (based on node
creationTimestamp) - Every developer gets the same 12h uptime window regardless of timezone
- Updated the Future Automation table accordingly
| | Component | Purpose | | ||
| |---|---| | ||
| | Cloud Scheduler | Cron trigger at 19:00 UTC daily | | ||
| | Cloud Function | Iterates clusters, checks labels and TTL, resizes node pools to 0 or deletes | |
There was a problem hiding this comment.
In case of deletion, should we add a Slack notification about the imminent deletion as a proposal?
Because in current logic, only terraform apply updates ttl, so it might be unexpected to see your resources are deleted, even if you scaled back your cluster.
There was a problem hiding this comment.
IMO connection to Slack adds too much complexity here.
We already work with spot instances, so the dev nodepools can disapear at any given moment.
I would say that if you need to persist some resources... better write a script to create them
There was a problem hiding this comment.
Indeed. Initially I considered Slack notification, but I removed before asking for a review. Since we already work with spot instances, devs are used to ephemeral nodepools. Adding Slack integration would be unnecessary complexity for this use case.
ccd05a5 to
a3123f8
Compare
a3123f8 to
8a26107
Compare
Summary
gcp-developer-cluster-lifecycle.mddefining lifecycle rules for developer GKE clusters inhcm-hyperfleet: naming conventions, required labels (owner,environment,ttl), 5-day TTL with renewal viaterraform apply, nightly shutdown, missing owner enforcement, event cluster rules, and orphaned resource cleanup proceduresprow-cicd-cluster.mdclarifying the Prow cluster is excluded from lifecycle enforcementContext
The HYPERFLEET-1112 audit of the
hcm-hyperfleetGCP project found ~$1,142/mo in wasted resources from clusters without lifecycle controls. This document establishes the policy. Automated enforcement is tracked in HYPERFLEET-1229.Test plan
prow-cicd-cluster.mdresolves to the new doc