HYPERFLEET-1185 - feat: add performance tests and tooling#127
HYPERFLEET-1185 - feat: add performance tests and tooling#127pnguyen44 wants to merge 2 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Central YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (7)
🔗 Linked repositories identifiedCodeRabbit considers these linked repositories for cross-repo context during reviews:
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughSummary by CodeRabbitRelease Notes
WalkthroughAdds in-cluster performance baseline measurement for core HyperFleet operations. Nine Ginkgo suites labeled Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 10 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@e2e/cluster/perf_create_latency.go`:
- Around line 34-38: The ginkgo.DeferCleanup closures in multiple perf test
files are currently only logging warnings when CleanupTestCluster fails, which
allows resource leaks. Modify the closure to propagate the error by returning it
from h.CleanupTestCluster instead of only logging and continuing. Apply the same
fix to all four affected files: In e2e/cluster/perf_create_latency.go (lines
34-38), change the DeferCleanup closure to return the error from
h.CleanupTestCluster. In e2e/cluster/perf_read_latency.go (lines 27-31), make
the same change to return the cleanup error. In
e2e/cluster/perf_update_latency.go (lines 29-33), make the same change to return
the cleanup error. In e2e/cluster/perf_list_latency.go (lines 27-31), make the
same change to return the cleanup error. This ensures failures during cleanup
are properly propagated rather than silently logged.
In `@e2e/cluster/perf_read_entity_size_latency.go`:
- Around line 39-43: Replace the warning-only logging pattern in the cleanup
callbacks across all affected files with explicit test failure handling. In each
ginkgo.DeferCleanup callback (CleanupTestCluster and CleanupTestNodepool calls),
use ginkgo.Fail() to surface cleanup errors as test failures instead of
downgrading them to warnings via ginkgo.GinkgoWriter.Printf. For the
perf_delete_latency.go files (both cluster and nodepool variants), implement
idempotency checks to distinguish between already-deleted resources (which
should not fail the test) and unexpected errors (which should fail via
ginkgo.Fail()). Apply this pattern consistently across
e2e/cluster/perf_read_entity_size_latency.go (39-43),
e2e/cluster/perf_list_filtered_latency.go (28-32),
e2e/cluster/perf_delete_latency.go (30-34),
e2e/cluster/perf_cascade_delete_latency.go (31-35 and 46-50),
e2e/nodepool/perf_create_latency.go (29-33 and 44-48), and
e2e/nodepool/perf_delete_latency.go (31-35 and 41-45).
In `@e2e/nodepool/perf_delete_latency.go`:
- Around line 59-60: The Eventually assertion in the nodepool hard-delete wait
is using the incorrect timeout configuration. Replace
h.Cfg.Timeouts.Adapter.Processing with h.Cfg.Timeouts.NodePool.Reconciled in the
Eventually call that checks for the http.StatusNotFound response. This ensures
the timeout budget matches what is used elsewhere in the codebase for nodepool
hard-delete operations, preventing false failures due to control-plane lag.
In `@perf/parse-report.sh`:
- Around line 15-27: When the user explicitly provides a file path as the first
argument to the script, add validation to fail immediately if that file does not
exist, rather than silently falling back to the latest results file. The current
check on line 15 only sets INPUT and SOURCE when the file exists, but should
instead validate that if an explicit path is provided (non-empty $1), the script
must verify the file exists and exit with an error if it doesn't. This prevents
accidental parsing of the wrong artifact when a user makes a typo in the
filename they explicitly provided.
In `@perf/run-in-cluster.sh`:
- Around line 44-63: The script uses a mutable image tag (DEV_TAG variable
constructed on line 44) in the IMAGE variable on line 45 and passes it to
kubectl run with imagePullPolicy=Always on line 56, which creates a security
vulnerability since tags can be overwritten. Instead, capture the immutable
image digest from the Makefile's image-dev target (which builds and pushes the
image) and modify the run-in-cluster.sh script to use the digest reference
(format: quay.io/$QUAY_USER/hyperfleet-e2e@sha256:...) instead of the tag-based
IMAGE variable in the kubectl run command.
In `@perf/seed-clusters.sh`:
- Around line 72-77: The curl DELETE request on line 75 does not validate the
HTTP response status before incrementing the deleted counter on line 76. This
causes failed deletes (4xx/5xx responses) to be incorrectly counted as
successes. Check the HTTP status code returned by the curl command (or its exit
status) and only increment the deleted counter when the delete operation
succeeds. Ensure that failures are either logged or handled appropriately so
that the final count accurately reflects only successful cluster deletions.
- Line 15: Replace the unsafe `source` command that executes the entire `.env`
file with a whitelisting approach that parses only specific, known variables
from the `.env` file. Instead of using `source "$REPO_DIR/.env"`, extract and
assign individual variables explicitly (for example, using grep or read commands
to parse specific key-value pairs). This prevents arbitrary shell code execution
while still accessing necessary configuration values. Apply this pattern
consistently across all perf scripts that currently use `source` to load `.env`
files.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 58ad3cde-a3d9-4582-aaad-658b4d5735b4
📒 Files selected for processing (21)
.env.example.gitignoreAGENTS.mdREADME.mde2e/cluster/perf_cascade_delete_latency.goe2e/cluster/perf_create_latency.goe2e/cluster/perf_delete_latency.goe2e/cluster/perf_list_filtered_latency.goe2e/cluster/perf_list_latency.goe2e/cluster/perf_read_entity_size_latency.goe2e/cluster/perf_read_latency.goe2e/cluster/perf_update_latency.goe2e/nodepool/perf_create_latency.goe2e/nodepool/perf_delete_latency.goperf/README.mdperf/parse-report.shperf/run-in-cluster.shperf/seed-clusters.shpkg/client/cluster.gotestdata/payloads/clusters/cluster-request-large.jsontestdata/payloads/clusters/cluster-request-small.json
🔗 Linked repositories identified
CodeRabbit considers these linked repositories for cross-repo context during reviews:
openshift-hyperfleet/architecture(manual)openshift-hyperfleet/hyperfleet-api(manual)openshift-hyperfleet/hyperfleet-sentinel(manual)openshift-hyperfleet/hyperfleet-adapter(manual)openshift-hyperfleet/hyperfleet-broker(manual)
|
/retest-required |
| Confirm the adapters are subscribed to the correct topic (not `placeholder`): | ||
|
|
||
| ```bash | ||
| kubectl exec -n hyperfleet deploy/adapter1-hyperfleet-adapter -- env | grep BROKER_TOPIC |
There was a problem hiding this comment.
The verification step tells users to check BROKER_TOPIC env var:
kubectl exec -n hyperfleet deploy/adapter1-hyperfleet-adapter -- env | grep BROKER_TOPICBut from the v1.0.0 breaking-changes checklist (item #15), the Sentinel's BROKER_TOPIC env var was removed in v1.0.0. These perf tests target v1.0.0. This verification command may return nothing, leading users to think the adapter isn't connected when it actually is — the topic is now configured via config file, not env var.
Needs verification: does the adapter (not Sentinel) deployment template still inject BROKER_TOPIC? If not, this step needs updating.
| for _, size := range sizes { | ||
| ginkgo.It("should read a "+size.name+" cluster within acceptable latency", func(ctx context.Context) { | ||
| ginkgo.By("creating a " + size.name + " cluster") | ||
| cluster, err := h.Client.CreateClusterFromPayload(ctx, h.TestDataPath(size.payload)) |
There was a problem hiding this comment.
The entity-size test creates clusters and reads them immediately without waiting for Reconciled. The other lifecycle tests (create, update, delete) all wait for Reconciled first.
A freshly-created cluster has only the spec — no adapter statuses, no conditions, no data from adapter reports. A reconciled cluster has all of that, making the JSON response larger. So this measures read latency on a smaller response than a production read would return.
If the intent is "does a bigger spec affect read latency" — fine as-is. If it's "realistic read latency of a fully-provisioned cluster" — it's undercounting. A one-line comment clarifying the intent would help.
| _, err := h.Client.PatchClusterFromPayload(ctx, clusterID, h.TestDataPath("payloads/clusters/cluster-patch.json")) | ||
| Expect(err).NotTo(HaveOccurred()) | ||
|
|
||
| Eventually(h.PollCluster(ctx, clusterID), h.Cfg.Timeouts.Cluster.Reconciled, h.Cfg.Polling.Interval). |
There was a problem hiding this comment.
Potential false near-zero latency from stale condition read.
After PATCH (line 44), generation increments from 1 to 2. Line 47-48 immediately polls for Reconciled=True. But Reconciled=True was already set from gen 1 reconciliation. HaveResourceCondition checks type and status but not observed_generation — so stale True at gen 1 is indistinguishable from fresh True at gen 2.
If the API doesn't synchronously flip Reconciled=False during the PATCH handler, the first poll reads the stale True and the test reports near-zero latency.
Suggested fix — wait for False first, then True:
// Confirm API detected generation mismatch
Eventually(h.PollCluster(ctx, clusterID), h.Cfg.Timeouts.Cluster.Reconciled, h.Cfg.Polling.Interval).
Should(helper.HaveResourceCondition(client.ConditionTypeReconciled, openapi.ResourceConditionStatusFalse))
// Then wait for re-reconciliation at new generation
Eventually(h.PollCluster(ctx, clusterID), h.Cfg.Timeouts.Cluster.Reconciled, h.Cfg.Polling.Interval).
Should(helper.HaveResourceCondition(client.ConditionTypeReconciled, openapi.ResourceConditionStatusTrue))Or add a generation-aware matcher that checks observed_generation == resource.generation.
Summary
HYPERFLEET-1185
Adds lightweight performance tests that measure baseline latencies for core HyperFleet operations (cluster and nodepool CRUD). Tests are labeled
perfand emit[PERF]log lines with measured durations, which can be parsed into a baseline report. Includes tooling to seed the database, run tests in-cluster, and generate summary reports.Note: This PR covers the test implementation and tooling only. Establishing official baselines and adding threshold assertions based on those baselines will follow in a separate PR.
Changes
ListClustersWithParamsclient method for parameterized cluster queries (search, pagination)seed-clusters.shfor database seeding,run-in-cluster.shfor in-cluster execution, andparse-report.shfor extracting baseline reportsTest plan
./perf/run-in-cluster.shand confirm[PERF]latency lines appear in output./perf/parse-report.shagainst the output and verify the baseline report is generated correctly./perf/seed-clusters.sh statusand./perf/seed-clusters.sh cleanupto verify seed tooling worksmake lintpasses