Skip to content

HYPERFLEET-1185 - feat: add performance tests and tooling#127

Open
pnguyen44 wants to merge 2 commits into
openshift-hyperfleet:mainfrom
pnguyen44:HYPERFLEET-1185/add-perf
Open

HYPERFLEET-1185 - feat: add performance tests and tooling#127
pnguyen44 wants to merge 2 commits into
openshift-hyperfleet:mainfrom
pnguyen44:HYPERFLEET-1185/add-perf

Conversation

@pnguyen44

@pnguyen44 pnguyen44 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

HYPERFLEET-1185

Adds lightweight performance tests that measure baseline latencies for core HyperFleet operations (cluster and nodepool CRUD). Tests are labeled perf and emit [PERF] log lines with measured durations, which can be parsed into a baseline report. Includes tooling to seed the database, run tests in-cluster, and generate summary reports.

Note: This PR covers the test implementation and tooling only. Establishing official baselines and adding threshold assertions based on those baselines will follow in a separate PR.

Changes

  • Added perf tests covering cluster create, read, list, filtered-list, update, delete, cascade-delete latencies, read-by-entity-size, and nodepool create/delete latencies
  • Added ListClustersWithParams client method for parameterized cluster queries (search, pagination)
  • Added perf tooling: seed-clusters.sh for database seeding, run-in-cluster.sh for in-cluster execution, and parse-report.sh for extracting baseline reports
  • Added small and large cluster payload templates for entity-size comparison tests

Test plan

  • Run perf tests in-cluster via ./perf/run-in-cluster.sh and confirm [PERF] latency lines appear in output
  • Run ./perf/parse-report.sh against the output and verify the baseline report is generated correctly
  • Run ./perf/seed-clusters.sh status and ./perf/seed-clusters.sh cleanup to verify seed tooling works
  • make lint passes
  • E2E tests passed (if cross-component or major changes)

@openshift-ci openshift-ci Bot requested review from ciaranRoche and ldornele June 16, 2026 15:27
@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sherine-k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: d447fec1-579f-4437-8f0c-a365b8e54819

📥 Commits

Reviewing files that changed from the base of the PR and between 1b8b8a5 and 7bb0de2.

📒 Files selected for processing (7)
  • configs/config.yaml
  • e2e/nodepool/perf_delete_latency.go
  • perf/parse-report.sh
  • perf/run-in-cluster.sh
  • perf/seed-clusters.sh
  • pkg/config/config.go
  • pkg/config/defaults.go
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift-hyperfleet/architecture (manual)
  • openshift-hyperfleet/hyperfleet-api (manual)
  • openshift-hyperfleet/hyperfleet-sentinel (manual)
  • openshift-hyperfleet/hyperfleet-adapter (manual)
  • openshift-hyperfleet/hyperfleet-broker (manual)
✅ Files skipped from review due to trivial changes (1)
  • pkg/config/defaults.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • perf/run-in-cluster.sh
  • e2e/nodepool/perf_delete_latency.go
  • perf/parse-report.sh

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • New Features

    • Added comprehensive performance testing suite measuring baseline latencies for cluster and nodepool operations (create, read, update, delete, list)
    • Included in-cluster testing tools with automated database seeding and result reporting
  • Documentation

    • New performance testing guide with setup prerequisites and step-by-step usage instructions

Walkthrough

Adds in-cluster performance baseline measurement for core HyperFleet operations. Nine Ginkgo suites labeled [Suite: cluster|nodepool][perf] time create-to-reconciled, read, read-by-entity-size, update-to-reconciled, list, filtered-list, delete-to-hard-delete, and cascade-delete latencies via polling and HTTP status checks. Each suite provisions test resources via BeforeEach, measures elapsed time for the targeted operation, logs [PERF] metrics, and defers cleanup. Supporting infrastructure: perf/run-in-cluster.sh builds container image via make image-dev, extracts reference, spawns transient pod in target namespace; perf/seed-clusters.sh provisions batch clusters via POST /clusters with perf-seed labels and implements targeted/full reset; perf/parse-report.sh extracts [PERF] and [FAIL] lines from timestamped results, embeds kubectl context metadata, and writes report. NodePoolTimeouts struct gains Deleted field with 2m default. ListClusters refactored to delegate to ListClustersWithParams. Two cluster JSON fixtures (small 31 lines, large 53 lines) added. Documentation extended.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
No Pii Or Sensitive Data In Logs ⚠️ Warning Internal hostnames logged to stdout: perf/seed-clusters.sh lines 138, 175 log $API_URL (internal service FQDN); line 139 logs kubectl context name; perf/run-in-cluster.sh line 65 passes internal ho... Remove or redact API_URL and kubectl context from log output; pass api-url as environment variable or config file instead of command argument.
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed Title clearly identifies the main change: addition of performance tests and associated tooling, which aligns with the bulk of the changeset (9 new perf test files, 3 shell scripts, documentation, and supporting infrastructure).
Description check ✅ Passed Description is directly related to the changeset, detailing specific performance test coverage (cluster CRUD, nodepool CRUD, entity-size comparisons), tooling components (seed, run, parse scripts), and scope boundaries.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Sec-02: Secrets In Log Output ✅ Passed No log statements found logging tokens, passwords, credentials, or secrets. Logging includes only non-sensitive config metadata: cluster IDs, names, paths, image references (with QUAY_USER as usern...
No Hardcoded Secrets ✅ Passed .env.example contains empty environment variables with no default values. Test data files use standard placeholder values (example.com, my-gcp-project). No hardcoded secrets, credentials, API keys,...
No Weak Cryptography ✅ Passed No banned crypto packages (MD5, DES, RC4, SHA1), no unsafe secret comparisons, no custom crypto implementations, no ECB mode, no hardcoded credentials. Environment variables used properly.
No Injection Vectors ✅ Passed No injection vectors found. Regex validates API IDs in perf/seed-clusters.sh (^[a-zA-Z0-9_-]+) before use. Go code uses openapi.GetClustersParams for parameterized queries. Shell variables properly...
No Privileged Containers ✅ Passed No privileged container settings found: Dockerfiles lack privileged: true, hostPID, hostNetwork, hostIPC, SYS_ADMIN capabilities, or allowPrivilegeEscalation flags. E2E test containers run as root...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@e2e/cluster/perf_create_latency.go`:
- Around line 34-38: The ginkgo.DeferCleanup closures in multiple perf test
files are currently only logging warnings when CleanupTestCluster fails, which
allows resource leaks. Modify the closure to propagate the error by returning it
from h.CleanupTestCluster instead of only logging and continuing. Apply the same
fix to all four affected files: In e2e/cluster/perf_create_latency.go (lines
34-38), change the DeferCleanup closure to return the error from
h.CleanupTestCluster. In e2e/cluster/perf_read_latency.go (lines 27-31), make
the same change to return the cleanup error. In
e2e/cluster/perf_update_latency.go (lines 29-33), make the same change to return
the cleanup error. In e2e/cluster/perf_list_latency.go (lines 27-31), make the
same change to return the cleanup error. This ensures failures during cleanup
are properly propagated rather than silently logged.

In `@e2e/cluster/perf_read_entity_size_latency.go`:
- Around line 39-43: Replace the warning-only logging pattern in the cleanup
callbacks across all affected files with explicit test failure handling. In each
ginkgo.DeferCleanup callback (CleanupTestCluster and CleanupTestNodepool calls),
use ginkgo.Fail() to surface cleanup errors as test failures instead of
downgrading them to warnings via ginkgo.GinkgoWriter.Printf. For the
perf_delete_latency.go files (both cluster and nodepool variants), implement
idempotency checks to distinguish between already-deleted resources (which
should not fail the test) and unexpected errors (which should fail via
ginkgo.Fail()). Apply this pattern consistently across
e2e/cluster/perf_read_entity_size_latency.go (39-43),
e2e/cluster/perf_list_filtered_latency.go (28-32),
e2e/cluster/perf_delete_latency.go (30-34),
e2e/cluster/perf_cascade_delete_latency.go (31-35 and 46-50),
e2e/nodepool/perf_create_latency.go (29-33 and 44-48), and
e2e/nodepool/perf_delete_latency.go (31-35 and 41-45).

In `@e2e/nodepool/perf_delete_latency.go`:
- Around line 59-60: The Eventually assertion in the nodepool hard-delete wait
is using the incorrect timeout configuration. Replace
h.Cfg.Timeouts.Adapter.Processing with h.Cfg.Timeouts.NodePool.Reconciled in the
Eventually call that checks for the http.StatusNotFound response. This ensures
the timeout budget matches what is used elsewhere in the codebase for nodepool
hard-delete operations, preventing false failures due to control-plane lag.

In `@perf/parse-report.sh`:
- Around line 15-27: When the user explicitly provides a file path as the first
argument to the script, add validation to fail immediately if that file does not
exist, rather than silently falling back to the latest results file. The current
check on line 15 only sets INPUT and SOURCE when the file exists, but should
instead validate that if an explicit path is provided (non-empty $1), the script
must verify the file exists and exit with an error if it doesn't. This prevents
accidental parsing of the wrong artifact when a user makes a typo in the
filename they explicitly provided.

In `@perf/run-in-cluster.sh`:
- Around line 44-63: The script uses a mutable image tag (DEV_TAG variable
constructed on line 44) in the IMAGE variable on line 45 and passes it to
kubectl run with imagePullPolicy=Always on line 56, which creates a security
vulnerability since tags can be overwritten. Instead, capture the immutable
image digest from the Makefile's image-dev target (which builds and pushes the
image) and modify the run-in-cluster.sh script to use the digest reference
(format: quay.io/$QUAY_USER/hyperfleet-e2e@sha256:...) instead of the tag-based
IMAGE variable in the kubectl run command.

In `@perf/seed-clusters.sh`:
- Around line 72-77: The curl DELETE request on line 75 does not validate the
HTTP response status before incrementing the deleted counter on line 76. This
causes failed deletes (4xx/5xx responses) to be incorrectly counted as
successes. Check the HTTP status code returned by the curl command (or its exit
status) and only increment the deleted counter when the delete operation
succeeds. Ensure that failures are either logged or handled appropriately so
that the final count accurately reflects only successful cluster deletions.
- Line 15: Replace the unsafe `source` command that executes the entire `.env`
file with a whitelisting approach that parses only specific, known variables
from the `.env` file. Instead of using `source "$REPO_DIR/.env"`, extract and
assign individual variables explicitly (for example, using grep or read commands
to parse specific key-value pairs). This prevents arbitrary shell code execution
while still accessing necessary configuration values. Apply this pattern
consistently across all perf scripts that currently use `source` to load `.env`
files.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 58ad3cde-a3d9-4582-aaad-658b4d5735b4

📥 Commits

Reviewing files that changed from the base of the PR and between 0cd32f2 and 1b8b8a5.

📒 Files selected for processing (21)
  • .env.example
  • .gitignore
  • AGENTS.md
  • README.md
  • e2e/cluster/perf_cascade_delete_latency.go
  • e2e/cluster/perf_create_latency.go
  • e2e/cluster/perf_delete_latency.go
  • e2e/cluster/perf_list_filtered_latency.go
  • e2e/cluster/perf_list_latency.go
  • e2e/cluster/perf_read_entity_size_latency.go
  • e2e/cluster/perf_read_latency.go
  • e2e/cluster/perf_update_latency.go
  • e2e/nodepool/perf_create_latency.go
  • e2e/nodepool/perf_delete_latency.go
  • perf/README.md
  • perf/parse-report.sh
  • perf/run-in-cluster.sh
  • perf/seed-clusters.sh
  • pkg/client/cluster.go
  • testdata/payloads/clusters/cluster-request-large.json
  • testdata/payloads/clusters/cluster-request-small.json
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift-hyperfleet/architecture (manual)
  • openshift-hyperfleet/hyperfleet-api (manual)
  • openshift-hyperfleet/hyperfleet-sentinel (manual)
  • openshift-hyperfleet/hyperfleet-adapter (manual)
  • openshift-hyperfleet/hyperfleet-broker (manual)

Comment thread e2e/cluster/perf_create_latency.go
Comment thread e2e/cluster/perf_read_entity_size_latency.go
Comment thread e2e/nodepool/perf_delete_latency.go Outdated
Comment thread perf/parse-report.sh Outdated
Comment thread perf/run-in-cluster.sh Outdated
Comment thread perf/seed-clusters.sh
Comment thread perf/seed-clusters.sh
@pnguyen44

Copy link
Copy Markdown
Contributor Author

/retest-required

Comment thread perf/README.md
Confirm the adapters are subscribed to the correct topic (not `placeholder`):

```bash
kubectl exec -n hyperfleet deploy/adapter1-hyperfleet-adapter -- env | grep BROKER_TOPIC

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The verification step tells users to check BROKER_TOPIC env var:

kubectl exec -n hyperfleet deploy/adapter1-hyperfleet-adapter -- env | grep BROKER_TOPIC

But from the v1.0.0 breaking-changes checklist (item #15), the Sentinel's BROKER_TOPIC env var was removed in v1.0.0. These perf tests target v1.0.0. This verification command may return nothing, leading users to think the adapter isn't connected when it actually is — the topic is now configured via config file, not env var.

Needs verification: does the adapter (not Sentinel) deployment template still inject BROKER_TOPIC? If not, this step needs updating.

for _, size := range sizes {
ginkgo.It("should read a "+size.name+" cluster within acceptable latency", func(ctx context.Context) {
ginkgo.By("creating a " + size.name + " cluster")
cluster, err := h.Client.CreateClusterFromPayload(ctx, h.TestDataPath(size.payload))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entity-size test creates clusters and reads them immediately without waiting for Reconciled. The other lifecycle tests (create, update, delete) all wait for Reconciled first.

A freshly-created cluster has only the spec — no adapter statuses, no conditions, no data from adapter reports. A reconciled cluster has all of that, making the JSON response larger. So this measures read latency on a smaller response than a production read would return.

If the intent is "does a bigger spec affect read latency" — fine as-is. If it's "realistic read latency of a fully-provisioned cluster" — it's undercounting. A one-line comment clarifying the intent would help.

_, err := h.Client.PatchClusterFromPayload(ctx, clusterID, h.TestDataPath("payloads/clusters/cluster-patch.json"))
Expect(err).NotTo(HaveOccurred())

Eventually(h.PollCluster(ctx, clusterID), h.Cfg.Timeouts.Cluster.Reconciled, h.Cfg.Polling.Interval).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential false near-zero latency from stale condition read.

After PATCH (line 44), generation increments from 1 to 2. Line 47-48 immediately polls for Reconciled=True. But Reconciled=True was already set from gen 1 reconciliation. HaveResourceCondition checks type and status but not observed_generation — so stale True at gen 1 is indistinguishable from fresh True at gen 2.

If the API doesn't synchronously flip Reconciled=False during the PATCH handler, the first poll reads the stale True and the test reports near-zero latency.

Suggested fix — wait for False first, then True:

// Confirm API detected generation mismatch
Eventually(h.PollCluster(ctx, clusterID), h.Cfg.Timeouts.Cluster.Reconciled, h.Cfg.Polling.Interval).
    Should(helper.HaveResourceCondition(client.ConditionTypeReconciled, openapi.ResourceConditionStatusFalse))

// Then wait for re-reconciliation at new generation
Eventually(h.PollCluster(ctx, clusterID), h.Cfg.Timeouts.Cluster.Reconciled, h.Cfg.Polling.Interval).
    Should(helper.HaveResourceCondition(client.ConditionTypeReconciled, openapi.ResourceConditionStatusTrue))

Or add a generation-aware matcher that checks observed_generation == resource.generation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants