Skip to content

ci: add testnet release orchestrator#3879

Open
ben-dz wants to merge 7 commits into
mainfrom
bdz/infra-1637
Open

ci: add testnet release orchestrator#3879
ben-dz wants to merge 7 commits into
mainfrom
bdz/infra-1637

Conversation

@ben-dz

@ben-dz ben-dz commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary of Changes

  • Add release.testnet.yml, a workflow_dispatch orchestrator that drives a full testnet release from one run: preflight checks (version format, devnet daily green) → automated version-bump PRs in doublezero and infra → human gate (testnet-release-gate) that verifies both PRs merged → 9-component tag push → CloudSmith package polling → Solana program build (serviceability default, telemetry/geolocation --features testnet) and staging via infra → human gate (testnet-program-deploy) around the manual program deploy → onchain version verification → infra core/client deploys and QA → Slack announce, with a failure notifier on every job. Supports dry_run (draft PRs, no tags, check-mode deploys) and skip_devnet_check inputs.
  • Add scripts/release/bump-version.sh: bumps the single workspace version in Cargo.toml, runs cargo update --workspace, and promotes the CHANGELOG ## Unreleased section into a dated, compare-linked version section with a fresh empty skeleton above it.
  • Add .github/actions/dispatch-and-wait, a composite action that dispatches a workflow_dispatch workflow in another repo with a generated correlation_id, locates the run by its run-name, and watches it to completion.
  • Add an optional skip_existing input (default false, today's behavior unchanged) to release.testnet.push.tags.yml so orchestrator re-runs succeed past already-pushed tags.
  • Add docs/testnet-release.md, the stage-by-stage runbook (what is automated vs. what the human does, dry-run mode and cleanup, recovery semantics).
  • This replaces the manual multi-step testnet release process with a single resumable run that pauses at the two points that genuinely need a human: merging/approving the version PRs and deploying the Solana programs.
  • The infra-side pieces this references (scripts/bump-testnet-versions.sh, stage-programs.testnet.yml, correlation_id inputs on the deploy/QA workflows) are runtime-only dependencies built under the sibling sub-issue of malbeclabs/infra#1636. The RELEASE_BOT_* / SLACK_TESTNET_ALERTS_WEBHOOK secrets and the testnet-release-gate / testnet-program-deploy environments will be created before first run.
  • Fixes malbeclabs/infra#1637

Testing Verification

  • ./scripts/release/bump-version.sh 0.99.0 run on a real checkout: the diff touched exactly Cargo.toml, Cargo.lock, CHANGELOG.md; the CHANGELOG gained a fresh empty ## Unreleased skeleton above a new ## [v0.99.0](.../compare/client/v0.27.1...client/v0.99.0) - 2026-06-11 header that took ownership of the previous Unreleased content; changes then reverted.
  • shellcheck 0.11.0 on bump-version.sh: zero findings. actionlint 1.7.12 on release.testnet.yml and release.testnet.push.tags.yml (includes shellcheck of embedded run: blocks): zero errors. The composite action YAML parses and is structurally valid.
  • The verify-onchain grep was validated against the client source rather than live testnet (no testnet access from this environment): smartcontract/cli/src/version.rs prints program version: X.Y.Z, which grep -qi "program version.*$VERSION" matches. A reviewer with testnet access may want to confirm against real doublezero --env testnet version output.
  • No workflows were dispatched.

Review findings (for the spec authors)

Independent architecture and security reviews were posted to the tracking issue (malbeclabs/infra#1637). No critical findings. Two High architecture findings flag trade-offs the issue spec itself chose, so they are surfaced here rather than changed unilaterally:

  • Single release commit is not pinned: build-programs checks out floating main and each tag job tags main at its own checkout time, so a commit landing between gate approval and the builds can make tags/packages/programs diverge. Fixable by resolving the version-bump PR's merge SHA in gate-tags and threading it to the build checkout and a new ref input on the tag workflow.
  • Orchestrator re-runs can double-dispatch infra deploys: if gh run watch fails while the infra run continues, "Re-run failed jobs" dispatches a second run. Mitigable by having dispatch-and-wait adopt an in-progress run matching cid-${GITHUB_RUN_ID}-, or by requiring concurrency groups on the infra workflows.

Both can land as small follow-ups if wanted.

Size note

583 added lines, 517 excluding docs — over the repo's ~500-line guideline, as anticipated and accepted in the tracking issue ("the PR is large (~500 lines, mostly workflow YAML); that is expected and accepted for this change"). The five pieces are interdependent (the orchestrator consumes the script, the action, and the skip_existing input), so splitting would leave unmergeable fragments.

@ben-dz ben-dz left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed against the plan in malbeclabs/infra#1636 / #1637. The implementation is faithful to the spec, and the cross-PR contracts with malbeclabs/infra#1640 all line up (input names, artifact name solana-programs-vX.Y.Z, staging path, bot branch names, script paths). The two disclosed architecture trade-offs are assessed below. One finding the spec didn't cover needs changes before merge.

Infra-side environment approvals (pre-merge changes requested)

infra's testnet environment has required reviewers (verified via the API). The infra jobs behind stage-programs, deploy-core, and deploy-clients all run with environment: testnet, so every run this orchestrator dispatches pauses for an approval in the infra repo before it executes. Three changes:

  1. Bump stage-programs timeout-minutes: 30120. A human approval sits inside that window; 30 minutes will time out on any slow approval, and a re-run would dispatch a second staging run.
  2. Post an approval request to Slack #int-tech with a link to the waiting run. Suggested shape: give dispatch-and-wait an optional slack-webhook input; when set, after the action locates the dispatched run it posts "infra run may need testnet environment approval: " before starting to watch. The orchestrator passes a webhook pointed at #int-tech for the three infra dispatches (new secret, e.g. SLACK_INT_TECH_WEBHOOK). Without this, the pipeline silently stalls until someone happens to look at infra's Actions tab.
  3. Document these approvals in the runbook. The stage table covers gate 1, the doublezero testnet prompt at tags, and gate 2, but not the three infra-side approvals. A real release is ~6–7 approval interactions across two repos; the runbook should say so.

Disclosed findings — agree, fine as follow-ups

  • Unpinned release commit: agree with the assessment and the proposed fix (resolve the bump PR's merge SHA in gate-tags, thread it to the build checkout and a ref input on the tag workflow). Land it before the team leans on the orchestrator heavily; an active main makes the divergence window real.
  • Double-dispatch on re-run: agree. Cheap mitigation inside dispatch-and-wait: before dispatching, adopt an in-progress run whose title matches cid-${GITHUB_RUN_ID}-. Matters most for the send-it deploys.

Minor

  • verify-cloudsmith: with set -euo pipefail, one transient CloudSmith API error kills the job instead of counting as a failed poll attempt. Consider tolerating per-query errors inside the loop so only the timeout fails the job. (Job is re-runnable, so annoyance not breakage.)
  • verify-onchain: pin apt-get install -y doublezero=${VERSION}-1 instead of latest, so the check is self-consistent if repo metadata lags.
  • DEPLOY.md says "set the onchain version" generically, but only serviceability has a settable version account (telemetry has none; geolocation has no CLI setter). Worth tightening so the deployer isn't hunting for commands that don't exist.
  • Preflight's gh run list under the restricted GITHUB_TOKEN is fine because this repo is public; if it ever 403s, add actions: read to the top-level permissions.

Verification claims spot-checked: bump-script test/revert, actionlint/shellcheck runs, the verify-onchain grep validated against smartcontract/cli/src/version.rs output format, and the target/deploy/ path matching the devnet smartcontract workflow. All credible; the dry run covers the rest.

@ben-dz

ben-dz commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Addressed in ed53cd1.

Infra-side environment approvals (the pre-merge items):

  1. stage-programs timeout bumped 30 → 120 minutes, with a comment explaining the human approval inside the window.
  2. dispatch-and-wait gains an optional slack-webhook input. After the action locates the dispatched run, it posts the run URL to the webhook before starting to watch (" run in may need environment approval: "). A failed ping is a warning, not a job failure. The orchestrator passes secrets.SLACK_INT_TECH_WEBHOOK on the three infra dispatches (stage-programs, deploy-core, deploy-clients). Action needed: create the SLACK_INT_TECH_WEBHOOK repo secret (incoming webhook pointed at #int-tech) before the first real run — until it exists the input is empty and the ping is silently skipped.
  3. Runbook updated: the intro now states a release is ~6–7 approval interactions across the two repos and lists them; the stage-programs, deploy-core, and deploy-clients rows in the stage table now call out the infra testnet environment approval with the #int-tech link.

Minors, also done:

  • verify-cloudsmith: per-query CloudSmith errors now count as a failed poll attempt (|| RPM=0 / || DEB=0) instead of killing the job; only the timeout fails it.
  • verify-onchain: pins apt-get install -y doublezero=${version}-1.
  • DEPLOY.md: now gives the exact command (doublezero --env testnet global-config set-version --min-compatible-version <X.Y.Z>) and notes serviceability is the only program with a settable version account (telemetry has none, geolocation has no CLI setter).
  • Preflight permissions: left as is per your note (public repo).

Follow-ups (not in this PR, as agreed): pinning the release commit through gate-tags → build/tags, and dispatch-and-wait adopting an in-progress cid-${GITHUB_RUN_ID}- run on re-run.

@ben-dz ben-dz marked this pull request as ready for review June 11, 2026 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant