ci: add testnet release orchestrator by ben-dz · Pull Request #3879 · malbeclabs/doublezero

ben-dz · 2026-06-11T19:21:31Z

Summary of Changes

Add release.testnet.yml, a workflow_dispatch orchestrator that drives a full testnet release from one run: preflight checks (version format, devnet daily green) → automated version-bump PRs in doublezero and infra → human gate (testnet-release-gate) that verifies both PRs merged → 9-component tag push → CloudSmith package polling → Solana program build (serviceability default, telemetry/geolocation --features testnet) and staging via infra → human gate (testnet-program-deploy) around the manual program deploy → onchain version verification → infra core/client deploys and QA → Slack announce, with a failure notifier on every job. Supports dry_run (draft PRs, no tags, check-mode deploys) and skip_devnet_check inputs.
Add scripts/release/bump-version.sh: bumps the single workspace version in Cargo.toml, runs cargo update --workspace, and promotes the CHANGELOG ## Unreleased section into a dated, compare-linked version section with a fresh empty skeleton above it.
Add .github/actions/dispatch-and-wait, a composite action that dispatches a workflow_dispatch workflow in another repo with a generated correlation_id, locates the run by its run-name, and watches it to completion.
Add an optional skip_existing input (default false, today's behavior unchanged) to release.testnet.push.tags.yml so orchestrator re-runs succeed past already-pushed tags.
Add docs/testnet-release.md, the stage-by-stage runbook (what is automated vs. what the human does, dry-run mode and cleanup, recovery semantics).
This replaces the manual multi-step testnet release process with a single resumable run that pauses at the two points that genuinely need a human: merging/approving the version PRs and deploying the Solana programs.
The infra-side pieces this references (scripts/bump-testnet-versions.sh, stage-programs.testnet.yml, correlation_id inputs on the deploy/QA workflows) are runtime-only dependencies built under the sibling sub-issue of malbeclabs/infra#1636. The RELEASE_BOT_* / SLACK_TESTNET_ALERTS_WEBHOOK secrets and the testnet-release-gate / testnet-program-deploy environments will be created before first run.
Fixes malbeclabs/infra#1637

Testing Verification

./scripts/release/bump-version.sh 0.99.0 run on a real checkout: the diff touched exactly Cargo.toml, Cargo.lock, CHANGELOG.md; the CHANGELOG gained a fresh empty ## Unreleased skeleton above a new ## [v0.99.0](.../compare/client/v0.27.1...client/v0.99.0) - 2026-06-11 header that took ownership of the previous Unreleased content; changes then reverted.
shellcheck 0.11.0 on bump-version.sh: zero findings. actionlint 1.7.12 on release.testnet.yml and release.testnet.push.tags.yml (includes shellcheck of embedded run: blocks): zero errors. The composite action YAML parses and is structurally valid.
The verify-onchain grep was validated against the client source rather than live testnet (no testnet access from this environment): smartcontract/cli/src/version.rs prints program version: X.Y.Z, which grep -qi "program version.*$VERSION" matches. A reviewer with testnet access may want to confirm against real doublezero --env testnet version output.
No workflows were dispatched.

Review findings (for the spec authors)

Independent architecture and security reviews were posted to the tracking issue (malbeclabs/infra#1637). No critical findings. Two High architecture findings flag trade-offs the issue spec itself chose, so they are surfaced here rather than changed unilaterally:

Single release commit is not pinned: build-programs checks out floating main and each tag job tags main at its own checkout time, so a commit landing between gate approval and the builds can make tags/packages/programs diverge. Fixable by resolving the version-bump PR's merge SHA in gate-tags and threading it to the build checkout and a new ref input on the tag workflow.
Orchestrator re-runs can double-dispatch infra deploys: if gh run watch fails while the infra run continues, "Re-run failed jobs" dispatches a second run. Mitigable by having dispatch-and-wait adopt an in-progress run matching cid-${GITHUB_RUN_ID}-, or by requiring concurrency groups on the infra workflows.

Both can land as small follow-ups if wanted.

Size note

583 added lines, 517 excluding docs — over the repo's ~500-line guideline, as anticipated and accepted in the tracking issue ("the PR is large (~500 lines, mostly workflow YAML); that is expected and accepted for this change"). The five pieces are interdependent (the orchestrator consumes the script, the action, and the skip_existing input), so splitting would leave unmergeable fragments.

ben-dz

Reviewed against the plan in malbeclabs/infra#1636 / #1637. The implementation is faithful to the spec, and the cross-PR contracts with malbeclabs/infra#1640 all line up (input names, artifact name solana-programs-vX.Y.Z, staging path, bot branch names, script paths). The two disclosed architecture trade-offs are assessed below. One finding the spec didn't cover needs changes before merge.

Infra-side environment approvals (pre-merge changes requested)

infra's testnet environment has required reviewers (verified via the API). The infra jobs behind stage-programs, deploy-core, and deploy-clients all run with environment: testnet, so every run this orchestrator dispatches pauses for an approval in the infra repo before it executes. Three changes:

Bump stage-programs timeout-minutes: 30 → 120. A human approval sits inside that window; 30 minutes will time out on any slow approval, and a re-run would dispatch a second staging run.
Post an approval request to Slack #int-tech with a link to the waiting run. Suggested shape: give dispatch-and-wait an optional slack-webhook input; when set, after the action locates the dispatched run it posts "infra run may need testnet environment approval: " before starting to watch. The orchestrator passes a webhook pointed at #int-tech for the three infra dispatches (new secret, e.g. SLACK_INT_TECH_WEBHOOK). Without this, the pipeline silently stalls until someone happens to look at infra's Actions tab.
Document these approvals in the runbook. The stage table covers gate 1, the doublezero testnet prompt at tags, and gate 2, but not the three infra-side approvals. A real release is ~6–7 approval interactions across two repos; the runbook should say so.

Disclosed findings — agree, fine as follow-ups

Unpinned release commit: agree with the assessment and the proposed fix (resolve the bump PR's merge SHA in gate-tags, thread it to the build checkout and a ref input on the tag workflow). Land it before the team leans on the orchestrator heavily; an active main makes the divergence window real.
Double-dispatch on re-run: agree. Cheap mitigation inside dispatch-and-wait: before dispatching, adopt an in-progress run whose title matches cid-${GITHUB_RUN_ID}-. Matters most for the send-it deploys.

Minor

verify-cloudsmith: with set -euo pipefail, one transient CloudSmith API error kills the job instead of counting as a failed poll attempt. Consider tolerating per-query errors inside the loop so only the timeout fails the job. (Job is re-runnable, so annoyance not breakage.)
verify-onchain: pin apt-get install -y doublezero=${VERSION}-1 instead of latest, so the check is self-consistent if repo metadata lags.
DEPLOY.md says "set the onchain version" generically, but only serviceability has a settable version account (telemetry has none; geolocation has no CLI setter). Worth tightening so the deployer isn't hunting for commands that don't exist.
Preflight's gh run list under the restricted GITHUB_TOKEN is fine because this repo is public; if it ever 403s, add actions: read to the top-level permissions.

Verification claims spot-checked: bump-script test/revert, actionlint/shellcheck runs, the verify-onchain grep validated against smartcontract/cli/src/version.rs output format, and the target/deploy/ path matching the devnet smartcontract workflow. All credible; the dry run covers the rest.

…trator

ben-dz · 2026-06-11T19:49:03Z

Addressed in ed53cd1.

Infra-side environment approvals (the pre-merge items):

stage-programs timeout bumped 30 → 120 minutes, with a comment explaining the human approval inside the window.
dispatch-and-wait gains an optional slack-webhook input. After the action locates the dispatched run, it posts the run URL to the webhook before starting to watch (" run in may need environment approval: "). A failed ping is a warning, not a job failure. The orchestrator passes secrets.SLACK_INT_TECH_WEBHOOK on the three infra dispatches (stage-programs, deploy-core, deploy-clients). Action needed: create the SLACK_INT_TECH_WEBHOOK repo secret (incoming webhook pointed at #int-tech) before the first real run — until it exists the input is empty and the ping is silently skipped.
Runbook updated: the intro now states a release is ~6–7 approval interactions across the two repos and lists them; the stage-programs, deploy-core, and deploy-clients rows in the stage table now call out the infra testnet environment approval with the #int-tech link.

Minors, also done:

verify-cloudsmith: per-query CloudSmith errors now count as a failed poll attempt (|| RPM=0 / || DEB=0) instead of killing the job; only the timeout fails it.
verify-onchain: pins apt-get install -y doublezero=${version}-1.
DEPLOY.md: now gives the exact command (doublezero --env testnet global-config set-version --min-compatible-version <X.Y.Z>) and notes serviceability is the only program with a settable version account (telemetry has none, geolocation has no CLI setter).
Preflight permissions: left as is per your note (public repo).

Follow-ups (not in this PR, as agreed): pinning the release commit through gate-tags → build/tags, and dispatch-and-wait adopting an in-progress cid-${GITHUB_RUN_ID}- run on re-run.

ben-dz added 5 commits June 11, 2026 19:13

release: add testnet version bump script

0a0ddb9

ci: add dispatch-and-wait composite action

55fbe26

ci: add skip_existing input to testnet tag push workflow

623faf6

ci: add testnet release orchestrator workflow

790796b

docs: add testnet release runbook

7bd74b9

ben-dz added the skip-changelog label Jun 11, 2026

changelog: add unreleased entry for testnet release orchestrator

86d72db

ben-dz commented Jun 11, 2026

View reviewed changes

ci: handle infra-side environment approvals in testnet release orches…

ed53cd1

…trator

ben-dz marked this pull request as ready for review June 11, 2026 20:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: add testnet release orchestrator#3879

ci: add testnet release orchestrator#3879
ben-dz wants to merge 7 commits into
mainfrom
bdz/infra-1637

ben-dz commented Jun 11, 2026

Uh oh!

ben-dz left a comment

Uh oh!

ben-dz commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ben-dz commented Jun 11, 2026

Summary of Changes

Testing Verification

Review findings (for the spec authors)

Size note

Uh oh!

ben-dz left a comment

Choose a reason for hiding this comment

Infra-side environment approvals (pre-merge changes requested)

Disclosed findings — agree, fine as follow-ups

Minor

Uh oh!

ben-dz commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant