ci: add testnet release orchestrator#3879
Conversation
ben-dz
left a comment
There was a problem hiding this comment.
Reviewed against the plan in malbeclabs/infra#1636 / #1637. The implementation is faithful to the spec, and the cross-PR contracts with malbeclabs/infra#1640 all line up (input names, artifact name solana-programs-vX.Y.Z, staging path, bot branch names, script paths). The two disclosed architecture trade-offs are assessed below. One finding the spec didn't cover needs changes before merge.
Infra-side environment approvals (pre-merge changes requested)
infra's testnet environment has required reviewers (verified via the API). The infra jobs behind stage-programs, deploy-core, and deploy-clients all run with environment: testnet, so every run this orchestrator dispatches pauses for an approval in the infra repo before it executes. Three changes:
- Bump
stage-programstimeout-minutes: 30→120. A human approval sits inside that window; 30 minutes will time out on any slow approval, and a re-run would dispatch a second staging run. - Post an approval request to Slack
#int-techwith a link to the waiting run. Suggested shape: givedispatch-and-waitan optionalslack-webhookinput; when set, after the action locates the dispatched run it posts "infra run may needtestnetenvironment approval: " before starting to watch. The orchestrator passes a webhook pointed at#int-techfor the three infra dispatches (new secret, e.g.SLACK_INT_TECH_WEBHOOK). Without this, the pipeline silently stalls until someone happens to look at infra's Actions tab. - Document these approvals in the runbook. The stage table covers gate 1, the doublezero
testnetprompt at tags, and gate 2, but not the three infra-side approvals. A real release is ~6–7 approval interactions across two repos; the runbook should say so.
Disclosed findings — agree, fine as follow-ups
- Unpinned release commit: agree with the assessment and the proposed fix (resolve the bump PR's merge SHA in
gate-tags, thread it to the build checkout and arefinput on the tag workflow). Land it before the team leans on the orchestrator heavily; an active main makes the divergence window real. - Double-dispatch on re-run: agree. Cheap mitigation inside
dispatch-and-wait: before dispatching, adopt an in-progress run whose title matchescid-${GITHUB_RUN_ID}-. Matters most for the send-it deploys.
Minor
verify-cloudsmith: withset -euo pipefail, one transient CloudSmith API error kills the job instead of counting as a failed poll attempt. Consider tolerating per-query errors inside the loop so only the timeout fails the job. (Job is re-runnable, so annoyance not breakage.)verify-onchain: pinapt-get install -y doublezero=${VERSION}-1instead of latest, so the check is self-consistent if repo metadata lags.DEPLOY.mdsays "set the onchain version" generically, but only serviceability has a settable version account (telemetry has none; geolocation has no CLI setter). Worth tightening so the deployer isn't hunting for commands that don't exist.- Preflight's
gh run listunder the restrictedGITHUB_TOKENis fine because this repo is public; if it ever 403s, addactions: readto the top-level permissions.
Verification claims spot-checked: bump-script test/revert, actionlint/shellcheck runs, the verify-onchain grep validated against smartcontract/cli/src/version.rs output format, and the target/deploy/ path matching the devnet smartcontract workflow. All credible; the dry run covers the rest.
|
Addressed in ed53cd1. Infra-side environment approvals (the pre-merge items):
Minors, also done:
Follow-ups (not in this PR, as agreed): pinning the release commit through gate-tags → build/tags, and dispatch-and-wait adopting an in-progress |
Summary of Changes
release.testnet.yml, aworkflow_dispatchorchestrator that drives a full testnet release from one run: preflight checks (version format, devnet daily green) → automated version-bump PRs in doublezero and infra → human gate (testnet-release-gate) that verifies both PRs merged → 9-component tag push → CloudSmith package polling → Solana program build (serviceability default, telemetry/geolocation--features testnet) and staging via infra → human gate (testnet-program-deploy) around the manual program deploy → onchain version verification → infra core/client deploys and QA → Slack announce, with a failure notifier on every job. Supportsdry_run(draft PRs, no tags, check-mode deploys) andskip_devnet_checkinputs.scripts/release/bump-version.sh: bumps the single workspace version inCargo.toml, runscargo update --workspace, and promotes the CHANGELOG## Unreleasedsection into a dated, compare-linked version section with a fresh empty skeleton above it..github/actions/dispatch-and-wait, a composite action that dispatches aworkflow_dispatchworkflow in another repo with a generatedcorrelation_id, locates the run by its run-name, and watches it to completion.skip_existinginput (defaultfalse, today's behavior unchanged) torelease.testnet.push.tags.ymlso orchestrator re-runs succeed past already-pushed tags.docs/testnet-release.md, the stage-by-stage runbook (what is automated vs. what the human does, dry-run mode and cleanup, recovery semantics).scripts/bump-testnet-versions.sh,stage-programs.testnet.yml,correlation_idinputs on the deploy/QA workflows) are runtime-only dependencies built under the sibling sub-issue of malbeclabs/infra#1636. TheRELEASE_BOT_*/SLACK_TESTNET_ALERTS_WEBHOOKsecrets and thetestnet-release-gate/testnet-program-deployenvironments will be created before first run.Testing Verification
./scripts/release/bump-version.sh 0.99.0run on a real checkout: the diff touched exactlyCargo.toml,Cargo.lock,CHANGELOG.md; the CHANGELOG gained a fresh empty## Unreleasedskeleton above a new## [v0.99.0](.../compare/client/v0.27.1...client/v0.99.0) - 2026-06-11header that took ownership of the previous Unreleased content; changes then reverted.shellcheck0.11.0 onbump-version.sh: zero findings.actionlint1.7.12 onrelease.testnet.ymlandrelease.testnet.push.tags.yml(includes shellcheck of embeddedrun:blocks): zero errors. The composite action YAML parses and is structurally valid.verify-onchaingrep was validated against the client source rather than live testnet (no testnet access from this environment):smartcontract/cli/src/version.rsprintsprogram version: X.Y.Z, whichgrep -qi "program version.*$VERSION"matches. A reviewer with testnet access may want to confirm against realdoublezero --env testnet versionoutput.Review findings (for the spec authors)
Independent architecture and security reviews were posted to the tracking issue (malbeclabs/infra#1637). No critical findings. Two High architecture findings flag trade-offs the issue spec itself chose, so they are surfaced here rather than changed unilaterally:
build-programschecks out floatingmainand each tag job tagsmainat its own checkout time, so a commit landing between gate approval and the builds can make tags/packages/programs diverge. Fixable by resolving the version-bump PR's merge SHA ingate-tagsand threading it to the build checkout and a newrefinput on the tag workflow.gh run watchfails while the infra run continues, "Re-run failed jobs" dispatches a second run. Mitigable by havingdispatch-and-waitadopt an in-progress run matchingcid-${GITHUB_RUN_ID}-, or by requiring concurrency groups on the infra workflows.Both can land as small follow-ups if wanted.
Size note
583 added lines, 517 excluding docs — over the repo's ~500-line guideline, as anticipated and accepted in the tracking issue ("the PR is large (~500 lines, mostly workflow YAML); that is expected and accepted for this change"). The five pieces are interdependent (the orchestrator consumes the script, the action, and the
skip_existinginput), so splitting would leave unmergeable fragments.