chore: experiment with Bazel Remote Execution (BRE) on Namespace#10579
chore: experiment with Bazel Remote Execution (BRE) on Namespace#10579basvandijk wants to merge 57 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces an experimental GitHub Actions path to run bazel test using Namespace Bazel Remote Execution (BRE), including automation to build/mirror/optimize a worker image and keep the workflow pinned to an immutable digest.
Changes:
- Adds a new experimental workflow to run
bazel teston Namespace runners with remote execution enabled. - Extends the container autobuild workflow with jobs to mirror
ic-buildintonscr.io, optimize it for BRE, and automatically update the pinned worker-image digest used by the new workflow.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| .github/workflows/container-autobuild.yml | Adds jobs to create an optimized Namespace BRE worker image and auto-update the pinned digest reference in workflows. |
| .github/workflows/bre-namespace-test.yml | New opt-in workflow to run bazel test using Namespace remote execution with a pinned worker image. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add a new 'bre-namespace-test.yml' workflow that runs 'bazel test' on Namespace runners using Bazel Remote Execution (BRE). Actions execute on Namespace workers booted from a custom worker image (a mirror of ic-build). Extend 'container-autobuild.yml' to mirror the freshly built ic-build image into nscr.io, optimize it for BRE, and pin the resulting digest. These jobs are decoupled from the production image-reference update so an early-access BRE failure can never block it.
Addresses Copilot review: 'nsc bazel execution setup' writes short-lived credentials into the bazelrc, so 'cat'-ing it could leak auth material into the Actions logs.
…update job - Use the same fully-qualified nscr.io ref for the upload destination and the digest lookup, so the inspected tag is guaranteed to exist. - Guard bre-worker-image and bre-namespace-test against fork PRs via head.repo.full_name == github.repository (matching ci-kickoff.yml), since both run on privileged Namespace runners with pre-authenticated nsc. - Drop update-image-references from update-worker-reference's needs for true decoupling; the existing 'git pull --rebase' absorbs any concurrent push.
27f3f10 to
95ee442
Compare
…ac895bdd550cac7bacb9dad553bae ic-build: sha256:f4c6c7e0e16da470cba7ebceb0145f588d5fd4859c04acfa607bee475ecfa914 ic-dev: sha256:2f98d344d708a1ae70938d5e777a1f141f7f2a9545687653f407a405eb1a27ea
|
Run URL: https://github.com/dfinity/ic/actions/runs/28442091939 New container images with tag: |
update-image-references now depends on bre-worker-image (default needs semantics) and pins the worker digest unconditionally in the same commit, dropping the cancelled()-guard and the separate update-worker-reference job. Waiting on bre-worker-image before committing/pushing avoids the concurrency cancellation seen earlier.
nsc base-image upload prepends $NSC_CONTAINER_REGISTRY (nscr.io/<tenant>) to a relative name. Passing the fully-qualified nscr.io/dfinity ref caused a double-prefixed push (nscr.io/<tenant>/nscr.io/dfinity/...) and a 401 on the digest lookup. Derive the registry from $NSC_CONTAINER_REGISTRY, upload a relative name, pin the resulting full ref, and match any tenant in the pin sed. Also fixes a stale comment referencing the removed update-worker-reference job.
bazel_skylib's copy_file emits a CopyFile spawn tagged no-remote/no-cache (COPY_EXECUTION_REQUIREMENTS). Under remote execution (Namespace BRE), where the only available spawn strategy is remote, these actions have no eligible strategy and fail with: CopyFile spawn cannot be executed with any of the available strategies: [remote] Force the CopyFile/CopyDirectory mnemonics to a local strategy. A per-mnemonic --strategy overrides --spawn_strategy, so it works regardless of the remote execution config, and is a no-op for non-remote-execution builds where these copies already run locally.
The Namespace remote-execution bazelrc makes 'remote' the only spawn strategy, so spawns that forbid remote execution have no eligible strategy and fail with: <Mnemonic> spawn cannot be executed with any of the available strategies: [remote] This affects e.g. bazel_skylib's copy_file (tagged no-remote) and rules_python's compile_pip_requirements .test target (tagged no-remote-exec, requires-network). Pass --spawn_strategy=remote,local on the BRE 'bazel test' command line so Bazel runs exactly those spawns locally on the runner while everything else still runs remotely. A command-line flag overrides the remote-only --spawn_strategy from the Namespace bazelrc. This supersedes the earlier per-mnemonic --strategy=CopyFile=local workaround, so revert it from the shared bazelrc.build (it had affected non-BRE builds too).
…e-bazel-remote-execution
…373e886e00994723ae3de5e53b107 ic-build: sha256:cb929d45e83f893f4b03fde1d596dd1acc3211d367f0b8b0195c13c72ee329de ic-dev: sha256:3e9bd73664f66ea9feb41c414731b66d71e4dc9b8febed19f46e01fe098caf60 ic-build-worker: nscr.io/c9ptjuknd7oc6/ic-build-worker@sha256:584ad7548763df6bf44fd8aa320091f8c67664c4e09111bd80d285b8c5fa8154
Genrules resolve their spawn strategy via --genrule_strategy / --strategy=Genrule, not --spawn_strategy, so the earlier --spawn_strategy=remote,local fallback did not reach them. Under the remote-only Namespace BRE config, genrules marked 'local = True' (e.g. //rs/tests:libvirtd and //rs/tests:dnsmasq, which copy host binaries) thus failed with: Genrule spawn cannot be executed with any of the available strategies: [remote] Pass --strategy=Genrule=remote,local on the 'bazel test' command line. Per Bazel, --strategy=<mnemonic> overrides both --spawn_strategy and --genrule_strategy, and a command-line flag overrides the Namespace bazelrc, so local-only genrules fall back to local while everything else still runs remotely.
The update-image-references sed rewrites ghcr.io/dfinity/ic-build(:|@)... across .github/workflow*/* (which includes container-autobuild.yml), clobbering the dynamic src back into a hardcoded digest each run. Build src from an ic_build_repo variable so the literal ghcr.io/dfinity/ic-build@ never appears on the line and the sed no longer matches it.
bazel/conf/.bazelrc.build sets --noexperimental_inmemory_dotd_files (forcing C++ .d dependency files to disk) to work around a DFINITY remote-cache bug (bazelbuild/bazel#22387). Under Namespace remote execution with build-without-the-bytes, intermediate .d files are not downloaded locally, so reading them from disk fails with: error while parsing .d file: .../infogetty.d (No such file or directory) Pass --experimental_inmemory_dotd_files on the 'bazel test' command line so .d files are streamed in memory from the remote nodes. This overrides the bazelrc setting (command-line flags win); the cache bug it guarded against does not apply here since this job bypasses the DFINITY cache and uses Namespace's executor.
ic-os image builds run rootless 'podman build' and are forced to run locally (in this job's container) by --strategy_regexp=ic-os[:/].*=local. Rootless podman must create a user namespace, which an unprivileged container blocks: cannot clone: Operation not permitted Error: cannot re-exec process Add --privileged (grants the capabilities podman needs) and a tmpfs at /tmp/containers (backs podman's --root/--runroot, off the container's overlay rootfs), mirroring the privileged container options used by the jobs in ci-main.yml. --cgroupns host is intentionally omitted for now; it is not implicated by this failure and can be re-added if a cgroup error appears.
…e-bazel-remote-execution
…826b789c9ea988d61e863e46d4d95 ic-build: sha256:e9f95a42acbb5dd96f36d53037129842e16f2ec628ea38f09c9d2404cba2fdff ic-dev: sha256:cb0b750d7254a4fa280b2f0d0a62ab05649fa9b8c24eafb59bb2dc040fd8dac2 ic-build-worker: nscr.io/c9ptjuknd7oc6/ic-build-worker@sha256:b3209ba49237175d9f4339daa4b0828f1eee0cf9bd8ccd193af7be4a9663d919
…ions rules_rust's `_symlink_sysroot_tree` iterated `target.files` instead of its `target_files` argument, so the linker target's runfiles (rust-lld's bundled `gcc-ld/*` self-contained linker wrappers, e.g. `gcc-ld/ld.lld`) were never symlinked into the generated sysroot and thus never became declared inputs of the Rustc actions. rustc defaults to lld on x86_64-unknown-linux-gnu (it links via `-fuse-ld=lld -B<sysroot>/lib/rustlib/<target>/bin/gcc-ld`). Local (non-sandboxed) builds still found gcc-ld on disk, but Bazel Remote Execution (Namespace BRE) ships only an action's declared inputs, so every Rustc link action (starting with the bootstrap process_wrapper) failed with: "the self-contained linker was requested, but it wasn't found in the target's sysroot, or in rustc's sysroot".
…e-bazel-remote-execution
…e-bazel-remote-execution
The previous version of this patch switched _symlink_sysroot_tree to iterate only the linker target's runfiles (rust-lld's gcc-ld/* wrappers), which dropped the rust-lld binary at lib/rustlib/<target>/bin/rust-lld that rustc invokes directly to link wasm32 canisters, causing 'linker `rust-lld` not found' under remote execution. Symlink the union of the linker target's files (the rust-lld binary, used directly for wasm32) and its runfiles (the gcc-ld/* wrappers, used by rustc's default lld on x86_64-unknown-linux-gnu) so both link under remote execution.
…e-bazel-remote-execution
Overview
Experiment with Namespace Bazel Remote Execution (BRE): run
bazel teston Namespace runners, with actions executed on remote Namespace workers booted from a custom worker image (a mirror ofic-build).Changes
Extended:
.github/workflows/container-autobuild.ymlbre-worker-imagejob (on the Namespace runner wherenscis pre-authenticated): mirrors the freshly builtic-buildimage — by digest — into the Namespace tenant registry ($NSC_CONTAINER_REGISTRY, i.e.nscr.io/<tenant>/ic-build-worker) viansc base-image upload, resolves the pushed digest, and optimizes it for BRE viansc base-image optimize. Same fork guard as above.ic-build-image: if it fails, that's a bug to fix.update-image-referencesjob now also pins the worker image. Itneedsbre-worker-image(so its commit/push — which re-triggers the workflow and would otherwise cancel the in-flight optimize viacancel-in-progress— only happens once the optimize completes), and rewrites the pinned ref inbre-namespace-test.ymlin the same commit as theic-build/ic-devref andTAGupdates. The pinsedis tenant-agnostic, so it self-corrects if the Namespace tenant ever changes.New workflow:
.github/workflows/bre-namespace-test.ymlbazel teston a Namespace runner (namespace-profile-amd64-linux-32x64) using BRE. Targets default to//...and are overridable via aworkflow_dispatchinput.nsc bazel execution setup(writes a bazelrc with the remote executor, cache and credentials — deliberately not printed, since it contains short-lived credentials).--remote_default_exec_properties=container-image=....--noworkspace_rc+ explicit.bazelrc.build+ the Namespace RBE bazelrc), mirroring the existingbazel-test-arm64job.--test_tag_filtersand runs with--keep_going.workflow_dispatch, pushes todev-gh-*, or non-fork PRs labeledCI_BRE. Restricted todfinity/ic; fork PRs are excluded because the job runs on a privileged Namespace runner with pre-authenticatednsc.Notes / follow-ups
ic-buildrebuild (bumpci/container/TAG).bre-worker-imageis required, a Namespace/BRE outage would block the productionic-build/ic-devreference bump — a conscious trade-off (both must succeed).//...under BRE (ic-os local-strategy targets, privileged/system tests); the broad--test_tag_filtersexclusions and--keep_goingreduce noise while iterating.