Skip to content

Reduce CI cost#19919

Merged
metascroy merged 5 commits into
mainfrom
reduce-ci-costs
Jun 1, 2026
Merged

Reduce CI cost#19919
metascroy merged 5 commits into
mainfrom
reduce-ci-costs

Conversation

@metascroy
Copy link
Copy Markdown
Contributor

Currently every push to main runs all macOS jobs unconditionally. This PR path-filters macOS jobs on push as well as PR, but samples 25% of push commits (deterministic by SHA hash) to still run the full suite for HUD/bisection signal. A new viable-strict-gate workflow fails on the 75% non-sampled commits and is added to viable/strict's requires list, so viable/strict only advances on commits where every job ran. Estimated ~75% macOS runner savings.

CI behavior changes:

  • macOS jobs in pull.yml / trunk.yml now skip on pushes that don't touch their paths and aren't in the sample
  • A new viable-strict-gate workflow runs on every push to main/release/* and fails when the commit isn't a sampled full-run.
  • update-viablestrict now requires the gate workflow → viable/strict advances ~every 4 commits instead of every commit.
  • Maintainers can force a full run on any main/release commit by running the new "Promote commit to viable/strict" workflow from the Actions tab
  • Sampling rule lives in one place: _ci-run-decision.yml; change the rate or rule there.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Jun 1, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19919

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 1, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Comment thread .github/workflows/_ci-run-decision.yml Fixed
Comment thread .github/workflows/viable-strict-gate.yml Fixed
Comment thread .github/workflows/viable-strict-gate.yml Fixed
@metascroy
Copy link
Copy Markdown
Contributor Author

@claude review this code

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @metascroy's task in 2m 14s —— View job


Code Review: Reduce CI Cost

  • Read changed workflow files
  • Analyze CI run decision logic
  • Analyze viable-strict-gate workflow
  • Analyze promote-to-viable-strict workflow
  • Review pull.yml and trunk.yml changes
  • Review update-viablestrict.yml changes
  • Post review findings

Overall Assessment

The design is sound — a deterministic 25% SHA-based sampling that's centralized in _ci-run-decision.yml, gated via viable-strict-gate.yml, with a clean manual escape hatch. The architecture is well-thought-out. A few issues to address:


Issues

1. test-llama-torchao-lowbit in trunk.yml is ungated (Medium)

trunk.yml:398-415 — This macOS job uses macos_job.yml but has no needs: run-decision or if: gate. It runs on every push to main unconditionally, bypassing the cost-saving mechanism. This seems like an oversight since all other macOS jobs in trunk.yml are gated.

Fix this →

2. Missing permissions block — CodeQL findings (Low)

As flagged by CodeQL, the following workflows lack explicit permissions:

  • _ci-run-decision.yml (line 69) — the decide job only needs read access
  • viable-strict-gate.yml (lines 32, 49) — both jobs only need read

Adding permissions: {} at the workflow level (or per-job) is a security best practice:

permissions: {}  # No token permissions needed

3. Misleading comment in pull.yml (Nit)

pull.yml:24 says "Emits is-full-run='true' for PR/dispatch/tag/sampled-push commits" but _ci-run-decision.yml actually returns 'false' for pull_request events — the PR comment incorrectly suggests PRs get is-full-run = 'true'. The mechanism works correctly (PR jobs use github.event_name == 'pull_request' || to run unconditionally on PRs), but the comment is misleading.

4. pull.yml test-coreml-bc-macos — behavior change for PRs (Verify)

pull.yml:1529 — The old condition was:

github.event_name != 'pull_request' ||
contains(...) || ...

Meaning: always run on push, path-filtered on PRs.

The new condition is:

contains(...) || ... ||
needs.run-decision.outputs.is-full-run == 'true'

Meaning: path-filtered on both push and PR. For PRs, is-full-run is 'false', so this job now only runs on PRs that touch relevant paths. Previously it ran on every non-push event unconditionally. This is a functional change to PR behavior — verify that's intentional.


Design Observations (Non-blocking)

Sampling approach is solid. Using the last hex character of SHA for deterministic 25% sampling ({0,4,8,c}) is clean and reproducible. The case statement is POSIX-compatible and fast.

Single source of truth. Having _ci-run-decision.yml as the sole logic owner and reusing it from both trunk.yml/pull.yml (for gating) and viable-strict-gate.yml (for advancement blocking) avoids drift.

Promote workflow is well-secured. SHA validation (40-char hex, must exist, must be reachable from main/release) and the idempotent tag-exists check are good defensive programming.

Concurrency control on the promote workflow (cancel-in-progress: false) is the right choice to avoid racing tag pushes.


Summary

The PR achieves its goal of ~75% macOS runner cost reduction with a clean, maintainable design. The main actionable issue is the ungated test-llama-torchao-lowbit job. The permissions and comment issues are low-priority polish.

Comment thread .github/workflows/_ci-run-decision.yml Outdated
# Keep in sync with the sample in viable-strict-gate.yml.
if [ "$IS_FULL" = "false" ] && [ "$EVENT_NAME" = "push" ]; then
case "$SHA" in
*0|*4|*8|*c) IS_FULL=true ;;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can hit 100 consecutive sha ending it 0,4,8,c in a row :p

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very unlikely, but I take the point :)

I'll change sampling to be based on git history depth so it regularly runs every 4 commits


test-coreml-bc-macos:
needs: changed-files
needs: [changed-files, run-decision]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also do it for windows?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll focus on macOS + core infra (viable strict update changes and main PR selection job) in this PR. windows and cuda enablement can be follow-ups.

Comment on lines +374 to +375
contains(needs.changed-files.outputs.changed-files, '.github/workflows/trunk.yml') ||
needs.run-decision.outputs.is-full-run == 'true'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not add the sha % 4 logic here and not kick off a new job? curious

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

B/c then we can use it in other jobs (like mlx.yml or cuda.yml) and keep the selection logic in one place

Copy link
Copy Markdown
Contributor

@digantdesai digantdesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. stamping to unblock.

@metascroy
Copy link
Copy Markdown
Contributor Author

@claude review this code

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @metascroy's task in 2m 10s —— View job


Code Review: Reduce CI Cost

  • Read changed workflow files
  • Analyze CI run decision logic
  • Analyze viable-strict-gate workflow
  • Analyze promote-to-viable-strict workflow
  • Review pull.yml and trunk.yml changes
  • Review update-viablestrict.yml changes
  • Post review findings

Overall Assessment

Well-designed approach to reducing macOS CI costs. The depth-based sampling (addressing digantdesai's review feedback) is an improvement over the original SHA-based approach since it guarantees hard bounds on gaps between full runs. The architecture is clean with _ci-run-decision.yml as a single source of truth.


Issues

1. git rev-list --count includes all merge parents (Medium)

_ci-run-decision.yml:74git rev-list --count "$SHA" counts ALL ancestors reachable from $SHA, not just the first-parent (linear) history. In a repo that merges PRs via merge commits, the count jumps by (1 + PR_commit_count) per merge, not by 1. This means:

  • The "every 4th commit" guarantee doesn't map to "every 4th merge to main" — some merges will have counts that jump by 5+ (skipping a mod-4 boundary), while multi-commit PR merges could hit multiple mod-4 boundaries
  • The sampling rate won't be exactly 25% of main-branch pushes; it could be higher or lower depending on PR sizes

If you want deterministic "every 4th push to main," use:

DEPTH=$(git rev-list --first-parent --count "$SHA")

This counts only the linear history along the default branch, giving exactly one count per push event.

2. Stale comment in pull.yml (Low)

pull.yml:18-19 — The comment says "non-PR events emit '*' so the per-job if: short-circuits via event_name != 'pull_request'" — but this pattern no longer exists for the gated jobs. The path-filtered macOS jobs now use contains(...) || is-full-run == 'true' instead of the old event_name != 'pull_request' || contains(...) pattern. The comment should be updated to reflect that non-PR events rely on is-full-run for the sampled gate.

3. viable-strict-gate promotion race: push-triggered failure persists (Verify)

When a commit is pushed to main, viable-strict-gate triggers and (75% of the time) fails. If you later promote that commit via promote-to-viable-strict.yml (pushing a ciflow/trunk/<sha> tag), a new run of viable-strict-gate triggers from the tag push and succeeds. But the original push-triggered run still shows as failed.

Whether update-viablestrict can advance depends on how pytorch/test-infra/.github/actions/update-viablestrict resolves conflicting conclusions for the same SHA. If it checks "latest run" or "any passing run" for the viable-strict-gate workflow, promotion works. If it requires ALL runs of that workflow for the SHA to pass, promoted commits would still be blocked. Worth verifying this is handled correctly by the upstream action.


Design Observations (Non-blocking)

  • Depth-based sampling is a good response to the review feedback. Using depth % 4 == 0 with --count gives deterministic results and (with --first-parent) would guarantee exactly 1-in-4 main-branch commits get full CI.
  • permissions: {} on viable-strict-gate.yml — good, addresses the CodeQL findings from the automated review.
  • Promote workflow validation is thorough: 40-char hex check, existence check, reachability check, idempotent tag-exists handling.
  • test-llama-torchao-lowbit in trunk.yml — the previous review flagged this as ungated, but looking at the current code (lines 391-395), it now correctly has needs: run-decision and the standard gate. Resolved.

Summary

The main actionable issue is using --first-parent in git rev-list --count to get a predictable linear count that maps 1:1 to pushes on main. Without it, the depth count is influenced by the size of merged PRs, making the sampling rate unpredictable. The other items are informational/low-priority.

@metascroy
Copy link
Copy Markdown
Contributor Author

@claude review this code

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @metascroy's task in 3m 7s —— View job


Code Review: Reduce CI Cost

  • Read changed workflow files
  • Analyze CI run decision logic
  • Analyze viable-strict-gate workflow
  • Analyze promote-to-viable-strict workflow
  • Review pull.yml and trunk.yml changes
  • Review mlx.yml changes
  • Review update-viablestrict.yml changes
  • Post review findings

Overall Assessment

The design is solid — depth-based deterministic 25% sampling centralized in _ci-run-decision.yml, with --first-parent for predictable linear counting (addressing digantdesai's earlier feedback). The viable-strict gate, promotion escape hatch, and per-job gating are all well-structured. Previous review issues (ungated test-llama-torchao-lowbit, missing permissions: {}) have been addressed.


Issues

1. Misleading comment about changed-files behavior (Medium)

trunk.yml:22-25 and pull.yml:18-21 both say:

Emits PR diff file list; non-PR events emit '*' (every contains() check returns true)

This is incorrect. GitHub Actions contains('*', 'backends/apple/coreml') returns false* is a literal single-character string, not a glob/wildcard in this context. The contains() function does substring matching, so * does not contain backends/apple/coreml.

The old code worked because it had event_name != 'pull_request' || to short-circuit on push. The new code works because it has needs.run-decision.outputs.is-full-run == 'true' as the final OR clause. So the mechanism is correct, but the comment incorrectly explains why.

This matters because a future reader trusting the comment might remove the is-full-run clause thinking contains() already handles push events.

Suggested fix:

# Emits PR diff file list; non-PR events emit '*'. Note: contains('*', 'path')
# is false (literal substring match, not glob), so path-filtered jobs on push
# rely on run-decision's is-full-run output to gate execution.

Fix this →

2. test-llama-runner-macos path filter is redundant on push (Low)

trunk.yml:480-493 — On push events, changed-files outputs '*'. Since contains('*', 'backends/apple/coreml') is false (see issue #1 above), ALL the contains() clauses evaluate to false on push. The job then only runs when is-full-run == 'true'.

This means these path-filtered jobs never run on non-sampled push commits, even if the push touches their relevant paths. This is different from the PR behavior where path filtering is meaningful. The same applies to test-coreml-delegate, test-static-llama-ane, test-models-macos-coreml, test-models-macos-mps, and test-huggingface-transformers-macos.

If the intent is "on push: only run on sampled commits (25%)" regardless of paths touched, then this is correct behavior — just know that path filtering has no effect on push events with the current _get-changed-files.yml implementation. If you want path filtering to also work on push (run on sampled OR path-matching pushes), you'd need _get-changed-files.yml to compute the diff against the previous commit for push events too.

3. viable-strict-gate tag trigger semantics (Verify)

viable-strict-gate.yml:25-26 triggers on tags: ciflow/trunk/*. When promote-to-viable-strict.yml pushes a ciflow/trunk/<sha> tag, this triggers a new viable-strict-gate run. The new run's github.sha will be the commit the tag points to, and the _ci-run-decision.yml will see REF=refs/tags/ciflow/trunk/... which matches the refs/tags/ciflow/* case and returns is-full-run=true. This path is sound.

However, verify that the actions/checkout@v4 in _ci-run-decision.yml (line 43) correctly checks out the tagged commit (not HEAD of the default branch) when triggered by a tag push. With fetch-depth: 0 and no explicit ref:, checkout defaults to github.sha which should be the tagged commit. This should work but is worth a smoke test.

4. Promote workflow — git branch -r parsing (Nit)

promote-to-viable-strict.yml:75:

for branch in main $(git branch -r | grep -E 'origin/release/' | sed 's|origin/||'); do

git branch -r output has leading whitespace (e.g., origin/release/0.6). The sed removes origin/ but not the leading spaces, so the variable would be release/0.6. Then origin/$branch becomes origin/ release/0.6 which won't resolve. Consider adding | tr -d ' ' or using sed 's|^[[:space:]]*origin/||'.

Fix this →


Design Observations (Non-blocking)

  • --first-parent in git rev-list --count — good fix from previous review. This guarantees exactly 1-in-4 main pushes are sampled regardless of PR merge sizes.
  • permissions: {} on viable-strict-gate.yml and permissions: contents: read on _ci-run-decision.yml — properly locked down.
  • Promotion workflow defense-in-depth (deleting failed gate runs) is a nice touch for HUD cleanliness.
  • mlx.yml gating pattern (github.event_name == 'pull_request' || is-full-run == 'true') is clean and consistent across all jobs.
  • Concurrency control (cancel-in-progress: false) on promote workflow prevents race conditions.

Summary

The PR is well-structured and the core mechanism is correct. The main actionable items are:

  1. Fix the misleading comment about contains('*', ...) behavior (risk of future misunderstanding)
  2. Fix whitespace handling in the promote workflow's branch parsing (functional bug)

The path-filter-on-push semantic (issue #2) is a design choice that's fine as-is if the intent is "push events are either full-run or skip" — just worth documenting explicitly.


@metascroy
Copy link
Copy Markdown
Contributor Author

@claude review this code

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @metascroy's task in 1m 31s —— View job


Code Review: Reduce CI Cost

  • Read changed workflow files
  • Analyze CI run decision logic
  • Analyze viable-strict-gate workflow
  • Analyze promote-to-viable-strict workflow
  • Review pull.yml and trunk.yml changes
  • Review mlx.yml changes
  • Review update-viablestrict.yml changes
  • Post review findings

Overall Assessment

This is a well-architected CI cost reduction. The depth-based deterministic 25% sampling centralized in _ci-run-decision.yml, the viable-strict gate preventing partial-signal advancement, and the manual promotion escape hatch all work together cleanly. Previous review feedback (ungated test-llama-torchao-lowbit, --first-parent, permissions: {}, branch parsing) has been addressed. The code is in good shape.


Issues

1. _get-changed-files.yml Compare API pagination may produce incomplete results (Medium)

_get-changed-files.yml:76 — The GitHub Compare API (/compare/BASE...HEAD) has a hard limit of 300 files in its response regardless of pagination (GitHub docs). If a push commit (e.g., a large merge) touches >300 files, the diff will silently be truncated. Jobs whose relevant paths are in the un-returned tail won't run even though they should.

This is unlikely for typical single-commit pushes to main, but could happen for merge commits or large automated updates. If it matters, you could add a check: if the response indicates total_commits or file count is at the limit, fall back to '*'.

2. _ci-run-decision.yml checkout ref on tag pushes (Low — verify)

_ci-run-decision.yml:43-45 — When viable-strict-gate.yml triggers from a ciflow/trunk/* tag push, actions/checkout@v4 with no explicit ref: defaults to github.sha, which should be the tagged commit. But on tag push events, github.sha is the commit the tag points to — so git rev-list --first-parent --count "$SHA" should resolve correctly. Worth a smoke test, but likely fine.

3. Race between _get-changed-files and force-push (Unlikely, informational)

If a force-push to main replaces commits (unusual for ExecuTorch), github.event.before could reference a SHA that no longer exists in the linear history. The Compare API may 404, and the || echo "" fallback would produce empty CHANGED_FILES, which then becomes '*' (line 82-84). This is actually safe — it falls through to "full" behavior. Just noting the implicit safety net works correctly.


Design Observations (Non-blocking)

  • include-push-diff: true opt-in is a nice API design. Callers like lint.yml that rely on '*' don't need changes.
  • The comment at pull.yml:18-23 and trunk.yml:22-27 is now accurate: it correctly explains that contains('*', 'path') is false and jobs rely on is-full-run.
  • promote-to-viable-strict.yml branch parsing now uses git for-each-ref instead of git branch -r, avoiding the whitespace issue from the previous review.
  • Job gating pattern is consistent across all three workflows (mlx.yml, trunk.yml, pull.yml):
    • Unconditionally-gated jobs: github.event_name == 'pull_request' || is-full-run == 'true'
    • Path-filtered jobs: contains(...) || ... || is-full-run == 'true'
  • viable-strict-gate and update-viablestrict — adding "viable-strict-gate" to the requires list is the correct integration point with PyTorch's HUD infrastructure.

Summary

The PR is clean and ready. The only potentially actionable item is the 300-file Compare API limit (#1), which is a known GitHub limitation and unlikely to matter in practice for this repo's merge patterns. All previously identified issues have been addressed. The design achieves its ~75% macOS runner cost savings goal with good ergonomics for maintainers.


@metascroy metascroy merged commit 410f930 into main Jun 1, 2026
169 checks passed
@metascroy metascroy deleted the reduce-ci-costs branch June 1, 2026 21:54
metascroy added a commit that referenced this pull request Jun 1, 2026
viable-strict-gate.yml and mlx.yml had permissions: {} but call
_ci-run-decision.yml, which needs contents: read for actions/checkout.
GitHub intersects caller permissions with callee needs ({} ∩ {contents:
read} = {}), so both workflows were rejected at registration since
#19919 landed. The gate hasn't run (so update-viablestrict has had no
signal), and mlx.yml hasn't triggered (MLX / * checks show [does not
exist] on HUD). Loosen both to permissions: contents: read. Audited all
other callers of _ci-run-decision.yml / _get-changed-files.yml; none
affected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants