Skip to content

COMP: Reduce Linux CI disk pressure and harden ccache management#6478

Open
hjmjohnson wants to merge 4 commits into
InsightSoftwareConsortium:mainfrom
hjmjohnson:ci/linux-azure-disk-management
Open

COMP: Reduce Linux CI disk pressure and harden ccache management#6478
hjmjohnson wants to merge 4 commits into
InsightSoftwareConsortium:mainfrom
hjmjohnson:ci/linux-azure-disk-management

Conversation

@hjmjohnson

Copy link
Copy Markdown
Member

Reduce Linux CI disk pressure: free preinstalled software, compact ccache before
Cache@2 tar, harden cleanup steps. Fixes recurring near-ENOSPC in ITK.Linux.Python.

Commit summary

Commit 1COMP: Align Linux.Python CI disk management with Linux CI pattern

  • Restructure AzurePipelinesLinuxPython.yml to match the pattern already in
    AzurePipelinesLinux.yml: conditional 1d/4d ccache eviction inside the build
    step, build-tree removal after diagnostic steps.
  • Reduce CCACHE_MAXSIZE from 8G to 5G.

Commit 2COMP: Free unused preinstalled software in Linux Azure CI jobs

  • Add "Free preinstalled software" step (first step, before checkout) to both
    Linux pipelines: remove Android SDK, GHC/GHCUP, .NET, Swift, CodeQL, Boost.
  • Prune unused Docker images.

Commit 3COMP: Harden Linux CI ccache and disk-cleanup after code review

  • Add ccache -c to the condition: always() cleanup step so a restored
    oversized cache is compacted before Cache@2 archives it (the missing step
    that would let ENOSPC recur on the first post-merge run).
  • Extend success eviction window 1d → 3d: ccache does not refresh entry mtimes
    on cache hits, so the 1d window was evicting warm entries used this build.
  • Delete false comment "ccache refreshes timestamps on hit" from Linux.yml.
  • Move CCACHE_MAXSIZE to the pipeline variables block so ccache --show-config
    and ccache --show-stats report the operative limit.
  • Add set -e to "Free preinstalled software" and cleanup steps.
  • Mark docker image prune -af with || true (daemon absence is non-fatal).
  • Fix Boost removal path to include headers (/usr/local/include/boost) and
    libraries (/usr/local/lib/libboost_*); the old path only removed CMake
    find-package configs (~1 MB, not ~200 MB).
  • Add rm -rf $(Agent.BuildDirectory)/ITK-dashboard to cleanup step.
  • Use df -h / consistently (was bare df -h in "Free preinstalled software").
  • Mark ccache eviction calls || true to express that maintenance failure does
    not override the build exit code.

@github-actions github-actions Bot added type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct labels Jun 19, 2026
@hjmjohnson hjmjohnson marked this pull request as ready for review June 20, 2026 00:38
@greptile-apps

greptile-apps Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR hardens disk management across all Linux, macOS, and Windows Azure CI pipelines to address recurring ENOSPC failures during Cache@2 post-job archives. The core changes are: promoting CCACHE_MAXSIZE from per-step env: to pipeline-level variables:; adding a "Free preinstalled software" step to reclaim ~5–15 GB before any build work; and replacing the old one-off eviction workaround in LinuxPython with a consistent pattern across all pipelines.

  • CCACHE eviction windows extended 1d → 3d on success — ccache does not refresh entry mtimes on hits, so the 1d window was silently evicting warm entries used during the current build.
  • ccache -c added to the always() cleanup step across all five pipelines to compact the cache before Cache@2 archives it; however, all three Linux/macOS cleanup steps use set -e without guarding ccache -c with || true, so a compaction failure will abort before rm -rf runs.
  • Boost removal path corrected to include headers (/usr/local/include/boost) and libraries (/usr/local/lib/libboost_*) instead of only the ~1 MB CMake find-package configs."

Confidence Score: 3/5

The disk-pressure improvements are well-motivated, but the cleanup step across Linux, LinuxPython, and macOS pipelines has a logic gap that can reproduce the original ENOSPC failure on the first run after merge.

All five pipelines share a new condition:always() cleanup step that runs ccache -c under set -e without a || true guard. If ccache -c exits non-zero — most plausible on a full disk, the exact failure mode this PR addresses — the step aborts before rm -rf removes the build tree, leaving several gigabytes on disk and allowing Cache@2 to hit ENOSPC exactly as before. A one-character fix (ccache -c || true) on each of the three Linux/macOS cleanup steps removes the risk entirely.

AzurePipelinesLinux.yml, AzurePipelinesLinuxPython.yml, and AzurePipelinesMacOSPython.yml all need || true added to ccache -c in their cleanup steps. AzurePipelinesBatch.yml and AzurePipelinesWindowsPython.yml share the same ccache -c gap but are lower risk because their rm -rf calls are already protected with || true.

Important Files Changed

Filename Overview
Testing/ContinuousIntegration/AzurePipelinesLinux.yml Moves CCACHE_MAXSIZE to pipeline variables, adds Free preinstalled software step, hardens eviction logic with 3d/4d windows and
Testing/ContinuousIntegration/AzurePipelinesLinuxPython.yml Replaces the old CCACHE_MAXSIZE=6.5G override workaround with a clean pipeline-variable declaration (5G), aligns eviction pattern with Linux.yml, adds Free preinstalled software step; same set -e + unguarded ccache -c issue in the new cleanup step.
Testing/ContinuousIntegration/AzurePipelinesMacOSPython.yml Adds macOS-specific Free preinstalled software step (dotnet + CoreSimulator runtimes), moves CCACHE_MAXSIZE to pipeline variables, adds eviction logic and cleanup step; same set -e + unguarded ccache -c fragility as Linux files.
Testing/ContinuousIntegration/AzurePipelinesBatch.yml Moves CCACHE_MAXSIZE to pipeline variables; adds a new cleanup step using $AGENT_JOBSTATUS for eviction, ccache -c, and rm -rf with
Testing/ContinuousIntegration/AzurePipelinesWindowsPython.yml Moves CCACHE_MAXSIZE to pipeline variables and adds cleanup step with $AGENT_JOBSTATUS-based eviction, ccache -c, and

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Free preinstalled software\nrm android/ghc/dotnet/swift/boost\ndocker image prune -af"] --> B["checkout + install dependencies"]
    B --> C["Cache@2 restore ccache + ExternalData"]
    C --> D["ccache zero-stats, evict-older-than 7d, show-config"]
    D --> E["Build and test\nctest -S dashboard.cmake"]
    E --> F{ctest_rc == 0?}
    F -- success --> G["ccache evict-older-than 3d or true"]
    F -- failure --> H["ccache evict-older-than 4d or true"]
    G --> I["exit ctest_rc"]
    H --> I
    I --> J["ccache show-stats - condition always"]
    J --> K["Diagnostics + JUnit + Publish results"]
    K --> L["Free build tree - condition always\nccache -c\nrm -rf build tree + ITK-dashboard\ndf -h /"]
    L --> M["Cache@2 post-job save"]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["Free preinstalled software\nrm android/ghc/dotnet/swift/boost\ndocker image prune -af"] --> B["checkout + install dependencies"]
    B --> C["Cache@2 restore ccache + ExternalData"]
    C --> D["ccache zero-stats, evict-older-than 7d, show-config"]
    D --> E["Build and test\nctest -S dashboard.cmake"]
    E --> F{ctest_rc == 0?}
    F -- success --> G["ccache evict-older-than 3d or true"]
    F -- failure --> H["ccache evict-older-than 4d or true"]
    G --> I["exit ctest_rc"]
    H --> I
    I --> J["ccache show-stats - condition always"]
    J --> K["Diagnostics + JUnit + Publish results"]
    K --> L["Free build tree - condition always\nccache -c\nrm -rf build tree + ITK-dashboard\ndf -h /"]
    L --> M["Cache@2 post-job save"]
Loading

Comments Outside Diff (1)

  1. Testing/ContinuousIntegration/AzurePipelinesLinux.yml, line 152-154 (link)

    P2 ccache --show-stats runs before compaction, so reported size is pre-compact

    The ccache --show-stats step runs before the "Free build tree" step that runs ccache -c. Disk usage metrics visible in the logs will reflect the state after eviction but before compaction, which can look larger than the final cache that Cache@2 actually archives. The same ordering exists in AzurePipelinesLinuxPython.yml and AzurePipelinesMacOSPython.yml. A second ccache --show-stats at the end of the cleanup step would let operators confirm the final archived cache size.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "COMP: Apply ccache and disk-cleanup hard..." | Re-trigger Greptile

Comment thread Testing/ContinuousIntegration/AzurePipelinesLinux.yml
@hjmjohnson hjmjohnson changed the title WIP: COMP: Reduce Linux CI disk pressure and harden ccache management COMP: Reduce Linux CI disk pressure and harden ccache management Jun 20, 2026
Replace the misplaced 'Free disk before post-job ccache upload' step
(which ran before diagnostic steps, used ninja clean, and patched
CCACHE_MAXSIZE with an env-var override) with the same pattern already
used by AzurePipelinesLinux.yml:

- Reduce CCACHE_MAXSIZE from 8G to 5G so the post-job Cache@2 tar
  needs less headroom to write its archive.
- Add conditional ccache eviction inside the build step: 1d on success
  (keeps only the warm entries), 4d on failure (retains fix-retry
  objects across runs).
- Move build-tree removal to after the diagnostic/JUnit/publish steps
  and replace ninja clean with rm -rf, freeing static libs, .so
  modules, and generated .cxx sources that ninja clean leaves behind.
Ubuntu-22.04 and ubuntu-24.04 hosted agents ship Android SDK (~9 GB),
Haskell/GHCup (~5 GB), .NET (~2-3 GB), Swift (~1.5 GB), CodeQL (~2 GB),
and Boost headers (~1.2 GB). ITK's Linux builds use none of these;
removing them at job start recovers ~20 GB before checkout, ccache
restore, and the build itself consume disk.
- Move CCACHE_MAXSIZE to the pipeline variables block so ccache --show-config
  and --show-stats see the operative limit (was scoped only to the build step)
- Add ccache -c to the always()-conditioned cleanup step so restored oversized
  caches are compacted before Cache@2 tars them
- Extend success eviction window 1d → 3d; ccache does not refresh mtimes on
  hits, so the 1d window was evicting warm entries that were used this build
- Delete false comment "ccache refreshes timestamps on hit" from Linux.yml
- Add set -e to "Free preinstalled software" and cleanup steps; add || true to
  docker image prune so daemon absence is non-fatal
- Fix boost removal path to include headers and libs (/usr/local/include/boost,
  /usr/local/lib/libboost_*); /usr/local/share/boost held only CMake configs
- Remove $(Agent.BuildDirectory)/ITK-dashboard clone in cleanup step
- Use df -h / consistently (was bare df -h in "Free preinstalled software")
- Mark ccache eviction calls with || true to surface that cache maintenance
  failure does not override the build exit code
Mirrors the fixes from the Linux pipelines (ccache -c, 3d/4d eviction,
CCACHE_MAXSIZE in variables block, build-tree cleanup) to the remaining
three Azure pipeline configurations.

macOS (AzurePipelinesMacOSPython.yml):
- Add "Free preinstalled software" step: remove .NET SDK and iOS
  simulator runtimes before checkout
- Move CCACHE_MAXSIZE: 8G to variables block
- Add ctest_rc/3d/4d eviction pattern to build step
- Add condition: always() cleanup: ccache -c, rm build tree and
  ITK-dashboard clone, df -h /

Windows Python (AzurePipelinesWindowsPython.yml):
- Move CCACHE_MAXSIZE: 8G to variables block
- Add condition: always() cleanup (bash via Git Bash): evict 3d/4d
  via $AGENT_JOBSTATUS, ccache -c, rm build tree and ITK-dashboard

Batch Windows (AzurePipelinesBatch.yml):
- Move CCACHE_MAXSIZE: 2.4G to variables block
- Same condition: always() cleanup as Windows Python
@hjmjohnson hjmjohnson force-pushed the ci/linux-azure-disk-management branch from a9bf82f to 6886b30 Compare June 20, 2026 04:28
@github-actions github-actions Bot added the type:Compiler Compiler support or related warnings label Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type:Compiler Compiler support or related warnings type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant