Skip to content

Add Storage TSG: storage pool capacity threshold warning (fixed vs thin volumes)#300

Open
AlBurns-MSFT wants to merge 3 commits into
Azure:mainfrom
AlBurns-MSFT:tsg/storage-pool-capacity-threshold
Open

Add Storage TSG: storage pool capacity threshold warning (fixed vs thin volumes)#300
AlBurns-MSFT wants to merge 3 commits into
Azure:mainfrom
AlBurns-MSFT:tsg/storage-pool-capacity-threshold

Conversation

@AlBurns-MSFT

Copy link
Copy Markdown
Collaborator

What

Adds a Storage TSG explaining the Storage Spaces Direct (S2D) storage pool capacity threshold warning and the supported remediation paths, gated on volume provisioning type (fixed vs thin).

The warning is not a false alarm — S2D needs free pool capacity in reserve so repair jobs can rebuild resiliency after a drive/node loss — but it is frequently misunderstood on clusters using fixed-provisioned volumes, where the pool can sit above the threshold even when the volume's file system is mostly empty.

Highlights

  • Step 1 decision gate on ProvisioningType (Fixed vs Thin) to prevent running the thin reclamation procedure on fixed volumes, where it is a no-op.
  • Path A (Fixed): add disks, convert to thin, shrink/remove, suppress alert, or raise threshold — each risk-labeled.
  • Path B (Thin): Optimize-Volume -SlabConsolidate + ReFS unmap reclamation (no -ReTrim on ReFS), with VM-suspend and CSV owner-node guidance.
  • Clarifies the two distinct threshold controls and the reserve-capacity rationale (repair headroom after drive/node loss).

Files

  • TSG/Storage/Troubleshoot-Storage-StoragePoolCapacityThreshold.md — new TSG
  • TSG/Storage/README.md — registers the TSG in the Storage index

Adds a Storage TSG explaining the S2D storage pool capacity threshold
warning and the supported remediation paths, gated on volume
provisioning type:

- Step 1 decision gate on ProvisioningType (Fixed vs Thin) to prevent
  running the thin reclamation procedure on fixed volumes, where it is
  a no-op.
- Path A (Fixed): add disks, convert to thin, shrink/remove, suppress
  alert, or raise threshold -- each risk-labeled.
- Path B (Thin): SlabConsolidate + ReFS unmap reclamation procedure
  (no -ReTrim on ReFS), with VM-suspend and CSV owner-node guidance.
- Clarifies the two distinct threshold controls and the reserve-capacity
  rationale (repair headroom after drive/node loss).

Registers the TSG in TSG/Storage/README.md.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 18, 2026 19:29

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Storage troubleshooting guide (TSG) to explain and remediate the S2D storage pool capacity threshold warning, with remediation explicitly gated on volume provisioning type (Fixed vs Thin) to avoid ineffective thin-reclamation steps on fixed volumes.

Changes:

  • Introduces a new TSG covering why the warning exists, how to determine provisioning type, and safe remediation options for Fixed vs Thin volumes.
  • Documents thin-volume reclamation guidance including VM workload handling, CSV owner-node execution guidance, and ReFS-specific notes.
  • Registers the new TSG in the Storage README index.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
TSG/Storage/Troubleshoot-Storage-StoragePoolCapacityThreshold.md New TSG describing the warning, decision gate on provisioning type, remediation paths, and verification steps.
TSG/Storage/README.md Adds the new TSG link to the Storage index.

Comment on lines +36 to +40
The single most important step is to **determine whether the affected volumes are
fixed or thin provisioned before taking any action**, because the space
reclamation procedure (`Optimize-Volume -SlabConsolidate` followed by
`Optimize-StoragePool`) **does nothing on a fixed-provisioned volume** and only
applies to thin-provisioned volumes.
Comment on lines +171 to +172
If the customer accepts the capacity posture and wants to stop the alert, the
Health Service threshold alert can be disabled:
Get-StorageJob

# Health faults
Get-StorageSubSystem -FriendlyName Clus* | Debug-StorageSubSystem
@Karl-WE

Karl-WE commented Jun 18, 2026

Copy link
Copy Markdown

Great addition!

… Get-HealthFault

- Overview: describe thin reclamation as SlabConsolidate + ReFS background unmap
  (Optimize-StoragePool is an optional rebalance, not the freeing step) to match
  Path B; align the summary table row too.
- Option A4: note the Health Service alert toggle is applied at the storage
  subsystem level, so it suppresses the threshold alert cluster-wide (all pools),
  not just one pool or volume.
- Verify: use the lightweight Get-HealthFault to list active faults instead of the
  heavier Debug-StorageSubSystem, matching the other Storage TSGs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@AlBurns-MSFT

Copy link
Copy Markdown
Collaborator Author

Thanks for the review — addressed all three points in e25447e:

  1. Overview/Path B consistency — reworded the overview so reclamation is Optimize-Volume -SlabConsolidate + the ReFS background unmap that frees the slabs; Optimize-StoragePool is described only as the optional rebalance (matching Path B), and aligned the summary-table row.
  2. A4 alert scope — called out that the toggle is set at the storage subsystem level, so it suppresses the threshold alert cluster-wide (all pools), not just one pool/volume.
  3. Verify cmdlet — swapped Debug-StorageSubSystem for the lightweight Get-HealthFault, consistent with the other Storage TSGs.

@1008covingtonlane 1008covingtonlane left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed for technical accuracy against Microsoft Learn — high-quality TSG, and the Fixed-vs-Thin decision gate (don't run Optimize-Volume -SlabConsolidate on a fixed volume) is exactly the right framing; it's a frequent real-world misread.

Verified correct against Learn:

  • 70% default thin-provisioning alert threshold.
  • ~15-minute ReFS background-unmap reclamation wait after -SlabConsolidate.
  • Cmdlets: Set-StoragePool -ThinProvisioningAlertThresholds, Set-VirtualDisk -ProvisioningType Thin, Get-VirtualDisk … ProvisioningType.
  • -SlabConsolidate (not -ReTrim) on ReFS; Arc-VM stop-from-Azure rather than host Suspend-VM; CSV owner-node / by-path handling; Optimize-StoragePool as rebalance, not the slab-free mechanism.
  • Every remediation carries a risk label; A4's subsystem-wide alert scope is clearly warned.

One substantive question inline on Option A2's "23H2 (build 2311.2) or later" conversion requirement — it appears to contradict the 22H2 conversion doc this TSG links. Everything else looks accurate and well-organized. Nice addition.

> [!IMPORTANT]
> In-place conversion of an existing fixed volume to thin requires **Azure Local
> 23H2 (build 2311.2) or later**. On earlier releases, do not attempt the
> conversion — instead create a new thin volume and migrate the data, then remove

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the careful Fixed-vs-Thin gating throughout — this is a genuinely useful TSG.

One question on this build requirement: I couldn't find a 2311.2 floor for in-place fixed→thin conversion in either Learn doc this TSG links. The conversion doc cited here (Convert fixed to thin provisioned volumes) documents the in-place Set-VirtualDisk -ProvisioningType Thin + remount flow on Azure Stack HCI 22H2, and the current Storage thin provisioning in Azure Local, version 23H2 doc says conversion is supported with no build gate.

If there's a known regression/issue that makes 2311.2 the real floor, it would be great to cite it here. Otherwise the "on earlier releases, do not attempt the conversion" guidance may send engineers down an unnecessary new-volume-and-migrate path when in-place conversion is documented as supported on 22H2. Happy to be corrected if you've hit a specific build issue.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — removing the unvalidated 2311.2 claim.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified — the revised note is accurate: the Set-VirtualDisk -ProvisioningType Thin + remount procedure is archived under /previous-versions/ due to the Azure Stack HCI → Azure Local rename (not a feature removal), and there's no documented minimum build. The "confirm on the current build, otherwise create-new-and-migrate" caveat is a sensible call. Thanks for the quick turnaround!

Option A2 stated that in-place fixed-to-thin conversion "requires Azure
Local 23H2 (build 2311.2) or later" and to not attempt it on earlier
releases. That build floor has no basis in any Microsoft source:

- The cited conversion doc ("Convert fixed to thin provisioned volumes")
  states "Applies to: Azure Stack HCI, version 22H2" and documents the
  in-place Set-VirtualDisk -ProvisioningType Thin flow with no build gate.
- The thin-provisioning concept doc FAQ ("Can existing fixed volumes be
  converted to thin? Yes ... supported") notes the feature is available
  since Azure Stack HCI 21H2 -- below the claimed floor.
- The current Azure Local known-issues/release notes carry no regression
  tying conversion to 2311.2 or any 23H2 build.
- The /previous-versions/ path reflects the Azure Stack HCI -> Azure Local
  rename (archival), not feature deprecation.

Replace the fabricated build floor with the accurate, sourced caveat: no
minimum build is published; the procedure is documented for 21H2/22H2 and
archived; it is not re-published in current 23H2/24H2 volume docs, so
confirm current-build support before recommending, and fall back to
new-volume-and-migrate only if support cannot be confirmed.

Addresses PR review comment by 1008covingtonlane.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@1008covingtonlane 1008covingtonlane left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. Verified the technical content against Microsoft Learn — the 70% default thin-provisioning alert threshold, the ~15-minute ReFS background-unmap reclamation wait, -SlabConsolidate (not -ReTrim) on ReFS, Arc-VM stop-from-Azure, and the CSV owner-node / by-path handling are all accurate. The Copilot bot's three points and the fixed-to-thin build-floor claim are now resolved. Clear, well-structured TSG and a genuinely useful one for the storage-pool capacity threshold misread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants