From aa4cec06bbf28f00a7d53c93911f4231facbacef Mon Sep 17 00:00:00 2001 From: Alex Burns Date: Wed, 17 Jun 2026 12:07:12 -0400 Subject: [PATCH 1/3] Add TSG: storage pool capacity threshold warning (fixed vs thin volumes) Adds a Storage TSG explaining the S2D storage pool capacity threshold warning and the supported remediation paths, gated on volume provisioning type: - Step 1 decision gate on ProvisioningType (Fixed vs Thin) to prevent running the thin reclamation procedure on fixed volumes, where it is a no-op. - Path A (Fixed): add disks, convert to thin, shrink/remove, suppress alert, or raise threshold -- each risk-labeled. - Path B (Thin): SlabConsolidate + ReFS unmap reclamation procedure (no -ReTrim on ReFS), with VM-suspend and CSV owner-node guidance. - Clarifies the two distinct threshold controls and the reserve-capacity rationale (repair headroom after drive/node loss). Registers the TSG in TSG/Storage/README.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- TSG/Storage/README.md | 1 + ...ot-Storage-StoragePoolCapacityThreshold.md | 360 ++++++++++++++++++ 2 files changed, 361 insertions(+) create mode 100644 TSG/Storage/Troubleshoot-Storage-StoragePoolCapacityThreshold.md diff --git a/TSG/Storage/README.md b/TSG/Storage/README.md index b2298a9..016b7e0 100644 --- a/TSG/Storage/README.md +++ b/TSG/Storage/README.md @@ -3,3 +3,4 @@ * [Troubleshooting Storage With Support Diagnostics Tool](./Troubleshooting-Storage-With-Support-Diagnostics-Tool.md) * [How To: Add physical disks to an existing Azure Local cluster](./HowTo-Storage-AddPhysicalDisksToS2DPool.md) * [Troubleshoot: Physical disks not claimed after insertion (`CanPool=False`)](./Troubleshoot-Storage-PhysicalDiskCanPoolFalse.md) +* [Troubleshoot: Storage pool capacity threshold warning (fixed vs thin volumes)](./Troubleshoot-Storage-StoragePoolCapacityThreshold.md) diff --git a/TSG/Storage/Troubleshoot-Storage-StoragePoolCapacityThreshold.md b/TSG/Storage/Troubleshoot-Storage-StoragePoolCapacityThreshold.md new file mode 100644 index 0000000..7dc368d --- /dev/null +++ b/TSG/Storage/Troubleshoot-Storage-StoragePoolCapacityThreshold.md @@ -0,0 +1,360 @@ +# Troubleshoot the storage pool capacity threshold warning (fixed vs thin volumes) + + + + + + + + + + + + + + + + + + +
ComponentStorage
SeverityMedium
Applicable ScenariosDay 2 Operations: Capacity management / Update readiness
Affected VersionsAll Azure Local releases (Storage Spaces Direct)
+ +## Overview + +This guide explains the Storage Spaces Direct (S2D) **storage pool capacity +threshold warning** and the supported options for resolving it. The warning +fires when pool allocation crosses the configured threshold (the thin +provisioning alert threshold defaults to **70%**). + +The warning is **not a false alarm**: S2D needs free pool capacity in reserve so +that storage repair jobs can rebuild resiliency after a drive or node is lost. It +is, however, frequently misunderstood on clusters that use **fixed-provisioned** +volumes, because a fixed volume commits its entire size to the pool the moment it +is created — so the pool can sit above the threshold even when the volume's file +system is mostly empty. + +The single most important step is to **determine whether the affected volumes are +fixed or thin provisioned before taking any action**, because the space +reclamation procedure (`Optimize-Volume -SlabConsolidate` followed by +`Optimize-StoragePool`) **does nothing on a fixed-provisioned volume** and only +applies to thin-provisioned volumes. + +## Symptoms + +**Observable behaviors:** + +- A storage pool capacity / threshold health fault is raised against the pool + (surfaced in Windows Admin Center and by the cluster Health Service). +- `Get-StoragePool` shows the pool allocated at or above the threshold + (commonly 70%+), even though volumes report significant free space inside. +- Solution update (upgrade) readiness checks flag a storage pool capacity warning. + This is a bypassable warning, but addressing it before an update is strongly + recommended (see [What and why](#what-and-why)). +- On Azure Local 23H2+ with Arc VMs, new virtual disk creation may fail with an + out-of-capacity error from the underlying virtualization layer once the pool is + near full, even though the volume looks fine in Windows Admin Center. + +## What and Why + +### Why the warning exists + +When a capacity drive (or a whole node) is lost, S2D automatically starts repair +("auto-heal") jobs that re-create the missing copies of your data on the +remaining drives to restore full resiliency. Those repair jobs need somewhere to +write — they consume free pool capacity. If the pool has no reserve, repair jobs +have nowhere to rebuild and remain **suspended** until the failed drive is +physically replaced, leaving the volume running with reduced (or no) redundancy +in the meantime. + +For this reason Microsoft recommends keeping free pool capacity in reserve. The +guidance is to reserve **the equivalent of one capacity drive per server, up to a +maximum of four drives** (reserve grows for parity and multi-tier +configurations). See +[Plan volumes — reserve capacity](https://learn.microsoft.com/windows-server/storage/storage-spaces/plan-volumes). + +This is especially important on **small clusters (for example, two nodes with a +two-way mirror)**: during an update, nodes are drained and rebooted one at a +time, and a full pool leaves no headroom for the storage layer to keep data +resilient through the drain. + +### Why fixed-provisioned volumes hit this so easily + +Volumes on Azure Local are either **Thin** or **Fixed** provisioned (thin is the +default for new volumes; the default can be changed at the pool level): + +- **Fixed** — the volume reserves its full size in the pool at creation time. A + fixed volume on an N-way mirror commits **N × the volume size** of pool + footprint up front, regardless of how much data is actually written into it. + Deleting data inside the volume does **not** return capacity to the pool. +- **Thin** — the volume consumes pool capacity only as data is written, and unused + capacity can (with the procedure below) be returned to the pool. + +So a large fixed volume can push the pool over the threshold purely by design. +That is customer-chosen over-provisioning, not a stranded-capacity defect, and it +is not recoverable by the thin reclamation procedure. + +### Two different threshold controls (do not confuse them) + +- The **thin provisioning alert threshold** (default 70%) is the percentage the + Health Service evaluates the pool against. Change it with + `Set-StoragePool -ThinProvisioningAlertThresholds`. +- The **Health Service pool capacity alert** is a master on/off switch over that + evaluation, toggled with `Set-StorageHealthSetting` + (`System.Storage.StoragePool.ThresholdAlert.Enabled`). + +These are layered, not alternatives: raising the threshold (Option A5) has no +effect if the Health Service alert has already been disabled (Option A4), and +disabling the alert silences it regardless of the threshold value. Decide which +layer you intend to act on before changing anything. + +## Step 1 — Determine the provisioning type (required first) + +Run this on any cluster node before choosing a remediation: + +```powershell +Get-VirtualDisk | Format-Table FriendlyName, ProvisioningType, Size, FootprintOnPool -AutoSize +``` + +- `ProvisioningType = Fixed` → follow [Path A](#path-a--fixed-provisioned-volumes). +- `ProvisioningType = Thin` → follow [Path B](#path-b--thin-provisioned-volumes-reclaim-unused-capacity). + +> [!IMPORTANT] +> Do **not** run `Optimize-Volume -SlabConsolidate` or `Optimize-StoragePool` to +> "free space" on a fixed-provisioned volume. There are no unused slabs to +> consolidate on a fixed volume, so the procedure returns no capacity and can +> waste a maintenance window. + +Also capture the current pool fill level so you can confirm the result later: + +```powershell +Get-StoragePool | Where-Object IsPrimordial -eq $false | + Format-Table FriendlyName, Size, AllocatedSize, + @{N='UsedPct';E={[math]::Round(100*$_.AllocatedSize/$_.Size,1)}} -AutoSize +``` + +## Path A — Fixed-provisioned volumes + +On fixed volumes the pool footprint is committed by design. Choose one or more of +the following based on the customer's goal. + +### Option A1 — Add capacity (recommended when growth is expected) — [LOW RISK] + +Add OEM-supported physical disks so total pool capacity grows and the allocation +percentage drops below the threshold. Follow +[How to add physical disks to an existing Azure Local cluster](./HowTo-Storage-AddPhysicalDisksToS2DPool.md). + +### Option A2 — Convert fixed volumes to thin — [MEDIUM RISK] + +Converting to thin lets the pool charge only for data actually written, which +usually drops allocation well below the threshold and enables the reclamation +procedure in Path B. Follow the documented procedure: +[Convert fixed to thin provisioned volumes on Azure Local](https://learn.microsoft.com/previous-versions/azure/azure-local/manage/thin-provisioning-conversion). +After conversion, run [Path B](#path-b--thin-provisioned-volumes-reclaim-unused-capacity) +to release the now-unused capacity back to the pool. + +> [!IMPORTANT] +> In-place conversion of an existing fixed volume to thin requires **Azure Local +> 23H2 (build 2311.2) or later**. On earlier releases, do not attempt the +> conversion — instead create a new thin volume and migrate the data, then remove +> the old fixed volume. Always confirm the conversion path is supported on the +> cluster's current build before recommending it to a customer. + +### Option A3 — Shrink or remove volumes — [MEDIUM RISK] + +Reduce committed footprint by removing volumes that are no longer needed, or by +recreating a volume at a smaller size. Note that **ReFS does not support in-place +volume shrink**, so "shrinking" a fixed ReFS volume means evacuating its data and +recreating it smaller. Plan for data movement and downtime. + +### Option A4 — Suppress the capacity alert — [MEDIUM RISK] + +If the customer accepts the capacity posture and wants to stop the alert, the +Health Service threshold alert can be disabled: + +```powershell +# Inspect current setting +Get-StorageSubSystem -FriendlyName Clus* | + Get-StorageHealthSetting -Name "System.Storage.StoragePool.ThresholdAlert.Enabled" + +# Disable the alert +Get-StorageSubSystem -FriendlyName Clus* | + Set-StorageHealthSetting -Name "System.Storage.StoragePool.ThresholdAlert.Enabled" -Value $false +``` + +> [!WARNING] +> Suppressing the alert hides a **real** safety signal. The underlying capacity +> risk (no reserve for repair jobs after a drive loss) still exists. Only do this +> when the customer has explicitly accepted that risk, and document it. + +> [!NOTE] +> Confirm the exact setting name on the live cluster first +> (`Get-StorageSubSystem -FriendlyName Clus* | Get-StorageHealthSetting`) — the +> health-setting namespace can vary by build. + +### Option A5 — Raise the alert threshold — [MEDIUM RISK] + +If the goal is to move the threshold rather than silence the alert entirely: + +```powershell +# Inspect the current threshold(s) +Get-StoragePool -FriendlyName "" | + Select-Object FriendlyName, ThinProvisioningAlertThresholds + +# Raise the threshold (value is a percentage integer; the parameter takes an array) +Set-StoragePool -FriendlyName "" -ThinProvisioningAlertThresholds @(80) +``` + +> [!WARNING] +> Raising the threshold reduces the early-warning margin before the pool runs out +> of repair headroom. The same capacity risk applies as in Option A4. + +## Path B — Thin-provisioned volumes (reclaim unused capacity) + +On thin volumes, capacity that was written and later deleted can remain committed +to the pool in partially used 256 MB "slabs". A slab is only returned to the pool +once all of its blocks are free. The supported procedure consolidates the live +data into fewer slabs and releases the emptied slabs back to the pool. + +> [!NOTE] +> This procedure recovers capacity only when the volume genuinely holds far less +> data than its pool footprint. Confirm there is real interior free space first +> (`Get-Volume` / volume reports show large free space while `FootprintOnPool` is +> close to `Size × resiliency`). If footprint matches the data actually written, +> there is nothing to reclaim. + +**Procedure (requires a VM suspend window on the affected volume):** — [MEDIUM RISK] + +1. *(Optional, no downtime)* Merge Hyper-V checkpoints that are no longer needed + (`Get-VM | Get-VMSnapshot`, then `Remove-VMSnapshot`). Checkpoint files pin + extra slabs and reduce what consolidation can recover. + +2. **Suspend the VMs running on the affected volume** so their virtual disk file + handles are released (required for consolidation). First find where each VM is + running, because a VM must be suspended on its owner node: + + ```powershell + Get-ClusterGroup | Where-Object GroupType -eq 'VirtualMachine' | + Select-Object Name, OwnerNode, State + ``` + + ```powershell + Suspend-VM -Name "" # run on (or target) the VM's owner node + ``` + + Each suspended VM writes a saved-state file roughly the size of its assigned + memory, so confirm there is enough free space on the volume first. Putting the + cluster resource into redirected access is **not** sufficient — the VMs must + actually be suspended (or stopped). + + > [!IMPORTANT] + > For **Arc-managed VMs** (Azure Local 23H2+), stop the VM from Azure + > (portal or CLI) rather than using `Suspend-VM` on the host. Suspending an + > Arc VM directly on the host can desynchronize the Arc agent / Arc Resource + > Bridge view of the VM state. Once workloads on the volume are stopped or + > suspended cluster-wide, proceed with consolidation. + +3. **Consolidate slabs** on the volume. For a cluster shared volume (CSV), + address it by path and run on the CSV owner node — `-FileSystemLabel` resolves + against the local node's volume cache and can miss or mismatch a CSV owned by + another node: + + ```powershell + # Identify the CSV owner node + Get-ClusterSharedVolume | Select-Object Name, OwnerNode + + # On the owner node, consolidate by path + Optimize-Volume -Path "C:\ClusterStorage\" -SlabConsolidate -Verbose + ``` + + > [!IMPORTANT] + > Do **not** add `-ReTrim`. On thin-provisioned ReFS, `-ReTrim` does nothing + > useful — ReFS does not use the NTFS retrim mechanism; it has its own + > background unmap workitem. (Some older published examples show + > `-ReTrim -SlabConsolidate` together; for ReFS, use `-SlabConsolidate` + > alone.) Slab consolidation is the time-consuming step and can take hours on + > multi-terabyte volumes. + +4. **Wait about 15 minutes** after consolidation completes. The capacity is + returned to the pool by the **ReFS background unmap workitem**, which runs + after `Optimize-Volume -SlabConsolidate` finishes — this wait, not the next + step, is what releases the emptied slabs. + + > [!NOTE] + > VMs only need to stay suspended through the consolidation in Step 3. Once + > Step 3 reports complete, you can resume the VMs (Step 6) and run the + > remaining steps with workloads online, shortening the maintenance window. + +5. **(Optional) Rebalance the pool allocation:** + + ```powershell + Optimize-StoragePool -FriendlyName "" -Verbose + ``` + + `Optimize-StoragePool` rebalances Storage Spaces allocations across the pool; + it is primarily used to spread data onto newly added drives and is a finalize + step here, not the mechanism that frees the slabs (that already happened in + Step 4). Monitor with `Get-StorageJob` and wait until no `Optimize` jobs are + running before re-measuring pool fill. If it finishes in seconds with no jobs, + that is expected when there is nothing to rebalance — it does **not** mean + reclamation failed; confirm the result with the pool fill query in + [Verify](#verify). + +6. **Resume the VMs:** + + ```powershell + Resume-VM -Name "" + ``` + +> [!NOTE] +> In some cases, even after a correct consolidation pass with workloads suspended, +> a final batch of slabs may remain committed and the pool does not drop as far as +> expected. If the pool stays above the threshold after a clean consolidation +> pass, open a Microsoft support case rather than repeating the procedure. + +## Choose the right option + +| Volume provisioning | Goal | Use | +|---|---|---| +| Fixed | Grow capacity | A1 — add physical disks | +| Fixed | Reduce committed footprint / enable reclamation | A2 — convert to thin, then Path B | +| Fixed | Remove unneeded volumes | A3 — shrink/remove (ReFS = evacuate + recreate) | +| Fixed | Stop the alert (risk accepted) | A4 — disable the Health Service alert | +| Fixed | Move the alert threshold | A5 — raise `ThinProvisioningAlertThresholds` | +| Thin | Return deleted-data capacity to the pool | Path B — SlabConsolidate + Optimize-StoragePool | + +## Verify + +After remediation, confirm the pool dropped below the threshold and the warning +cleared: + +```powershell +# Pool fill level +Get-StoragePool | Where-Object IsPrimordial -eq $false | + Format-Table FriendlyName, Size, AllocatedSize, + @{N='UsedPct';E={[math]::Round(100*$_.AllocatedSize/$_.Size,1)}} -AutoSize + +# Any in-flight storage jobs +Get-StorageJob + +# Health faults +Get-StorageSubSystem -FriendlyName Clus* | Debug-StorageSubSystem +``` + +For an upgrade, re-run the solution update readiness check and confirm the +capacity finding is resolved or accepted. + +## Related Issues + +- [How to add physical disks to an existing Azure Local cluster](./HowTo-Storage-AddPhysicalDisksToS2DPool.md) +- [Troubleshooting Storage With Support Diagnostics Tool](./Troubleshooting-Storage-With-Support-Diagnostics-Tool.md) + +## References + +- [Thin provisioning on Azure Local](https://learn.microsoft.com/previous-versions/azure/azure-local/manage/thin-provisioning) +- [Convert fixed to thin provisioned volumes on Azure Local](https://learn.microsoft.com/previous-versions/azure/azure-local/manage/thin-provisioning-conversion) +- [Plan volumes (capacity and reserve)](https://learn.microsoft.com/windows-server/storage/storage-spaces/plan-volumes) +- [Optimize-Volume](https://learn.microsoft.com/powershell/module/storage/optimize-volume) +- [Optimize-StoragePool](https://learn.microsoft.com/powershell/module/storage/optimize-storagepool) +- [Troubleshoot Storage Spaces Direct health and operational states](https://learn.microsoft.com/windows-server/storage/storage-spaces/storage-spaces-states) + +--- From e25447e246dac3d59d039f862dccc173b3ac6395 Mon Sep 17 00:00:00 2001 From: Alex Burns Date: Thu, 18 Jun 2026 21:38:33 -0400 Subject: [PATCH 2/3] Address review feedback: overview/Path B consistency, A4 alert scope, Get-HealthFault - Overview: describe thin reclamation as SlabConsolidate + ReFS background unmap (Optimize-StoragePool is an optional rebalance, not the freeing step) to match Path B; align the summary table row too. - Option A4: note the Health Service alert toggle is applied at the storage subsystem level, so it suppresses the threshold alert cluster-wide (all pools), not just one pool or volume. - Verify: use the lightweight Get-HealthFault to list active faults instead of the heavier Debug-StorageSubSystem, matching the other Storage TSGs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- ...ot-Storage-StoragePoolCapacityThreshold.md | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/TSG/Storage/Troubleshoot-Storage-StoragePoolCapacityThreshold.md b/TSG/Storage/Troubleshoot-Storage-StoragePoolCapacityThreshold.md index 7dc368d..55cac71 100644 --- a/TSG/Storage/Troubleshoot-Storage-StoragePoolCapacityThreshold.md +++ b/TSG/Storage/Troubleshoot-Storage-StoragePoolCapacityThreshold.md @@ -35,9 +35,9 @@ system is mostly empty. The single most important step is to **determine whether the affected volumes are fixed or thin provisioned before taking any action**, because the space -reclamation procedure (`Optimize-Volume -SlabConsolidate` followed by -`Optimize-StoragePool`) **does nothing on a fixed-provisioned volume** and only -applies to thin-provisioned volumes. +reclamation procedure (`Optimize-Volume -SlabConsolidate`, then waiting for the +ReFS background unmap to release the freed slabs) **does nothing on a +fixed-provisioned volume** and only applies to thin-provisioned volumes. ## Symptoms @@ -182,7 +182,12 @@ Get-StorageSubSystem -FriendlyName Clus* | ``` > [!WARNING] -> Suppressing the alert hides a **real** safety signal. The underlying capacity +> This setting is applied at the **storage subsystem level** +> (`Get-StorageSubSystem ... | Set-StorageHealthSetting`), so it suppresses the +> capacity threshold alert **cluster-wide — for every pool in the subsystem**, not +> just the affected pool or volume. +> +> Suppressing the alert also hides a **real** safety signal. The underlying capacity > risk (no reserve for repair jobs after a drive loss) still exists. Only do this > when the customer has explicitly accepted that risk, and document it. @@ -320,7 +325,7 @@ data into fewer slabs and releases the emptied slabs back to the pool. | Fixed | Remove unneeded volumes | A3 — shrink/remove (ReFS = evacuate + recreate) | | Fixed | Stop the alert (risk accepted) | A4 — disable the Health Service alert | | Fixed | Move the alert threshold | A5 — raise `ThinProvisioningAlertThresholds` | -| Thin | Return deleted-data capacity to the pool | Path B — SlabConsolidate + Optimize-StoragePool | +| Thin | Return deleted-data capacity to the pool | Path B — SlabConsolidate + ReFS unmap | ## Verify @@ -336,8 +341,8 @@ Get-StoragePool | Where-Object IsPrimordial -eq $false | # Any in-flight storage jobs Get-StorageJob -# Health faults -Get-StorageSubSystem -FriendlyName Clus* | Debug-StorageSubSystem +# Active health faults across the cluster +Get-HealthFault ``` For an upgrade, re-run the solution update readiness check and confirm the From 33110f4a2b46dfdb5b86a71435b5565e65552b0f Mon Sep 17 00:00:00 2001 From: Alex Burns Date: Fri, 19 Jun 2026 10:39:00 -0400 Subject: [PATCH 3/3] Fix unsupported build-floor claim for fixed-to-thin conversion Option A2 stated that in-place fixed-to-thin conversion "requires Azure Local 23H2 (build 2311.2) or later" and to not attempt it on earlier releases. That build floor has no basis in any Microsoft source: - The cited conversion doc ("Convert fixed to thin provisioned volumes") states "Applies to: Azure Stack HCI, version 22H2" and documents the in-place Set-VirtualDisk -ProvisioningType Thin flow with no build gate. - The thin-provisioning concept doc FAQ ("Can existing fixed volumes be converted to thin? Yes ... supported") notes the feature is available since Azure Stack HCI 21H2 -- below the claimed floor. - The current Azure Local known-issues/release notes carry no regression tying conversion to 2311.2 or any 23H2 build. - The /previous-versions/ path reflects the Azure Stack HCI -> Azure Local rename (archival), not feature deprecation. Replace the fabricated build floor with the accurate, sourced caveat: no minimum build is published; the procedure is documented for 21H2/22H2 and archived; it is not re-published in current 23H2/24H2 volume docs, so confirm current-build support before recommending, and fall back to new-volume-and-migrate only if support cannot be confirmed. Addresses PR review comment by 1008covingtonlane. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- ...eshoot-Storage-StoragePoolCapacityThreshold.md | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/TSG/Storage/Troubleshoot-Storage-StoragePoolCapacityThreshold.md b/TSG/Storage/Troubleshoot-Storage-StoragePoolCapacityThreshold.md index 55cac71..7201d9e 100644 --- a/TSG/Storage/Troubleshoot-Storage-StoragePoolCapacityThreshold.md +++ b/TSG/Storage/Troubleshoot-Storage-StoragePoolCapacityThreshold.md @@ -153,11 +153,16 @@ After conversion, run [Path B](#path-b--thin-provisioned-volumes-reclaim-unused- to release the now-unused capacity back to the pool. > [!IMPORTANT] -> In-place conversion of an existing fixed volume to thin requires **Azure Local -> 23H2 (build 2311.2) or later**. On earlier releases, do not attempt the -> conversion — instead create a new thin volume and migrate the data, then remove -> the old fixed volume. Always confirm the conversion path is supported on the -> cluster's current build before recommending it to a customer. +> Microsoft publishes **no minimum build** for in-place fixed-to-thin conversion. +> The linked procedure (`Set-VirtualDisk -ProvisioningType Thin` plus a volume +> remount) is documented for Azure Stack HCI 21H2/22H2 and is now archived under +> `/previous-versions/` because of the Azure Stack HCI to Azure Local rename — not +> a documented removal of the feature. However, the current Azure Local 23H2/24H2 +> volume docs do not re-publish an in-place conversion procedure, so confirm it is +> still supported on the cluster's current build (against current guidance or with +> the storage team) before recommending it to a customer. If you cannot confirm +> support, create a new thin volume and migrate the data instead, then remove the +> old fixed volume. ### Option A3 — Shrink or remove volumes — [MEDIUM RISK]