diff --git a/TSG/Storage/README.md b/TSG/Storage/README.md index b2298a9..016b7e0 100644 --- a/TSG/Storage/README.md +++ b/TSG/Storage/README.md @@ -3,3 +3,4 @@ * [Troubleshooting Storage With Support Diagnostics Tool](./Troubleshooting-Storage-With-Support-Diagnostics-Tool.md) * [How To: Add physical disks to an existing Azure Local cluster](./HowTo-Storage-AddPhysicalDisksToS2DPool.md) * [Troubleshoot: Physical disks not claimed after insertion (`CanPool=False`)](./Troubleshoot-Storage-PhysicalDiskCanPoolFalse.md) +* [Troubleshoot: Storage pool capacity threshold warning (fixed vs thin volumes)](./Troubleshoot-Storage-StoragePoolCapacityThreshold.md) diff --git a/TSG/Storage/Troubleshoot-Storage-StoragePoolCapacityThreshold.md b/TSG/Storage/Troubleshoot-Storage-StoragePoolCapacityThreshold.md new file mode 100644 index 0000000..7201d9e --- /dev/null +++ b/TSG/Storage/Troubleshoot-Storage-StoragePoolCapacityThreshold.md @@ -0,0 +1,370 @@ +# Troubleshoot the storage pool capacity threshold warning (fixed vs thin volumes) + + + + + + + + + + + + + + + + + + +
ComponentStorage
SeverityMedium
Applicable ScenariosDay 2 Operations: Capacity management / Update readiness
Affected VersionsAll Azure Local releases (Storage Spaces Direct)
+ +## Overview + +This guide explains the Storage Spaces Direct (S2D) **storage pool capacity +threshold warning** and the supported options for resolving it. The warning +fires when pool allocation crosses the configured threshold (the thin +provisioning alert threshold defaults to **70%**). + +The warning is **not a false alarm**: S2D needs free pool capacity in reserve so +that storage repair jobs can rebuild resiliency after a drive or node is lost. It +is, however, frequently misunderstood on clusters that use **fixed-provisioned** +volumes, because a fixed volume commits its entire size to the pool the moment it +is created — so the pool can sit above the threshold even when the volume's file +system is mostly empty. + +The single most important step is to **determine whether the affected volumes are +fixed or thin provisioned before taking any action**, because the space +reclamation procedure (`Optimize-Volume -SlabConsolidate`, then waiting for the +ReFS background unmap to release the freed slabs) **does nothing on a +fixed-provisioned volume** and only applies to thin-provisioned volumes. + +## Symptoms + +**Observable behaviors:** + +- A storage pool capacity / threshold health fault is raised against the pool + (surfaced in Windows Admin Center and by the cluster Health Service). +- `Get-StoragePool` shows the pool allocated at or above the threshold + (commonly 70%+), even though volumes report significant free space inside. +- Solution update (upgrade) readiness checks flag a storage pool capacity warning. + This is a bypassable warning, but addressing it before an update is strongly + recommended (see [What and why](#what-and-why)). +- On Azure Local 23H2+ with Arc VMs, new virtual disk creation may fail with an + out-of-capacity error from the underlying virtualization layer once the pool is + near full, even though the volume looks fine in Windows Admin Center. + +## What and Why + +### Why the warning exists + +When a capacity drive (or a whole node) is lost, S2D automatically starts repair +("auto-heal") jobs that re-create the missing copies of your data on the +remaining drives to restore full resiliency. Those repair jobs need somewhere to +write — they consume free pool capacity. If the pool has no reserve, repair jobs +have nowhere to rebuild and remain **suspended** until the failed drive is +physically replaced, leaving the volume running with reduced (or no) redundancy +in the meantime. + +For this reason Microsoft recommends keeping free pool capacity in reserve. The +guidance is to reserve **the equivalent of one capacity drive per server, up to a +maximum of four drives** (reserve grows for parity and multi-tier +configurations). See +[Plan volumes — reserve capacity](https://learn.microsoft.com/windows-server/storage/storage-spaces/plan-volumes). + +This is especially important on **small clusters (for example, two nodes with a +two-way mirror)**: during an update, nodes are drained and rebooted one at a +time, and a full pool leaves no headroom for the storage layer to keep data +resilient through the drain. + +### Why fixed-provisioned volumes hit this so easily + +Volumes on Azure Local are either **Thin** or **Fixed** provisioned (thin is the +default for new volumes; the default can be changed at the pool level): + +- **Fixed** — the volume reserves its full size in the pool at creation time. A + fixed volume on an N-way mirror commits **N × the volume size** of pool + footprint up front, regardless of how much data is actually written into it. + Deleting data inside the volume does **not** return capacity to the pool. +- **Thin** — the volume consumes pool capacity only as data is written, and unused + capacity can (with the procedure below) be returned to the pool. + +So a large fixed volume can push the pool over the threshold purely by design. +That is customer-chosen over-provisioning, not a stranded-capacity defect, and it +is not recoverable by the thin reclamation procedure. + +### Two different threshold controls (do not confuse them) + +- The **thin provisioning alert threshold** (default 70%) is the percentage the + Health Service evaluates the pool against. Change it with + `Set-StoragePool -ThinProvisioningAlertThresholds`. +- The **Health Service pool capacity alert** is a master on/off switch over that + evaluation, toggled with `Set-StorageHealthSetting` + (`System.Storage.StoragePool.ThresholdAlert.Enabled`). + +These are layered, not alternatives: raising the threshold (Option A5) has no +effect if the Health Service alert has already been disabled (Option A4), and +disabling the alert silences it regardless of the threshold value. Decide which +layer you intend to act on before changing anything. + +## Step 1 — Determine the provisioning type (required first) + +Run this on any cluster node before choosing a remediation: + +```powershell +Get-VirtualDisk | Format-Table FriendlyName, ProvisioningType, Size, FootprintOnPool -AutoSize +``` + +- `ProvisioningType = Fixed` → follow [Path A](#path-a--fixed-provisioned-volumes). +- `ProvisioningType = Thin` → follow [Path B](#path-b--thin-provisioned-volumes-reclaim-unused-capacity). + +> [!IMPORTANT] +> Do **not** run `Optimize-Volume -SlabConsolidate` or `Optimize-StoragePool` to +> "free space" on a fixed-provisioned volume. There are no unused slabs to +> consolidate on a fixed volume, so the procedure returns no capacity and can +> waste a maintenance window. + +Also capture the current pool fill level so you can confirm the result later: + +```powershell +Get-StoragePool | Where-Object IsPrimordial -eq $false | + Format-Table FriendlyName, Size, AllocatedSize, + @{N='UsedPct';E={[math]::Round(100*$_.AllocatedSize/$_.Size,1)}} -AutoSize +``` + +## Path A — Fixed-provisioned volumes + +On fixed volumes the pool footprint is committed by design. Choose one or more of +the following based on the customer's goal. + +### Option A1 — Add capacity (recommended when growth is expected) — [LOW RISK] + +Add OEM-supported physical disks so total pool capacity grows and the allocation +percentage drops below the threshold. Follow +[How to add physical disks to an existing Azure Local cluster](./HowTo-Storage-AddPhysicalDisksToS2DPool.md). + +### Option A2 — Convert fixed volumes to thin — [MEDIUM RISK] + +Converting to thin lets the pool charge only for data actually written, which +usually drops allocation well below the threshold and enables the reclamation +procedure in Path B. Follow the documented procedure: +[Convert fixed to thin provisioned volumes on Azure Local](https://learn.microsoft.com/previous-versions/azure/azure-local/manage/thin-provisioning-conversion). +After conversion, run [Path B](#path-b--thin-provisioned-volumes-reclaim-unused-capacity) +to release the now-unused capacity back to the pool. + +> [!IMPORTANT] +> Microsoft publishes **no minimum build** for in-place fixed-to-thin conversion. +> The linked procedure (`Set-VirtualDisk -ProvisioningType Thin` plus a volume +> remount) is documented for Azure Stack HCI 21H2/22H2 and is now archived under +> `/previous-versions/` because of the Azure Stack HCI to Azure Local rename — not +> a documented removal of the feature. However, the current Azure Local 23H2/24H2 +> volume docs do not re-publish an in-place conversion procedure, so confirm it is +> still supported on the cluster's current build (against current guidance or with +> the storage team) before recommending it to a customer. If you cannot confirm +> support, create a new thin volume and migrate the data instead, then remove the +> old fixed volume. + +### Option A3 — Shrink or remove volumes — [MEDIUM RISK] + +Reduce committed footprint by removing volumes that are no longer needed, or by +recreating a volume at a smaller size. Note that **ReFS does not support in-place +volume shrink**, so "shrinking" a fixed ReFS volume means evacuating its data and +recreating it smaller. Plan for data movement and downtime. + +### Option A4 — Suppress the capacity alert — [MEDIUM RISK] + +If the customer accepts the capacity posture and wants to stop the alert, the +Health Service threshold alert can be disabled: + +```powershell +# Inspect current setting +Get-StorageSubSystem -FriendlyName Clus* | + Get-StorageHealthSetting -Name "System.Storage.StoragePool.ThresholdAlert.Enabled" + +# Disable the alert +Get-StorageSubSystem -FriendlyName Clus* | + Set-StorageHealthSetting -Name "System.Storage.StoragePool.ThresholdAlert.Enabled" -Value $false +``` + +> [!WARNING] +> This setting is applied at the **storage subsystem level** +> (`Get-StorageSubSystem ... | Set-StorageHealthSetting`), so it suppresses the +> capacity threshold alert **cluster-wide — for every pool in the subsystem**, not +> just the affected pool or volume. +> +> Suppressing the alert also hides a **real** safety signal. The underlying capacity +> risk (no reserve for repair jobs after a drive loss) still exists. Only do this +> when the customer has explicitly accepted that risk, and document it. + +> [!NOTE] +> Confirm the exact setting name on the live cluster first +> (`Get-StorageSubSystem -FriendlyName Clus* | Get-StorageHealthSetting`) — the +> health-setting namespace can vary by build. + +### Option A5 — Raise the alert threshold — [MEDIUM RISK] + +If the goal is to move the threshold rather than silence the alert entirely: + +```powershell +# Inspect the current threshold(s) +Get-StoragePool -FriendlyName "" | + Select-Object FriendlyName, ThinProvisioningAlertThresholds + +# Raise the threshold (value is a percentage integer; the parameter takes an array) +Set-StoragePool -FriendlyName "" -ThinProvisioningAlertThresholds @(80) +``` + +> [!WARNING] +> Raising the threshold reduces the early-warning margin before the pool runs out +> of repair headroom. The same capacity risk applies as in Option A4. + +## Path B — Thin-provisioned volumes (reclaim unused capacity) + +On thin volumes, capacity that was written and later deleted can remain committed +to the pool in partially used 256 MB "slabs". A slab is only returned to the pool +once all of its blocks are free. The supported procedure consolidates the live +data into fewer slabs and releases the emptied slabs back to the pool. + +> [!NOTE] +> This procedure recovers capacity only when the volume genuinely holds far less +> data than its pool footprint. Confirm there is real interior free space first +> (`Get-Volume` / volume reports show large free space while `FootprintOnPool` is +> close to `Size × resiliency`). If footprint matches the data actually written, +> there is nothing to reclaim. + +**Procedure (requires a VM suspend window on the affected volume):** — [MEDIUM RISK] + +1. *(Optional, no downtime)* Merge Hyper-V checkpoints that are no longer needed + (`Get-VM | Get-VMSnapshot`, then `Remove-VMSnapshot`). Checkpoint files pin + extra slabs and reduce what consolidation can recover. + +2. **Suspend the VMs running on the affected volume** so their virtual disk file + handles are released (required for consolidation). First find where each VM is + running, because a VM must be suspended on its owner node: + + ```powershell + Get-ClusterGroup | Where-Object GroupType -eq 'VirtualMachine' | + Select-Object Name, OwnerNode, State + ``` + + ```powershell + Suspend-VM -Name "" # run on (or target) the VM's owner node + ``` + + Each suspended VM writes a saved-state file roughly the size of its assigned + memory, so confirm there is enough free space on the volume first. Putting the + cluster resource into redirected access is **not** sufficient — the VMs must + actually be suspended (or stopped). + + > [!IMPORTANT] + > For **Arc-managed VMs** (Azure Local 23H2+), stop the VM from Azure + > (portal or CLI) rather than using `Suspend-VM` on the host. Suspending an + > Arc VM directly on the host can desynchronize the Arc agent / Arc Resource + > Bridge view of the VM state. Once workloads on the volume are stopped or + > suspended cluster-wide, proceed with consolidation. + +3. **Consolidate slabs** on the volume. For a cluster shared volume (CSV), + address it by path and run on the CSV owner node — `-FileSystemLabel` resolves + against the local node's volume cache and can miss or mismatch a CSV owned by + another node: + + ```powershell + # Identify the CSV owner node + Get-ClusterSharedVolume | Select-Object Name, OwnerNode + + # On the owner node, consolidate by path + Optimize-Volume -Path "C:\ClusterStorage\" -SlabConsolidate -Verbose + ``` + + > [!IMPORTANT] + > Do **not** add `-ReTrim`. On thin-provisioned ReFS, `-ReTrim` does nothing + > useful — ReFS does not use the NTFS retrim mechanism; it has its own + > background unmap workitem. (Some older published examples show + > `-ReTrim -SlabConsolidate` together; for ReFS, use `-SlabConsolidate` + > alone.) Slab consolidation is the time-consuming step and can take hours on + > multi-terabyte volumes. + +4. **Wait about 15 minutes** after consolidation completes. The capacity is + returned to the pool by the **ReFS background unmap workitem**, which runs + after `Optimize-Volume -SlabConsolidate` finishes — this wait, not the next + step, is what releases the emptied slabs. + + > [!NOTE] + > VMs only need to stay suspended through the consolidation in Step 3. Once + > Step 3 reports complete, you can resume the VMs (Step 6) and run the + > remaining steps with workloads online, shortening the maintenance window. + +5. **(Optional) Rebalance the pool allocation:** + + ```powershell + Optimize-StoragePool -FriendlyName "" -Verbose + ``` + + `Optimize-StoragePool` rebalances Storage Spaces allocations across the pool; + it is primarily used to spread data onto newly added drives and is a finalize + step here, not the mechanism that frees the slabs (that already happened in + Step 4). Monitor with `Get-StorageJob` and wait until no `Optimize` jobs are + running before re-measuring pool fill. If it finishes in seconds with no jobs, + that is expected when there is nothing to rebalance — it does **not** mean + reclamation failed; confirm the result with the pool fill query in + [Verify](#verify). + +6. **Resume the VMs:** + + ```powershell + Resume-VM -Name "" + ``` + +> [!NOTE] +> In some cases, even after a correct consolidation pass with workloads suspended, +> a final batch of slabs may remain committed and the pool does not drop as far as +> expected. If the pool stays above the threshold after a clean consolidation +> pass, open a Microsoft support case rather than repeating the procedure. + +## Choose the right option + +| Volume provisioning | Goal | Use | +|---|---|---| +| Fixed | Grow capacity | A1 — add physical disks | +| Fixed | Reduce committed footprint / enable reclamation | A2 — convert to thin, then Path B | +| Fixed | Remove unneeded volumes | A3 — shrink/remove (ReFS = evacuate + recreate) | +| Fixed | Stop the alert (risk accepted) | A4 — disable the Health Service alert | +| Fixed | Move the alert threshold | A5 — raise `ThinProvisioningAlertThresholds` | +| Thin | Return deleted-data capacity to the pool | Path B — SlabConsolidate + ReFS unmap | + +## Verify + +After remediation, confirm the pool dropped below the threshold and the warning +cleared: + +```powershell +# Pool fill level +Get-StoragePool | Where-Object IsPrimordial -eq $false | + Format-Table FriendlyName, Size, AllocatedSize, + @{N='UsedPct';E={[math]::Round(100*$_.AllocatedSize/$_.Size,1)}} -AutoSize + +# Any in-flight storage jobs +Get-StorageJob + +# Active health faults across the cluster +Get-HealthFault +``` + +For an upgrade, re-run the solution update readiness check and confirm the +capacity finding is resolved or accepted. + +## Related Issues + +- [How to add physical disks to an existing Azure Local cluster](./HowTo-Storage-AddPhysicalDisksToS2DPool.md) +- [Troubleshooting Storage With Support Diagnostics Tool](./Troubleshooting-Storage-With-Support-Diagnostics-Tool.md) + +## References + +- [Thin provisioning on Azure Local](https://learn.microsoft.com/previous-versions/azure/azure-local/manage/thin-provisioning) +- [Convert fixed to thin provisioned volumes on Azure Local](https://learn.microsoft.com/previous-versions/azure/azure-local/manage/thin-provisioning-conversion) +- [Plan volumes (capacity and reserve)](https://learn.microsoft.com/windows-server/storage/storage-spaces/plan-volumes) +- [Optimize-Volume](https://learn.microsoft.com/powershell/module/storage/optimize-volume) +- [Optimize-StoragePool](https://learn.microsoft.com/powershell/module/storage/optimize-storagepool) +- [Troubleshoot Storage Spaces Direct health and operational states](https://learn.microsoft.com/windows-server/storage/storage-spaces/storage-spaces-states) + +---