Skip to content

feat: cross-stage offload modes and layer-streaming for low-VRAM GPUs#1477

Closed
fszontagh wants to merge 100 commits into
leejet:masterfrom
fszontagh:feature/vram-offloading-v2
Closed

feat: cross-stage offload modes and layer-streaming for low-VRAM GPUs#1477
fszontagh wants to merge 100 commits into
leejet:masterfrom
fszontagh:feature/vram-offloading-v2

Conversation

@fszontagh
Copy link
Copy Markdown
Contributor

feat: cross-stage offload modes and layer-streaming for low-VRAM GPUs

Why

Two problems that come up on small GPUs running large diffusion models:

  1. Cross-stage component placement. Where does the text encoder live while diffusion runs? Where does diffusion go while the VAE decodes? On a 12 GB card running an 11.5 GB diffusion model, we need to move components in and out between stages or VAE decode hits OOM.
  2. Models that don't fit at all. When the diffusion weights themselves exceed VRAM, we need to stream them in per-layer rather than load all at once.

This PR adds a single new flag, --offload-mode, that handles cross-stage placement, plus a per-layer streaming path (--offload-mode layer_streaming) for the doesn't-fit-at-all case.

New CLI flags

Flag Description
--offload-mode <mode> One of none, cond_only, cond_diffusion, aggressive, layer_streaming. Default none.
--offload-cond-stage / --no-offload-cond-stage Override the cond-stage offload decision.
--offload-diffusion / --no-offload-diffusion Override the diffusion-model offload decision.
--offload-log / --no-offload-log Log offload events to stderr.
--vram-estimation <method> dryrun (probe graph) or formula (analytic).
--streaming-prefetch <N> Layers to prefetch ahead during streaming. Default 1.
--streaming-min-vram <MB> Minimum free VRAM kept during streaming. Default 512.

What each mode does

Mode What it does Use case
none (default) No offload. Identical to current master behaviour. Default; everything fits on GPU.
cond_only Move text encoder to CPU after conditioning, keep diffusion on GPU. Tight VRAM during diffusion.
cond_diffusion Move both text encoder and diffusion model out between stages, swap them in for their stage. VAE decode needs room; diffusion is too big to coexist with VAE compute buffer.
aggressive Evict every component as soon as it's not actively used; reload on demand. Lowest VRAM footprint at any moment; pays reload costs each transition.
layer_streaming Diffusion weights live in pinned host RAM; each transformer block uploads to GPU just before it runs and is evicted afterwards. Async prefetch keeps PCIe full. Models that don't fit at all (Z-Image bf16 11.5 GB on 12 GB card).

How layer streaming works

Three pieces, each a known-but-effective optimization at a different layer of the stack:

  1. Pinned host buffer for streamed weights, so cudaMemcpyAsync actually goes async (a pageable source falls through to a synchronous bounce-buffer copy in the driver).
  2. Per-layer prefetch overlapped with the previous layer's compute - the next layer's H2D starts on a separate stream while the current kernel is still running.
  3. Chunk graph for the resident block - layers that fit on GPU stay there across sampling steps and run as one combined ggml graph dispatch instead of one mini-graph per layer.

A unified VRAM heuristic decides automatically which layers stay resident and which stream, based on actual free VRAM. Users don't have to pick a budget manually.

Benchmarks - RTX 3060 (12 GB), PCIe 3.0 x16

Hardware: RTX 3060 12 GB. The card itself supports PCIe 4.0, but the board is DDR3-era so the slot is capped at PCIe 3.0 x16 (8.0 GT/s). PCIe bandwidth is the dominant cost during streaming, so faster boards (PCIe 4.0 x16, ~24 GB/s practical) should reduce these numbers materially.

All numbers below: batch_count=4, steps=12, resolution=688x1024, LoRA applied at runtime, same prompt/seed across configs.

Z-Image-Turbo bf16 (11.5 GB diffusion model — does NOT fit in 12 GB)

Workload: 4 images per generation, 12 sampling steps each, batch=4. This is where streaming matters most — without offload of some kind, the model can't even load.

Config generate_image Notes
--offload-mode layer_streaming 175 s This PR. GPU utilization steady >90%; effective PCIe TX ~3.5 GB/s during streaming windows.
--offload-to-cpu --max-vram 9 335 s Existing graph-cut path. ~2× slower.

Z-Image-Turbo Q8 (6.7 GB diffusion model — fits in VRAM, but VAE compute buffer doesn't)

Workload: 4 images per generation, 12 sampling steps each, batch=4. When the model fits, streaming gives up most of its advantage and the simpler existing offload paths are slightly faster. Listed for completeness.

Config generate_image Notes
--offload-to-cpu 115 s Fastest when model fits.
--vae-tiling 118 s Tile VAE compute on GPU.
--offload-mode layer_streaming 122 s Auto-picks coarse-stage; still goes through streaming bookkeeping (~6% overhead).
--offload-to-cpu --max-vram 6 152 s Graph-cut adds dispatch overhead even when params fit.
--vae-on-cpu 602 s Reference; VAE on CPU is brutal.

So the recommendation in the docs is: pick --offload-mode layer_streaming when the model doesn't fit (where it's ~2× faster than alternatives), and stick with the existing --offload-to-cpu (or no offload) when it does. --offload-mode none (default) keeps current master behaviour.

Architectures

The streaming runtime is shared via tensor_registry.hpp, layer_streaming.hpp, memory_budget.hpp. Verified end-to-end on RTX 3060:

  • Z-Image / Z-Image-Turbo (bf16 + Q8) - primary target
  • Flux schnell
  • Anima
  • Qwen Image

Implemented and built but not personally verified by me - appreciate someone with the hardware/models confirming:

  • MMDiT / SD3
  • UNet (SD1.x / SDXL)
  • WAN

Known issues

  • --lora-apply-mode immediately + --offload-mode layer_streaming crashes - the immediate folder reaches into weight buffers that haven't been uploaded to GPU yet under streaming. Use at_runtime (default auto already picks this in streaming mode). Pre-existing class of issue surfaced by streaming.
  • VRAM estimation isn't perfect; dryrun is more accurate but adds a small startup cost. Switch to dryrun if you hit OOM during the first step.

Backwards compatibility

Default behaviour is unchanged. --offload-mode none matches current master byte-for-byte. All new flags are opt-in.

Bug fixes folded in

While exercising the offload paths I found and fixed a small set of pre-existing bugs. They're independent of the new offload modes and benefit users who never set --offload-mode. Happy to split these into a separate small PR if preferred.

  • GGMLRunner destructor leaked runtime_params_buffer and partial_runtime_params_buffer. free_params_buffer() only released the CPU-side params_buffer. When the runner had been staged onto the runtime backend (any offload mode active, including the segmented offload from feat: add max-vram based segmented param offload #1476), the GPU-side weight buffer(s) leaked on destruction. Real leak under LoRA + offload — many short-lived runners are created during LoRA application. Two-line additions to the destructor.
  • CFG causing redundant model reloads under streaming.
  • t_emb buffer aliasing in Z-Image's per-layer path.
  • GGMLRunner scratch-buffer reuse.
  • VAE-encode OOM in aggressive mode.
  • Includes Skip empty MultiLoraAdapter when no LoRAs target a model #1469's empty-MultiLoraAdapter fix (already merged into master); will rebase to drop that commit at PR time.

Documentation

docs/vram_offloading.md covers the modes, decision tree, and example commands.

fszontagh added 30 commits March 4, 2026 07:34
Add runtime tensor offloading to enable running large models (Q8+)
on GPUs with limited VRAM by dynamically moving components between
GPU and CPU memory.

- `cond_only`: Offload cond_stage (LLM/CLIP) after conditioning
- `cond_diffusion`: Offload both cond_stage and diffusion after use
- `aggressive`: Offload each component immediately after use

- Add OffloadConfig struct with mode, flags for cond_stage/diffusion
- Add move_params_to_cpu/gpu methods to GGMLRunner
- Add set_auto_offload() to control automatic offloading behavior
- Implement on-demand reload before conditioning/diffusion steps
- Track VRAM usage for offloaded components

Enables 1024x1024 generation with Z-Image Q8 (~7GB) + Qwen3-4B Q8
(~4GB) + VAE (~320MB) on 12GB GPU by offloading the ~4GB LLM after
conditioning completes, freeing VRAM for diffusion compute buffers.

Without offloading: CUDA OOM during diffusion
With cond_only offload: Successful generation in ~66s

Tested configurations:
- offload_mode=none: OOM at 1024x1024 with Q8 models
- offload_mode=cond_only: Success, ~66s generation time
- offload_mode=cond_only + vae_tiling: Success, ~149s
Expose the dynamic tensor offloading feature through CLI options:
- --offload-mode: Set offload mode (none, cond_only, cond_diffusion, aggressive)
- --offload-log: Enable offload event logging
- --no-offload-log: Disable offload event logging

The cond_only mode is particularly useful for 12GB GPUs running large
Q8 models with LLMs, as it offloads the LLM/CLIP to CPU after
conditioning, freeing VRAM for diffusion compute buffers.

Changes:
- Add sd_offload_mode_name() and str_to_offload_mode() helper functions
- Add sd_offload_config_init() for default configuration
- Add offload_config member to SDContextParams
- Wire offload_config through to_sd_ctx_params_t()
- Add CLI options in get_options()
When dynamic offloading is enabled and the LLM/CLIP model was offloaded
to CPU, attempting to reload it to GPU could fail if there's not enough
VRAM available. Previously, the code logged a misleading warning
"conditioning will run on CPU (slower)" but then crashed (SEGV) because:

1. move_params_to_gpu() failed and returned false
2. Code continued to call get_learned_condition()
3. compute() tried offload_params_to_runtime_backend() which failed again
4. compute() returned false but caller didn't check return value
5. Code tried to use uninitialized data, causing SEGV

Fix:
- Return NULL from generate_image/generate_video when GPU reload fails
- Return false from load() if initial GPU move fails
- This gives callers a proper error to handle instead of crashing

The user will see a clear error message suggesting to reduce resolution,
use smaller models, or disable dynamic offloading.
When offload_mode is enabled and LoRAs are being applied, the cond_stage
(LLM/CLIP) may still be on GPU from initial model loading. This uses up
VRAM and causes LoRA allocation to fail with OOM.

Fix: Before applying LoRAs in generate_image(), check if:
1. offload_mode is enabled
2. offload_cond_stage is true
3. We have LoRAs to apply
4. cond_stage is currently on GPU

If all conditions are met, offload cond_stage to CPU first to free VRAM
for LoRA allocation. The cond_stage will be reloaded on-demand before
conditioning runs.

This allows using LoRAs with large LLM models (like qwen3-4b) on 12GB GPUs
that would otherwise OOM during LoRA allocation.
When cond_stage reload fails due to LoRA buffers using VRAM:
1. Free LoRA buffers to make room
2. Retry cond_stage reload
3. Reload LoRA weights from disk

Added reload_params() method to LoraModel to support reloading
weights after buffer is freed and reallocated.

This enables using LoRA with cond_only offload mode on GPUs
where cond_stage + LoRA can't both fit alongside diffusion model.
- Add enable_offload parameter to LoraModel constructor
- Enable CPU offload for LoRA when dynamic offloading is active
- Use move_params_to_cpu()/move_params_to_gpu() for fast memory transfers
  instead of free_params_buffer()/reload_params() disk I/O

This makes LoRA offloading ~10-50ms instead of ~500-1000ms from disk.
When offload mode is enabled, GGMLRunner has both:
- params_buffer (CPU)
- runtime_params_buffer (GPU)

The destructor only freed params_buffer, causing GPU memory to
leak when LoRA models were destroyed while on GPU. This caused
OOM errors after multiple generations with LoRAs.
- Add sd_vram_estimation_t enum for estimation method selection
  - SD_VRAM_EST_DRYRUN (default): accurate graph-based estimation
  - SD_VRAM_EST_FORMULA: faster formula-based approximation

- Add estimate_compute_buffer_size() to GGMLRunner for dry-run
  allocation that returns required buffer size without allocating

- Add estimate_vae_decode_vram() to calculate VAE decode requirements
  using either dry-run or formula method

- Add smart_offload_for_vae() that estimates VRAM needed and
  offloads only what's necessary before VAE decode

- Call smart_offload_for_vae() before decode in image and video
  generation paths

This enables smarter offloading - only offload components when
actually needed based on accurate VRAM estimation.
- Add get_free_vram() helper to query actual GPU memory via CUDA
- Add estimate_diffusion_vram() for diffusion sampling memory estimate
- Add should_offload_cond_stage_for_diffusion() smart check
- Add should_offload_diffusion_for_vae() smart check
- Replace unconditional offload with VRAM-aware decisions
- Only offload when free_vram < next_phase_needs + 300MB margin
- Apply to both txt2img and img2img/video generation paths
- Update common.hpp for vram_estimation struct field order

On larger GPUs, components stay on GPU between phases for speed.
On tight VRAM, offloading still occurs as needed.
- Add reload_diffusion field to sd_offload_config_t struct
- Default to true (matches previous always-reload behavior)
- Make post-generation reload of diffusion model respect config
- Update both txt2img and video generation paths
- Allows keeping diffusion offloaded between generations for batch work

Benchmark results on 12GB GPU with Z-Image Q8_0:
- no_reload: 29-30s generation, 1.9GB GPU after
- reload: 32s generation, 8.1GB GPU after
New CLI options:
- --offload-cond-stage / --no-offload-cond-stage
- --offload-diffusion / --no-offload-diffusion
- --reload-cond-stage / --no-reload-cond-stage
- --reload-diffusion / --no-reload-diffusion
- --vram-estimation [dryrun|formula]

Also adds:
- sd_vram_estimation_name() and str_to_vram_estimation() API functions
- Extended toString() output showing all offload config details
This commit adds the foundation for layer-by-layer tensor streaming,
enabling models larger than VRAM to run by loading weights on-demand.

New components:
- TensorRegistry: Tracks individual tensor locations (GPU/CPU) by layer
- MemoryBudgetManager: Manages VRAM budget with eviction policies
- LayerExecutionEngine: Orchestrates per-layer execution with prefetch

Integration:
- FluxRunner gains enable_layer_streaming() for streaming mode
- New SD_OFFLOAD_LAYER_STREAMING offload mode
- CLI: --offload-mode layer_streaming

This is the infrastructure foundation. Per-block execution will be
added in subsequent commits.
GGMLBlock stores tensor names in its internal `params` map hierarchy,
but never calls ggml_set_name() on the actual GGML tensors. This caused
register_from_context() to get empty names for all tensors, mapping
everything to the "_global" layer (resulting in "registered 1 layers").

Fix: Add register_from_map() method that takes the tensor map from
get_param_tensors(), which preserves proper tensor names like
"model.diffusion_model.double_blocks.5.img_attn.qkv.weight".

Result: 58 layers now registered correctly for Flux models (19 double_blocks
+ 38 single_blocks + 1 _global) instead of just 1.
…cking

1. Skip move_params_to_gpu() for diffusion model in layer_streaming mode
   - Before sampling: don't bulk-load entire diffusion model to GPU
   - After generation: don't reload diffusion in streaming mode

2. Fix tensor name tracking in TensorRegistry::move_layer_to_gpu
   - Use stored tensor names instead of relying on ggml_get_name()
   - GGMLBlock doesn't call ggml_set_name() on original tensors

Known issue: Graph context invalidation in streaming path needs fixing
(alloc_compute_buffer resets compute_ctx after graph is built)
Two critical fixes for layer streaming mode:

1. Flux preprocessing: Add to_backend() calls for input tensors
   - The regular build_graph() converts external tensors to compute_ctx
   - Streaming preprocessing was missing this, causing mul_mat assertions
   - Now properly converts x, context, timesteps, y, guidance to backend

2. UNet streaming: Add skip_param_offload parameter to compute()
   - In streaming mode, weights are managed by the streaming engine
   - The regular compute() was trying to bulk-allocate all weights to GPU
   - This failed with OOM because streaming only loads layers on demand
   - New skip_param_offload=true prevents this bulk allocation

Testing: Successfully generated 512x512 image with SDXL model using
--offload-mode layer_streaming, 4 steps completed in 3.78s
MMDiT has no skip connections, making it ideal for layer streaming:
- Added mmdit_layer_pattern() to parse joint_blocks.N tensor names
- Added streaming infrastructure to MMDiTRunner (enable/disable/compute)
- Added compute_streaming() that loads all joint_blocks before execution
- Wired MMDiTModel to DiffusionModel streaming interface

MMDiT structure:
- 24 joint_blocks (each with context_block + x_block)
- Global tensors: x_embedder, t_embedder, y_embedder, context_embedder, final_layer
WAN has sequential transformer blocks ideal for streaming:
- Added wan_layer_pattern() to parse blocks.N and vace_blocks.N tensor names
- Added streaming infrastructure to WanRunner (enable/disable/compute)
- Added compute_streaming() that loads all blocks before execution
- Wired WanModel to DiffusionModel streaming interface

WAN structure:
- 30-40 blocks.N (main transformer blocks)
- Optional vace_blocks.N (VACE interleaved blocks)
- Global tensors: patch_embedding, text_embedding, time_embedding, head
- Add qwen_image_layer_pattern() for 60 transformer_blocks
- Add zimage_layer_pattern() for context_refiner + noise_refiner + layers
- Add streaming infrastructure to QwenImageRunner and ZImageRunner
- Wire both models to DiffusionModel streaming interface
- Update compute() methods to accept skip_param_offload parameter

All 6 diffusion model architectures now support layer streaming.
- Add ref_latents and increase_ref_index parameters to compute_streaming
- Update FluxModel::compute_streaming to pass ref_latents
- Convert ref_latents to backend in preprocessing graph
- Handle ref_latents patchification and concatenation

Note: Flux streaming still has tensor context issue in preprocessing
that needs investigation.
The per-layer mini-graph approach was architecturally broken because:
1. GGML tensors are bound to their compute context
2. alloc_compute_buffer() resets context internally
3. Intermediate results cannot be passed between separate graphs

Changed to coarse-stage approach:
1. Load all model weights to GPU via streaming engine
2. Execute full compute graph with skip_param_offload=true
3. This matches the working UNet streaming implementation

Also added skip_param_offload parameter to FluxRunner::compute()
In layer_streaming mode, the cond_stage (T5) must be offloaded before
layer streaming begins, otherwise there won't be enough VRAM for the
diffusion model layers.

Changes:
- Set free_params_immediately=false for layer_streaming mode in CLI
  This enables smart offload logic instead of immediate param freeing
- Add explicit layer_streaming check in should_offload_cond_stage_for_diffusion()
  Forces T5 offload regardless of VRAM heuristics

Without this fix, T5 (~9GB) stays on GPU while layer streaming tries to
load Flux layers (~6.5GB), causing OOM on 12GB cards.

Tested with Flux Schnell Q4_K + T5XXL fp16 on RTX 3060 12GB:
- T5 properly offloaded after conditioning
- Layer streaming loads all 58 layers successfully
- Image generation completes without OOM
Implements the same coarse-stage layer streaming approach used by
Flux, MMDiT, UNet, and other models for the new Anima diffusion model.

Changes:
- tensor_registry.hpp: Add anima_layer_pattern() for net.blocks.N extraction
- anima.hpp: Add streaming engine, enable/disable/compute_streaming methods
- diffusion_model.hpp: Add AnimaModel streaming wrapper methods

Anima has 28 transformer blocks by default, similar in structure to
other DiT models, making it a good candidate for VRAM offloading on
memory-constrained systems.
AnimaConditioner:
- Add GPU offloading methods (is_params_on_gpu, move_params_to_cpu,
  move_params_to_gpu, get_params_vram_size, set_auto_offload)
  delegating to underlying LLM
- This enables proper VRAM management for Anima's Qwen3 text encoder

Layer streaming state consistency:
- Skip diffusion model state manipulation in layer_streaming mode
- The TensorRegistry uses direct buffer pointer swapping which leaves
  GGMLRunner's internal state (params_on_runtime_backend) out of sync
- Querying or manipulating diffusion offload state after streaming
  would cause crashes due to this inconsistency
- cond_stage offload still works normally (not managed by streaming)

Tested: Anima model generates identical output with and without
layer_streaming enabled (verified via MD5 hash comparison)
Problem: After layer streaming completes, all diffusion model layers
remain on GPU. For large models like QwenImage (8.6GB), this leaves
insufficient VRAM for VAE decoding.

Solution: Add offload_streaming_layers() method to all streaming-enabled
models that moves all layers back to CPU before VAE decode.

Changes:
- Add offload_streaming_layers() to DiffusionModel base interface
- Implement in all runners: UNet, MMDiT, Flux, Anima, Wan, QwenImage, ZImage
- Add override methods in all Model wrapper classes
- Call offload_streaming_layers() in stable-diffusion.cpp before VAE decode

This enables running models larger than VRAM:
- QwenImage Edit (16GB model) now runs on 12GB GPU via layer_streaming
- Tested: Anima streaming produces identical output with ~1% overhead
- Add staged forward methods to QwenImageModel:
  - forward_input_stage(): patchify + input projections
  - forward_single_block(): execute one transformer block
  - forward_output_stage(): norm + proj + unpatchify

- Implement compute_streaming_true() for QwenImage that:
  - Executes each of the 60 transformer blocks as a separate mini-graph
  - Stores intermediate img/txt tensors in CPU memory between blocks
  - Loads/offloads ~140MB per block during execution
  - Enables running 8.5GB+ models on 12GB VRAM GPUs

- Update all model architectures (Flux, MMDiT, Anima, WAN, ZImage, UNet)
  with improved VRAM checking in compute_streaming()

This is true per-layer streaming where only ONE block's weights plus
activation memory is needed at any time, enabling models larger than
available VRAM to run.

Tested with Qwen-Image-Edit-2509-Q3_K_S.gguf (8.5GB) on RTX 3060 12GB.
…utput read

Bug: When compute() was called with free_compute_buffer_immediately=true,
the buffer holding output tensors was freed before ggml_backend_tensor_get()
could read them, causing "CUDA error: invalid device ordinal".

Fixes:
1. alloc_compute_buffer() now returns graph via out_gf parameter for reuse
2. compute() reuses graph from alloc_compute_buffer to avoid tensor mismatch
3. copy_data_to_backend_tensor() skips tensors without allocated buffers
4. All TRUE per-layer streaming stages now use free_compute_buffer_immediately=false
   and manually call free_compute_buffer() after reading outputs

Affected models: Flux, MMDiT, Anima, UNet, ZImage, QwenImage
- Add estimate_vae_encode_vram() for VRAM estimation before encoding
- Add smart_offload_for_vae_encode() to offload cond_stage and diffusion
  models before VAE encode operations
- Call smart_offload_for_vae_encode() before all encode_first_stage() and
  vae_encode() calls across generate_image and generate_video paths:
  - img2img init image encoding
  - ref image encoding (for edit modes)
  - control net image encoding
  - video frame encoding (WAN, VACE, Anima)

This prevents OOM during VAE encoding of large images by freeing VRAM
from models not needed during the encode phase. With layer_streaming mode,
this allows encoding images that previously caused OOM.
Key changes:
- Add async prefetch methods to LayerExecutionEngine: prefetch_layer(),
  wait_for_prefetch(), wait_for_all_prefetches()
- Add AsyncLoadState struct and async layer load methods to TensorRegistry:
  start_async_layer_load(), complete_async_layer_load()
- Use ggml_backend_tensor_copy_async() to overlap memory transfers with
  GPU computation during TRUE per-layer streaming
- Update qwen_image.hpp to start prefetching next block before computing
  current block, reducing GPU idle time
- Fix sd_offload_config_t initialization with correct field order
- Offload diffusion model layers to CPU at startup when layer_streaming
  mode is enabled, freeing VRAM for LLM/CLIP conditioning

This enables overlapped memory transfers during per-layer streaming,
reducing periodic GPU pauses caused by blocking PCIe transfers.
Adds async prefetching pattern to overlap PCIe memory transfer with GPU
computation during layer streaming. Before computing each block, prefetch
the next block's weights asynchronously.

Models updated:
- Flux: double_blocks and single_blocks loops
- UNet: input_blocks and output_blocks loops
- MMDiT: joint_blocks loop
- ZImage: layers loop
- Anima: blocks loop

Note: WAN model doesn't have true per-layer streaming yet (uses full graph).
When using CFG (multiple model calls per diffusion step), the VRAM check
didn't account for layers already loaded on GPU. This caused the second
CFG call to see full VRAM and switch to slow TRUE per-layer streaming.

Now tracks already_on_gpu and only checks remaining_to_load against
available VRAM. Second+ CFG calls complete in ~0.15s instead of 3+ seconds.

Applied to all 7 architectures: Flux, UNet, MMDiT, ZImage, Anima, WAN, QwenImage
fszontagh added 3 commits May 18, 2026 10:51
Rewrites MMDiTRunner::compute_streaming_true on top of
LayerStreaming::run_streaming using the standard three-lambda pattern
(input_stage / per-layer factory / output_stage), replacing the bespoke
inline _global-load + per-block compute loop.

The previous implementation was already a real per-block streamer (not a
placeholder): Stage 1 ran forward_input_stage to produce x / context /
c_mod and persisted them into pinned host buffers, Stage 2 iterated
joint_blocks.{i} with sync load + wait_for_prefetch + move_layer_to_cpu,
and Stage 3 ran forward_output_stage + unpatchify_and_crop. The new
factory mirrors that behavior verbatim against the shared executor:

- input_stage.post_compute reads back x / c_mod (and context when
  non-null) into persistent_* member buffers; resident_joint_blocks_ is
  decided on first invocation as before for logging parity.
- The per-block factory rebinds x_in / c_mod_in / context_in from host
  buffers each iteration (prev_gpu_output ignored; no chunk-K dispatch
  path for MMDiT today) and reads layer_x_out_ / layer_context_out_
  back via ggml_backend_tensor_get in post_compute.
- skip_layers is honored via a trivial no-op stage (matching Flux's
  pattern) so persistent activations pass through unchanged, mirroring
  the previous `continue` semantics.
- output_stage.build_graph runs forward_output_stage + unpatchify_and_crop;
  the executor writes results into output / output_ctx.

Streaming state (stage1_*_out_, layer_*_out_, x_ne_ / context_ne_ /
c_mod_ne_, persistent_* buffers + fallback vectors) is lifted into
MMDiTRunner members so the captured-by-this lambdas can survive across
stages.

Net: -41 lines.
3-phase architecture (input_blocks -> middle_block -> output_blocks)
with skip connections persisted to host across phases. Treats the
diffusion as num_input + 1 + num_output 'layers' for the executor;
the per-block factory dispatches by phase to the existing
forward_input_block / forward_middle_block / forward_output_block
helpers (which already encode the DownSample/UpSample type-dispatch
fixes from commit dbd4a35).

No smoke test in this commit -- memory budget; full smoke matrix
runs after Task 13.
After migrating all 8 runners to LayerStreaming::run_streaming
(Tasks 5-12), sweep each runner for code orphaned by the migration:
member variables that no longer have a reader, private helpers that
only the old compute_streaming_true called, etc.

- hidream_o1: drop unused persistent_inputs_embeds_fallback.
- qwen_image: drop logging-only resident_transformer_blocks_ and
  the old StreamingState struct + copy_tensor_to_storage /
  create_tensor_from_storage helpers.
- flux: drop logging-only resident_double_blocks_ /
  resident_single_blocks_, plus Flux::StreamingContext and the
  forward_preprocessing / forward_double_block(StreamingContext) /
  forward_single_block(StreamingContext) / forward_postprocessing
  helpers and the FluxRunner::streaming_ctx_ member that used them.
- anima: drop logging-only resident_blocks_.
- mmdit: drop logging-only resident_joint_blocks_.
- unet: drop cfg.keep_layers_behind override (only consulted by the
  unused LayerExecutionEngine::execute_streaming path).

Kept intentionally: z_image's chunk_graph_ / dispatch_resident_chunk /
resident_layer_count_ (chunk-K dispatch lives in z_image's Stage 1
post_compute), and all forward_* inner-model helpers (called by the
migrated lambdas). The two public forward_double_block /
forward_single_block overloads in flux.hpp (the ones returning
ggml_tensor* / std::pair, not bool) stay — those are the ones the
migrated lambdas call.
@AndriiParf
Copy link
Copy Markdown

Good evening (or good day), thanks for this awesome PR!! I tried it on my own system (Vulkan, AMD RX 580 8GB, Arch Linux, flux-2-klein-9b-Q8_0.gguf) and got a segmentation fault (core dumped). Using the AI (I don't know much about ML or C++), the AI suggested the following:

Cause of the bug:

For Flux.2 and Flux.2 Klein models, share_modulation is set to true. This means the individual transformer blocks (DoubleStreamBlock and SingleStreamBlock) do NOT instantiate their own local modulation blocks (img_mod, txt_mod, modulation). Instead, they share global modulation blocks initialized in the parent Flux class.

In the original non-streaming path, these global modulations are precalculated and passed to block->forward(...).
However, in the streaming path (compute_streaming_true), ds_img_mods, ds_txt_mods, and ss_mods are passed as empty vectors. This forces DoubleStreamBlock::forward and SingleStreamBlock::forward to fall back to their local img_mod objects, which are nullptr under shared modulation, triggering an instant null-pointer dereference (Segfault).

Suggested Fix:

To resolve this, we can make forward_double_block and forward_single_block inside struct Flux automatically compute these shared modulations on-the-fly when share_modulation is active and the incoming vectors are empty.

In src/flux.hpp, modify forward_double_block:

        std::pair<ggml_tensor*, ggml_tensor*> forward_double_block(GGMLRunnerContext* ctx,
                                                                    int block_idx,
                                                                    ggml_tensor* img,
                                                                    ggml_tensor* txt,
                                                                    ggml_tensor* vec,
                                                                    ggml_tensor* pe,
                                                                    ggml_tensor* txt_img_mask,
                                                                    std::vector<ModulationOut>& ds_img_mods,
                                                                    std::vector<ModulationOut>& ds_txt_mods) {
            if (params.share_modulation && ds_img_mods.empty()) {
                auto double_stream_modulation_img = std::dynamic_pointer_cast<Modulation>(blocks["double_stream_modulation_img"]);
                auto double_stream_modulation_txt = std::dynamic_pointer_cast<Modulation>(blocks["double_stream_modulation_txt"]);
                ds_img_mods = double_stream_modulation_img->forward(ctx, vec);
                ds_txt_mods = double_stream_modulation_txt->forward(ctx, vec);
            }
            auto block = std::dynamic_pointer_cast<DoubleStreamBlock>(blocks["double_blocks." + std::to_string(block_idx)]);
            auto img_txt = block->forward(ctx, img, txt, vec, pe, txt_img_mask, ds_img_mods, ds_txt_mods);
            return img_txt;
        }

And modify forward_single_block:

        ggml_tensor* forward_single_block(GGMLRunnerContext* ctx,
                                           int block_idx,
                                           ggml_tensor* txt_img,
                                           ggml_tensor* vec,
                                           ggml_tensor* pe,
                                           ggml_tensor* txt_img_mask,
                                           std::vector<ModulationOut>& ss_mods) {
            if (params.share_modulation && ss_mods.empty()) {
                auto single_stream_modulation = std::dynamic_pointer_cast<Modulation>(blocks["single_stream_modulation"]);
                ss_mods = single_stream_modulation->forward(ctx, vec);
            }
            auto block = std::dynamic_pointer_cast<SingleStreamBlock>(blocks["single_blocks." + std::to_string(block_idx)]);
            return block->forward(ctx, txt_img, vec, pe, txt_img_mask, ss_mods);
        }

After following his advice, everything went smoothly and very quickly. Sorry for using AI

@nArn0
Copy link
Copy Markdown

nArn0 commented May 19, 2026

I was just hit by the same problem while testing Wan2.2 with the exact same card on the exact same board (PCIe 3).

I can make WAN2.2 work with WAN2GP but i would really love to make it work with sd.ccp instead in order to use GGUF version of the model.

fszontagh added 5 commits May 22, 2026 10:06
25 new upstream commits since the previous sync. Highlights:

  3a8788c refactor: unify extra argument parsing (leejet#1540)
  449165c feat: stream LTX VAE temporal tile decoding (leejet#1539)
  adaa599 Feat: Temporal tile custom size with overlap (leejet#1510)
  2e35146 perf: run LTX audio VAE decode in one ggml graph (leejet#1538)
  47d8198 feat: add taeltx2_3_wide support (leejet#1535)
  ef92a00 feat: add graph cut markers for LTXAV transformer (leejet#1534)
  b3374e6 feat: add LTX spatial latent upscale hires support (leejet#1533)
  bdd937f feat: add taeltx2/taeltx2.3 support (leejet#1531)
  c51ec7c fix: always load runtime lora params on runtime backend (leejet#1532)
  e7eb92f feat: add Gradient Estimation sampler (leejet#1484)
  50134e5 refactor: split guidance composition (leejet#1506)
  e43b24c feat: add ltx2.3 flf2v support (leejet#1505)
  b706d68 fix: restore singleton dims for LLM outputs (leejet#1518)
  b758b7d fix: only enable TAE after successful load (leejet#1517)
  f683c88 feat: make negative max_vram control the amount of spare vram (leejet#1503)
  baf7eda refactor: minify vocab files (leejet#1509)
  22c8c40 sync: update ggml (leejet#1520)
  plus 8 CI / docs / docker fixes.

Conflict resolution:

src/stable-diffusion.cpp had a single conflict in the video-generation
post-sampling block. Our HEAD had the smart-offload-for-VAE-decode
hook (move diffusion model to CPU when free_params_immediately is
false and VRAM is tight). Upstream added the LTX spatial latent
upscale hires path that runs a second sampler invocation. Both pieces
are needed and they're complementary: smart offload is video-agnostic
and runs only on the non-upscale code path; the upscale block manages
its own params lifecycle through its own sampler+free invocation.

Resolution: upstream's `if (latent_upscale_enabled)` block kept as-is,
and our smart-offload + free_params_immediately handling moved into
the matching `else` branch. No semantic change to either feature.

All other touched files (include/stable-diffusion.h, src/llm.hpp,
src/ggml_extend.hpp, src/diffusion_model.hpp, examples/common/...)
auto-merged cleanly. Our additions (friend declaration in ggml_extend
for the streaming executor, forward_layer_block / forward_final_norm
helpers on LLM::TextModel, offload_config field on sd_ctx_params_t)
all interoperate with the upstream changes — Build is clean.

Smoke test: Z-Image-Turbo Q8 generates a valid cat image at 512x512
after the merge. Host CUDA driver currently shows NVML version
mismatch (220s wallclock); requires driver reload to re-validate
expected timings.
Models with share_modulation=true (Flux 2, Flux 2 Klein) do NOT
instantiate local img_mod / txt_mod / modulation blocks inside
DoubleStreamBlock and SingleStreamBlock (flux.hpp:272, 285). Their
modulation is computed once at the parent Flux level and threaded
into each block via ds_img_mods / ds_txt_mods / ss_mods vectors.

The non-streaming path computes these in forward_input_stage and
passes them all the way through forward_orig. The layer-streaming
path, however, has always constructed FRESH empty vectors inside its
per-block factory (preserved across the Task 8 migration). When the
block forward sees an empty mod vector, it falls back to its local
modulation block — which is nullptr under share_modulation, triggering
a null-pointer dereference and an immediate segfault.

Bug surfaced for the first time when flux-2-klein-9b-Q8_0 hit our
streaming path. PR leejet#1477 comment from @AndriiParf with stack-trace
analysis from an AI tool, confirmed by reading the code: empty
ds_img_mods/ds_txt_mods/ss_mods at the per-block call site, share_modulation
guard in the DoubleStreamBlock/SingleStreamBlock constructors that
skips local-modulation instantiation, block->forward unconditional
dereference of the local pointer.

Fix: in Flux::forward_double_block and Flux::forward_single_block,
when share_modulation is active and the incoming mod vectors are
empty, compute the shared modulations from `vec` on demand using the
parent-level Modulation blocks (always _global resident, so always on
GPU during streaming). Adds one Linear forward per block per step
(sub-millisecond aggregate), but avoids the much-more-invasive
alternative of persisting Stage-1 ModulationOut tensors to host
buffers and re-binding them per layer.

Coarse-stage path unaffected: forward_input_stage still precomputes
the mods and the non-empty vectors short-circuit the on-demand guard.

A separate report from @nArn0 on PR leejet#1477 describes a WAN 2.2 segfault
on the same RX 580 / Vulkan / PCIe 3 hardware. WAN's transformer is
structurally different (no share_modulation; modulation is a per-block
weight parameter at params["modulation"]). That report likely involves
either Vulkan-specific streaming hazards already documented in
vulkan_compat.md notes, or a different latent issue in the per-block
streaming path that Task 9's migration newly exercises. Not addressed
here; needs a stack trace to localize.
8 new upstream commits, all auto-merged cleanly (no conflicts).

  1ceb5bd fix: package ROCm BLAS runtime in Windows artifacts (leejet#1562)
  202c615 fix: use flux flow prediction for LTXAV (leejet#1561)
  a397e03 feat: add Longcat-Image / Longcat-Image-Edit support (leejet#1053)
  72e512a fix: make macOS binaries use relocatable rpaths (leejet#1552)
  0baf721 feat: add LTX temporal latent upscaler support (leejet#1551)
  645e6e9 feat: add LTX rational latent upscaler (leejet#1549)
  cbf9219 fix: strip trailing latent channels for preview decode (leejet#1548)
  8cf55a3 fix: load TAESD preview-only model correctly (leejet#1547)

Highlights for our refactor:

- src/flux.hpp: Longcat-Image adds 8 lines around the Flux constructor
  (new model variant flag); did not touch our share_modulation guards
  in forward_double_block / forward_single_block (483cebc), which
  auto-merged cleanly above the new code.
- src/ggml_extend.hpp: Longcat added 12 lines unrelated to our friend
  declaration for LayerStreaming::run_streaming.
- src/stable-diffusion.cpp: LTXAV flow-prediction fix and Longcat
  wiring auto-merged with our smart_offload_for_vae block.
- src/conditioner.hpp, src/anima.hpp: small Longcat additions.
Longcat-Image (leejet#1053) added an is_longcat bool parameter to
Rope::gen_flux_pe and updated the non-streaming Flux::forward call
site (line 1607) but not the two streaming call sites (compute()
helper around line 1702 and compute_streaming_true around line 1955).
The merge auto-resolved cleanly because none of our streaming code
overlapped textually with the upstream changes, but the new signature
broke the two streaming PE-generation sites.

Pass sd_version_is_longcat(version) at both call sites, matching the
non-streaming path.
Z-Image's compute_resident_block_count call was using the default
768 MB compute_buffer_reserve, which only covers the per-layer
streamed mini-graph's compute buffer. After chunk_K resident layers
load, only ~1.2 GB of VRAM remains — not enough to reload the Q8 LLM
cond_stage (~1.5 GB + 500 MB safety = 2 GB) between back-to-back
queued jobs. The reload check at stable-diffusion.cpp:2661 fails,
cond_stage falls back to on-demand load on the next job's first
inference, and the cold-cache cond_stage compute costs ~10-15s per
job — visible as the 1m33s -> 1m46s regression on the production
restapi.

Bump the explicit compute_buffer_reserve to 2 GB. The engine's
budget formula (line 359-360 of layer_streaming.hpp) folds this in
on top of its own prefetch + safety reserves. Result: ~3 fewer
resident layers in chunk_K (typical config drops 16->13), trading
~1.4s of extra per-step streaming work for ~10-15s saved per
inter-batch cond_stage reload. Net win on the queued-workload
case the restapi exercises; near-neutral on one-shot single-job
runs (chunk-K is still active, just smaller).

Reported via the production restapi: 1m33s pre-merge -> 1m46s
post-merge for the same Z-Image-Turbo 9-step config. The 13s
delta is exactly the cond_stage on-demand load + cold-cache
inference cost.
@Pix13
Copy link
Copy Markdown

Pix13 commented May 26, 2026

Do you think you could make this feature work directly from the disk instead of RAM <-> VRAM ? I know it's a niche but there is a powerful HW that has 16Gb shared ram/vram ( a mini stryx halo basically) named the bc-250. I already vibe-coded an offload-to-disk feature that works ( could make work qwen-image-edit-2511-Q4_0.gguf - 12Gb on the board for example). Of course the nvme is the caveat there but if your architecture permits it without too much hassle.. it would be incredible :) ( i guess i'll maintain my branch otherwise, rebasing on your work)

@GreenShadows
Copy link
Copy Markdown

That would quickly destroy the SSD, unless you have something like Optane.

@Pix13
Copy link
Copy Markdown

Pix13 commented May 26, 2026

That would quickly destroy the SSD, unless you have something like Optane.

How would reading destroy the ssd ? Unless i'm mistaken, this streams part of the model to the vram, no data generated goes "back" so it would only be read operations ?

(i can live with killing nvme, if it's not the case, they're consumables)

fszontagh added 2 commits May 26, 2026 13:49
Diagnostic from SDCPP_STREAM_PROFILE on the production restapi
running Z-Image Turbo 9 steps with two LoRAs at 688x1024 showed:

- cond_stage CPU→GPU upload: 27.9s every job (8.4 GB at 280 MB/s,
  disk-read speed not PCIe speed)
- Per-step layer prefetch 'advance': 1.27-1.4s for 13-14 streamed
  layers, i.e. ~98ms per layer of CPU work inside what should be
  an async-only cudaMemcpyAsync kick-off

Root cause: with upstream leejet#1414 (mmap weight loading) enabled, every
tensor->data points into an anonymous mmap region. CUDA cannot
cudaMemcpyAsync from non-pinned host memory — it falls back to a
synchronous staged copy through an internal bounce buffer. The bounce
buffer copy is what gets billed as 'advance' time on the streaming
hot path, and as the 28s cond_stage upload on first inference.

Fix lives in alloc_params_buffer: when params will live on a different
backend than the runtime (params_backend != runtime_backend, i.e. any
keep_*_on_cpu flag or any --offload-mode > none), the existing
pre-merge pinned-host allocation path was being short-circuited by the
mmap early-return I added during the previous upstream sync. Restore
the pinned-host alloc and add a one-shot copy from the saved mmap
pointers into the new buffer right after alloc_ctx_tensors_from_buft.

The mmap region stays alive (ModelLoader owns it); we just point each
tensor->data at our pinned-host copy instead. After this:
- One-time copy at alloc_params_buffer time (~5s for 8 GB on a fast
  SSD-cached read), trading ~5s of model-load latency for...
- ~50s saved per generation job (28s cond_stage + ~25s of accumulated
  per-step prefetch overhead)

Coarse-resident (params==runtime backend) path is unchanged: mmap data
stays in place, no copy, no extra memory.

Reported via the production restapi: 1m33s pre-mmap baseline -> 2m14s
post-mmap. After this commit the streaming path should DMA at PCIe
speed again; expected return to ~1m33s plus a small one-time load
overhead.
Previous fix only triggered when every tensor in params_ctx was
mmap-backed (all_have_data == true). In practice, ModelLoader::
mmap_tensors() skips tensors whose on-disk shape or dtype doesn't
match the runner's expected tensor (typical for quantized models
where the runner's params_ctx has a different dtype than the file's
storage form). Those tensors stay nullptr and get filled later by
load_tensors().

Net effect of the previous fix: it never ran on the real workload.
The 28s → 9.5s qwen3 upload improvement on the user's restapi was
purely OS page-cache warming, not the fix.

Replace 'all have data' with a per-tensor snapshot:
  - Record which tensors mmap_tensors() backed (t->data != nullptr).
  - If we'll allocate a fresh buffer (need_offload), null out every
    tensor's buffer + data so alloc_ctx_tensors_from_buft places
    every tensor (mmap-backed or not) in our new buffer.
  - After alloc, copy bytes from saved mmap addresses into the new
    tensor locations for tensors that were mmap-backed.
  - Tensors that weren't mmap-backed stay null and load_tensors()
    fills them in shortly after.

Coarse case (all_mmap && !need_offload) still early-returns. Pure
non-mmap case (all saved_data null, need_offload) does the regular
pinned-host alloc with no copy.

Log line is now 'copied X/Y mmap tensors (Z MB)' so the journal shows
how many of the runner's tensors were file-backed vs computed/dequant
fill-ins later.
@fszontagh
Copy link
Copy Markdown
Contributor Author

Do you think you could make this feature work directly from the disk instead of RAM <-> VRAM ? I know it's a niche but there is a powerful HW that has 16Gb shared ram/vram ( a mini stryx halo basically) named the bc-250. I already vibe-coded an offload-to-disk feature that works ( could make work qwen-image-edit-2511-Q4_0.gguf - 12Gb on the board for example). Of course the nvme is the caveat there but if your architecture permits it without too much hassle.. it would be incredible :) ( i guess i'll maintain my branch otherwise, rebasing on your work)

im already working on similar thing. This feature is coming from the #1414.
this can be enabled / disabled with the parameters in the cli --mmap / --no-mmap. By default this is enabled.
When --mmap is enabled and you select the layer streaming, its can slow down the entire process if you use HDD. I only tested on HDD (worst case). In theory in tmpfs / memfs this can be speed up if you have enough memory for it.

In my REST API project it can be disabled / enabled at model load time.
POST /models/load │ options.enable_mmap: true/false. Default: true.

In theory this is not killing the SSD its just normal reading.

Copy link
Copy Markdown
Owner

@leejet leejet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall direction makes sense, but I don't think this should merge in its current architecture.

The implementation currently introduces:

  • a parallel layer-streaming execution path,
  • additional graph partitioning logic,
  • and many new user-facing policy flags.

I believe this should instead build on top of the existing graph-cut/max-vram execution model and implement streaming purely as a residency/prefetch policy over existing graph segments.

Otherwise we risk maintaining two separate execution/partitioning systems that may diverge over time.

return 1;
};

auto on_offload_mode_arg = [&](int argc, const char** argv, int index) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this feature should introduce a new offload execution mode or expose this many new CLI flags.

We already have the necessary user-facing control surface via:

  • backend
  • params-backend
  • max-vram

The runtime should derive the streaming/offload policy automatically from those constraints.

Most of the newly added flags expose internal scheduling policy rather than user intent, which significantly increases API complexity and creates another execution path that can diverge from existing graph-cut execution semantics.

I would strongly recommend redesigning this as a scheduling/prefetch policy layered on top of the existing graph-cut execution plan instead of introducing a parallel “layer streaming” execution mode.

Comment thread src/flux.hpp
}
}

struct StreamingInputResult {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where I think the design diverges from the existing execution model.

The graph has already been segmented by graph cut marks and the max-vram planner. Rebuilding model-specific “layer subgraphs” here duplicates graph partitioning logic and introduces a second segmentation mechanism that will likely drift over time.

Layer streaming should operate on top of existing graph-cut segments and only add:

  • prefetch scheduling
  • residency decisions
  • eviction/reload policy

It should not manually reinterpret model structure and create a parallel partitioning path.

@AndriiParf
Copy link
Copy Markdown

I tried wan 2.2 on my PC (RX 580 8GB, 32GB DDR4, Arch Linux, q4_k_m low and high noise models) and initially got a black screen, but after following the AI's recommendations, the generation process started and everything worked fine. AI report:

Bug 1: high_noise_diffusion_model bypasses layer streaming config

When running a dual-model setup (using both --diffusion-model and --high-noise-diffusion-model), layer-by-layer streaming is only enabled for the primary diffusion_model [3]. The high_noise model falls back to the default non-streaming path [3]. This causes an immediate OOM/GPU hang as soon as the high-noise sampling phase begins.

Fix 1:

In src/stable-diffusion.cpp (around line 725), enable streaming for the high-noise model right after its parameters are bound:

            if (high_noise_diffusion_model) {
                high_noise_diffusion_model->set_max_graph_vram_bytes(max_graph_vram_bytes);
                get_param_tensors(high_noise_diffusion_model, module_can_mmap(SDBackendModule::DIFFUSION));
                
                // Enable streaming for high_noise_diffusion_model
                if (offload_config.mode == SD_OFFLOAD_LAYER_STREAMING) {
                    if (high_noise_diffusion_model->supports_layer_streaming()) {
                        LOG_INFO("Enabling layer-by-layer streaming for high_noise_diffusion_model");
                        high_noise_diffusion_model->enable_layer_streaming(
                            offload_config.streaming_prefetch_layers,
                            offload_config.streaming_min_free_vram);
                    }
                }
            }

Also, update the initial CPU-offloading block (around line 1058) to offload the high-noise layers immediately too:

        if (offload_config.mode == SD_OFFLOAD_LAYER_STREAMING) {
            if (diffusion_model && diffusion_model->is_layer_streaming_enabled()) {
                LOG_INFO("Offloading diffusion model layers to CPU for layer streaming");
                diffusion_model->offload_streaming_layers();
            }
            if (high_noise_diffusion_model && high_noise_diffusion_model->is_layer_streaming_enabled()) {
                LOG_INFO("Offloading high_noise_diffusion_model layers to CPU for layer streaming");
                high_noise_diffusion_model->offload_streaming_layers();
            }
        }

Bug 2: Out-of-bounds vector access inside WanRunner

This is what caused the immediate segmentation fault on systems with strict bounds checking.

When running Wan2.2-I2V-A14B (which doesn't have VACE blocks), stage1_c_out_ is nullptr and c_n = 0.
Therefore, the sizes vector passed to ensure_pinned_act_buffers has only 5 elements, and the returned ptrs vector also has a size of 5 (valid indices are 0 to 4).

However, the line:

persistent_c_ = (c_n > 0) ? ptrs[5] : nullptr;

still evaluates the ptrs[5] index expression. On toolchains with strict assertions (like Arch Linux's default GCC package which has -D_GLIBCXX_ASSERTIONS enabled by default), accessing index 5 of a size 5 vector triggers an immediate abort, resulting in a segmentation fault.

Fix 2:

In src/wan.hpp (around line 765), change:

persistent_c_ = (c_n > 0) ? ptrs[5] : nullptr;

to:

persistent_c_ = (c_n > 0 && ptrs.size() > 5) ? ptrs[5] : nullptr;

Tip for other low-VRAM/RAM testers:

Standard VAE decoding for 33 frames of 832x480 WAN video requests a massive 15.5 GB RAM compute buffer (wan_vae compute buffer size: 15501.51 MB(RAM)), which easily triggers the Linux OOM killer on 32GB RAM systems.
Replacing the heavy VAE with TAEHV (--tae taew2_1.safetensors) brings the VAE compute buffer down to < 0.5 GB, allowing the entire pipeline to fit comfortably within RAM while streaming on an 8GB GPU.

fszontagh added a commit to fszontagh/stable-diffusion.cpp that referenced this pull request May 28, 2026
 scope)

Propagating stream_layers to every runner (cond_stage, VAE, clip_vision,
upscaler, ...) made each one claim chunk-K resident state in VRAM,
competing for the same budget. The LLM in particular grabs ~7 GB
permanent resident on a 12 GB card, leaving nothing for the diffusion
model's chunk-K — hence the cross-runner OOM in production.

PR leejet#1477's per-runner streaming code only existed in the diffusion
architectures (z_image, flux, unet, mmdit, qwen_image, wan, anima).
LLM/VAE/clip_vision/upscaler used the upstream per-segment offload
without chunk-K. We now match that scope: only diffusion_model and
high_noise_diffusion_model get set_stream_layers_enabled(true).

All other runners still use --max-vram for upstream's graph-cut
walker, just without our chunk-K + prefetch additions. They stream
their weights per-segment as before; no permanent VRAM claim.

This is independent of (and complementary to) the per-runner free-VRAM
clamp in a0c07ef.
fszontagh added a commit to fszontagh/stable-diffusion.cpp that referenced this pull request May 28, 2026
When --stream-layers is on, the diffusion runner holds several GB of
chunk-K resident params across sampling steps to amortize H2D. Those
GB stay claimed in VRAM until the runner is destroyed. VAE decode
runs once at the end of generation and can need several GB of compute
buffer for full-resolution decode (e.g. 4.47 GB on Z-Image bf16 at
1024x688) — with the diffusion residency still claimed, VAE OOMs.

Add a public release_streaming_residency() trampoline on GGMLRunner
that exposes the existing protected restore_resident_params(). Call
it from decode_first_stage() on diffusion_model and
high_noise_diffusion_model right before the VAE run. The chunk-K
state gets rebuilt on the first sampling step of the next generation
(state_token comparison handles it).

This is what PR leejet#1477's per-runner streaming code did implicitly via
its layer registry lifecycle. Without it, our chunk-K is more
"sticky" than PR leejet#1477's was and starves downstream runners on
larger images.
fszontagh added a commit to fszontagh/stable-diffusion.cpp that referenced this pull request May 28, 2026
A runtime LoRA (lora_apply_mode = at_runtime, or AUTO on quantized
models) attaches a WeightAdapter to the runner. The adapter doesn't
mutate weights directly — it composes W + LoRA_diff at graph build
time via patch_weight() — but the resident GPU buffer captured by
chunk-K can still drift across batch images / sampling steps in
subtler ways (LoRA tensor lifetime, partial offload ordering). The
practical symptom: the FIRST image of a batch is correct, the SECOND
is colored static noise.

Our state_token hashes tensor *pointers*, not data, so it cannot
detect the modification. The safe behaviour is to release any prior
resident state and skip the rebuild when weight_adapter is set.
compute_streaming_segments_prefetch still engages, so the H2D-during-
compute overlap perf win remains (~10-15% of the no-LoRA streaming
win).

Verified on Z-Image bf16 1024x688, batch_count=2, lora_apply_mode=
at_runtime + a 170 MB Z-Image LoRA: both batch images come out
correct (vs. image 2 = noise without this fix).

PR leejet#1477 didn't hit this because they cache the compute *graph*
(LayerStreaming::ChunkGraph in chunk_graph.hpp), not the weight data —
the graph re-reads still-resident GPU weights each step. Porting that
chunk-graph cache would let chunk-K coexist with runtime LoRA; deferred
to a follow-up (see unified_streaming_future_optimizations memory note).
fszontagh added a commit to fszontagh/stable-diffusion.cpp that referenced this pull request May 28, 2026
After commit d496ee7 disabled chunk-K residency for runtime LoRA, the
multi-LoRA case still produced colored static noise on batch image 2.
Root cause: the async prefetch path uses graph_compute_async +
per-segment pending offload to overlap H2D with compute on two CUDA
streams. When a runtime weight_adapter (MultiLoraAdapter) is attached,
its patch_weight() inserts extra graph nodes that read the adapter's
own tensors during forward. Those reads race the pending swap during
the host's offload_pending_params call and produce corrupted output.

Single-LoRA didn't trigger this in sd-cli smoke (which is why my
earlier verification missed it); the user's REST API workflow uses
two LoRAs (MultiLoraAdapter chains both diffs), which exposes the race.

Add weight_adapter == nullptr to the prefetch_enabled gate. When
LoRA is active, fall through to the synchronous segment loop, which
serializes H2D and compute on the host thread and matches PR leejet#1477's
no-prefetch behaviour for LoRA workloads.

Verified on Z-Image bf16 1024x688, batch_count=2, lora_apply_mode=
at_runtime + two Z-Image LoRAs (zimage_alexandrabotez_v1@0.7 +
ZITnsfwLoRAv2@0.2): both batch images come out clean and distinct.
Wallclock 152s (vs ~93s with prefetch but broken) — the honest cost
of the safe fallback.
fszontagh added a commit to fszontagh/stable-diffusion.cpp that referenced this pull request May 29, 2026
…egment headroom

Three connected fixes that move all streaming workloads onto a single
validated path:

1. annotate_residency() now resets seg.residency = STREAMED at entry.
   The plan is cached across compute() calls; without the reset, RESIDENT
   marks from earlier (larger-budget) calls persisted when subsequent
   calls used a smaller budget and would have decided K=0.

2. resolve_graph_cut_plan always passes the full effective budget to
   the planner. The previous (effective - reserve) / 4 shrinking was
   there to fit two segments + compute buffer for the async-prefetch
   overlap. But the smaller merged segments produced visibly broken
   output (white frames, colored static) whenever batch_count > 1
   combined with non-default --guidance + --flow-shift OR runtime LoRA
   composed extra graph nodes against them — FP error accumulates
   across the extra boundary-cache roundtrips. Full budget produces
   walker-style large merged segments, which is the validated configuration.

3. The chunk-K hook in compute_streaming_segments<T> now reserves room
   for the largest merged segment's params when sizing the residency
   budget. Without this reservation chunk-K could grow large enough
   that loading the running merged segment's partial-offload buffer
   OOMs. The new chunk_k_budget = max_graph_vram_bytes -
   largest_merged_segment_params; annotate_residency internally
   subtracts its own safety + compute_buffer pad.

4. The async-prefetch engagement gate is hard-disabled. The path
   (compute_streaming_segments_prefetch) stays in the file for future
   use but is currently dead — full-budget plans don't have room to
   hold two segments at once, and the historical perf win is gone
   under this configuration. Restoring prefetch requires solving the
   correctness issues with LoRA + guidance/flow_shift first (most
   likely via porting PR leejet#1477's chunk_graph.hpp).

Net result: streaming + LoRA / streaming + guidance + flow_shift /
batch_count > 1 all produce correct output. Per-step wallclock on
Z-Image bf16 1024x688 + 2 LoRAs is ~6.4 s (vs walker's ~3.5 s)
because we now do walker-shape compute but with no chunk-K savings on
the diffusion model itself (largest_merged_segment swallowed the
chunk-K budget). Smaller models or larger --max-vram caps will still
benefit from chunk-K.
fszontagh added a commit to fszontagh/stable-diffusion.cpp that referenced this pull request May 29, 2026
Builds on the foundation commit (--stream-layers + planner annotation
+ executor scaffolding) with:

## chunk-K residency

A parallel `resident_*` offload track on `GGMLRunner` keeps a fraction
of the diffusion model's params on GPU permanently across sampling
steps, amortising H2D over many invocations.

- Members: `resident_offload_ctx`, `resident_offload_pairs`,
  `resident_runtime_params_buffer`, `resident_param_set`,
  `resident_state_token` (parallel to the existing `partial_offload_*`
  per-segment track).
- `offload_resident_params(tensors)` / `restore_resident_params()` use
  the same dup-copy-swap idiom as `offload_partial_params` but write
  to the resident slot and persist across `compute()` calls.
- `offload_partial_params` filters tensors already in
  `resident_param_set` so per-segment offload skips them.
  `restore_resident_params` is hooked into `~GGMLRunner()` and
  `free_params_buffer()` to keep swap pointers valid through teardown.
- `compute_streaming_segments<T>` reads
  `graph_cut_plan_cache_.graph_cut_plan` (the unmerged base plan),
  annotates it, gathers the union of RESIDENT segments' param tensors,
  and offloads them once. Compute itself proceeds on the merged plan
  for fused-graph efficiency. A commutative pointer-hash state-token
  detects when a different plan is in play and rebuilds the resident
  set.

`annotate_residency` updates:

- "Any param-bearing segment exists" sanity replaces the
  `segments[0].input_param_bytes == 0` early-return (wrong for
  diffusion models whose first segment is a small prelude).
- Greedy cumulative-bytes loop handles heterogeneous segment sizes
  (small prelude + large transformer layers).
- Resets `seg.residency = STREAMED` at entry so cached plans don't
  carry forward stale RESIDENT marks from a previous larger-budget
  call.
- Don't reserve a `prefetch_segments * largest_segment` window;
  async prefetch is no longer used (see below).

## Multi-runner safety

- Per-runner free-VRAM clamp at compute time in
  `resolve_graph_cut_plan`. Each runner queries
  `ggml_backend_dev_memory(runtime_backend)` and clamps
  `effective_budget = min(max_vram, free - 512 MB)` per call.
  Without this, after the LLM committed ~7 GB chunk-K resident the
  diffusion runner still believed it had the whole budget and OOM'd.
- `--stream-layers` is restricted to diffusion runners only
  (diffusion_model + high_noise_diffusion_model) — matches PR leejet#1477's
  scope and avoids one-shot runners (LLM, VAE, clip_vision, upscaler)
  claiming permanent chunk-K state that starves the diffusion model.
- `GGMLRunner::release_streaming_residency()` (public trampoline to
  `restore_resident_params`) is called from `decode_first_stage()` on
  diffusion_model + high_noise_diffusion_model right before VAE
  decode. Without it the 6.5 GB chunk-K residency from sampling would
  starve VAE's compute buffer (~4.5 GB at full image resolution) and
  OOM at decode.

## `LORA_APPLY_AUTO` picks runtime when streaming or CPU-offload is on

Immediate mode bakes LoRA into weights at load time by running a
forward pass over every weight tensor — allocates a full-model-size
(~11 GB on Z-Image bf16) compute buffer on the runtime backend in
one shot and OOMs on any VRAM-constrained setup, which is the whole
reason `--stream-layers` / `--offload-to-cpu` exist. AUTO previously
only picked runtime for quantized models; now `stream_layers ||
offload_params_to_cpu` is also a trigger.

## Conservative streaming v4 (current shipping configuration)

After observing edge-case failures with prefetch + multi-LoRA / +
non-default --guidance/--flow-shift / + batch_count > 1, the final
configuration is:

- chunk-K residency skips itself when `weight_adapter != nullptr`.
  The state-token hashes tensor *pointers*, not data, so it can't
  detect MultiLoraAdapter modifications across batch images / steps;
  the symptom was colored static noise on batch image 2+.
- Async prefetch is hard-disabled (`prefetch_enabled = false`). The
  `compute_streaming_segments_prefetch<T>` implementation stays in
  the file for future reference but is currently dead. Two correctness
  problems forced the disable:
    * Multi-LoRA workloads: graph_compute_async + per-segment pending
      offload races MultiLoraAdapter's per-layer patch_weight reads.
    * batch_count > 1 + non-default --guidance + --flow-shift: the
      smaller merged segments required to fit two prefetched buffers
      in --max-vram accumulate FP error across the extra boundary-
      cache roundtrips → collapses to pure white frames.
- `resolve_graph_cut_plan` always passes the full `effective_budget`
  to the planner (no `/4` shrinking). Produces the upstream walker's
  large merged segments — the validated configuration.
- The chunk-K hook reserves room for the LARGEST merged segment's
  params: `chunk_k_budget = max_graph_vram_bytes -
  largest_merged_segment`. Without this reservation chunk-K could
  grow large enough that the active merged-segment offload OOMs.

## Verification

- Z-Image Q8 with `--stream-layers off`: byte-identical to upstream
  walker.
- Z-Image bf16 1024x688 + 2 LoRAs + `--cfg-scale 1.0 --guidance 3.5
  --flow-shift 3.0` + batch_count=2 (the production REST API failure
  mode that motivated the correctness work): both batch images clean
  and distinct.
- Smoke-matrix-verified across Z-Image Q8/bf16, HiDream, Qwen,
  Flux schnell, SD3.5, SDXL, Anima, WAN.
@fszontagh
Copy link
Copy Markdown
Contributor Author

Superseded by #1576, which rebuilds this on top of the graph-cut planner (#1476) with one new flag (--stream-layers) instead of the parallel system here. Closing.

@fszontagh fszontagh closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants