[Cherry-pick] PRs #1648 #1650 #1594 #1269 #1326 #1652 #1651 #1601 #1653 #1558 #1670 #1662 #1677 #1327 #1673 #1676 #1687 #1678 #1691 #1697 #1702 #1704 #1726 #1729 by kevalmorabia97 · Pull Request #1734 · NVIDIA/Model-Optimizer

kevalmorabia97 · 2026-06-15T18:08:52Z

Cherry-picked PRs

Summary by CodeRabbit

Release Notes

New Features
- Added Alpamayo quantization example with FP8/NVFP4 export support.
- Introduced FastGen DMD2 distillation library for Qwen-Image text-to-image optimization.
- Added lossless MXFP4-to-NVFP4 weight casting for DeepSeek models.
- Expanded PTQ recipes with new NVFP4 variants (MLP-only, experts-only, weight-only).
- Enhanced sparse attention calibration and export capabilities.
Documentation
- Added end-to-end Nemotron-3 optimization tutorial and comprehensive PTQ recipe guide.
- Updated example READMEs and CHANGELOG with latest optimization capabilities.
Bug Fixes
- Fixed sparse attention configuration export schema.
- Improved KV cache reuse settings for context logits generation.

Disable codecov binary validation which seems to be constantly failing ``` gpg: Signature made Tue Apr 21 19:28:03 2026 UTC gpg: using RSA key 27034E7FDB850E0BBC2C62FF806BB28AED779869 gpg: Can't check signature: No public key ==> Could not verify signature. Please contact Codecov if problem continues Exiting... ```  ## Summary by CodeRabbit * **Chores** * Updated CI workflow notes and removed an outdated header comment. * Added explanatory comments to the Linux job and adjusted the code coverage upload step to use a relaxed validation mode (no other upload settings changed).  --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? Type of change: Bug fix `tests/examples/llm_eval/test_llm_eval.py::test_qwen3_eval_fp8` was silently passing while its evals crashed, then began failing as a timeout. This repairs the whole pipeline: - **lm_eval `IndexError` (root cause):** TRT-LLM KV-cache prefix reuse returns truncated `context_logits` for shared-prefix requests (e.g. hellaswag's one-context / many-endings), which breaks `parse_logprobs`. Add an `enable_kv_cache_reuse` flag to `modelopt.deploy.llm.LLM` (default `True`, unchanged) and disable it for the eval deployment so full-length context logits are returned. - **Silent CI green:** `python eval.py | tee result.txt` returns `tee`'s exit code, so a crashing eval was masked. Add `set -o pipefail` to `huggingface_example.sh` so failures fail the test. - **Long-prompt overflows:** with the tiny test model's toy tokenizer, gsm8k/MMLU prompts exceed `max_seq_len`. Bump test `max_position_embeddings` to 8192, skip MMLU prompts that don't fit even at zero-shot, and add an MMLU sample limit (`--mmlu_limit`). - **human-eval build failures:** install with `--no-build-isolation` (`pkg_resources` is absent in pip's isolated build env), patch its malformed `console_scripts` entry point, and pin the clone. - **Cleanups:** gate the post-quant `run_tensorrt_llm.py` smoke test behind the `quant` task (eval tasks deploy on their own; ~45s saved for eval-only runs); replace the SIGPIPE-prone serve-readiness `tail -f | while` with a poll loop (required under `pipefail`). ### Usage N/A — example/test fix. ### Testing All four eval tasks verified end-to-end in the CI container (TRT-LLM 1.3.0rc17, RTX 6000 Ada): lm_eval (hellaswag + gsm8k), MMLU, and simple_eval (humaneval) all complete with exit 0 and no `IndexError`/overflow. Cold full run ≈ 340s on this GPU. CI test on 2-gpu: https://github.com/NVIDIA/Model-Optimizer/actions/runs/27154417497/job/80153551154 ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ (new `enable_kv_cache_reuse` defaults to current behavior; new script flags are optional) - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A (no new dependencies) - Did you write any new necessary tests?: N/A (fixes and strengthens an existing test) - Did you update Changelog?: N/A (bug fix to examples/tests) - Did you get Claude approval on this PR?: ❌ (pending) ### Additional Information The full test runs ~340s on an RTX 6000 Ada; CI runners are historically slower, while `@pytest.mark.timeout` is set to 600 — worth watching the first CI run and bumping if it's close. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **New Features** * Added an option to limit MMLU evaluation length. * **Bug Fixes** * Disabled KV-cache prefix reuse for evaluations needing per-token context logits to prevent truncated/incorrect logprobs. * Skip examples whose prompts remain too long; warn and report accuracy as NaN if all examples are skipped. * **Chores / Scripts** * Improved example scripts for reproducible installs, patched entry point handling, pipeline failure detection, conditional test invocation, polling-based log wait, and a new CLI flag for MMLU limits. * **Tests** * Increased timeout and prompt headroom; capped MMLU smoke tests for speed.  --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? Type of change: ? New example  Adds example for Alpamayo-1 quantization with ModelOpt (FP8, NVFP4, AutoQuant) ### Usage ``` python quantize.py --ckpt nvidia/Alpamayo-R1-10B --output-dir ./alpamayo-r1-fp8 --quantize fp8 ``` ### Testing  ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  - Did you get Claude approval on this PR?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **New Features** * Added Alpamayo 1 vision-language-action model quantization example supporting FP8, NVFP4, and mixed-precision optimization modes * Introduced CLI quantization tool with calibration loop and checkpoint export capabilities for both fake-quantized and real-quantized formats * **Documentation** * Added comprehensive guide documenting the Alpamayo quantization example, model details, and usage instructions  --------- Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? Type of change: New Feature  Adds HuggingFace `config.json` export of skip-softmax sparse-attention calibration for diffusion pipelines (e.g. Wan 2.2), on top of the base skip-softmax work. - **`_export_diffusers_checkpoint`** walks every `nn.Module` component of a diffusers pipeline, calls `export_sparse_attention_config`, and writes the result into that component's `config.json` under the `sparse_attention_config` key. The sparse config lives **only** in `config.json` — there is no standalone `sparse.yaml`. - **`export_sparse_attention_config`** emits a `config_groups` schema where each algorithm's parameters are nested inside its own group; only `config_groups` and `producer` are top-level: - skip-softmax group → `algorithm: "skip_softmax"`, `targets`, `ignore` (layers kept dense — e.g. cross-attention + first/last blocks), `initial_disabled_steps` (opt-in, user-set; emitted only when `> 0`), `threshold_scale_factor` (`a * exp(b * target_sparsity)`), and `target_sparsity`. - N:M group → `algorithm: "sparse_softmax"` with `sparsity_n`/`sparsity_m`, `dense_sink_tokens`, `dense_recent_tokens` flattened into the group. - **Deploy reader** (`modelopt/torch/sparsity/attention_sparsity/plugins/sparse_attn_config.py`) reads these per-group params back, keeping the export↔load round-trip consistent. - **Example wiring**: `examples/diffusers/sparsity/wan22_skip_softmax.py` gains `--export-dir`, `--skip-softmax-threshold`, and `--initial-disabled-steps`. `--export-dir` runs `export_hf_checkpoint(pipe, export_dir=...)` after calibration. - Updated `CHANGELOG.rst`. ### Usage ```bash python examples/diffusers/sparsity/wan22_skip_softmax.py \ --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \ --calibrate --target-sparsity 0.5 --calib-size 4 \ --initial-disabled-steps 5 \ --export-dir ./wan22_skip_softmax_ckpt ``` Resulting layout — a `config.json` per component, **no `sparse.yaml`**: ``` wan22_skip_softmax_ckpt/ ├── transformer/config.json # carries sparse_attention_config ├── transformer_2/config.json # carries sparse_attention_config ├── vae/ … text_encoder/ … tokenizer/ … scheduler/ … └── model_index.json ``` A representative `config.json` entry for a diffusion transformer: ```json "sparse_attention_config": { "config_groups": { "group_0": { "algorithm": "skip_softmax", "targets": ["WanAttention"], "ignore": ["blocks.0.attn1", "blocks.0.attn2", "…"], "initial_disabled_steps": 5, "threshold_scale_factor": { "formula": "a * exp(b * target_sparsity)", "prefill": {"a": 1443.49, "b": 4.30} }, "target_sparsity": {"prefill": 0.5} } }, "producer": {"name": "modelopt", "version": "0.45.0..."} } ``` The N:M variant adds a second group: ```json "group_1": { "algorithm": "sparse_softmax", "targets": ["WanAttention"], "sparsity_n": 2, "sparsity_m": 4, "dense_sink_tokens": 0, "dense_recent_tokens": 64 } ``` ### Testing - `tests/examples/diffusers_sparsity/test_sparsity.py`: baseline / triton-baseline / fixed-threshold runs of the Wan 2.2 example, plus a Python-API calibrate → **export** test asserting the nested `sparse_attention_config` (`threshold_scale_factor`, `target_sparsity`, `ignore`, `initial_disabled_steps`) and the absence of any `sparse.yaml`. - `tests/unit/torch/sparsity/attention_sparsity/test_sparse_attention_conversion.py` and `test_sparse_attn_config.py`: unit coverage of the per-group export schema and the deploy-reader round-trip (writer nests → reader reads from groups → internal mtsa config unchanged). - Validated end-to-end on Wan 2.2 T2V-A14B: full 4-prompt / 40-step / 81-frame calibration; the exported checkpoint carries the nested schema in both `transformer` and `transformer_2` `config.json`, and runtime measurement shows ~47–49% tile sparsity at a 0.5 target. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ❌ The exported `sparse_attention_config` schema was renamed and nested per-group during 0.45.x development, and the loader reads only the new layout — checkpoints exported by earlier 0.45.x builds must be re-exported. No released version is affected.  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅  - Did you write any new necessary tests?: ✅  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅  ### Additional Information  --------- Signed-off-by: Jingyu Xin <jingyux@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? **Type of change:** New example + new `modelopt.torch.fastgen` library module. Adds **DMD2 (Distribution Matching Distillation) for Qwen-Image** — distilling the base model into a few-step (1–4) generator. Includes the framework-agnostic `modelopt.torch.fastgen` loss library (DMD pipeline, EMA, optional GAN discriminator) and a NeMo AutoModel–based training example with a mock-data smoke config, a real-data config, and inference / export scripts. **Noted**: the example script will be migrated to AutoModel repo ### Usage ```bash # Mock-data wiring smoke — runs end-to-end with no dataset to prepare torchrun --nproc-per-node=8 \ examples/diffusers/fastgen/dmd2_finetune.py \ --config examples/diffusers/fastgen/configs/dmd2_qwen_image_smoke.yaml ``` See `examples/diffusers/fastgen/README.md` for real-data training and inference. ### Testing Unit tests under `tests/unit/torch/fastgen/`; `pre-commit` / code-quality clean. ### Before your PR is "*Ready for review*" - Backward compatible?: ✅ (new, additive module) - Followed `CONTRIBUTING.md` for any copied code / new deps: ✅ - New tests added?: ✅ - Updated Changelog?: N/A  ## Summary by CodeRabbit * **New Features** * Adds a FastGen-based distillation framework (DMD2) with student/fake-score training, EMA support, GAN discriminator branch, inference pipeline, and export utilities. * Qwen-Image integration with latent packing and feature-capture for plugin-enabled pipelines. * **Documentation** * New README, example configs, and runnable example scripts for Qwen-Image distillation and inference. * **Tests** * Comprehensive unit tests covering math parity, gradient routing, plugins, hooks, EMA, and recipe setup.  --------- Signed-off-by: Jingyu Xin <jingyux@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…only PTQ recipes (#1652) ### What does this PR do? Type of change: new feature (recipes) Several `general/ptq` recipe families shipped a data-driven FP8 KV-cache (`-kv_fp8`) variant but lacked the constant-amax `kv_fp8_cast` companion that `fp8_default` and `nvfp4_default` already have. This PR adds the missing cast variants so every KV-quantizing (and the weight-only) family offers the calibration-free FP8 KV-cache option: - `general/ptq/nvfp4_experts_only-kv_fp8_cast` - `general/ptq/nvfp4_mlp_only-kv_fp8_cast` - `general/ptq/nvfp4_omlp_only-kv_fp8_cast` - `general/ptq/nvfp4_weight_only-kv_fp8_cast` Each new recipe composes the exact same model-quant config as its existing sibling and swaps the `kv_fp8` unit for the shared `kv_fp8_cast` unit (constant-amax FP8 KV cache; no KV calibration forward pass). The docs guide table/tree and the changelog are updated to match. ### Usage ```bash python examples/llm_ptq/hf_ptq.py \ --pyt_ckpt_path <model> \ --recipe general/ptq/nvfp4_mlp_only-kv_fp8_cast ``` ### Testing Extended the built-in PTQ smoke test `tests/unit/recipe/test_loader.py::test_load_recipe_all_builtins` with the four new recipe paths; all four load into a valid `ModelOptPTQRecipe` with a populated `quantize` section. ``` $ python -m pytest tests/unit/recipe/test_loader.py tests/unit/recipe/test_presets.py -q 180 passed ``` `pre-commit` (including the `validate modelopt recipes` hook) passes on all changed files. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ (additive — only new recipe files) - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ (extended the builtin recipe smoke test) - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ - Did you get Claude approval on this PR?: ❌ (not yet) ### Additional Information The two weight-only families were discussed for scope; `nvfp4_weight_only` is included (it already names a KV mode, `kv_fp16`), while `int4_blockwise_weight_only` is intentionally left untouched since it carries no `-kv_` composition. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **New Features** * Added four new NVFP4 PTQ (Post-Training Quantization) recipe variants: experts-only, MLP-only, OMLP-only, and weight-only configurations. * All new recipes include FP8 KV-cache cast mode support for improved inference performance. * **Documentation** * Updated built-in recipes guide with new NVFP4 recipe options and repository layout. * **Tests** * Expanded recipe loader test coverage for new recipe configurations.  Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

) ### What does this PR do? Type of change: CI / infrastructure (build-time speedup) ModelOpt's CUDA quantization extensions (`modelopt_cuda_ext`, `_fp8`, `_mx`) JIT-compile via `torch.utils.cpp_extension.load()` on first use — ~110–140s **each** in a fresh container, which is the dominant cost of the `gpu_trtllm` job and the TRT-LLM example jobs. This caches them across runs. The logic lives in a reusable composite action, **`.github/actions/cache-extensions`**, used by both `gpu_tests.yml` and `_example_tests_runner.yml`: - Sets a **literal in-container `TORCH_EXTENSIONS_DIR`** (`/root/.cache/torch_extensions`). `${{ github.workspace }}` can't be used — for `container:` jobs it resolves to the *host* path, which is mounted elsewhere (`/__w`) inside the container, so torch and the cache step would disagree on the location. - Caches that dir with `actions/cache`, keyed on a caller-supplied **env discriminator** (`rtxpro6000` + container image) plus a `hashFiles` of the kernel/loader sources — so the cache busts on any kernel change and is scoped per arch+image. - On an **exact hit**, **backdates the kernel sources** below the cached objects so ninja reuses them. (Touching the *objects* instead desyncs ninja's `.ninja_deps`, which records each output's build-time mtime → `stored deps info out of date` → rebuild.) Also fixes the unused `runner` default in `_example_tests_runner.yml` (`h100` → `rtxpro6000`) so it can't seed a wrong-arch cache. ### Usage N/A — CI only. To reuse from another job: ```yaml - uses: ./.github/actions/cache-extensions with: cache-key: rtxpro6000-${{ matrix.container_image }} # GPU arch + image ``` ### Testing Validated on `gpu_trtllm`: cache hit → `ninja: no work to do` → `test_cuda_ext*` dropped from **113s / 108s / 139s → 2.8s / 0.03s / 0.03s** (~360s saved per run). Jobs that build no extension (e.g. `gpu_vllm`) simply skip the save. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ (CI-only; key busts on source/image change) - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update Changelog?: N/A (CI infrastructure) - Did you get Claude approval on this PR?: ❌ (pending) ### Additional Information - Single-arch assumption: callers pass `rtxpro6000` in `cache-key`; if the runner fleet ever mixes GPU archs, update that prefix (the cache path is not arch-specific). - No explicit TTL: the key is content-addressed, and GitHub auto-evicts caches unused for 7 days (+ 10 GB/repo LRU). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…r examples/megatron_bridge (#1601) ### What does this PR do? Type of change: documentation (+ minor test fixes) Migrates the Nemotron-3-Nano-30B-A3B-BF16 tutorial quantization step from `examples/llm_ptq/hf_ptq.py` to the Megatron-Bridge quantize + export, and relocates the tutorial next to the scripts it now uses. Now that the whole tutorial is Megatron-Bridge based, it lives under `examples/megatron_bridge/`. - **Quantization migration:** replace the single `hf_ptq.py` call with `examples/megatron_bridge/quantize.py` (calibrate + save a Megatron checkpoint) → `examples/megatron_bridge/export.py` (deployable unified HF checkpoint). The FP8 results table is refreshed with the `quantize.py` numbers (same defaults, slightly better on average). - **Relocation:** moved `examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/` → `examples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/`. A **redirect-stub `README.md`** remains at the old path (a directory symlink isn't traversable in the GitHub web UI), and all in-repo references (root README, CHANGELOG, pruning READMEs, megatron_bridge README) plus the tutorial's own relative links are updated. - **Evaluation:** per-format vLLM benchmark commands (BF16 / FP8), FP8 deployment notes documented in `nemo_evaluator.yaml`, reduced LiveCodeBench/AIME `num_repeats` (were too slow), and bumped the `nemo-evaluator-launcher` pin. - **Misc:** drop the `examples/megatron_bridge/requirements.txt` `transformers<5` pin in favor of an inline "downgrade `transformers<5` to save pruned Nemotron checkpoints" note; guard the hybrid Mamba-MoE sharded-state-dict test behind `HAS_MAMBA` (requires `mamba_ssm`); shrink the tiny Gemma3 test fixture's attention heads. > **Note:** the **NVFP4 + QAD** experiments (formerly the focus of this PR) are split out — their accuracy/throughput results are still in progress — and will follow in a separate PR on top of this one. ### Testing Docs-only + test-guard changes. Pre-commit hooks (markdownlint, RST checks, ruff, mypy) pass. The tutorial's relative links and the old-path redirect stub were verified to resolve to real files. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ (old tutorial path still resolves via a redirect-stub README; `quantize.py`/`export.py` already exist in `examples/megatron_bridge`) - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A (adjusts/guards existing tests only) - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ (existing tutorial entry updated to the new path) - Did you get Claude approval on this PR?: ✅ ### Additional Information Supersedes the previous "Part 3 of 4 (NVFP4 + QAD docs)" scope of this PR; the NVFP4 + QAD tutorial additions will land in a follow-up.  ## Summary by CodeRabbit * **Documentation** * Moved the Nemotron-3-Nano-30B-A3B tutorial into the Megatron-Bridge tutorials and replaced the old file with a pointer to the new location. * Updated vLLM throughput numbers to 2.6× and expanded results/throughput tables. * Reworked the FP8 quantization/export workflow and added a note to use transformers<5 when saving pruned models. * Added a tutorials index and adjusted evaluator launcher pin and repeat counts. * **Tests** * Tests now detect optional Mamba support and skip related tests when unavailable.  --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

#1653) ### What does this PR do? Type of change: new feature Brings the GPT-OSS lossless MXFP4 → NVFP4 cast (#1372) to DeepSeek V4's routed-expert export by adding a `--cast_mxfp4_to_nvfp4` flag to `examples/deepseek/deepseek_v4/quantize_to_nvfp4.py`. To avoid duplicating the closed-form math, the shared numerics — `mxfp4_to_nvfp4_global_amax`, `mxfp4_to_nvfp4_per_block_amax`, and the E2M1/E4M3/E8M0 constants — are **hoisted out of the GPT-OSS example cast into the library** at `modelopt/torch/quantization/utils/numeric_utils.py`. Both the GPT-OSS cast (`examples/llm_ptq/cast_mxfp4_to_nvfp4.py`) and the new DeepSeek path now import them from there. DeepSeek V4's routed experts ship as MXFP4 (E2M1 nibbles + a power-of-two E8M0 scale per 32-element block). By default the export dequantizes them to BF16 and re-quantizes to NVFP4 using the calibrated per-tensor weight amax, which re-derives per-block scales from the data and is therefore lossy. With the flag, the cast pins `scale_2 = 2^(k_max-8)` and each per-block E4M3 scale to `2^(k_j-m)` straight from the source E8M0 scales, so `per_block_scale * scale_2 = 2^k_j` and the NVFP4 nibbles equal the source MXFP4 nibbles bit-for-bit (for every block whose `k_j` lands in E4M3's representable window; rare out-of-range blocks clamp). The one V4-specific addition is that w1/w3 share a single `scale_2` for the fused GEMM1, so `k_max` is taken over both projections. The flag only affects routed-expert **weights** — activation `input_scale` still comes from `--amax_path` calibration. ### Usage ```bash python deepseek_v4/quantize_to_nvfp4.py \ --amax_path ${AMAX} \ --source_ckpt ${DS_V4} \ --output_ckpt ${HF_NVFP4_PATH} \ --cast_mxfp4_to_nvfp4 ``` ### Testing - The hoisted numerics get unit tests in `tests/unit/torch/quantization/test_numeric_utils.py` (10 cases: per-tensor global_amax, per-block amax incl. out-of-range, magnitude-table cache) — 10/10 pass. The example test `tests/examples/llm_ptq/test_cast_mxfp4_to_nvfp4.py` keeps the cast-specific cases (quantizer naming, `build_amax_map`, `apply_to_model`). - Validated on real DeepSeek-V4-Flash expert tensors (incl. the on-disk `float8_e8m0fnu` scale dtype): 23.5M blocks, 100% lossless, 0 error. - Generated a full NVFP4 checkpoint for DeepSeek-V4-Flash (43 layers, 256 routed experts) end-to-end: `[cast] lossless MXFP4->NVFP4 blocks: 8,657,043,456/8,657,043,456 (100.0000%)`. Output weights match an independently-produced reference cast byte-for-byte (`weight_scale`, `weight_scale_2`, packed nibbles modulo the harmless sign-of-zero). ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ (new opt-in flag; default export behavior unchanged; hoist re-exports through the existing example module) - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ N/A (no new deps; shared numerics moved into the library rather than duplicated) - Did you write any new necessary tests?: ✅ (library numerics covered by `tests/unit/torch/quantization/test_numeric_utils.py`; end-to-end validated on a real DeepSeek-V4 checkpoint) - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ - Did you get Claude approval on this PR?: ❌ (will run `/claude review`) ### Additional Information Mirrors and reuses #1372 (GPT-OSS MXFP4 → NVFP4 cast); the closed-form numerics are now shared via `modelopt.torch.quantization.utils.numeric_utils`. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **New Features** * Added `--cast_mxfp4_to_nvfp4` flag to perform a closed-form, mostly lossless MXFP4→NVFP4 conversion for routed-expert weights with aggregated lossless/block statistics. * **Documentation** * Updated DeepSeek V4 export instructions and README to document the new flag and clarify calibration behavior for activation scales. * **Chores** * Exposed shared numeric quantization utilities for MXFP4→NVFP4 casting. * **Tests** * Added and updated tests to validate the new numeric helpers and conversion behavior.  Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? Type of change: Bug fix INT8 entropy calibration of fp16 ONNX models (e.g. ConvNext / EfficientViT / YOLOv8 backbones quantized via `python -m modelopt.onnx.quantization --quantize_mode=int8`) used to fail during histogram collection with: ``` ValueError: Too many bins for data range. Cannot create 128 finite-sized bins. ``` `_collect_value` in `modelopt/onnx/quantization/ort_patching.py` derives `threshold = max(abs(min), abs(max))` from the activation tensor and passes `range=(-threshold, threshold)` to `np.histogram(...)`. When the model is fp16 and a calibrated activation has a small range (≲ 1e-5), both endpoints inherit fp16 dtype. Under numpy 2.0's NEP-50 strict promotion, the resulting fp16 `linspace` collapses consecutive 128-bin edges to the same value and numpy refuses to build the histogram. numpy 1.x silently used higher-precision intermediate dtype, masking the issue. The fix casts the range endpoints to Python `float` so numpy computes bin edges in float64 regardless of input dtype. Applied at both call sites: `_collect_value` and the single-node variant `_collect_value_histogram_collector_single_node_calibration`. ### Usage ```bash # The affected workflow — INT8 entropy calibration of any fp16 ONNX model: python -m modelopt.onnx.quantization \ --quantize_mode=int8 \ --onnx_path=model.fp16.onnx \ --calibration_data_path=calib.npy ``` No API change. ### Testing - Added `test_collect_value_fp16_narrow_range` in `tests/gpu/onnx/test_ort_patching.py` that calls `_collect_value` with a fp16 tensor (mostly zeros + one ~1e-5 value) and asserts the histogram is built without raising and all bin edges are distinct. Fails on the buggy code, passes after the fix. - Reproduced the original failure on numpy 2.2.6 before the fix. - Full `tests/gpu/onnx/test_ort_patching.py` suite (31 tests) passes. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅  ## Summary by CodeRabbit * **Bug Fixes** * Fixed INT8 entropy calibration for fp16 ONNX models failing with NumPy >= 2.0. Histogram range computation now correctly handles fp16 activations with small dynamic ranges. * **Tests** * Added test coverage for INT8 calibration with fp16 activations using narrow value ranges.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1558?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)   Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? Type of change: Bug fix Fixes `apply_chat_template` failures when loading `nemotron-sft-agentic-v2` with Nemotron3 Nano tokenizer. HF agentic datasets store OpenAI-style `tool_calls` with `function.arguments` as JSON **strings**, but Nemotron v3 chat templates iterate `tool_call.arguments|items` in Jinja2, which requires a **mapping**. That mismatch raised: ``` TypeError: Can only get item pairs from a mapping. ``` This PR: - Adds shared `prepare_messages_for_chat_template()` in `modelopt.torch.utils.dataset_utils` to normalize string tool-call arguments to dicts (including both nested `function.arguments` and top-level `arguments`). - Routes `get_dataset_samples` / `get_dataset_dataloader` chat-template paths through the helper with `reasoning_content="native"` and `normalize_tool_calls=True`, preserving `reasoning_content` for tokenizers that handle it natively while fixing tool calls. - Refactors `megatron_preprocess_data._process_messages` to delegate to the same helper (no behavior change: `strip`/`inline` still handle reasoning; `native` still returns messages unchanged without tool-call normalization). - Consolidates tests: hermetic logic stays in unit tests; one live GPU integration test covers the v3 calibration path. ### Testing - New e2e tests added to replace previous simpler tests - Manual verification (Nemotron 3 Nano tokenizer + `nemotron-sft-agentic-v2`): ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A  - Did you write any new necessary tests?: ✅  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A  - Did you get Claude approval on this PR?: Not yet  ### Additional Information Root cause: Nemotron v3 Jinja chat templates use `tool_call.arguments|items`; OpenAI-format dataset rows store arguments as JSON strings. Related prior art in-repo: `megatron_preprocess_data` already normalized tool-call arguments inline; this PR deduplicates that logic into `prepare_messages_for_chat_template`.  ## Summary by CodeRabbit ## Release Notes * **New Features** * Added public utilities for preparing OpenAI-style chat messages with reasoning content support, including native reasoning mode handling. * Implemented automatic tool call argument normalization for consistent tokenizer operations. * **Refactor** * Consolidated chat template application across registered and auto-detected chat datasets using unified preprocessing. * **Tests** * Added unit and integration tests validating reasoning content preparation and chat template functionality.  Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? Type of change: documentation Adds two docs under `modelopt_recipes/` (no code or behavior changes): - **`README.md`** — catalog of the recipe library: its purpose (a recipe is the single, version-controlled source of truth for *how* a model is optimized), the directory layout (`general/`, `huggingface/`, `models/`, `configs/`), how to load/select recipes (`load_recipe`, `--recipe`), and a high-level map of the general PTQ combos, speculative-decoding, and distillation recipes. - **`recipe.md`** — a focused guide to the PTQ schemes: the general `general/ptq/` body scopes (full-model FP8/NVFP4, scoped experts-only / mlp-only / omlp-only, weight-only), KV-cache modes (`kv_fp8_cast` / `kv_nvfp4_cast` / `kv_fp8`), calibration variants (max / mse / gptq / layerwise), low- vs high-concurrency deployment guidance, and the model-specific recipes under `huggingface/` and `models/` — each compared to its general baseline. ### Usage ```python # Documentation only. The recipes themselves load as before, e.g.: from modelopt.recipe import load_recipe cfg = load_recipe("general/ptq/nvfp4_experts_only-kv_fp8_cast") ``` ### Testing `pre-commit run --files modelopt_recipes/README.md modelopt_recipes/recipe.md` passes (markdownlint, modelopt recipe validation, license/format hooks). ### Before your PR is "*Ready for review*" - Is this change backward compatible?: N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A  - Did you get Claude approval on this PR?: ❌  ### Additional Information Documentation for the `modelopt_recipes/` library; content verified against the recipe YAMLs and the `modelopt.recipe` / config-loader source. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Documentation** * Added comprehensive ModelOpt recipes guide describing YAML-based, composable optimization workflows, directory/lookup layout, reuse via imports, and how to add or share recipes. * Added PTQ quantization guide covering recipe naming/structure, quantization scopes and KV-cache options, calibration variant guidance, model-specific overrides, multimodal considerations, and a checkpoint-mirroring example.  Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…a 4 MTP (#1677) Type of change: Bug fix Fixes the specdec_bench vLLM wrapper's MTP `speculative_config` emission so Gemma 4 MTP no longer hits the wrong code path inside vLLM. vLLM's `SpeculativeConfig.__post_init__` (`vllm/config/speculative.py:529-602`) auto-detects `method` ONLY when it's unset. When `model` is provided and `method` is `None`, the default branch sets `method = "draft_model"` — the generic same-architecture draft path, NOT MTP. That path enforces equal num_heads between target and draft and raises: ``` AssertionError: All layers in one attention group must share num_heads; got {8, 4} ``` on heterogeneous-head models. Gemma 4 has 8 target heads and 4 draft heads by design. PR #1663 changed the MTP branch in the wrapper to emit `{model: <assistant>, num_speculative_tokens: N}` WITHOUT `method` when `draft_model_dir` was provided, based on a misread of vLLM PR #41745's test plan that only showed the `{model, num_speculative_tokens}` shape. That test plan was the direct `LLM(...)` constructor invocation; vLLM had already defaulted method internally. Going through specdec_bench's `AsyncEngineArgs(speculative_config=...)` path, the explicit `method` key is required to avoid the auto-detect → draft_model fallback. vLLM's own test at [`tests/v1/e2e/spec_decode/test_spec_decode.py:818-823`](https://github.com/vllm-project/vllm/blob/main/tests/v1/e2e/spec_decode/test_spec_decode.py#L818) does exactly this for the gemma4-e4b parametrization: ```python speculative_config = { "method": method, # "mtp" "num_speculative_tokens": ..., } if draft_model is not None: # Gemma 4 case speculative_config["model"] = draft_model ``` Restore `method="mtp"` as the unconditional MTP path. ADD `model` only when `draft_model_dir` is set. Backward-compatible for Qwen 3.5 MTP / DeepSeek MTP / other inline-MTP families (they keep the bare `{method: "mtp"}` config). Field-tested via vLLM PR #41745's correctness test on `gemma-4-E4B-it` + `gemma-4-E4B-it-assistant`: produced 304.7 output TPS at γ=4 vs 171.0 baseline (178% speedup) on H100. The same `speculative_config` shape this fix emits. [OMNIML-5024](https://jirasw.nvidia.com/browse/OMNIML-5024) pipeline - Wrapper emitted `{model: assistant, num_speculative_tokens: 3}` - vLLM auto-detected `method = "draft_model"` - Loaded gemma-4-E4B-it-assistant (4 heads) as a generic draft for gemma-4-E4B-it (8 heads) - Attention-group num_heads check tripped → AssertionError, task_0 FAILED, task_1 CANCELLED - Backward compatible: ✅ (Qwen 3.5 / DeepSeek MTP unchanged; only the MTP+`draft_model_dir` case changes). - New tests: ❌ — the test exercising this codepath would need a GPU + gemma-4 model checkout, which is cluster work, not unit-test scope. JIRA-tracked validation via OMNIML-5024 dispatch after this lands. - Changelog: ❌ - vLLM PR #41745 (Gemma4 MTP support) - Companion: NVIDIA/Model-Optimizer PR #1675 (launcher `GlobalVariables.draft_model` schema fix)  * **Bug Fixes** * Fixed speculative decoding configuration handling in the benchmark example to ensure consistent method assignment and proper draft model configuration. * **Documentation** * Updated configuration comments to reflect corrected behavior and improved clarity.  Signed-off-by: Chenhan Yu <chenhany@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? Type of change: New feature (PTQ recipe) Adds a declarative YAML recipe for post-training quantization of **Nemotron-H** hybrid models (Mamba-2 + MLP + Attention) under the `modelopt_recipes` framework. The recipe is mixed-precision and composed **entirely from existing recipe units** — no core-library or `hf_ptq.py` changes are required. (Quantized `nn.Embedding` support, which the embedding line relies on, already landed in #1495.) Precision mirrors the **GGUF Q4_K_M** bit allocation of the same model, mapped onto NVFP4/FP8. `modelopt_recipes/models/Nemotron-H/nvfp4_w4a16.yaml`: | Precision | GGUF source | Modules | | --- | --- | --- | | NVFP4 W4A4 | Q4_K / Q5_0 | in_proj, out_proj, up_proj, attn q/k/v/o_proj, down_proj (Q4_K layers 13,15,20,22,27,29,37,39) | | FP8 W8A8 | Q6_K | MLP down_proj (layers 1,3,5,8,10,18,25,33,41) | | NVFP4 W4A16 (weight-only) | — | input embedding | | FP8 W8A16 (weight-only) | — | lm_head | | bf16 | F32 | Mamba conv1d, all norms, A_log / D / dt_bias | The Q8_0 attn `v_proj` layers (24, 32) are kept **NVFP4 W4A4** rather than FP8: ModelOpt's export fuses q/k/v (they share the attention input) and requires one format across the group, so `v` can't diverge from `q`/`k`. Built from the units `base_disable_all`, `w4a4_nvfp4_nvfp4`, `default_disabled_quantizers`, `configs/numerics/fp8`, and `configs/numerics/nvfp4`. ### Usage ```bash python examples/llm_ptq/hf_ptq.py \ --pyt_ckpt_path nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \ --recipe models/Nemotron-H/Nemotron-3-Nano-4B/nvfp4_w4a16 \ --trust_remote_code \ --export_path nemotron-3-nano-4b-nvfp4 ``` ### Testing - `pre-commit run --files modelopt_recipes/models/Nemotron-H/nvfp4_w4a16.yaml` passes, including the `validate modelopt recipes` schema hook. - End-to-end PTQ + unified HF export on `nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16` (calibration: `cnn_dailymail`, 512 samples, seq 512). Produced a 2.57 GB unified HF checkpoint; `hf_quant_config.json` was verified per-layer against the table above — 9 FP8 W8A8 `down_proj` + FP8 weight-only `lm_head`, NVFP4 W4A16 embedding, NVFP4 W4A4 everywhere else, with q/k/v/o uniform within each attention layer (required for export fusion). ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors). - Is this change backward compatible?: ✅ — purely additive; a new opt-in recipe file. - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A — declarative config; covered by the `validate modelopt recipes` pre-commit hook. - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A — new recipe config, not a library API change. - Did you get Claude approval on this PR?: ❌ — pending `/claude review`. ### Additional Information Depends on #1495 (quantized `nn.Embedding` support) for the embedding line to pack on export. Possible follow-ups (out of scope here): - A compressed-tensors conversion pass so the checkpoint is consumable by vLLM (`*.weight → *.weight_packed`, `*.weight_scale_2 → *.weight_global_scale`, and a `format: nvfp4-pack-quantized` / `quant_method: compressed-tensors` quantization config). - A `--vllm-compat`-style variant that additionally excludes Mamba `in_proj` (output dim `17504 = intermediate + conv_dim + num_heads` is not divisible by 64, violating Marlin repack alignment) and leaves `lm_head` / embedding in bf16, for out-of-the-box vLLM consumption. --------- Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…en (#1673) ### What does this PR do? Type of change: Bug fix  Fixes the generation **preview** in `examples/llm_ptq/hf_ptq.py` producing garbage output (e.g. repeated `\u200b` zero-width-space tokens) for models whose tokenizer has `pad_token == eos_token` — most visibly GLM-5.1. The garbage appeared *before* quantization, so it was not a quantization issue. **Root cause:** `pre_quantize` / `post_quantize` take the first (left-padded) calibration sample and call `full_model.generate(preview_input_ids, ...)` **without an `attention_mask`**. HuggingFace only auto-infers the mask when `pad_token_id != eos_token_id` (`generation/utils.py:_prepare_attention_mask_for_generation`); when they are equal it falls back to an all-ones mask, so the model attends to the leading pad/eos tokens, ignores the real prompt, and (for GLM's MoE/DSA/MTP path) collapses to a single repeated token. Calibration itself was always correct — it already passes the mask; only the preview generation was missing it. **Fix:** thread the calibration batch's `attention_mask` through to both preview `generate()` calls. One file changed (`examples/llm_ptq/hf_ptq.py`, +20/-8). ### Usage No usage change — the same command now produces a coherent preview instead of `\u200b` repetition ### Testing Reproduced the exact mechanism (left padding + pad_token == eos_token + missing attention_mask) on a small model(GPT2): without the mask the model emits the same HF warning as the bug report and ignores the prompt; with the mask the output is byte-identical to the unpadded baseline. Verified no behavioral change for models where pad != eos (the explicit mask equals HF's inferred input_ids.ne(pad_id)) and for Whisper (its batch carries no attention_mask, so the path is unchanged). Pre-commit: ruff-check, ruff-format, and mypy (no new errors vs. main) all pass. Before your PR is "Ready for review" Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S). Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.). - Is this change backward compatible?: ✅  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A  - Did you write any new necessary tests?: N/A  - Did you update Changelog?: N/A  - Did you get Claude approval on this PR?: ✅  ### Additional Information Backward compatible across model familes: | Model class | Before (no mask passed) | After (mask passed) | Result | |---|---|---|---| | `pad != eos` (most: T5, BART, many LLMs) | HF infers mask = `input_ids.ne(pad_id)` | explicit calib mask = same tensor | **Identical output** — no change | | `pad == eos` (GLM-5.1, GPT-2-style) | all-ones fallback → attends to pad → garbage | correct mask | **Fixed** | | Whisper | no mask | batch has no `attention_mask` key → `None` → no mask | **Identical** — no change | | Nemotron-VL / DeepSeek / NemotronH / `--skip_generate` | `generate()` not called on this path | unchanged | No change |  ## Summary by CodeRabbit ## Release Notes * **Bug Fixes** * Enhanced LLM post-quantization example to properly handle attention masks during preview generation. The quantization preview now correctly threads attention masks through generate() calls, ensuring accurate generation outputs are captured both before and after quantization steps.  Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

) ### What does this PR do? Type of change: Bug fix Fix `--quant_cfg` CLI parsing by typing `quant_cfg` as `str | None` instead of `str | QuantizeConfig | None` ### Testing ``` accelerate launch --config_file examples/gpt-oss/configs/zero3.yaml examples/gpt-oss/sft.py --config examples/gpt-oss/configs/sft_full.yaml --model_name_or_path openai/gpt-oss-20b --quant_cfg MXFP4_MLP_WEIGHT_ONLY_CFG --output_dir gpt-oss-20b-qa ``` ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A - Did you get Claude approval on this PR?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **Refactor** * Quantization config parameter now accepts string identifiers or none; resolution behavior for named presets remains unchanged. * **Documentation** * Updated argument reference to reflect the parameter type change while preserving the deprecation note and usage guidance.  --------- Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…1687) ### What does this PR do? Type of change: Bug fix Exclude Qwen visual and vision_tower modules from NVFP4 quantization and keep the Qwen linear attention projection exclusions. These modules can produce matrix dimensions that are incompatible with vLLM 0.22.1's ModelOpt FP4 Marlin fallback path, causing checkpoint load or profiling failures such as `size_n = 4304 is not divisible by tile_n_size = 64`. ### Usage N/A. This is a recipe configuration fix. ### Testing - `python -m pytest tests/unit/recipe/test_presets.py tests/unit/recipe/test_loader.py -q` - `python -m pre_commit run --files modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml tests/unit/recipe/test_loader.py tests/unit/recipe/test_presets.py` - E2E validation with `vllm/vllm-openai:v0.22.1`: PTQ export validation passed with zero Marlin-incompatible quantized layers, and vLLM `/health`, `/v1/models`, and `/v1/completions` passed. The final PR broadens the validated visual MLP exclusions to the full `*visual*` subtree and adds the common `*vision_tower*` naming pattern. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: Yes - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: Yes - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A - Did you get Claude approval on this PR?: N/A ### Additional Information N/A  ## Summary by CodeRabbit * **Tests** * Added unit tests that verify the built-in PTQ recipe and preset correctly disable incompatible projection and visual components for certain quantization modes. * Ensures quantization settings are validated across recipes and presets. * **Chores** * Updated quantization configuration to disable quantizers for select projection and vision-related model layers.  Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…295242) (#1678) Type of change: Bug fix Fixes the GPT-OSS MXFP4 → NVFP4 PTQ path (`examples/llm_ptq/hf_ptq.py` with `--cast_mxfp4_to_nvfp4`), which failed in three independent ways. The documented command now runs end-to-end and produces a bit-exact (100% lossless) NVFP4 checkpoint. Addresses **nvbug 6295279** (OMNIML-5046) and **nvbug 6295242** (OMNIML-5045). 1. **nvbug 6295242 — CUDA illegal memory access on load.** GPT-OSS ships native MXFP4 weights that Transformers dequantizes to BF16; the threaded weight loader trips an illegal-memory access when `device_map="auto"` shards the dequant across **multiple GPUs**. The missing optional `kernels` package only *forces* the dequant path — it is not the root cause. `get_model` now detects MXFP4 checkpoints and loads them with `Mxfp4Config(dequantize=True)` on a **sequential** device map so the dequant stays on a single device. `kernels` is no longer required. 2. **nvbug 6295279 #1 — `NotImplementedError: Mxfp4GptOssExperts` during unified HF export.** Forcing `dequantize=True` yields plain `GptOssExperts` (even when `kernels` is installed), which ModelOpt wraps and exports normally. 3. **nvbug 6295279 #2 — `FileNotFoundError` in the cast step.** `--cast_mxfp4_to_nvfp4` treated `--pyt_ckpt_path` as a local dir; a HF Hub ID now resolves to its cached snapshot dir via `_resolve_model_path`. Also fixes a **static-block NVFP4 regression** (surfaced by the cast's `force_weight_quantizers_static`, introduced by #1560's now-unconditional `weight_only_quantize`): `_QuantGptOssExperts` / `_QuantLlama4TextExperts` quantize their expert weights transposed in the forward (`_transposed_quantize`), but the inherited `iter_weights_for_calibration` fed the non-transposed weight, locking a mismatched block-quant `_original_shape` and raising `ValueError: Input shape has changed`. The override now calibrates on the transposed view, matching both the forward and the export's `_amax` orientation. `get_model` never had explicit handling for a *natively pre-quantized MXFP4* checkpoint — GPT-OSS fell through the generic *unquantized-checkpoint* branch and relied on Transformers' **implicit** MXFP4 behavior, which is fragile across three axes. The cast was originally validated (#1372, 2026-05-01) in the "lucky" quadrant of each: - **GPU count:** `device_map="auto"` on a single GPU never shards, so the dequant stays on one device. On multiple GPUs `auto` balances the model and shards the MXFP4→BF16 dequant across devices → CUDA illegal-memory crash (6295242). - **`kernels` presence:** without `kernels`, Transformers auto-dequantizes to BF16 `GptOssExperts` (exportable). With `kernels` installed it keeps the packed `Mxfp4GptOssExperts` kernel path → export `NotImplementedError` (6295279 #1). - **Transformers version:** the kernel-backed experts wrapper and the threaded multi-GPU weight loader are newer-Transformers behavior (env here is 5.5.4). Earlier versions simply dequantized MXFP4 → BF16, which is what the old generic path happened to need. The QA env sat in the *breaking* quadrant (multi-GPU and/or `kernels` present, newer Transformers), so the implicit path failed. The new branch makes both decisions explicit and deterministic (`dequantize=True` + single-device load), regardless of environment — mirroring the existing `has_pack_quantized_config` branch for compressed-tensors checkpoints. The fourth issue (static-block `Input shape has changed`) is a separate regression: it was introduced by **#1560 (2026-06-02, "Make sure all weight quantizers have `_amax`")**, a month *after* the cast landed. previously it ran only when no calibration `forward_loop` was supplied, and the cast always supplies one — so the non-transposed weight-quantizer call simply never happened before. The conflict only appears at the intersection of (a) transposed-quantize experts (GPT-OSS/Llama4), (b) static-block NVFP4 — which `--cast_mxfp4_to_nvfp4` forces via `force_weight_quantizers_static` — and (c) #1560. CI's GPT-OSS NVFP4 coverage uses the *dynamic*-block path, which never locks the block shape, so #1560 looked safe. ```bash python hf_ptq.py \ --pyt_ckpt_path openai/gpt-oss-20b \ --qformat nvfp4_mlp_only \ --cast_mxfp4_to_nvfp4 \ --export_path ./gpt-oss-20b-nvfp4 ``` - Ran the documented command end-to-end on 2xB200 (`openai/gpt-oss-20b`): cast overrode **48/48** expert weight quantizers, **100% lossless** layers/blocks, exported a valid packed-NVFP4 HF checkpoint (uint8 weights + FP8 per-block `weight_scale` + per-tensor `weight_scale_2` + `hf_quant_config.json`). - Verified plain `--qformat nvfp4_mlp_only` (no cast) still works end-to-end. - **Independently verified the export is bit-exact:** dequantized the exported NVFP4 weights (ModelOpt's E2M1 LUT + pack layout) and compared against Transformers' canonical MXFP4→BF16 dequant (`Mxfp4Config(dequantize=True)`) over all 24 layers × both expert weights — `max_abs_err = 0`, 100% bitwise-equal in bf16. So `dequant(exported NVFP4) == dequant(original MXFP4)` exactly. - New unit tests: `test_get_original_hf_quant_method_*` (load detection) and `test_gpt_oss_experts_iter_weights_for_calibration_transposed` (the transpose regression). Existing `test_cast_mxfp4_to_nvfp4.py` (8 tests) still pass. `pre-commit` clean. **Known limitation:** verified for gpt-oss-20b (fits one GPU). gpt-oss-120b dequantized does not fit a single GPU, so `sequential` would still span GPUs — that case would need a CPU-dequant-then-dispatch path and is left as a follow-up. - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ (0.45 Bug Fixes) - Did you get Claude approval on this PR?: ❌ (not yet run) nvbug 6295279, nvbug 6295242 / OMNIML-5046, OMNIML-5045. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  * **Bug Fixes** * Prevented CUDA illegal-memory access during MXFP4→NVFP4 casting. * Fixed expert-weight calibration orientation to avoid shape mismatches. * **New Features** * Support loading native MXFP4 checkpoints with automatic dequantization. * Resolve remote model identifiers to local checkpoints when casting MXFP4→NVFP4, improving reliability. * **Tests** * Added unit and GPU regression tests covering quant-method detection, casting, and expert-weight calibration.  Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…6293731, 6293762) (#1691) ### What does this PR do? Type of change: Bug fix Fixes two sglang deployment failures on multimodal Gemma (`gemma-4-31B-it`) caused by general PTQ presets leaking quantization into the SigLIP vision branch via broad wildcards: - **NVBug 6293731** — `general/ptq/fp8_default-kv_fp8`: the `w8a8_fp8_fp8` unit enables bare `*weight_quantizer` / `*input_quantizer`, which also match the vision tower (`model.vision_tower.*`, `model.visual.*`) and the vision embedding projection (`model.embed_vision.*`). The exported checkpoint deploys but emits **garbled text** in sglang. - **NVBug 6293762** — `general/ptq/nvfp4_mlp_only-kv_fp8`: the `*mlp*` enables also match the vision tower's block MLPs (`model.vision_tower.encoder.layers.*.mlp`), and an image request **crashes** the FP4 kernel at decode: `ValueError: too many values to unpack (expected 2)` in sglang's `modelopt_quant.py` `apply`. ### Fix Add `*embed_vision*` / `*vision_tower*` / `*visual*` disable rules to the shared `configs/ptq/units/default_disabled_quantizers` unit, alongside the existing `*router*` / `*lm_head*` entries. Both the composed `general/ptq/*` recipes **and** the `configs/ptq/presets/model/*` presets import this unit, so: - every general recipe (`fp8_default`, `nvfp4_default`, `nvfp4_mlp_only`, `nvfp4_omlp_only`, …) keeps the vision branch in BF16 by default — fixing the whole vision-overreach class, not just the two reported recipes; - the `test_general_ptq_yaml_matches_config_dicts` YAML↔preset parity test stays satisfied (both sides pick up the new entries from the one shared unit). The rules are **no-ops on text-only models** (nothing matches). A recipe that intentionally wants to quantize the vision branch can re-enable these after importing the unit. Files changed: - `modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml` (+14) ### Testing Re-export of `gemma-4-31B-it` with the affected recipes and re-deploy in sglang (the env from the bug reports: `lmsysorg/sglang:v0.5.12.post1`, GB200) to confirm fp8_default no longer garbles text and nvfp4_mlp_only no longer crashes on image requests. _(Results to be appended.)_ Unit-level: `tests/unit/recipe/test_loader.py::test_general_ptq_yaml_matches_config_dicts` (parity) passes for all four general presets. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ (text-only checkpoints unaffected; new rules only match vision modules that should never have been quantized by a general recipe) - If you copied code from any other sources or added a new PIP dependency: N/A - Did you write any new necessary tests?: N/A (recipe data fix; covered by the existing parity test + verified by real PTQ export + sglang deploy) - Did you update Changelog?: N/A - Did you get Claude approval on this PR?: ❌ (pending) ### Additional Information NVBug 6293731 and 6293762. Reported on modelopt 0.45.0rc0, GB200, gemma-4-31B-it, sglang 0.5.12.post1. Tracked under OMNIML-5034. Companion to PR #1690 (same vision-overreach class on the gemma-specific `w4a8_awq` recipe, NVBug 6294017). 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Chores** * Updated quantization configuration to preserve BF16 precision for vision encoder components in multimodal models.  --------- Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…fo in clear_stale_value_info (#1697) ### What does this PR do? Type of change: Bug fix INT4 quantization upgrades the model to opset >= 21, at which point ONNX Runtime runs type inference while building the AWQ calibration `InferenceSession`. Custom ops backed by TensorRT plugins (domain `trt.plugins`) have no ORT type-inference function, so their output types are only known from the `value_info` that TensorRT type/shape inference populated earlier in preprocessing. `clear_stale_value_info` cleared `value_info` wholesale, dropping those types, so ORT failed output type inference for the custom op at model load, e.g.: ``` Node (Conv-2) Op (IdentityConv) output arg (X2) type inference failed ``` - `modelopt/onnx/utils.py`: in `clear_stale_value_info`, preserve `value_info` entries for outputs of `trt.plugins`-domain nodes (which ORT cannot re-derive); clear the rest as before. - `tests/gpu/onnx/quantization/test_plugin.py`: add a regression test quantizing a model with the built-in `CustomSkipLayerNormPluginDynamic` plugin at INT4 + awq_clip (the opset >= 21 path), asserting the quantized model is produced and the custom op survives. ### Usage ```python python -m modelopt.onnx.quantization \ --onnx_path=model.onnx \ --quantize_mode=int4 \ --calibration_method=awq_clip \ --trt_plugins=/path/to/plugin.so ``` ### Testing - `pytest tests/gpu/onnx/quantization/test_plugin.py -k int4_awq` — fails before the fix (ORT type-inference error at calibration-session load) and passes after. The full `test_plugin.py` (including the existing INT8 quantization and autocast cases) passes. - The example [here](https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/onnx_ptq/README.md#quantize-an-onnx-model-with-custom-op) also failed before this fix, now passes. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A ### Additional info Fixing regression inserted by #1565  ## Summary by CodeRabbit * **Bug Fixes** * Preserve metadata for TensorRT plugin outputs during cleanup and correctly reconcile output data types so custom plugin ops remain intact after optimization/quantization. * **Tests** * Added a GPU ONNX regression test covering int4 quantization with AWQ calibration to ensure TensorRT plugins are retained.  Signed-off-by: Gwenaelle Cunha Sergio <gcunhasergio@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…#1702) ### What does this PR do? Type of change: Bug fix Fixes nvbug **6311147** (OMNIML-5103). `examples/deepseek/deepseek_v3/ptq.py` resolved the cloned DeepSeek-V3 / DeepSeek-V3.2-Exp inference repos relative to its own directory (`deepseek_v3/`) via `Path(__file__).resolve().parent`. But the [README](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/deepseek) clones those repos into the parent `examples/deepseek/` directory and runs the script from there, so the lookup landed one level too deep and raised `ValueError: DeepSeek-V3 or DeepSeek-V3.2-Exp not found` (the error message also printed the wrong directory). The fix resolves from `parent.parent` via a single `DEEPSEEK_DIR` base shared by both repo paths and the error message. ### Usage ```bash # Run from examples/deepseek/ as documented in the README, after cloning # DeepSeek-V3 (or DeepSeek-V3.2-Exp) into that directory: torchrun --nproc-per-node 8 --master_port=12346 deepseek_v3/ptq.py \ --model_path $DS_CKPT \ --config DeepSeek-V3/inference/configs/config_671B.json \ --quant_cfg NVFP4_DEFAULT_CFG \ --output_path $FP4_QUANT_PATH ``` ### Testing - Confirmed against the repro path: with the file at `examples/deepseek/deepseek_v3/ptq.py` and the repos cloned into `examples/deepseek/`, `Path(__file__).resolve().parent.parent` now points at `examples/deepseek/` so `DeepSeek-V3/inference` resolves correctly. - Verified the sibling `examples/deepseek/deepseek_v4/` does not share the bug (it takes an explicit `--dsv4_inference_dir` argument instead). - `pre-commit` clean. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A (one-line path fix in an example script that requires the DeepSeek repos + multi-GPU checkpoint to exercise) - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A (bug is in a 0.45-cycle example, not a regression from a released version) - Did you get Claude approval on this PR?: ❌ (not yet run) ### Additional Information nvbug 6311147 / OMNIML-5103. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Bug Fixes** * Improved path resolution in the example script to more reliably locate the required inference repository.  Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? Type of change: Bug fix Extends the calibration/memory-probe `use_cache` guard to Step 3.7-style nested text configs. Step 3.7 remote code reads the language config under `model.config.text_config` directly and raises `AttributeError` when `use_cache` is absent during PTQ calibration with Transformers >5. This keeps the existing Step 3.5 behavior and applies the same temporary set/restore logic to the nested text config. ### Usage No API change. PTQ calibration continues to use the existing forward-loop path. ### Testing - `pre-commit run ruff-format --files modelopt/torch/utils/dataset_utils.py tests/unit/torch/utils/test_dataset_utils.py` - `pre-commit run ruff-check --files modelopt/torch/utils/dataset_utils.py tests/unit/torch/utils/test_dataset_utils.py` - `python -m py_compile modelopt/torch/utils/dataset_utils.py tests/unit/torch/utils/test_dataset_utils.py` - `python -m pytest tests/unit/torch/utils/test_dataset_utils.py -k "disable_use_cache or iter_use_cache_configs or forward_loop_runs_under_disabled" -vv` ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A - Did you get Claude approval on this PR?: N/A ### Additional Information This is separate from PR #1693. Step 3.7 needs both fixes if both failure paths are exercised: this PR fixes PTQ calibration-time `use_cache` handling, while PR #1693 fixes exported config `layer_types` metadata for deployment config loading.  ## Summary by CodeRabbit * **Bug Fixes** * Improved handling of cache flags stored in nested model configuration objects: cache is reliably disabled during dataset operations and restored or removed afterward. * **Tests** * Added unit tests covering nested-config disabling, restoration/removal of cache flags post-operation, and deduplication when nested configs reference the same object.  Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

Fixes #1658 Type of change: Bug fix, documentation This PR updates the Puzzletron dataset preparation flow to use the already published prebuilt dataset `nvidia/Puzzle-KD-Nemotron-Post-Training-Dataset-v2` by default, avoiding the need to download the full raw `nvidia/Nemotron-Post-Training-Dataset-v2` dataset (~136 GB) just to filter it down to the same ~2.6 GB result. Changes included: - Add `PREBUILT_KD_DATASET` constant in `prepare_dataset.py` - Short-circuit dataset preparation when `dataset_name` matches the prebuilt dataset, loading it directly and skipping the download + filtering pipeline - Update 8 Puzzletron example configs to use the prebuilt dataset path by default - Update the Puzzletron README to document the default ~3 GB path and clarify that the raw ~136 GB path is still available if users want to reproduce preprocessing Default lightweight path: ```bash python -m modelopt.torch.puzzletron.dataset.prepare_dataset \ --dataset_name nvidia/Puzzle-KD-Nemotron-Post-Training-Dataset-v2 \ --output_dir path/to/Puzzle-KD-Nemotron-Post-Training-Dataset-v2 ``` Raw dataset path (existing behavior, still supported): ```bash python -m modelopt.torch.puzzletron.dataset.prepare_dataset \ --dataset_name nvidia/Nemotron-Post-Training-Dataset-v2 \ --output_dir path/to/Nemotron-Post-Training-Dataset-v2 ``` - Ran `pre-commit run --all-files` - Most hooks passed successfully - Local pre-commit `mypy` reported unrelated existing errors in: - `modelopt/torch/opt/config_loader.py` - `modelopt/recipe/loader.py` - Verified this change separately with a local mock-based test: - prebuilt dataset path correctly loads and saves directly - original raw dataset path remains untouched - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A - Did you get Claude approval on this PR?: N/A This change preserves the original raw-dataset workflow for users who explicitly want to regenerate the filtered dataset from scratch, while making the default example flow much lighter and easier to use.  * **Documentation** * Updated setup instructions to use a prebuilt, optimized dataset by default, simplifying the model compression workflow. * **Chores** * Updated model compression configurations across multiple examples to use the prebuilt dataset. * Enhanced dataset preparation to support prebuilt dataset handling for more efficient setup.  Signed-off-by: Sabari07 <sabursd18@gmail.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

fixes the oom (cpu ram) issue (reported in #1681)  ## Summary by CodeRabbit * **Bug Fixes** * Optimized memory management during model validation operations. Explicit resource cleanup procedures are now performed after each solution validation, preventing memory accumulation and eliminating out-of-memory errors during extended validation workflows. * **Configuration** * Updated default validation dataset configuration setting.  Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

coderabbitai · 2026-06-15T18:09:10Z

📝 Walkthrough

Walkthrough

This PR adds new Alpamayo and Qwen-Image example workflows, updates LLM quantization and sparse-attention export paths, expands recipe and tutorial documentation, improves dataset and Puzzletron utilities, fixes ONNX handling, and adjusts CI workflows, caching, and evaluation scripts.

Changes

CI and workflow updates

Layer / File(s)	Summary
Extension cache action `.github/actions/cache-extensions/action.yml`	Adds a composite action that sets `TORCH_EXTENSIONS_DIR`, restores cached torch extension artifacts, and backdates kernel sources on cache hits.
Workflow rollout `.github/workflows/_example_tests_runner.yml`, `.github/workflows/gpu_tests.yml`, `.github/workflows/example_tests.yml`, `.github/workflows/unit_tests.yml`	Switches runner defaults, wires the cache action into workflows, removes one example matrix entry, shortens one GPU timeout, and enables Codecov `skip_validation`.

Alpamayo quantization example

Layer / File(s)	Summary
Quantization script `examples/alpamayo/quantize.py`	Adds Alpamayo message building, teacher-forced loss patching, calibration loops, PTQ and AutoQuantize flows, CLI handling, and checkpoint export.
Example docs and tests `examples/alpamayo/README.md`, `tests/examples/alpamayo/test_quantize.py`	Documents the example workflow and adds tests for Parquet clip-id loading and message construction.

LLM quantization and evaluation fixes

Layer / File(s)	Summary
Shared MXFP4/NVFP4 numerics and expert calibration `modelopt/torch/quantization/utils/numeric_utils.py`, `examples/llm_ptq/cast_mxfp4_to_nvfp4.py`, `modelopt/torch/quantization/plugins/huggingface.py`, `modelopt/torch/quantization/plugins/transformers_trainer.py`	Moves MXFP4→NVFP4 math into shared utilities, updates cast code to consume them, adds transposed expert-weight calibration, and narrows `quant_cfg` typing.
DeepSeek and HF cast/load flows `examples/deepseek/...`, `examples/llm_ptq/example_utils.py`, `examples/llm_ptq/hf_ptq.py`	Adds lossless DeepSeek V4 cast support, updates DeepSeek path resolution, detects original HF MXFP4 configs, threads preview attention masks, and resolves local source checkpoint paths before casting.
Runtime and evaluation wiring `modelopt/deploy/llm/generate.py`, `examples/llm_eval/`, `examples/llm_ptq/run_tensorrt_llm.py`, `examples/llm_ptq/scripts/`, `examples/specdec_bench/specdec_bench/models/vllm.py`	Adds configurable KV-cache reuse, disables it for context-logit example paths, fixes MTP speculative decoding config, and updates evaluation scripts and shell wiring.
Coverage and interface docs `tests/examples/llm_ptq/`, `tests/gpu/torch/quantization/`, `tests/unit/torch/quantization/*`, `examples/llm_qat/ARGUMENTS.md`, `tests/_test_utils/torch/transformers_models.py`	Adds cast and calibration regression coverage, tests the new HF quant-method helper, updates quant-arg docs, and adjusts a tiny transformer test helper.

FastGen DMD2 diffusion stack

Layer / File(s)	Summary
FastGen core APIs `modelopt/torch/fastgen/`, `modelopt/torch/fastgen/methods/`, `modelopt/torch/fastgen/plugins/__init__.py`	Adds the FastGen package with DMD configs, EMA, flow-matching utilities, losses, loaders, base pipeline classes, and the DMD training pipeline.
Qwen-Image plugin and example flow `modelopt/torch/fastgen/plugins/qwen_image.py`, `modelopt_recipes/general/distillation/dmd2_qwen_image.yaml`, `examples/diffusers/fastgen/*`	Adds the Qwen-Image plugin, a DMD2 recipe, training configs, training recipe, export utility, inference pipeline, requirements, and example documentation.
Fixtures and FastGen test coverage `tests/_test_utils/torch/diffusers_models.py`, `tests/examples/diffusers/conftest.py`, `tests/unit/torch/fastgen/*`	Adds tiny Qwen-Image fixtures and unit coverage for FastGen math, plugin behavior, gradient routing, optimizer steps, hook guards, and EMA/timestep conversion behavior.

Sparse attention export and example updates

Layer / File(s)	Summary
Exported sparse-attention schema `modelopt/torch/sparsity/attention_sparsity/*`, `modelopt/torch/export/unified_export_hf.py`	Adds grouped sparse-attention export metadata, records log-space calibration fields and `initial_disabled_steps`, updates checkpoint parsing, and injects sparse config into exported component `config.json` files.
Wan skip-softmax example flow `examples/diffusers/README.md`, `examples/diffusers/sparsity/*`, `CHANGELOG.rst`	Extends sparse-attention docs and the Wan example with calibration-frame control, checkpoint export, and exported disabled-step metadata.
Schema and export coverage `tests/examples/diffusers/sparsity/test_sparsity.py`, `tests/unit/torch/sparsity/attention_sparsity/*`	Adds export coverage for fixed-threshold and calibrated skip-softmax flows and updates schema assertions to the new grouped metadata shape.

Chat-template utilities and Puzzletron dataset updates

Layer / File(s)	Summary
Chat-template preprocessing `modelopt/torch/utils/dataset_utils.py`, `modelopt/torch/utils/plugins/megatron_preprocess_data.py`, `tests/unit/torch/utils/test_dataset_utils.py`, `tests/gpu/torch/utils/test_dataset_utils.py`	Adds message normalization helpers for reasoning content and tool calls, uses split-specific raw dataset files, and expands nested `use_cache` handling and tests.
Puzzletron dataset and validation flow `modelopt/torch/puzzletron/dataset/prepare_dataset.py`, `modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py`, `examples/puzzletron/*`	Prefers a prebuilt Puzzle-KD dataset, updates Puzzletron docs and configs to that dataset, and frees validation model references between iterations.

Recipes and tutorial documentation

Layer / File(s)	Summary
Recipe catalog and presets `modelopt_recipes/README.md`, `modelopt_recipes/ptq.md`, `modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml`, `modelopt_recipes/general/ptq/`, `modelopt_recipes/models/.../nvfp4_w4a16.yaml`, `docs/source/guides/10_recipes.rst`, `tests/unit/recipe/`	Expands recipe docs, adds new PTQ recipe variants and disabled quantizer patterns, adds a model-specific Nemotron recipe, and updates recipe coverage.
Tutorial and changelog refresh `README.md`, `CHANGELOG.rst`, `examples/megatron_bridge/`, `examples/pruning/`, `tests/_test_utils/torch/megatron/models.py`, `tests/gpu_megatron/torch/quantization/plugins/test_megatron.py`	Updates top-level release notes and tutorial links, adds a Megatron Bridge end-to-end tutorial, refreshes evaluator templates and pruning docs, and gates a Megatron test on Mamba availability.

ONNX calibration and metadata fixes

Layer / File(s)	Summary
Histogram and value-info handling `modelopt/onnx/quantization/ort_patching.py`, `modelopt/onnx/utils.py`	Casts histogram ranges to Python floats for narrow-value cases and preserves `trt.plugins` outputs during stale `value_info` cleanup.
ONNX coverage `tests/gpu/onnx/test_ort_patching.py`, `tests/gpu/onnx/quantization/test_plugin.py`	Adds tests for fp16 narrow-range histogram collection and int4 AWQ quantization with TensorRT plugin nodes preserved.

Sequence Diagram(s)

sequenceDiagram
  participant Config as DMD2 config
  participant Recipe as DMD2DiffusionRecipe
  participant Pipeline as DMDPipeline
  participant Checkpoint as sidecar checkpoint

  Config->>Recipe: load config and overrides
  Recipe->>Pipeline: build student, teacher, fake_score, discriminator
  Recipe->>Pipeline: run student or fake-score phase
  Pipeline-->>Recipe: return phase losses
  Recipe->>Checkpoint: save student and DMD2 sidecar state

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

NVIDIA/Model-Optimizer#1653: Also adds the DeepSeek V4 --cast_mxfp4_to_nvfp4 lossless routed-expert cast path and related MXFP4→NVFP4 numeric handling.
NVIDIA/Model-Optimizer#1678: Also updates GPT-OSS MXFP4→NVFP4 PTQ/cast handling, including transposed expert calibration behavior and related example flows.
NVIDIA/Model-Optimizer#1670: Also refactors modelopt/torch/utils/dataset_utils.py and megatron_preprocess_data.py around chat-template message preparation and tool-call normalization.

Suggested labels

cherry-pick-0.45.0

Suggested reviewers

realAsma
Edwardf0t1
meenchen
Fridah-nv
cjluo-nv

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch cherry-picks/release-0.45.0

github-actions · 2026-06-15T18:15:15Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-15 19:23 UTC

codecov · 2026-06-15T18:18:02Z

Codecov Report

❌ Patch coverage is 69.60396% with 307 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.81%. Comparing base (2c52e7b) to head (e276eef).

Files with missing lines	Patch %	Lines
modelopt/torch/fastgen/ema.py	38.59%	70 Missing ⚠️
modelopt/torch/fastgen/methods/dmd.py	72.64%	61 Missing ⚠️
modelopt/torch/fastgen/plugins/qwen_image.py	59.63%	44 Missing ⚠️
modelopt/torch/fastgen/discriminators.py	0.00%	37 Missing ⚠️
modelopt/torch/fastgen/flow_matching.py	64.36%	31 Missing ⚠️
modelopt/torch/fastgen/loader.py	34.88%	28 Missing ⚠️
modelopt/torch/utils/dataset_utils.py	86.36%	12 Missing ⚠️
...delopt/torch/puzzletron/dataset/prepare_dataset.py	12.50%	7 Missing ⚠️
modelopt/torch/fastgen/pipeline.py	73.91%	6 Missing ⚠️
modelopt/torch/fastgen/factory.py	83.33%	3 Missing ⚠️
... and 4 more

Additional details and impacted files

@@                Coverage Diff                 @@
##           release/0.45.0    #1734      +/-   ##
==================================================
- Coverage           77.48%   76.81%   -0.68%     
==================================================
  Files                 489      504      +15     
  Lines               54415    55332     +917     
==================================================
+ Hits                42165    42501     +336     
- Misses              12250    12831     +581

Flag	Coverage Δ
examples	`42.28% <10.59%> (-0.64%)`	⬇️
gpu	`57.65% <12.67%> (-1.40%)`	⬇️
regression	`14.64% <1.38%> (-0.16%)`	⬇️
unit	`54.26% <67.32%> (+0.25%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 18

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/llm_ptq/example_utils.py (1)

646-659: ⚠️ Potential issue | 🟠 Major

Fix has_pack_quantized_config() to handle object-style quantization configs like get_original_hf_quant_method() does.

The function calls .get() directly on quantization_config without checking whether it's a dict or object. In Transformers 4.56.0, quantization config objects (e.g., Mxfp4Config) do not support dict-style .get() access—only attribute access. This will raise AttributeError at runtime if quantization_config is an object instead of a dict.

The same file already demonstrates the correct pattern in get_original_hf_quant_method() (lines 546–548), which uses isinstance(quant_cfg, dict) to branch between .get() for dicts and getattr() for objects. Apply this same guard to has_pack_quantized_config() on lines 630 and 636.

Suggested fix

        def has_pack_quantized_config(config):
+           def _cfg_get(qcfg, key, default=None):
+               return qcfg.get(key, default) if isinstance(qcfg, dict) else getattr(qcfg, key, default)
+
            # Check top-level quantization_config
            if hasattr(config, "quantization_config"):
-               if config.quantization_config.get("format", None) == "pack-quantized":
+               if _cfg_get(config.quantization_config, "format") == "pack-quantized":
                    return True
            # Check nested text_config.quantization_config (for multi-modal models like kimi k2.5)
            if hasattr(config, "text_config") and hasattr(
                config.text_config, "quantization_config"
            ):
-               if config.text_config.quantization_config.get("format", None) == "pack-quantized":
+               if _cfg_get(config.text_config.quantization_config, "format") == "pack-quantized":
                    return True
            return False

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_ptq/example_utils.py` around lines 646 - 659, The
`has_pack_quantized_config()` function calls `.get()` directly on
`quantization_config` without checking whether it's a dict or an object, which
causes AttributeError at runtime when the config is an object like `Mxfp4Config`
that doesn't support dict-style access. Fix this by adopting the same pattern
already used in `get_original_hf_quant_method()` at lines 546-548: add an
`isinstance(quantization_config, dict)` check to branch between using `.get()`
for dict-style configs and `getattr()` for object-style configs. Apply this
guard to both `.get()` calls in `has_pack_quantized_config()` around lines 630
and 636.

🧹 Nitpick comments (1)

modelopt/torch/fastgen/plugins/__init__.py (1)
24-27: ⚡ Quick win

Define explicit __all__ in package __init__.py before wildcard re-export.

This package re-exports plugin symbols but does not declare its own __all__. Add module-level __all__ and extend it from qwen_image.__all__ when the plugin import succeeds so the public surface stays explicit.

As per coding guidelines, “Define the public API with __all__ at the top of each module and re-export via from .module import * in package __init__.py files.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/fastgen/plugins/__init__.py` around lines 24 - 27, The package
__init__.py file in modelopt/torch/fastgen/plugins/ performs a wildcard import
from qwen_image but does not define its own __all__ to explicitly declare the
public API. Define a module-level __all__ variable (can be initialized as an
empty list or with expected symbols), then within the import_plugin context
block for qwen_image, extend __all__ to include the symbols from
qwen_image.__all__ after the successful import. This ensures the public surface
of the package remains explicit and follows the coding guidelines for defining
public APIs.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/alpamayo/quantize.py`:
- Around line 650-652: The global torch.no_grad() context manager wrapping the
main() function call disables gradient computation globally, which prevents the
--quantize auto path from computing gradients needed by the auto_quantize() API
for gradient-based sensitivity scoring during the search phase. Remove the with
torch.no_grad(): wrapper and call main() directly without the context manager to
allow gradients to flow through the loss function when needed.
- Around line 461-467: The debug logging statements that extract GPU tensor
values to CPU scalars using .item() calls (on v_pred and v_target with
torch.isfinite, min, max, and abs().mean() operations) create unnecessary
CPU-GPU synchronization points in the optimization loop hot path. Either remove
these print statements entirely, or gate them behind a conditional debug flag
(such as if debug_logging:) that defaults to False so they do not execute during
normal operation. This will eliminate the synchronization overhead while
preserving the ability to enable detailed logging when needed for debugging.

In `@examples/deepseek/deepseek_v4/quantize_to_nvfp4.py`:
- Around line 329-335: The per-block amax computation uses
mxfp4_to_nvfp4_per_block_amax which internally recomputes k_max, but this may
differ from the shared k_max passed to the parent function, causing the
in-range/out-of-range classification to mismatch with the weight_scale_2
computed from the shared k_max. To fix this, either pass the shared k_max
parameter into the mxfp4_to_nvfp4_per_block_amax helper function so it uses the
correct shared value for both classification and scaling, or compute the
per_block_scale and in-range logic directly in this location using the shared
k_max instead of relying on the helper's internally recomputed value. This
ensures the block classification and weight_scale_2 derivation use the same
k_max reference.

In `@examples/diffusers/fastgen/dmd2_recipe.py`:
- Line 669: The torch.load calls at lines 669, 674, 687, and 696 in
dmd2_recipe.py use weights_only=False when loading checkpoints from the
user-supplied restore_from parameter, creating a Remote Code Execution risk if
checkpoints are untrusted or tampered with. For each of these four locations,
either change weights_only=False to weights_only=True to safely deserialize only
tensor data, or if weights_only=False is absolutely necessary for functionality,
add an inline comment explaining the security justification and request approval
from `@NVIDIA/modelopt-setup-codeowners`. The preferred approach is to switch all
four calls to weights_only=True unless there is a documented reason why the
model architecture requires full pickle deserialization.

In `@examples/diffusers/fastgen/export_diffusers_qwen_image.py`:
- Around line 49-52: The example usage in the export_diffusers_qwen_image.py
script shows `--base_pipeline_path Qwen/Qwen-Image`, but the export_diffusers()
function requires a local directory path and will fail with a non-directory
input like a model identifier. Replace the Qwen/Qwen-Image reference in the
example usage (around line 51) with a local snapshot directory path (e.g.,
/path/to/local/qwen_image_base or similar) to accurately reflect the expected
input format.

In `@examples/diffusers/fastgen/inference_dmd2_qwen_image.py`:
- Line 483: The os.makedirs call at line 483 crashes when output_png is a bare
filename because os.path.dirname returns an empty string. Before calling
os.makedirs on the dirname of output_png, guard against empty parent paths by
checking if the dirname is empty and using "." (current directory) as a fallback
when it is. This ensures the code handles both full paths and bare filenames
gracefully.
- Line 153: The torch.load call loading the EMA checkpoint with
weights_only=False enables unsafe pickle deserialization, creating a
code-execution risk for malicious files. Since the ema_path parameter is
caller-supplied without documented safety justification, and EMA state contains
only model weights which can be safely deserialized, change weights_only=False
to weights_only=True in the torch.load call on line 153 to disable pickle
deserialization and load only tensor data safely.
- Around line 145-146: The directory validation at lines 145-146 using
os.path.isdir rejects HuggingFace model IDs like the documented CLI default
"Qwen/Qwen-Image" at line 505, preventing diffusers from resolving the model.
Either remove the os.path.isdir check and let diffusers handle both local paths
and model IDs, or update the CLI default and help text at line 505 to require a
local snapshot path instead. Additionally, add an inline comment at line 153
where torch.load is called with weights_only=False explaining that it is safe
because the EMA checkpoint is internally-generated and trusted, not
user-supplied, to satisfy security guidelines.

In `@examples/llm_ptq/scripts/huggingface_example.sh`:
- Around line 296-305: Variable expansions in the python command invocations are
unquoted, which can cause argument splitting or glob expansion if the variables
contain spaces or special characters. In the mmlu.py command starting at line
296, wrap all variable expansions including $MODEL_ABS_PATH, $SAVE_PATH,
$MMLU_DATA_PATH, and $mmlu_flags in double quotes to ensure they are treated as
single arguments. Apply the same quoting fix to the corresponding command
invocation at lines 320-323 for consistency, wrapping all variable expansions in
that location with double quotes as well.

In `@modelopt/deploy/llm/generate.py`:
- Around line 291-295: Replace the assert statement in the
generate_context_logits() method that validates enable_kv_cache_reuse with an
explicit if statement that raises a ValueError. The current assert can be
stripped when Python runs with optimization flags (like -O), which would
silently allow incorrect behavior in this public API method. Change the
condition to check if self._enable_kv_cache_reuse is True, and if so, raise a
ValueError with the same descriptive error message that currently appears in the
assert.

In `@modelopt/torch/fastgen/__init__.py`:
- Around line 57-68: The package API surface is not explicitly curated via
`__all__` in the two `__init__.py` files, making API drift likely. In
modelopt/torch/fastgen/__init__.py at lines 57-68, add an explicit `__all__`
list that aggregates all exported names from the wildcard imports (config, ema,
factory, loader, methods.dmd, pipeline modules) and explicitly includes the
module-level re-exports (flow_matching, losses, utils, plugins) to define the
curated public API surface. In modelopt/torch/fastgen/methods/__init__.py at
line 18, after the wildcard re-export from .dmd, add a line that imports __all__
from the dmd module and assigns it as __all__ to explicitly pin the exported
surface, ensuring both files follow the coding guideline of declaring public
surfaces with explicit `__all__` declarations.

In `@modelopt/torch/fastgen/config.py`:
- Around line 94-103: The _check_bounds validator method uses assert statements
to validate external input from YAML configuration, which is unsafe because
asserts can be disabled in optimized Python runs (with -O flag), allowing
invalid configuration to pass validation. Replace all four assert statements
with explicit ValueError raises instead, maintaining the same validation logic
and error messages but using the raise ValueError syntax to guarantee validation
always occurs regardless of Python optimization settings.

In `@modelopt/torch/fastgen/discriminators.py`:
- Around line 92-94: The feature_indices filtering at line 92 in the __init__
method only checks the upper bound (i < num_blocks) but allows negative indices
and can result in an empty set, causing torch.cat to fail later at line 136.
Replace the current filter condition with proper validation that enforces 0 <= i
< num_blocks for each index. Additionally, add a check after filtering to raise
an informative error immediately if feature_indices becomes empty, rather than
allowing silent failure downstream. This validates the input once at the
interface boundary as per coding guidelines.

In `@modelopt/torch/fastgen/ema.py`:
- Around line 127-129: The EMA shadow initialization and reset paths do not
respect the local_shard mode and unnecessarily call _gather_full(), which
triggers expensive all-gathers and memory spikes. Add a conditional check for
config.mode == "local_shard" before calling _gather_full() in the shadow
initialization and reset logic. When in local_shard mode, use the local shard
directly (the parameter p itself or a detached copy) instead of gathering the
full tensor across all ranks. Apply this fix at all locations where
_gather_full() is called during shadow initialization and reset operations,
including the code block around self._shadow[clean] assignment and any similar
shadow update paths.

In `@modelopt/torch/fastgen/flow_matching.py`:
- Around line 185-195: The _truncated_lognormal function creates all tensors on
CPU during sampling and only transfers the final result to the requested device,
causing unnecessary host/device traffic. Pass the device parameter directly to
all tensor creation calls including torch.tensor calls for log_min_t, log_max_t,
mean, and std tensors, as well as the torch.rand call for sampling u, to ensure
all intermediate computations happen on the target device from the start rather
than requiring a final .to() transfer.

In `@modelopt/torch/fastgen/loader.py`:
- Around line 74-93: The candidate paths in the _candidate_paths function are
being added in the wrong order. Currently, filesystem paths are appended before
built-in recipe paths, but the documented contract states built-in recipes
should be checked first. For both the string and Path branches of the
conditional logic, reverse the order of the candidate appends so that all
_BUILTIN_RECIPES_LIB.joinpath calls are executed before the corresponding Path
or direct filesystem path calls. This applies to all four
candidate.append/extend calls in the string branch and all four calls in the
Path branch.

In `@tests/examples/diffusers/sparsity/test_sparsity.py`:
- Around line 165-170: Move the imports currently at lines 165–170 (from
diffusers import AutoencoderKLWan and WanPipeline, the
modelopt.torch.sparsity.attention_sparsity import, the
modelopt.torch.export.export_hf_checkpoint import, and the SparseAttentionModule
import) from inside the test function to the module scope at the top of the file
with the other imports. If any of these imports require deferred loading due to
optional dependencies or circular import concerns, keep them in the function and
add a brief comment explaining why.

In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 38-59: The test reuses the same messages variable after it has
been processed by the first prepare_messages_for_chat_template call, which may
have modified the input. For the second assertion that tests
normalize_tool_calls=False, create a fresh copy of the messages input with the
identical structure instead of reusing the modified messages variable. This
ensures the non-normalizing code path is tested with unmodified input and can
properly detect regressions in how it preserves content when normalization is
disabled.

---

Outside diff comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 646-659: The `has_pack_quantized_config()` function calls `.get()`
directly on `quantization_config` without checking whether it's a dict or an
object, which causes AttributeError at runtime when the config is an object like
`Mxfp4Config` that doesn't support dict-style access. Fix this by adopting the
same pattern already used in `get_original_hf_quant_method()` at lines 546-548:
add an `isinstance(quantization_config, dict)` check to branch between using
`.get()` for dict-style configs and `getattr()` for object-style configs. Apply
this guard to both `.get()` calls in `has_pack_quantized_config()` around lines
630 and 636.

---

Nitpick comments:
In `@modelopt/torch/fastgen/plugins/__init__.py`:
- Around line 24-27: The package __init__.py file in
modelopt/torch/fastgen/plugins/ performs a wildcard import from qwen_image but
does not define its own __all__ to explicitly declare the public API. Define a
module-level __all__ variable (can be initialized as an empty list or with
expected symbols), then within the import_plugin context block for qwen_image,
extend __all__ to include the symbols from qwen_image.__all__ after the
successful import. This ensures the public surface of the package remains
explicit and follows the coding guidelines for defining public APIs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 82f25edc-4ce3-4c7d-9a1c-2b2541d741f9

📥 Commits

Reviewing files that changed from the base of the PR and between 2c52e7b and e276eef.

⛔ Files ignored due to path filters (2)

examples/alpamayo/0417_16rows_train_set_for_calibration_25.10.parquet is excluded by !**/*.parquet
examples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/figures/learning_curves.png is excluded by !**/*.png

📒 Files selected for processing (122)

.github/actions/cache-extensions/action.yml
.github/workflows/_example_tests_runner.yml
.github/workflows/example_tests.yml
.github/workflows/gpu_tests.yml
.github/workflows/unit_tests.yml
CHANGELOG.rst
README.md
docs/source/guides/10_recipes.rst
examples/alpamayo/README.md
examples/alpamayo/quantize.py
examples/deepseek/README.md
examples/deepseek/deepseek_v3/ptq.py
examples/deepseek/deepseek_v4/quantize_to_nvfp4.py
examples/diffusers/README.md
examples/diffusers/fastgen/README.md
examples/diffusers/fastgen/configs/dmd2_qwen_image.yaml
examples/diffusers/fastgen/configs/dmd2_qwen_image_smoke.yaml
examples/diffusers/fastgen/dmd2_finetune.py
examples/diffusers/fastgen/dmd2_recipe.py
examples/diffusers/fastgen/export_diffusers_qwen_image.py
examples/diffusers/fastgen/inference_dmd2_qwen_image.py
examples/diffusers/fastgen/requirements.txt
examples/diffusers/sparsity/README.md
examples/diffusers/sparsity/wan22_skip_softmax.py
examples/llm_eval/lm_eval_tensorrt_llm.py
examples/llm_eval/mmlu.py
examples/llm_eval/run_simple_eval.sh
examples/llm_ptq/cast_mxfp4_to_nvfp4.py
examples/llm_ptq/example_utils.py
examples/llm_ptq/hf_ptq.py
examples/llm_ptq/run_tensorrt_llm.py
examples/llm_ptq/scripts/huggingface_example.sh
examples/llm_ptq/scripts/parser.sh
examples/llm_qat/ARGUMENTS.md
examples/megatron_bridge/README.md
examples/megatron_bridge/requirements.txt
examples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/ABLATIONS.md
examples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md
examples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/nemo_evaluator.yaml
examples/megatron_bridge/tutorials/README.md
examples/pruning/README.md
examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md
examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md
examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/nemo_evaluator.yaml
examples/pruning/minitron_vs_puzzletron/README.md
examples/puzzletron/README.md
examples/puzzletron/configs/gptoss-20b_remove_experts_memory/gptoss-20b_remove_experts_memory.yaml
examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml
examples/puzzletron/configs/llama-3_2-3B_pruneffn_memory/llama-3_2-3B_pruneffn_memory.yaml
examples/puzzletron/configs/mistral-small-24b-instruct-2501_pruneffn_memory/mistral-small-24b-instruct-2501_pruneffn_memory.yaml
examples/puzzletron/configs/nemotron-nano-12b-v2/nemotron_nano_12b_v2_pruneffn_memory.yaml
examples/puzzletron/configs/qwen2_5_7b_instruct_pruneffn_memory/qwen2_5_7b_instruct_pruneffn_memory.yaml
examples/puzzletron/configs/qwen3-8b_pruneffn_memory/qwen3_8b_pruneffn_memory.yaml
examples/specdec_bench/specdec_bench/models/vllm.py
modelopt/deploy/llm/generate.py
modelopt/onnx/quantization/ort_patching.py
modelopt/onnx/utils.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/fastgen/__init__.py
modelopt/torch/fastgen/config.py
modelopt/torch/fastgen/discriminators.py
modelopt/torch/fastgen/ema.py
modelopt/torch/fastgen/factory.py
modelopt/torch/fastgen/flow_matching.py
modelopt/torch/fastgen/loader.py
modelopt/torch/fastgen/losses.py
modelopt/torch/fastgen/methods/__init__.py
modelopt/torch/fastgen/methods/dmd.py
modelopt/torch/fastgen/pipeline.py
modelopt/torch/fastgen/plugins/__init__.py
modelopt/torch/fastgen/plugins/qwen_image.py
modelopt/torch/fastgen/utils.py
modelopt/torch/puzzletron/dataset/prepare_dataset.py
modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py
modelopt/torch/quantization/plugins/huggingface.py
modelopt/torch/quantization/plugins/transformers_trainer.py
modelopt/torch/quantization/utils/numeric_utils.py
modelopt/torch/sparsity/attention_sparsity/calibration/calibrate.py
modelopt/torch/sparsity/attention_sparsity/calibration/calibrator.py
modelopt/torch/sparsity/attention_sparsity/config.py
modelopt/torch/sparsity/attention_sparsity/conversion.py
modelopt/torch/sparsity/attention_sparsity/plugins/sparse_attn_config.py
modelopt/torch/utils/dataset_utils.py
modelopt/torch/utils/plugins/megatron_preprocess_data.py
modelopt_recipes/README.md
modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml
modelopt_recipes/general/distillation/dmd2_qwen_image.yaml
modelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8_cast.yaml
modelopt_recipes/general/ptq/nvfp4_mlp_only-kv_fp8_cast.yaml
modelopt_recipes/general/ptq/nvfp4_omlp_only-kv_fp8_cast.yaml
modelopt_recipes/general/ptq/nvfp4_weight_only-kv_fp8_cast.yaml
modelopt_recipes/models/Nemotron-H/Nemotron-3-Nano-4B/nvfp4_w4a16.yaml
modelopt_recipes/ptq.md
tests/_test_utils/torch/diffusers_models.py
tests/_test_utils/torch/megatron/models.py
tests/_test_utils/torch/transformers_models.py
tests/examples/alpamayo/test_quantize.py
tests/examples/diffusers/conftest.py
tests/examples/diffusers/sparsity/test_sparsity.py
tests/examples/diffusers_sparsity/test_sparsity.py
tests/examples/llm_eval/test_llm_eval.py
tests/examples/llm_ptq/test_cast_mxfp4_to_nvfp4.py
tests/examples/llm_ptq/test_example_utils.py
tests/gpu/onnx/quantization/test_plugin.py
tests/gpu/onnx/test_ort_patching.py
tests/gpu/torch/quantization/test_gpt_oss_mxfp4_nvfp4_cast_cuda.py
tests/gpu/torch/utils/test_dataset_utils.py
tests/gpu_megatron/torch/quantization/plugins/test_megatron.py
tests/unit/recipe/test_loader.py
tests/unit/recipe/test_presets.py
tests/unit/torch/fastgen/conftest.py
tests/unit/torch/fastgen/test_dmd_gradient_routing.py
tests/unit/torch/fastgen/test_dmd_math.py
tests/unit/torch/fastgen/test_dmd_pipeline_step.py
tests/unit/torch/fastgen/test_hook_requirements.py
tests/unit/torch/fastgen/test_pred_type_conversion.py
tests/unit/torch/fastgen/test_qwen_image_plugin.py
tests/unit/torch/quantization/plugins/test_huggingface.py
tests/unit/torch/quantization/test_numeric_utils.py
tests/unit/torch/sparsity/attention_sparsity/test_sparse_attention_conversion.py
tests/unit/torch/sparsity/attention_sparsity/test_sparse_attn_config.py
tests/unit/torch/utils/test_dataset_utils.py

💤 Files with no reviewable changes (2)

examples/megatron_bridge/requirements.txt
tests/examples/diffusers_sparsity/test_sparsity.py

coderabbitai · 2026-06-15T18:26:54Z

+        print(
+            f"[autoquant-fwd] v_pred: finite={torch.isfinite(v_pred).all().item()} "
+            f"min={v_pred.min().item():.4g} max={v_pred.max().item():.4g} "
+            f"abs_mean={v_pred.abs().mean().item():.4g} | "
+            f"v_target: finite={torch.isfinite(v_target).all().item()} "
+            f"min={v_target.min().item():.4g} max={v_target.max().item():.4g}"
+        )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate the file and check its size fd -a "quantize.py" examples/alpamayo/ # Then read the relevant section around lines 461-473 cat -n examples/alpamayo/quantize.py | sed -n '450,480p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1827

🏁 Script executed:

# Check context around forward_step to see if there's any debug flag or conditional logging cat -n examples/alpamayo/quantize.py | sed -n '440,480p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2247

Remove debug logging or gate behind debug flag to avoid CPU-GPU sync in AutoQuantize hot path.

Lines 461–467 and 470–473 extract GPU tensor values with .item() for logging on every iteration of the optimization loop. This introduces synchronization points that can materially slow the search. Either remove the prints or gate them behind a conditional debug flag (e.g., if debug_logging:).

Per coding guidelines, avoid tensor-to-Python scalar extraction in GPU hot paths unless the CPU needs the value. Here these extractions are only for logging and are unnecessary for computation.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/alpamayo/quantize.py` around lines 461 - 467, The debug logging statements that extract GPU tensor values to CPU scalars using .item() calls (on v_pred and v_target with torch.isfinite, min, max, and abs().mean() operations) create unnecessary CPU-GPU synchronization points in the optimization loop hot path. Either remove these print statements entirely, or gate them behind a conditional debug flag (such as if debug_logging:) that defaults to False so they do not execute during normal operation. This will eliminate the synchronization overhead while preserving the ability to enable detailed logging when needed for debugging.

Source: Coding guidelines

coderabbitai · 2026-06-15T18:26:54Z

+if __name__ == "__main__":
+    with torch.no_grad():
+        main()


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify the root cause in this PR file and inspect auto-quantize internals for grad usage. rg -n -C3 'with torch\.no_grad:|auto_quantize\(|def forward_step|def loss_func|mse_loss' examples/alpamayo/quantize.py rg -n -C3 'def auto_quantize|backward\(|autograd|grad' modelopt/torch/quantization

Repository: NVIDIA/Model-Optimizer

Length of output: 50379

Remove the global torch.no_grad() wrapper around main().

The global autograd disable at line 651 prevents gradient computation in the --quantize auto path (line 485), which relies on loss-driven search with explicit .backward() calls. The auto_quantize() API uses gradient-based sensitivity scoring by default and expects gradients to flow through the loss function during the search phase.

✅ Proposed fix

if __name__ == "__main__": - with torch.no_grad(): - main() + main()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/alpamayo/quantize.py` around lines 650 - 652, The global torch.no_grad() context manager wrapping the main() function call disables gradient computation globally, which prevents the --quantize auto path from computing gradients needed by the auto_quantize() API for gradient-based sensitivity scoring during the search phase. Remove the with torch.no_grad(): wrapper and call main() directly without the context manager to allow gradients to flow through the loss function when needed.

coderabbitai · 2026-06-15T18:26:54Z

+    per_block_amax = mxfp4_to_nvfp4_per_block_amax(blocks, e8m0)  # (out, nblk16) fp32
+
+    m = k_max - E4M3_KMAX
+    weight_scale_2 = torch.tensor(2.0**m, dtype=torch.float32, device=bf16.device).reshape(())
+    per_block_scale = (
+        (per_block_amax / (E2M1_MAX * weight_scale_2))
+        .clamp(min=2**-9, max=E4M3_MAX)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the shared k_max when deriving OOR per-block amax.

mxfp4_to_nvfp4_per_block_amax(blocks, e8m0) recomputes k_max from this tensor, but w1/w3 cast uses the shared fused-GEMM1 k_max passed into _quantize_weight_nvfp4_lossless. A w1/w3 tensor with a lower local max can classify blocks as in-range here while weight_scale_2 is based on the larger shared max, so the OOR fallback no longer matches the scale actually emitted. Thread the shared k_max into the per-block helper or compute in_range/data-derived fallback locally from the same k_max.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/deepseek/deepseek_v4/quantize_to_nvfp4.py` around lines 329 - 335, The per-block amax computation uses mxfp4_to_nvfp4_per_block_amax which internally recomputes k_max, but this may differ from the shared k_max passed to the parent function, causing the in-range/out-of-range classification to mismatch with the weight_scale_2 computed from the shared k_max. To fix this, either pass the shared k_max parameter into the mxfp4_to_nvfp4_per_block_amax helper function so it uses the correct shared value for both classification and scaling, or compute the per_block_scale and in-range logic directly in this location using the shared k_max instead of relying on the helper's internally recomputed value. This ensures the block classification and weight_scale_2 derivation use the same k_max reference.

coderabbitai · 2026-06-15T18:26:54Z

+            )
+
+        if os.path.isfile(ema_path) and self._dmd_pipeline.ema is not None:
+            ema_state = torch.load(ema_path, map_location="cpu", weights_only=False)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, find and examine the file find . -name "dmd2_recipe.py" -type f

Repository: NVIDIA/Model-Optimizer

Length of output: 110

🏁 Script executed:

# Check the specific lines mentioned head -n 700 examples/diffusers/fastgen/dmd2_recipe.py | tail -n 50

Repository: NVIDIA/Model-Optimizer

Length of output: 2721

🏁 Script executed:

# Get broader context around the torch.load calls sed -n '650,710p' examples/diffusers/fastgen/dmd2_recipe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3251

🏁 Script executed:

# Check if there are inline comments justifying weights_only=False grep -n "weights_only" examples/diffusers/fastgen/dmd2_recipe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 440

🏁 Script executed:

# Verify file location and type ls -la examples/diffusers/fastgen/dmd2_recipe.py && wc -l examples/diffusers/fastgen/dmd2_recipe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 205

🏁 Script executed:

# Check where these paths originate - find the function/method these lines are in sed -n '600,670p' examples/diffusers/fastgen/dmd2_recipe.py | head -80

Repository: NVIDIA/Model-Optimizer

Length of output: 3510

🏁 Script executed:

# Look for where ema_path, state_path, disc_path, disc_opt_path are defined grep -n "ema_path\|state_path\|disc_path\|disc_opt_path" examples/diffusers/fastgen/dmd2_recipe.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 1505

🏁 Script executed:

# Find the function definition and parameter origins sed -n '600,670p' examples/diffusers/fastgen/dmd2_recipe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3510

🏁 Script executed:

# Check where ckpt_dir comes from and what the function signature is sed -n '550,650p' examples/diffusers/fastgen/dmd2_recipe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 5259

🏁 Script executed:

# Check how restore_from is set and where it comes from grep -n "restore_from" examples/diffusers/fastgen/dmd2_recipe.py | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 1433

🏁 Script executed:

# Look at the load_checkpoint method to understand the flow sed -n '700,800p' examples/diffusers/fastgen/dmd2_recipe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 4337

🏁 Script executed:

# Check where load_checkpoint is called and what calls _restore_dmd_extras grep -B5 -A5 "_restore_dmd_extras" examples/diffusers/fastgen/dmd2_recipe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1660

Replace unsafe torch.load(..., weights_only=False) with weights_only=True in checkpoint restore path.

These four torch.load calls at lines 669, 674, 687, 696 deserialize checkpoint files with pickle enabled. The checkpoint paths originate from user-supplied restore_from input, creating an RCE risk if a checkpoint is untrusted or tampered with. None of the calls include inline comments justifying why weights_only=False is necessary and safe. Switch to weights_only=True, or if False is genuinely unavoidable, add an inline trust rationale and request security exception approval from @NVIDIA/modelopt-setup-codeowners.

Suggested fix

- ema_state = torch.load(ema_path, map_location="cpu", weights_only=False) + ema_state = torch.load(ema_path, map_location="cpu", weights_only=True) - state = torch.load(state_path, map_location="cpu", weights_only=False) + state = torch.load(state_path, map_location="cpu", weights_only=True) - disc_state = torch.load(disc_path, map_location="cpu", weights_only=False) + disc_state = torch.load(disc_path, map_location="cpu", weights_only=True) - disc_opt_state = torch.load(disc_opt_path, map_location="cpu", weights_only=False) + disc_opt_state = torch.load(disc_opt_path, map_location="cpu", weights_only=True)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

ema_state = torch.load(ema_path, map_location="cpu", weights_only=False)

ema_state = torch.load(ema_path, map_location="cpu", weights_only=True)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/diffusers/fastgen/dmd2_recipe.py` at line 669, The torch.load calls at lines 669, 674, 687, and 696 in dmd2_recipe.py use weights_only=False when loading checkpoints from the user-supplied restore_from parameter, creating a Remote Code Execution risk if checkpoints are untrusted or tampered with. For each of these four locations, either change weights_only=False to weights_only=True to safely deserialize only tensor data, or if weights_only=False is absolutely necessary for functionality, add an inline comment explaining the security justification and request approval from `@NVIDIA/modelopt-setup-codeowners`. The preferred approach is to switch all four calls to weights_only=True unless there is a documented reason why the model architecture requires full pickle deserialization.

Source: Coding guidelines

coderabbitai · 2026-06-15T18:26:54Z

+    python export_diffusers_qwen_image.py \\
+        --student_path     /path/to/checkpoint/epoch_0_step_500/model/consolidated \\
+        --base_pipeline_path Qwen/Qwen-Image \\
+        --output_dir       /path/to/output/qwen_image_dmd2 \\


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the --base_pipeline_path usage example.

The example on Line 51 uses Qwen/Qwen-Image, but export_diffusers() requires a local directory and fails on non-directory input. Please update the example/help text to a local snapshot path (or broaden input handling).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/diffusers/fastgen/export_diffusers_qwen_image.py` around lines 49 - 52, The example usage in the export_diffusers_qwen_image.py script shows `--base_pipeline_path Qwen/Qwen-Image`, but the export_diffusers() function requires a local directory path and will fail with a non-directory input like a model identifier. Replace the Qwen/Qwen-Image reference in the example usage (around line 51) with a local snapshot directory path (e.g., /path/to/local/qwen_image_base or similar) to accurately reflect the expected input format.

coderabbitai · 2026-06-15T18:26:55Z

+                full = _gather_full(p.detach(), fsdp2=config.fsdp2)
+                target_dtype = _resolve_dtype(config.dtype, full.dtype)
+                self._shadow[clean] = copy.deepcopy(full).to(dtype=target_dtype)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

local_shard mode is bypassed during EMA shadow init/reset.

When config.mode == "local_shard", these paths still call _gather_full(...), which can trigger unnecessary all-gathers and memory spikes.

Suggested fix

- full = _gather_full(p.detach(), fsdp2=config.fsdp2) + full = ( + _gather_full(p.detach(), fsdp2=config.fsdp2) + if config.mode == "full_tensor" + else (p.detach().to_local() if _is_distributed_tensor(p) else p.detach()) + ) target_dtype = _resolve_dtype(config.dtype, full.dtype) self._shadow[clean] = copy.deepcopy(full).to(dtype=target_dtype) ... - live = _gather_full(p.detach(), fsdp2=self.config.fsdp2) + live = ( + _gather_full(p.detach(), fsdp2=self.config.fsdp2) + if self.config.mode == "full_tensor" + else (p.detach().to_local() if _is_distributed_tensor(p) else p.detach()) + ) shadow.copy_(live.to(device=shadow.device, dtype=shadow.dtype))

Also applies to: 252-253

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/fastgen/ema.py` around lines 127 - 129, The EMA shadow initialization and reset paths do not respect the local_shard mode and unnecessarily call _gather_full(), which triggers expensive all-gathers and memory spikes. Add a conditional check for config.mode == "local_shard" before calling _gather_full() in the shadow initialization and reset logic. When in local_shard mode, use the local shard directly (the parameter p itself or a detached copy) instead of gathering the full tensor across all ranks. Apply this fix at all locations where _gather_full() is called during shadow initialization and reset operations, including the code block around self._shadow[clean] assignment and any similar shadow update paths.

coderabbitai · 2026-06-15T18:26:55Z

+    log_min_t = torch.tensor(math.log(min_t), dtype=torch.float64)
+    log_max_t = torch.tensor(math.log(max_t), dtype=torch.float64)
+    normal = Normal(
+        torch.tensor(mean, dtype=torch.float64),
+        torch.tensor(std, dtype=torch.float64),
+    )
+    cdf_min = normal.cdf(log_min_t)
+    cdf_max = normal.cdf(log_max_t)
+    u = torch.rand(n, dtype=torch.float64) * (cdf_max - cdf_min) + cdf_min
+    t = normal.icdf(u).exp()
+    return t.to(device=device, dtype=dtype)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

_truncated_lognormal ignores the requested device during sampling.

This path samples on CPU and transfers later, which adds avoidable host/device traffic in training loops.

Suggested fix

- log_min_t = torch.tensor(math.log(min_t), dtype=torch.float64) - log_max_t = torch.tensor(math.log(max_t), dtype=torch.float64) + log_min_t = torch.tensor(math.log(min_t), dtype=torch.float64, device=device) + log_max_t = torch.tensor(math.log(max_t), dtype=torch.float64, device=device) normal = Normal( - torch.tensor(mean, dtype=torch.float64), - torch.tensor(std, dtype=torch.float64), + torch.tensor(mean, dtype=torch.float64, device=device), + torch.tensor(std, dtype=torch.float64, device=device), ) cdf_min = normal.cdf(log_min_t) cdf_max = normal.cdf(log_max_t) - u = torch.rand(n, dtype=torch.float64) * (cdf_max - cdf_min) + cdf_min + u = torch.rand(n, dtype=torch.float64, device=device) * (cdf_max - cdf_min) + cdf_min t = normal.icdf(u).exp() return t.to(device=device, dtype=dtype)

As per coding guidelines, keep tensor work on the GPU and avoid unnecessary CPU-GPU syncs.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/fastgen/flow_matching.py` around lines 185 - 195, The _truncated_lognormal function creates all tensors on CPU during sampling and only transfers the final result to the requested device, causing unnecessary host/device traffic. Pass the device parameter directly to all tensor creation calls including torch.tensor calls for log_min_t, log_max_t, mean, and std tensors, as well as the torch.rand call for sampling u, to ensure all intermediate computations happen on the target device from the start rather than requiring a final .to() transfer.

Source: Coding guidelines

coderabbitai · 2026-06-15T18:26:55Z

+    # Normalize to string for suffix probing; keep Path/Traversable behavior otherwise.
+    if isinstance(config_file, str):
+        base = config_file
+        if base.endswith(_SUFFIXES):
+            candidates.append(Path(base))
+            candidates.append(_BUILTIN_RECIPES_LIB.joinpath(base))
+        else:
+            candidates.extend(Path(base + suffix) for suffix in _SUFFIXES)
+            candidates.extend(_BUILTIN_RECIPES_LIB.joinpath(base + suffix) for suffix in _SUFFIXES)
+    elif isinstance(config_file, Path):
+        if config_file.suffix in _SUFFIXES:
+            candidates.append(config_file)
+            if not config_file.is_absolute():
+                candidates.append(_BUILTIN_RECIPES_LIB.joinpath(str(config_file)))
+        else:
+            candidates.extend(Path(str(config_file) + suffix) for suffix in _SUFFIXES)
+            if not config_file.is_absolute():
+                candidates.extend(
+                    _BUILTIN_RECIPES_LIB.joinpath(str(config_file) + suffix) for suffix in _SUFFIXES
+                )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Candidate resolution order is inverted vs the documented contract.

The module docs say built-in recipes are checked first, but _candidate_paths currently prioritizes filesystem paths.

Suggested fix

if isinstance(config_file, str): base = config_file if base.endswith(_SUFFIXES): - candidates.append(Path(base)) - candidates.append(_BUILTIN_RECIPES_LIB.joinpath(base)) + candidates.append(_BUILTIN_RECIPES_LIB.joinpath(base)) + candidates.append(Path(base)) else: - candidates.extend(Path(base + suffix) for suffix in _SUFFIXES) candidates.extend(_BUILTIN_RECIPES_LIB.joinpath(base + suffix) for suffix in _SUFFIXES) + candidates.extend(Path(base + suffix) for suffix in _SUFFIXES) elif isinstance(config_file, Path): if config_file.suffix in _SUFFIXES: - candidates.append(config_file) if not config_file.is_absolute(): candidates.append(_BUILTIN_RECIPES_LIB.joinpath(str(config_file))) + candidates.append(config_file) else: - candidates.extend(Path(str(config_file) + suffix) for suffix in _SUFFIXES) if not config_file.is_absolute(): candidates.extend( _BUILTIN_RECIPES_LIB.joinpath(str(config_file) + suffix) for suffix in _SUFFIXES ) + candidates.extend(Path(str(config_file) + suffix) for suffix in _SUFFIXES)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/fastgen/loader.py` around lines 74 - 93, The candidate paths in the _candidate_paths function are being added in the wrong order. Currently, filesystem paths are appended before built-in recipe paths, but the documented contract states built-in recipes should be checked first. For both the string and Path branches of the conditional logic, reverse the order of the candidate appends so that all _BUILTIN_RECIPES_LIB.joinpath calls are executed before the corresponding Path or direct filesystem path calls. This applies to all four candidate.append/extend calls in the string branch and all four calls in the Path branch.

coderabbitai · 2026-06-15T18:26:55Z

+    from diffusers import AutoencoderKLWan, WanPipeline
+
+    import modelopt.torch.sparsity.attention_sparsity as mtsa
+    from modelopt.torch.export import export_hf_checkpoint
+    from modelopt.torch.sparsity.attention_sparsity.sparse_attention import SparseAttentionModule
+


⚠️ Potential issue | 🟡 Minor

Move imports to module scope.

At lines 165–170, imports are inside the test function without justification. Per guidelines, imports belong at file top so errors surface at collection time. Move them to module scope unless they require deferred loading (optional dependencies or circular imports) — in which case add a brief comment explaining why.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/examples/diffusers/sparsity/test_sparsity.py` around lines 165 - 170, Move the imports currently at lines 165–170 (from diffusers import AutoencoderKLWan and WanPipeline, the modelopt.torch.sparsity.attention_sparsity import, the modelopt.torch.export.export_hf_checkpoint import, and the SparseAttentionModule import) from inside the test function to the module scope at the top of the file with the other imports. If any of these imports require deferred loading due to optional dependencies or circular import concerns, keep them in the function and add a brief comment explaining why.

coderabbitai · 2026-06-15T18:26:55Z

+def test_prepare_messages_for_chat_template():
+    messages = [
+        {
+            "role": "assistant",
+            "content": "answer",
+            "reasoning_content": "think",
+            "tool_calls": [
+                {"function": {"name": "search", "arguments": '{"q": "x"}'}},
+            ],
+        },
+    ]
+    prepared = prepare_messages_for_chat_template(
+        messages, reasoning_content="native", normalize_tool_calls=True
+    )
+    assert prepared[0]["reasoning_content"] == "think"
+    assert prepared[0]["tool_calls"][0]["function"]["arguments"] == {"q": "x"}
+    assert (
+        prepare_messages_for_chat_template(
+            messages, reasoning_content="native", normalize_tool_calls=False
+        )
+        is messages
+    )


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use fresh input for the non-normalizing branch assertion.

The second assertion reuses messages after the normalizing call, so it can miss regressions in the normalize_tool_calls=False path’s content preservation.

Suggested test adjustment

def test_prepare_messages_for_chat_template(): - messages = [ + messages = [ { "role": "assistant", "content": "answer", "reasoning_content": "think", "tool_calls": [ {"function": {"name": "search", "arguments": '{"q": "x"}'}}, ], }, ] prepared = prepare_messages_for_chat_template( messages, reasoning_content="native", normalize_tool_calls=True ) assert prepared[0]["reasoning_content"] == "think" assert prepared[0]["tool_calls"][0]["function"]["arguments"] == {"q": "x"} + + raw_messages = [ + { + "role": "assistant", + "content": "answer", + "reasoning_content": "think", + "tool_calls": [ + {"function": {"name": "search", "arguments": '{"q": "x"}'}}, + ], + }, + ] assert ( prepare_messages_for_chat_template( - messages, reasoning_content="native", normalize_tool_calls=False + raw_messages, reasoning_content="native", normalize_tool_calls=False ) - is messages + is raw_messages )

As per coding guidelines, checked-in tests should protect expected behavior and regressions.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 38 - 59, The test reuses the same messages variable after it has been processed by the first prepare_messages_for_chat_template call, which may have modified the input. For the second assertion that tests normalize_tool_calls=False, create a fresh copy of the messages input with the identical structure instead of reusing the modified messages variable. This ensures the non-normalizing code path is tested with unmodified input and can properly detect regressions in how it preserves content when normalization is disabled.

Source: Coding guidelines

kevalmorabia97 and others added 24 commits June 15, 2026 11:03

kevalmorabia97 requested review from a team as code owners June 15, 2026 18:08

kevalmorabia97 requested review from Edwardf0t1, cjluo-nv and realAsma and removed request for a team June 15, 2026 18:08

kevalmorabia97 requested review from AAnoosheh and jenchen13 and removed request for a team, Edwardf0t1, cjluo-nv and realAsma June 15, 2026 18:10

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

kevalmorabia97 merged commit a6c7243 into release/0.45.0 Jun 15, 2026
85 of 87 checks passed

kevalmorabia97 deleted the cherry-picks/release-0.45.0 branch June 15, 2026 19:23

	ema_state = torch.load(ema_path, map_location="cpu", weights_only=False)
	ema_state = torch.load(ema_path, map_location="cpu", weights_only=True)

Conversation

kevalmorabia97 commented Jun 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cherry-picked PRs

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

github-actions Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

kevalmorabia97 commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

github-actions Bot commented Jun 15, 2026 •

edited

Loading

codecov Bot commented Jun 15, 2026 •

edited

Loading