Skip to content

[Cherry-pick] PRs #1648 #1650 #1594 #1269 #1326 #1652 #1651 #1601 #1653 #1558 #1670 #1662 #1677 #1327 #1673 #1676 #1687 #1678 #1691 #1697 #1702 #1704 #1726 #1729#1734

Merged
kevalmorabia97 merged 24 commits into
release/0.45.0from
cherry-picks/release-0.45.0
Jun 15, 2026
Merged

[Cherry-pick] PRs #1648 #1650 #1594 #1269 #1326 #1652 #1651 #1601 #1653 #1558 #1670 #1662 #1677 #1327 #1673 #1676 #1687 #1678 #1691 #1697 #1702 #1704 #1726 #1729#1734
kevalmorabia97 merged 24 commits into
release/0.45.0from
cherry-picks/release-0.45.0

Conversation

@kevalmorabia97

@kevalmorabia97 kevalmorabia97 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Cherry-picked PRs

Summary by CodeRabbit

Release Notes

  • New Features

    • Added Alpamayo quantization example with FP8/NVFP4 export support.
    • Introduced FastGen DMD2 distillation library for Qwen-Image text-to-image optimization.
    • Added lossless MXFP4-to-NVFP4 weight casting for DeepSeek models.
    • Expanded PTQ recipes with new NVFP4 variants (MLP-only, experts-only, weight-only).
    • Enhanced sparse attention calibration and export capabilities.
  • Documentation

    • Added end-to-end Nemotron-3 optimization tutorial and comprehensive PTQ recipe guide.
    • Updated example READMEs and CHANGELOG with latest optimization capabilities.
  • Bug Fixes

    • Fixed sparse attention configuration export schema.
    • Improved KV cache reuse settings for context logits generation.

kevalmorabia97 and others added 24 commits June 15, 2026 11:03
Disable codecov binary validation which seems to be constantly failing

```
gpg: Signature made Tue Apr 21 19:28:03 2026 UTC
gpg:                using RSA key 27034E7FDB850E0BBC2C62FF806BB28AED779869
gpg: Can't check signature: No public key
==> Could not verify signature. Please contact Codecov if problem continues
    Exiting...
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
  * Updated CI workflow notes and removed an outdated header comment.
* Added explanatory comments to the Linux job and adjusted the code
coverage upload step to use a relaxed validation mode (no other upload
settings changed).
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Type of change: Bug fix

`tests/examples/llm_eval/test_llm_eval.py::test_qwen3_eval_fp8` was
silently passing while its evals crashed, then began failing as a
timeout. This repairs the whole pipeline:

- **lm_eval `IndexError` (root cause):** TRT-LLM KV-cache prefix reuse
returns truncated `context_logits` for shared-prefix requests (e.g.
hellaswag's one-context / many-endings), which breaks `parse_logprobs`.
Add an `enable_kv_cache_reuse` flag to `modelopt.deploy.llm.LLM`
(default `True`, unchanged) and disable it for the eval deployment so
full-length context logits are returned.
- **Silent CI green:** `python eval.py | tee result.txt` returns `tee`'s
exit code, so a crashing eval was masked. Add `set -o pipefail` to
`huggingface_example.sh` so failures fail the test.
- **Long-prompt overflows:** with the tiny test model's toy tokenizer,
gsm8k/MMLU prompts exceed `max_seq_len`. Bump test
`max_position_embeddings` to 8192, skip MMLU prompts that don't fit even
at zero-shot, and add an MMLU sample limit (`--mmlu_limit`).
- **human-eval build failures:** install with `--no-build-isolation`
(`pkg_resources` is absent in pip's isolated build env), patch its
malformed `console_scripts` entry point, and pin the clone.
- **Cleanups:** gate the post-quant `run_tensorrt_llm.py` smoke test
behind the `quant` task (eval tasks deploy on their own; ~45s saved for
eval-only runs); replace the SIGPIPE-prone serve-readiness `tail -f |
while` with a poll loop (required under `pipefail`).

### Usage

N/A — example/test fix.

### Testing

All four eval tasks verified end-to-end in the CI container (TRT-LLM
1.3.0rc17, RTX 6000 Ada): lm_eval (hellaswag + gsm8k), MMLU, and
simple_eval (humaneval) all complete with exit 0 and no
`IndexError`/overflow. Cold full run ≈ 340s on this GPU.

CI test on 2-gpu:
https://github.com/NVIDIA/Model-Optimizer/actions/runs/27154417497/job/80153551154

### Before your PR is "*Ready for review*"

- Is this change backward compatible?: ✅ (new `enable_kv_cache_reuse`
defaults to current behavior; new script flags are optional)
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A (no new
dependencies)
- Did you write any new necessary tests?: N/A (fixes and strengthens an
existing test)
- Did you update Changelog?: N/A (bug fix to examples/tests)
- Did you get Claude approval on this PR?: ❌ (pending)

### Additional Information

The full test runs ~340s on an RTX 6000 Ada; CI runners are historically
slower, while `@pytest.mark.timeout` is set to 600 — worth watching the
first CI run and bumping if it's close.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
  * Added an option to limit MMLU evaluation length.

* **Bug Fixes**
* Disabled KV-cache prefix reuse for evaluations needing per-token
context logits to prevent truncated/incorrect logprobs.
* Skip examples whose prompts remain too long; warn and report accuracy
as NaN if all examples are skipped.

* **Chores / Scripts**
* Improved example scripts for reproducible installs, patched entry
point handling, pipeline failure detection, conditional test invocation,
polling-based log wait, and a new CLI flag for MMLU limits.

* **Tests**
* Increased timeout and prompt headroom; capped MMLU smoke tests for
speed.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Type of change: ? New example

<!-- Details about the change. -->

Adds example for Alpamayo-1 quantization with ModelOpt (FP8, NVFP4,
AutoQuant)

### Usage

```
python quantize.py --ckpt nvidia/Alpamayo-R1-10B --output-dir ./alpamayo-r1-fp8 --quantize fp8
```

### Testing
<!-- Mention how have you tested your change if applicable. -->

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->
- Did you get Claude approval on this PR?: ✅ / ❌ / N/A <!--- Run
`/claude review`. NVIDIA org members can self-trigger for complex
changes; orthogonal to CodeRabbit. -->

### Additional Information
<!-- E.g. related issue. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added Alpamayo 1 vision-language-action model quantization example
supporting FP8, NVFP4, and mixed-precision optimization modes
* Introduced CLI quantization tool with calibration loop and checkpoint
export capabilities for both fake-quantized and real-quantized formats

* **Documentation**
* Added comprehensive guide documenting the Alpamayo quantization
example, model details, and usage instructions
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Type of change: New Feature <!-- Use one of the following: Bug fix, new
feature, new example, new tests, documentation. -->

Adds HuggingFace `config.json` export of skip-softmax sparse-attention
calibration for diffusion pipelines (e.g. Wan 2.2), on top of the base
skip-softmax work.

- **`_export_diffusers_checkpoint`** walks every `nn.Module` component
of a diffusers pipeline, calls `export_sparse_attention_config`, and
writes the result into that component's `config.json` under the
`sparse_attention_config` key. The sparse config lives **only** in
`config.json` — there is no standalone `sparse.yaml`.
- **`export_sparse_attention_config`** emits a `config_groups` schema
where each algorithm's parameters are nested inside its own group; only
`config_groups` and `producer` are top-level:
- skip-softmax group → `algorithm: "skip_softmax"`, `targets`, `ignore`
(layers kept dense — e.g. cross-attention + first/last blocks),
`initial_disabled_steps` (opt-in, user-set; emitted only when `> 0`),
`threshold_scale_factor` (`a * exp(b * target_sparsity)`), and
`target_sparsity`.
- N:M group → `algorithm: "sparse_softmax"` with
`sparsity_n`/`sparsity_m`, `dense_sink_tokens`, `dense_recent_tokens`
flattened into the group.
- **Deploy reader**
(`modelopt/torch/sparsity/attention_sparsity/plugins/sparse_attn_config.py`)
reads these per-group params back, keeping the export↔load round-trip
consistent.
- **Example wiring**:
`examples/diffusers/sparsity/wan22_skip_softmax.py` gains
`--export-dir`, `--skip-softmax-threshold`, and
`--initial-disabled-steps`. `--export-dir` runs
`export_hf_checkpoint(pipe, export_dir=...)` after calibration.
- Updated `CHANGELOG.rst`.

### Usage

```bash
python examples/diffusers/sparsity/wan22_skip_softmax.py \
    --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
    --calibrate --target-sparsity 0.5 --calib-size 4 \
    --initial-disabled-steps 5 \
    --export-dir ./wan22_skip_softmax_ckpt
```

Resulting layout — a `config.json` per component, **no `sparse.yaml`**:

```
wan22_skip_softmax_ckpt/
├── transformer/config.json        # carries sparse_attention_config
├── transformer_2/config.json      # carries sparse_attention_config
├── vae/ …  text_encoder/ …  tokenizer/ …  scheduler/ …
└── model_index.json
```

A representative `config.json` entry for a diffusion transformer:

```json
"sparse_attention_config": {
  "config_groups": {
    "group_0": {
      "algorithm": "skip_softmax",
      "targets": ["WanAttention"],
      "ignore": ["blocks.0.attn1", "blocks.0.attn2", "…"],
      "initial_disabled_steps": 5,
      "threshold_scale_factor": {
        "formula": "a * exp(b * target_sparsity)",
        "prefill": {"a": 1443.49, "b": 4.30}
      },
      "target_sparsity": {"prefill": 0.5}
    }
  },
  "producer": {"name": "modelopt", "version": "0.45.0..."}
}
```

The N:M variant adds a second group:

```json
"group_1": {
  "algorithm": "sparse_softmax",
  "targets": ["WanAttention"],
  "sparsity_n": 2, "sparsity_m": 4,
  "dense_sink_tokens": 0, "dense_recent_tokens": 64
}
```

### Testing

- `tests/examples/diffusers_sparsity/test_sparsity.py`: baseline /
triton-baseline / fixed-threshold runs of the Wan 2.2 example, plus a
Python-API calibrate → **export** test asserting the nested
`sparse_attention_config` (`threshold_scale_factor`, `target_sparsity`,
`ignore`, `initial_disabled_steps`) and the absence of any
`sparse.yaml`.
-
`tests/unit/torch/sparsity/attention_sparsity/test_sparse_attention_conversion.py`
and `test_sparse_attn_config.py`: unit coverage of the per-group export
schema and the deploy-reader round-trip (writer nests → reader reads
from groups → internal mtsa config unchanged).
- Validated end-to-end on Wan 2.2 T2V-A14B: full 4-prompt / 40-step /
81-frame calibration; the exported checkpoint carries the nested schema
in both `transformer` and `transformer_2` `config.json`, and runtime
measurement shows ~47–49% tile sparsity at a 0.5 target.

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ❌ The exported
`sparse_attention_config` schema was renamed and nested per-group during
0.45.x development, and the loader reads only the new layout —
checkpoints exported by earlier 0.45.x builds must be re-exported. No
released version is affected. <!--- If ❌, explain why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ <!---
Mandatory -->
- Did you write any new necessary tests?: ✅ <!--- Mandatory for new
features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ <!--- Only for new features, API changes, critical bug fixes or
backward incompatible changes. -->

### Additional Information
<!-- E.g. related issue. -->

---------

Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

**Type of change:** New example + new `modelopt.torch.fastgen` library
module.

Adds **DMD2 (Distribution Matching Distillation) for Qwen-Image** —
distilling the base model into a few-step (1–4) generator. Includes the
framework-agnostic `modelopt.torch.fastgen` loss library (DMD pipeline,
EMA, optional GAN discriminator) and a NeMo AutoModel–based training
example with a mock-data smoke config, a real-data config, and inference
/ export scripts.

**Noted**: the example script will be migrated to AutoModel repo

### Usage

```bash
# Mock-data wiring smoke — runs end-to-end with no dataset to prepare
torchrun --nproc-per-node=8 \
    examples/diffusers/fastgen/dmd2_finetune.py \
    --config examples/diffusers/fastgen/configs/dmd2_qwen_image_smoke.yaml
```

See `examples/diffusers/fastgen/README.md` for real-data training and
inference.

### Testing

Unit tests under `tests/unit/torch/fastgen/`; `pre-commit` /
code-quality clean.

### Before your PR is "*Ready for review*"

- Backward compatible?: ✅ (new, additive module)
- Followed `CONTRIBUTING.md` for any copied code / new deps: ✅
- New tests added?: ✅
- Updated Changelog?: N/A

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Adds a FastGen-based distillation framework (DMD2) with
student/fake-score training, EMA support, GAN discriminator branch,
inference pipeline, and export utilities.
* Qwen-Image integration with latent packing and feature-capture for
plugin-enabled pipelines.

* **Documentation**
* New README, example configs, and runnable example scripts for
Qwen-Image distillation and inference.

* **Tests**
* Comprehensive unit tests covering math parity, gradient routing,
plugins, hooks, EMA, and recipe setup.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…only PTQ recipes (#1652)

### What does this PR do?

Type of change: new feature (recipes)

Several `general/ptq` recipe families shipped a data-driven FP8 KV-cache
(`-kv_fp8`) variant but lacked the constant-amax `kv_fp8_cast` companion
that `fp8_default` and `nvfp4_default` already have. This PR adds the
missing cast variants so every KV-quantizing (and the weight-only)
family offers the calibration-free FP8 KV-cache option:

- `general/ptq/nvfp4_experts_only-kv_fp8_cast`
- `general/ptq/nvfp4_mlp_only-kv_fp8_cast`
- `general/ptq/nvfp4_omlp_only-kv_fp8_cast`
- `general/ptq/nvfp4_weight_only-kv_fp8_cast`

Each new recipe composes the exact same model-quant config as its
existing sibling and swaps the `kv_fp8` unit for the shared
`kv_fp8_cast` unit (constant-amax FP8 KV cache; no KV calibration
forward pass). The docs guide table/tree and the changelog are updated
to match.

### Usage

```bash
python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path <model> \
    --recipe general/ptq/nvfp4_mlp_only-kv_fp8_cast
```

### Testing

Extended the built-in PTQ smoke test
`tests/unit/recipe/test_loader.py::test_load_recipe_all_builtins` with
the four new recipe paths; all four load into a valid
`ModelOptPTQRecipe` with a populated `quantize` section.

```
$ python -m pytest tests/unit/recipe/test_loader.py tests/unit/recipe/test_presets.py -q
180 passed
```

`pre-commit` (including the `validate modelopt recipes` hook) passes on
all changed files.

### Before your PR is "*Ready for review*"

- Is this change backward compatible?: ✅ (additive — only new recipe
files)
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: ✅ (extended the builtin recipe
smoke test)
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅
- Did you get Claude approval on this PR?: ❌ (not yet)

### Additional Information

The two weight-only families were discussed for scope;
`nvfp4_weight_only` is included (it already names a KV mode, `kv_fp16`),
while `int4_blockwise_weight_only` is intentionally left untouched since
it carries no `-kv_` composition.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added four new NVFP4 PTQ (Post-Training Quantization) recipe variants:
experts-only, MLP-only, OMLP-only, and weight-only configurations.
* All new recipes include FP8 KV-cache cast mode support for improved
inference performance.

* **Documentation**
* Updated built-in recipes guide with new NVFP4 recipe options and
repository layout.

* **Tests**
  * Expanded recipe loader test coverage for new recipe configurations.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
)

### What does this PR do?

Type of change: CI / infrastructure (build-time speedup)

ModelOpt's CUDA quantization extensions (`modelopt_cuda_ext`, `_fp8`,
`_mx`) JIT-compile via `torch.utils.cpp_extension.load()` on first use —
~110–140s **each** in a fresh container, which is the dominant cost of
the `gpu_trtllm` job and the TRT-LLM example jobs. This caches them
across runs.

The logic lives in a reusable composite action,
**`.github/actions/cache-extensions`**, used by both `gpu_tests.yml` and
`_example_tests_runner.yml`:

- Sets a **literal in-container `TORCH_EXTENSIONS_DIR`**
(`/root/.cache/torch_extensions`). `${{ github.workspace }}` can't be
used — for `container:` jobs it resolves to the *host* path, which is
mounted elsewhere (`/__w`) inside the container, so torch and the cache
step would disagree on the location.
- Caches that dir with `actions/cache`, keyed on a caller-supplied **env
discriminator** (`rtxpro6000` + container image) plus a `hashFiles` of
the kernel/loader sources — so the cache busts on any kernel change and
is scoped per arch+image.
- On an **exact hit**, **backdates the kernel sources** below the cached
objects so ninja reuses them. (Touching the *objects* instead desyncs
ninja's `.ninja_deps`, which records each output's build-time mtime →
`stored deps info out of date` → rebuild.)

Also fixes the unused `runner` default in `_example_tests_runner.yml`
(`h100` → `rtxpro6000`) so it can't seed a wrong-arch cache.

### Usage

N/A — CI only. To reuse from another job:
```yaml
- uses: ./.github/actions/cache-extensions
  with:
    cache-key: rtxpro6000-${{ matrix.container_image }}  # GPU arch + image
```

### Testing

Validated on `gpu_trtllm`: cache hit → `ninja: no work to do` →
`test_cuda_ext*` dropped from **113s / 108s / 139s → 2.8s / 0.03s /
0.03s** (~360s saved per run). Jobs that build no extension (e.g.
`gpu_vllm`) simply skip the save.

### Before your PR is "*Ready for review*"

- Is this change backward compatible?: ✅ (CI-only; key busts on
source/image change)
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A
- Did you update Changelog?: N/A (CI infrastructure)
- Did you get Claude approval on this PR?: ❌ (pending)

### Additional Information

- Single-arch assumption: callers pass `rtxpro6000` in `cache-key`; if
the runner fleet ever mixes GPU archs, update that prefix (the cache
path is not arch-specific).
- No explicit TTL: the key is content-addressed, and GitHub auto-evicts
caches unused for 7 days (+ 10 GB/repo LRU).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…r examples/megatron_bridge (#1601)

### What does this PR do?

Type of change: documentation (+ minor test fixes)

Migrates the Nemotron-3-Nano-30B-A3B-BF16 tutorial quantization step
from `examples/llm_ptq/hf_ptq.py` to the Megatron-Bridge quantize +
export, and relocates the tutorial next to the scripts it now uses. Now
that the whole tutorial is Megatron-Bridge based, it lives under
`examples/megatron_bridge/`.

- **Quantization migration:** replace the single `hf_ptq.py` call with
`examples/megatron_bridge/quantize.py` (calibrate + save a Megatron
checkpoint) → `examples/megatron_bridge/export.py` (deployable unified
HF checkpoint). The FP8 results table is refreshed with the
`quantize.py` numbers (same defaults, slightly better on average).
- **Relocation:** moved
`examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/` →
`examples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/`.
A **redirect-stub `README.md`** remains at the old path (a directory
symlink isn't traversable in the GitHub web UI), and all in-repo
references (root README, CHANGELOG, pruning READMEs, megatron_bridge
README) plus the tutorial's own relative links are updated.
- **Evaluation:** per-format vLLM benchmark commands (BF16 / FP8), FP8
deployment notes documented in `nemo_evaluator.yaml`, reduced
LiveCodeBench/AIME `num_repeats` (were too slow), and bumped the
`nemo-evaluator-launcher` pin.
- **Misc:** drop the `examples/megatron_bridge/requirements.txt`
`transformers<5` pin in favor of an inline "downgrade `transformers<5`
to save pruned Nemotron checkpoints" note; guard the hybrid Mamba-MoE
sharded-state-dict test behind `HAS_MAMBA` (requires `mamba_ssm`);
shrink the tiny Gemma3 test fixture's attention heads.

> **Note:** the **NVFP4 + QAD** experiments (formerly the focus of this
PR) are split out — their accuracy/throughput results are still in
progress — and will follow in a separate PR on top of this one.

### Testing

Docs-only + test-guard changes. Pre-commit hooks (markdownlint, RST
checks, ruff, mypy) pass. The tutorial's relative links and the old-path
redirect stub were verified to resolve to real files.

### Before your PR is "*Ready for review*"

- Is this change backward compatible?: ✅ (old tutorial path still
resolves via a redirect-stub README; `quantize.py`/`export.py` already
exist in `examples/megatron_bridge`)
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A (adjusts/guards existing
tests only)
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ (existing tutorial entry updated to the new path)
- Did you get Claude approval on this PR?: ✅

### Additional Information

Supersedes the previous "Part 3 of 4 (NVFP4 + QAD docs)" scope of this
PR; the NVFP4 + QAD tutorial additions will land in a follow-up.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Moved the Nemotron-3-Nano-30B-A3B tutorial into the Megatron-Bridge
tutorials and replaced the old file with a pointer to the new location.
* Updated vLLM throughput numbers to 2.6× and expanded
results/throughput tables.
* Reworked the FP8 quantization/export workflow and added a note to use
transformers<5 when saving pruned models.
* Added a tutorials index and adjusted evaluator launcher pin and repeat
counts.

* **Tests**
* Tests now detect optional Mamba support and skip related tests when
unavailable.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
#1653)

### What does this PR do?

Type of change: new feature

Brings the GPT-OSS lossless MXFP4 → NVFP4 cast (#1372) to DeepSeek V4's
routed-expert export by adding a `--cast_mxfp4_to_nvfp4` flag to
`examples/deepseek/deepseek_v4/quantize_to_nvfp4.py`.

To avoid duplicating the closed-form math, the shared numerics —
`mxfp4_to_nvfp4_global_amax`, `mxfp4_to_nvfp4_per_block_amax`, and the
E2M1/E4M3/E8M0 constants — are **hoisted out of the GPT-OSS example cast
into the library** at
`modelopt/torch/quantization/utils/numeric_utils.py`. Both the GPT-OSS
cast (`examples/llm_ptq/cast_mxfp4_to_nvfp4.py`) and the new DeepSeek
path now import them from there.

DeepSeek V4's routed experts ship as MXFP4 (E2M1 nibbles + a
power-of-two E8M0 scale per 32-element block). By default the export
dequantizes them to BF16 and re-quantizes to NVFP4 using the calibrated
per-tensor weight amax, which re-derives per-block scales from the data
and is therefore lossy. With the flag, the cast pins `scale_2 =
2^(k_max-8)` and each per-block E4M3 scale to `2^(k_j-m)` straight from
the source E8M0 scales, so `per_block_scale * scale_2 = 2^k_j` and the
NVFP4 nibbles equal the source MXFP4 nibbles bit-for-bit (for every
block whose `k_j` lands in E4M3's representable window; rare
out-of-range blocks clamp). The one V4-specific addition is that w1/w3
share a single `scale_2` for the fused GEMM1, so `k_max` is taken over
both projections. The flag only affects routed-expert **weights** —
activation `input_scale` still comes from `--amax_path` calibration.

### Usage

```bash
python deepseek_v4/quantize_to_nvfp4.py \
    --amax_path ${AMAX} \
    --source_ckpt ${DS_V4} \
    --output_ckpt ${HF_NVFP4_PATH} \
    --cast_mxfp4_to_nvfp4
```

### Testing

- The hoisted numerics get unit tests in
`tests/unit/torch/quantization/test_numeric_utils.py` (10 cases:
per-tensor global_amax, per-block amax incl. out-of-range,
magnitude-table cache) — 10/10 pass. The example test
`tests/examples/llm_ptq/test_cast_mxfp4_to_nvfp4.py` keeps the
cast-specific cases (quantizer naming, `build_amax_map`,
`apply_to_model`).
- Validated on real DeepSeek-V4-Flash expert tensors (incl. the on-disk
`float8_e8m0fnu` scale dtype): 23.5M blocks, 100% lossless, 0 error.
- Generated a full NVFP4 checkpoint for DeepSeek-V4-Flash (43 layers,
256 routed experts) end-to-end: `[cast] lossless MXFP4->NVFP4 blocks:
8,657,043,456/8,657,043,456 (100.0000%)`. Output weights match an
independently-produced reference cast byte-for-byte (`weight_scale`,
`weight_scale_2`, packed nibbles modulo the harmless sign-of-zero).

### Before your PR is "*Ready for review*"

- Is this change backward compatible?: ✅ (new opt-in flag; default
export behavior unchanged; hoist re-exports through the existing example
module)
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ N/A (no new
deps; shared numerics moved into the library rather than duplicated)
- Did you write any new necessary tests?: ✅ (library numerics covered by
`tests/unit/torch/quantization/test_numeric_utils.py`; end-to-end
validated on a real DeepSeek-V4 checkpoint)
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅
- Did you get Claude approval on this PR?: ❌ (will run `/claude review`)

### Additional Information

Mirrors and reuses #1372 (GPT-OSS MXFP4 → NVFP4 cast); the closed-form
numerics are now shared via
`modelopt.torch.quantization.utils.numeric_utils`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added `--cast_mxfp4_to_nvfp4` flag to perform a closed-form, mostly
lossless MXFP4→NVFP4 conversion for routed-expert weights with
aggregated lossless/block statistics.

* **Documentation**
* Updated DeepSeek V4 export instructions and README to document the new
flag and clarify calibration behavior for activation scales.

* **Chores**
* Exposed shared numeric quantization utilities for MXFP4→NVFP4 casting.

* **Tests**
* Added and updated tests to validate the new numeric helpers and
conversion behavior.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Type of change: Bug fix

INT8 entropy calibration of fp16 ONNX models (e.g. ConvNext /
EfficientViT / YOLOv8 backbones quantized via `python -m
modelopt.onnx.quantization --quantize_mode=int8`) used to fail during
histogram collection with:

```
ValueError: Too many bins for data range. Cannot create 128 finite-sized bins.
```

`_collect_value` in `modelopt/onnx/quantization/ort_patching.py` derives
`threshold = max(abs(min), abs(max))` from the activation tensor and
passes `range=(-threshold, threshold)` to `np.histogram(...)`. When the
model is fp16 and a calibrated activation has a small range (≲ 1e-5),
both endpoints inherit fp16 dtype. Under numpy 2.0's NEP-50 strict
promotion, the resulting fp16 `linspace` collapses consecutive 128-bin
edges to the same value and numpy refuses to build the histogram. numpy
1.x silently used higher-precision intermediate dtype, masking the
issue.

The fix casts the range endpoints to Python `float` so numpy computes
bin edges in float64 regardless of input dtype. Applied at both call
sites: `_collect_value` and the single-node variant
`_collect_value_histogram_collector_single_node_calibration`.

### Usage

```bash
# The affected workflow — INT8 entropy calibration of any fp16 ONNX model:
python -m modelopt.onnx.quantization \
    --quantize_mode=int8 \
    --onnx_path=model.fp16.onnx \
    --calibration_data_path=calib.npy
```

No API change.

### Testing

- Added `test_collect_value_fp16_narrow_range` in
`tests/gpu/onnx/test_ort_patching.py` that calls `_collect_value` with a
fp16 tensor (mostly zeros + one ~1e-5 value) and asserts the histogram
is built without raising and all bin edges are distinct. Fails on the
buggy code, passes after the fix.
- Reproduced the original failure on numpy 2.2.6 before the fix.
- Full `tests/gpu/onnx/test_ort_patching.py` suite (31 tests) passes.

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: ✅
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Fixed INT8 entropy calibration for fp16 ONNX models failing with NumPy
>= 2.0. Histogram range computation now correctly handles fp16
activations with small dynamic ranges.

* **Tests**
* Added test coverage for INT8 calibration with fp16 activations using
narrow value ranges.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1558?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Type of change: Bug fix

Fixes `apply_chat_template` failures when loading
`nemotron-sft-agentic-v2` with Nemotron3 Nano tokenizer.

HF agentic datasets store OpenAI-style `tool_calls` with
`function.arguments` as JSON **strings**, but Nemotron v3 chat templates
iterate `tool_call.arguments|items` in Jinja2, which requires a
**mapping**. That mismatch raised:

```
TypeError: Can only get item pairs from a mapping.
```

This PR:

- Adds shared `prepare_messages_for_chat_template()` in
`modelopt.torch.utils.dataset_utils` to normalize string tool-call
arguments to dicts (including both nested `function.arguments` and
top-level `arguments`).
- Routes `get_dataset_samples` / `get_dataset_dataloader` chat-template
paths through the helper with `reasoning_content="native"` and
`normalize_tool_calls=True`, preserving `reasoning_content` for
tokenizers that handle it natively while fixing tool calls.
- Refactors `megatron_preprocess_data._process_messages` to delegate to
the same helper (no behavior change: `strip`/`inline` still handle
reasoning; `native` still returns messages unchanged without tool-call
normalization).
- Consolidates tests: hermetic logic stays in unit tests; one live GPU
integration test covers the v3 calibration path.

### Testing

- New e2e tests added to replace previous simpler tests
- Manual verification (Nemotron 3 Nano tokenizer +
`nemotron-sft-agentic-v2`):

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ <!--- If ❌, explain why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A <!---
Mandatory -->
- Did you write any new necessary tests?: ✅ <!--- Mandatory for new
features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A <!--- Only for new features, API changes, critical bug fixes or
backward incompatible changes. -->
- Did you get Claude approval on this PR?: Not yet <!--- Run `/claude
review`. NVIDIA org members can self-trigger for complex changes;
orthogonal to CodeRabbit. -->

### Additional Information

Root cause: Nemotron v3 Jinja chat templates use
`tool_call.arguments|items`; OpenAI-format dataset rows store arguments
as JSON strings.

Related prior art in-repo: `megatron_preprocess_data` already normalized
tool-call arguments inline; this PR deduplicates that logic into
`prepare_messages_for_chat_template`.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* Added public utilities for preparing OpenAI-style chat messages with
reasoning content support, including native reasoning mode handling.
* Implemented automatic tool call argument normalization for consistent
tokenizer operations.

* **Refactor**
* Consolidated chat template application across registered and
auto-detected chat datasets using unified preprocessing.

* **Tests**
* Added unit and integration tests validating reasoning content
preparation and chat template functionality.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Type of change: documentation

Adds two docs under `modelopt_recipes/` (no code or behavior changes):

- **`README.md`** — catalog of the recipe library: its purpose (a recipe
is the
single, version-controlled source of truth for *how* a model is
optimized), the
directory layout (`general/`, `huggingface/`, `models/`, `configs/`),
how to
load/select recipes (`load_recipe`, `--recipe`), and a high-level map of
the
  general PTQ combos, speculative-decoding, and distillation recipes.
- **`recipe.md`** — a focused guide to the PTQ schemes: the general
`general/ptq/`
body scopes (full-model FP8/NVFP4, scoped experts-only / mlp-only /
omlp-only,
weight-only), KV-cache modes (`kv_fp8_cast` / `kv_nvfp4_cast` /
`kv_fp8`),
calibration variants (max / mse / gptq / layerwise), low- vs
high-concurrency
deployment guidance, and the model-specific recipes under `huggingface/`
and
  `models/` — each compared to its general baseline.

### Usage

```python
# Documentation only. The recipes themselves load as before, e.g.:
from modelopt.recipe import load_recipe
cfg = load_recipe("general/ptq/nvfp4_experts_only-kv_fp8_cast")
```

### Testing

`pre-commit run --files modelopt_recipes/README.md
modelopt_recipes/recipe.md`
passes (markdownlint, modelopt recipe validation, license/format hooks).

### Before your PR is "*Ready for review*"

- Is this change backward compatible?: N/A <!-- docs only -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A <!-- docs only -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A <!-- docs only -->
- Did you get Claude approval on this PR?: ❌ <!-- not yet -->

### Additional Information

Documentation for the `modelopt_recipes/` library; content verified
against the
recipe YAMLs and the `modelopt.recipe` / config-loader source.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Added comprehensive ModelOpt recipes guide describing YAML-based,
composable optimization workflows, directory/lookup layout, reuse via
imports, and how to add or share recipes.
* Added PTQ quantization guide covering recipe naming/structure,
quantization scopes and KV-cache options, calibration variant guidance,
model-specific overrides, multimodal considerations, and a
checkpoint-mirroring example.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…a 4 MTP (#1677)

Type of change: Bug fix

Fixes the specdec_bench vLLM wrapper's MTP `speculative_config` emission
so Gemma 4 MTP no longer hits the wrong code path inside vLLM.

vLLM's `SpeculativeConfig.__post_init__`
(`vllm/config/speculative.py:529-602`) auto-detects `method` ONLY when
it's unset. When `model` is provided and `method` is `None`, the default
branch sets `method = "draft_model"` — the generic same-architecture
draft path, NOT MTP. That path enforces equal num_heads between target
and draft and raises:

```
AssertionError: All layers in one attention group must share num_heads; got {8, 4}
```

on heterogeneous-head models. Gemma 4 has 8 target heads and 4 draft
heads by design.

PR #1663 changed the MTP branch in the wrapper to emit `{model:
<assistant>, num_speculative_tokens: N}` WITHOUT `method` when
`draft_model_dir` was provided, based on a misread of vLLM PR #41745's
test plan that only showed the `{model, num_speculative_tokens}` shape.
That test plan was the direct `LLM(...)` constructor invocation; vLLM
had already defaulted method internally. Going through specdec_bench's
`AsyncEngineArgs(speculative_config=...)` path, the explicit `method`
key is required to avoid the auto-detect → draft_model fallback.

vLLM's own test at
[`tests/v1/e2e/spec_decode/test_spec_decode.py:818-823`](https://github.com/vllm-project/vllm/blob/main/tests/v1/e2e/spec_decode/test_spec_decode.py#L818)
does exactly this for the gemma4-e4b parametrization:

```python
speculative_config = {
    "method": method,                # "mtp"
    "num_speculative_tokens": ...,
}
if draft_model is not None:           # Gemma 4 case
    speculative_config["model"] = draft_model
```

Restore `method="mtp"` as the unconditional MTP path. ADD `model` only
when `draft_model_dir` is set. Backward-compatible for Qwen 3.5 MTP /
DeepSeek MTP / other inline-MTP families (they keep the bare `{method:
"mtp"}` config).

Field-tested via vLLM PR #41745's correctness test on `gemma-4-E4B-it` +
`gemma-4-E4B-it-assistant`: produced 304.7 output TPS at γ=4 vs 171.0
baseline (178% speedup) on H100. The same `speculative_config` shape
this fix emits.

[OMNIML-5024](https://jirasw.nvidia.com/browse/OMNIML-5024) pipeline
- Wrapper emitted `{model: assistant, num_speculative_tokens: 3}`
- vLLM auto-detected `method = "draft_model"`
- Loaded gemma-4-E4B-it-assistant (4 heads) as a generic draft for
gemma-4-E4B-it (8 heads)
- Attention-group num_heads check tripped → AssertionError, task_0
FAILED, task_1 CANCELLED

- Backward compatible: ✅ (Qwen 3.5 / DeepSeek MTP unchanged; only the
MTP+`draft_model_dir` case changes).
- New tests: ❌ — the test exercising this codepath would need a GPU +
gemma-4 model checkout, which is cluster work, not unit-test scope.
JIRA-tracked validation via OMNIML-5024 dispatch after this lands.
- Changelog: ❌

- vLLM PR #41745 (Gemma4 MTP support)
- Companion: NVIDIA/Model-Optimizer PR #1675 (launcher
`GlobalVariables.draft_model` schema fix)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

* **Bug Fixes**
* Fixed speculative decoding configuration handling in the benchmark
example to ensure consistent method assignment and proper draft model
configuration.

* **Documentation**
* Updated configuration comments to reflect corrected behavior and
improved clarity.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Type of change: New feature (PTQ recipe)

Adds a declarative YAML recipe for post-training quantization of
**Nemotron-H** hybrid models (Mamba-2 + MLP + Attention) under the
`modelopt_recipes` framework. The recipe is mixed-precision and composed
**entirely from existing recipe units** — no core-library or `hf_ptq.py`
changes are required. (Quantized `nn.Embedding` support, which the
embedding line relies on, already landed in #1495.)

Precision mirrors the **GGUF Q4_K_M** bit allocation of the same model,
mapped onto NVFP4/FP8.

`modelopt_recipes/models/Nemotron-H/nvfp4_w4a16.yaml`:

| Precision | GGUF source | Modules |
| --- | --- | --- |
| NVFP4 W4A4 | Q4_K / Q5_0 | in_proj, out_proj, up_proj, attn
q/k/v/o_proj, down_proj (Q4_K layers 13,15,20,22,27,29,37,39) |
| FP8 W8A8 | Q6_K | MLP down_proj (layers 1,3,5,8,10,18,25,33,41) |
| NVFP4 W4A16 (weight-only) | — | input embedding |
| FP8 W8A16 (weight-only) | — | lm_head |
| bf16 | F32 | Mamba conv1d, all norms, A_log / D / dt_bias |

The Q8_0 attn `v_proj` layers (24, 32) are kept **NVFP4 W4A4** rather
than FP8: ModelOpt's export fuses q/k/v (they share the attention input)
and requires one format across the group, so `v` can't diverge from
`q`/`k`.

Built from the units `base_disable_all`, `w4a4_nvfp4_nvfp4`,
`default_disabled_quantizers`, `configs/numerics/fp8`, and
`configs/numerics/nvfp4`.

### Usage

```bash
python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \
    --recipe models/Nemotron-H/Nemotron-3-Nano-4B/nvfp4_w4a16 \
    --trust_remote_code \
    --export_path nemotron-3-nano-4b-nvfp4
```

### Testing

- `pre-commit run --files
modelopt_recipes/models/Nemotron-H/nvfp4_w4a16.yaml` passes, including
the `validate modelopt recipes` schema hook.
- End-to-end PTQ + unified HF export on
`nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16` (calibration: `cnn_dailymail`,
512 samples, seq 512). Produced a 2.57 GB unified HF checkpoint;
`hf_quant_config.json` was verified per-layer against the table above —
9 FP8 W8A8 `down_proj` + FP8 weight-only `lm_head`, NVFP4 W4A16
embedding, NVFP4 W4A4 everywhere else, with q/k/v/o uniform within each
attention layer (required for export fusion).

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors).

- Is this change backward compatible?: ✅ — purely additive; a new opt-in
recipe file.
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A — declarative config;
covered by the `validate modelopt recipes` pre-commit hook.
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A — new recipe config, not a library API change.
- Did you get Claude approval on this PR?: ❌ — pending `/claude review`.

### Additional Information

Depends on #1495 (quantized `nn.Embedding` support) for the embedding
line to pack on export.

Possible follow-ups (out of scope here):
- A compressed-tensors conversion pass so the checkpoint is consumable
by vLLM (`*.weight → *.weight_packed`, `*.weight_scale_2 →
*.weight_global_scale`, and a `format: nvfp4-pack-quantized` /
`quant_method: compressed-tensors` quantization config).
- A `--vllm-compat`-style variant that additionally excludes Mamba
`in_proj` (output dim `17504 = intermediate + conv_dim + num_heads` is
not divisible by 64, violating Marlin repack alignment) and leaves
`lm_head` / embedding in bf16, for out-of-the-box vLLM consumption.

---------

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…en (#1673)

### What does this PR do?

Type of change: Bug fix <!-- Use one of the following: Bug fix, new
feature, new example, new tests, documentation. -->

Fixes the generation **preview** in `examples/llm_ptq/hf_ptq.py`
producing garbage
output (e.g. repeated `\u200b` zero-width-space tokens) for models whose
tokenizer
has `pad_token == eos_token` — most visibly GLM-5.1. The garbage
appeared *before*
quantization, so it was not a quantization issue.

**Root cause:** `pre_quantize` / `post_quantize` take the first
(left-padded)
calibration sample and call `full_model.generate(preview_input_ids,
...)` **without
an `attention_mask`**. HuggingFace only auto-infers the mask when
`pad_token_id != eos_token_id`
(`generation/utils.py:_prepare_attention_mask_for_generation`);
when they are equal it falls back to an all-ones mask, so the model
attends to the
leading pad/eos tokens, ignores the real prompt, and (for GLM's
MoE/DSA/MTP path)
collapses to a single repeated token. Calibration itself was always
correct — it
already passes the mask; only the preview generation was missing it.

**Fix:** thread the calibration batch's `attention_mask` through to both
preview
`generate()` calls. One file changed (`examples/llm_ptq/hf_ptq.py`,
+20/-8).

### Usage

No usage change — the same command now produces a coherent preview
instead of
`\u200b` repetition

### Testing

Reproduced the exact mechanism (left padding + pad_token == eos_token +
missing
attention_mask) on a small model(GPT2): without the mask the model emits
the same
HF warning as the bug report and ignores the prompt; with the mask the
output is
byte-identical to the unpadded baseline. Verified no behavioral change
for models
where pad != eos (the explicit mask equals HF's inferred
input_ids.ne(pad_id))
and for Whisper (its batch carries no attention_mask, so the path is
unchanged).
Pre-commit: ruff-check, ruff-format, and mypy (no new errors vs. main)
all pass.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits
are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding
hardcoded trust_remote_code=True, torch.load(..., weights_only=False),
pickle, etc.).

- Is this change backward compatible?: ✅ <!-- Only changes internal
helper signatures within the example script; no public API affected. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in CONTRIBUTING.md: N/A <!-- No
copied code, no new dependency. -->
- Did you write any new necessary tests?: N/A <!-- Preview path requires
model loading; no existing unit-test harness covers it. Verified via a
standalone repro of the root-cause mechanism. -->
- Did you update Changelog?: N/A <!-- Bug fix confined to an
example-script preview; not a library/API change. Happy to add a 0.46
bug-fix entry if preferred. -->
- Did you get Claude approval on this PR?: ✅ <!-- Will run `/claude
review` before requesting review. -->

### Additional Information
Backward compatible across model familes:
| Model class | Before (no mask passed) | After (mask passed) | Result |
|---|---|---|---|
| `pad != eos` (most: T5, BART, many LLMs) | HF infers mask =
`input_ids.ne(pad_id)` | explicit calib mask = same tensor | **Identical
output** — no change |
| `pad == eos` (GLM-5.1, GPT-2-style) | all-ones fallback → attends to
pad → garbage | correct mask | **Fixed** |
| Whisper | no mask | batch has no `attention_mask` key → `None` → no
mask | **Identical** — no change |
| Nemotron-VL / DeepSeek / NemotronH / `--skip_generate` | `generate()`
not called on this path | unchanged | No change |

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **Bug Fixes**
* Enhanced LLM post-quantization example to properly handle attention
masks during preview generation. The quantization preview now correctly
threads attention masks through generate() calls, ensuring accurate
generation outputs are captured both before and after quantization
steps.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
)

### What does this PR do?

Type of change:  Bug fix

Fix `--quant_cfg` CLI parsing by typing `quant_cfg` as `str | None`
instead of `str | QuantizeConfig | None`

### Testing
```
accelerate launch   --config_file examples/gpt-oss/configs/zero3.yaml   examples/gpt-oss/sft.py   --config examples/gpt-oss/configs/sft_full.yaml   --model_name_or_path openai/gpt-oss-20b   --quant_cfg MXFP4_MLP_WEIGHT_ONLY_CFG   --output_dir gpt-oss-20b-qa
```

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A
- Did you get Claude approval on this PR?: ✅ / ❌ / N/A <!--- Run
`/claude review`. NVIDIA org members can self-trigger for complex
changes; orthogonal to CodeRabbit. -->

### Additional Information
<!-- E.g. related issue. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* Quantization config parameter now accepts string identifiers or none;
resolution behavior for named presets remains unchanged.
* **Documentation**
* Updated argument reference to reflect the parameter type change while
preserving the deprecation note and usage guidance.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…1687)

### What does this PR do?

Type of change: Bug fix

Exclude Qwen visual and vision_tower modules from NVFP4 quantization and
keep the Qwen linear attention projection exclusions. These modules can
produce matrix dimensions that are incompatible with vLLM 0.22.1's
ModelOpt FP4 Marlin fallback path, causing checkpoint load or profiling
failures such as `size_n = 4304 is not divisible by tile_n_size = 64`.

### Usage

N/A. This is a recipe configuration fix.

### Testing

- `python -m pytest tests/unit/recipe/test_presets.py
tests/unit/recipe/test_loader.py -q`
- `python -m pre_commit run --files
modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml
tests/unit/recipe/test_loader.py tests/unit/recipe/test_presets.py`
- E2E validation with `vllm/vllm-openai:v0.22.1`: PTQ export validation
passed with zero Marlin-incompatible quantized layers, and vLLM
`/health`, `/v1/models`, and `/v1/completions` passed. The final PR
broadens the validated visual MLP exclusions to the full `*visual*`
subtree and adds the common `*vision_tower*` naming pattern.

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: Yes
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: Yes
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A
- Did you get Claude approval on this PR?: N/A

### Additional Information

N/A

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Tests**
* Added unit tests that verify the built-in PTQ recipe and preset
correctly disable incompatible projection and visual components for
certain quantization modes.
* Ensures quantization settings are validated across recipes and
presets.

* **Chores**
* Updated quantization configuration to disable quantizers for select
projection and vision-related model layers.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…295242) (#1678)

Type of change: Bug fix

Fixes the GPT-OSS MXFP4 → NVFP4 PTQ path (`examples/llm_ptq/hf_ptq.py`
with `--cast_mxfp4_to_nvfp4`), which failed in three independent ways.
The documented command now runs end-to-end and produces a bit-exact
(100% lossless) NVFP4 checkpoint. Addresses **nvbug 6295279**
(OMNIML-5046) and **nvbug 6295242** (OMNIML-5045).

1. **nvbug 6295242 — CUDA illegal memory access on load.** GPT-OSS ships
native MXFP4 weights that Transformers dequantizes to BF16; the threaded
weight loader trips an illegal-memory access when `device_map="auto"`
shards the dequant across **multiple GPUs**. The missing optional
`kernels` package only *forces* the dequant path — it is not the root
cause. `get_model` now detects MXFP4 checkpoints and loads them with
`Mxfp4Config(dequantize=True)` on a **sequential** device map so the
dequant stays on a single device. `kernels` is no longer required.
2. **nvbug 6295279 #1 — `NotImplementedError: Mxfp4GptOssExperts` during
unified HF export.** Forcing `dequantize=True` yields plain
`GptOssExperts` (even when `kernels` is installed), which ModelOpt wraps
and exports normally.
3. **nvbug 6295279 #2 — `FileNotFoundError` in the cast step.**
`--cast_mxfp4_to_nvfp4` treated `--pyt_ckpt_path` as a local dir; a HF
Hub ID now resolves to its cached snapshot dir via
`_resolve_model_path`.

Also fixes a **static-block NVFP4 regression** (surfaced by the cast's
`force_weight_quantizers_static`, introduced by #1560's
now-unconditional `weight_only_quantize`): `_QuantGptOssExperts` /
`_QuantLlama4TextExperts` quantize their expert weights transposed in
the forward (`_transposed_quantize`), but the inherited
`iter_weights_for_calibration` fed the non-transposed weight, locking a
mismatched block-quant `_original_shape` and raising `ValueError: Input
shape has changed`. The override now calibrates on the transposed view,
matching both the forward and the export's `_amax` orientation.

`get_model` never had explicit handling for a *natively pre-quantized
MXFP4* checkpoint — GPT-OSS fell through the generic
*unquantized-checkpoint* branch and relied on Transformers' **implicit**
MXFP4 behavior, which is fragile across three axes. The cast was
originally validated (#1372, 2026-05-01) in the "lucky" quadrant of
each:

- **GPU count:** `device_map="auto"` on a single GPU never shards, so
the dequant stays on one device. On multiple GPUs `auto` balances the
model and shards the MXFP4→BF16 dequant across devices → CUDA
illegal-memory crash (6295242).
- **`kernels` presence:** without `kernels`, Transformers
auto-dequantizes to BF16 `GptOssExperts` (exportable). With `kernels`
installed it keeps the packed `Mxfp4GptOssExperts` kernel path → export
`NotImplementedError` (6295279 #1).
- **Transformers version:** the kernel-backed experts wrapper and the
threaded multi-GPU weight loader are newer-Transformers behavior (env
here is 5.5.4). Earlier versions simply dequantized MXFP4 → BF16, which
is what the old generic path happened to need.

The QA env sat in the *breaking* quadrant (multi-GPU and/or `kernels`
present, newer Transformers), so the implicit path failed. The new
branch makes both decisions explicit and deterministic
(`dequantize=True` + single-device load), regardless of environment —
mirroring the existing `has_pack_quantized_config` branch for
compressed-tensors checkpoints.

The fourth issue (static-block `Input shape has changed`) is a separate
regression: it was introduced by **#1560 (2026-06-02, "Make sure all
weight quantizers have `_amax`")**, a month *after* the cast landed.
previously it ran only when no calibration `forward_loop` was supplied,
and the cast always supplies one — so the non-transposed
weight-quantizer call simply never happened before. The conflict only
appears at the intersection of (a) transposed-quantize experts
(GPT-OSS/Llama4), (b) static-block NVFP4 — which `--cast_mxfp4_to_nvfp4`
forces via `force_weight_quantizers_static` — and (c) #1560. CI's
GPT-OSS NVFP4 coverage uses the *dynamic*-block path, which never locks
the block shape, so #1560 looked safe.

```bash
python hf_ptq.py \
  --pyt_ckpt_path openai/gpt-oss-20b \
  --qformat nvfp4_mlp_only \
  --cast_mxfp4_to_nvfp4 \
  --export_path ./gpt-oss-20b-nvfp4
```

- Ran the documented command end-to-end on 2xB200
(`openai/gpt-oss-20b`): cast overrode **48/48** expert weight
quantizers, **100% lossless** layers/blocks, exported a valid
packed-NVFP4 HF checkpoint (uint8 weights + FP8 per-block `weight_scale`
+ per-tensor `weight_scale_2` + `hf_quant_config.json`).
- Verified plain `--qformat nvfp4_mlp_only` (no cast) still works
end-to-end.
- **Independently verified the export is bit-exact:** dequantized the
exported NVFP4 weights (ModelOpt's E2M1 LUT + pack layout) and compared
against Transformers' canonical MXFP4→BF16 dequant
(`Mxfp4Config(dequantize=True)`) over all 24 layers × both expert
weights — `max_abs_err = 0`, 100% bitwise-equal in bf16. So
`dequant(exported NVFP4) == dequant(original MXFP4)` exactly.
- New unit tests: `test_get_original_hf_quant_method_*` (load detection)
and `test_gpt_oss_experts_iter_weights_for_calibration_transposed` (the
transpose regression). Existing `test_cast_mxfp4_to_nvfp4.py` (8 tests)
still pass. `pre-commit` clean.

**Known limitation:** verified for gpt-oss-20b (fits one GPU).
gpt-oss-120b dequantized does not fit a single GPU, so `sequential`
would still span GPUs — that case would need a CPU-dequant-then-dispatch
path and is left as a follow-up.

- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: ✅
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ (0.45 Bug Fixes)
- Did you get Claude approval on this PR?: ❌ (not yet run)

nvbug 6295279, nvbug 6295242 / OMNIML-5046, OMNIML-5045.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

* **Bug Fixes**
  * Prevented CUDA illegal-memory access during MXFP4→NVFP4 casting.
* Fixed expert-weight calibration orientation to avoid shape mismatches.

* **New Features**
* Support loading native MXFP4 checkpoints with automatic
dequantization.
* Resolve remote model identifiers to local checkpoints when casting
MXFP4→NVFP4, improving reliability.

* **Tests**
* Added unit and GPU regression tests covering quant-method detection,
casting, and expert-weight calibration.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…6293731, 6293762) (#1691)

### What does this PR do?

Type of change: Bug fix

Fixes two sglang deployment failures on multimodal Gemma
(`gemma-4-31B-it`) caused by general PTQ presets leaking quantization
into the SigLIP vision branch via broad wildcards:

- **NVBug 6293731** — `general/ptq/fp8_default-kv_fp8`: the
`w8a8_fp8_fp8` unit enables bare `*weight_quantizer` /
`*input_quantizer`, which also match the vision tower
(`model.vision_tower.*`, `model.visual.*`) and the vision embedding
projection (`model.embed_vision.*`). The exported checkpoint deploys but
emits **garbled text** in sglang.
- **NVBug 6293762** — `general/ptq/nvfp4_mlp_only-kv_fp8`: the `*mlp*`
enables also match the vision tower's block MLPs
(`model.vision_tower.encoder.layers.*.mlp`), and an image request
**crashes** the FP4 kernel at decode: `ValueError: too many values to
unpack (expected 2)` in sglang's `modelopt_quant.py` `apply`.

### Fix

Add `*embed_vision*` / `*vision_tower*` / `*visual*` disable rules to
the shared `configs/ptq/units/default_disabled_quantizers` unit,
alongside the existing `*router*` / `*lm_head*` entries.

Both the composed `general/ptq/*` recipes **and** the
`configs/ptq/presets/model/*` presets import this unit, so:
- every general recipe (`fp8_default`, `nvfp4_default`,
`nvfp4_mlp_only`, `nvfp4_omlp_only`, …) keeps the vision branch in BF16
by default — fixing the whole vision-overreach class, not just the two
reported recipes;
- the `test_general_ptq_yaml_matches_config_dicts` YAML↔preset parity
test stays satisfied (both sides pick up the new entries from the one
shared unit).

The rules are **no-ops on text-only models** (nothing matches). A recipe
that intentionally wants to quantize the vision branch can re-enable
these after importing the unit.

Files changed:
- `modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml`
(+14)

### Testing

Re-export of `gemma-4-31B-it` with the affected recipes and re-deploy in
sglang (the env from the bug reports: `lmsysorg/sglang:v0.5.12.post1`,
GB200) to confirm fp8_default no longer garbles text and nvfp4_mlp_only
no longer crashes on image requests. _(Results to be appended.)_
Unit-level:
`tests/unit/recipe/test_loader.py::test_general_ptq_yaml_matches_config_dicts`
(parity) passes for all four general presets.

### Before your PR is "*Ready for review*"

- Is this change backward compatible?: ✅ (text-only checkpoints
unaffected; new rules only match vision modules that should never have
been quantized by a general recipe)
- If you copied code from any other sources or added a new PIP
dependency: N/A
- Did you write any new necessary tests?: N/A (recipe data fix; covered
by the existing parity test + verified by real PTQ export + sglang
deploy)
- Did you update Changelog?: N/A
- Did you get Claude approval on this PR?: ❌ (pending)

### Additional Information

NVBug 6293731 and 6293762. Reported on modelopt 0.45.0rc0, GB200,
gemma-4-31B-it, sglang 0.5.12.post1. Tracked under OMNIML-5034.
Companion to PR #1690 (same vision-overreach class on the gemma-specific
`w4a8_awq` recipe, NVBug 6294017).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Updated quantization configuration to preserve BF16 precision for
vision encoder components in multimodal models.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…fo in clear_stale_value_info (#1697)

### What does this PR do?

Type of change: Bug fix

INT4 quantization upgrades the model to opset >= 21, at which point ONNX
Runtime
runs type inference while building the AWQ calibration
`InferenceSession`. Custom
ops backed by TensorRT plugins (domain `trt.plugins`) have no ORT
type-inference
function, so their output types are only known from the `value_info`
that TensorRT
type/shape inference populated earlier in preprocessing.
`clear_stale_value_info`
cleared `value_info` wholesale, dropping those types, so ORT failed
output type
inference for the custom op at model load, e.g.:

```
Node (Conv-2) Op (IdentityConv) output arg (X2) type inference failed
```

- `modelopt/onnx/utils.py`: in `clear_stale_value_info`, preserve
`value_info`
entries for outputs of `trt.plugins`-domain nodes (which ORT cannot
re-derive);
  clear the rest as before.
- `tests/gpu/onnx/quantization/test_plugin.py`: add a regression test
quantizing a
model with the built-in `CustomSkipLayerNormPluginDynamic` plugin at
INT4 +
awq_clip (the opset >= 21 path), asserting the quantized model is
produced and the
  custom op survives.

### Usage

```python
python -m modelopt.onnx.quantization \
    --onnx_path=model.onnx \
    --quantize_mode=int4 \
    --calibration_method=awq_clip \
    --trt_plugins=/path/to/plugin.so
```

### Testing

- `pytest tests/gpu/onnx/quantization/test_plugin.py -k int4_awq` —
fails before the fix
(ORT type-inference error at calibration-session load) and passes after.
The full
`test_plugin.py` (including the existing INT8 quantization and autocast
cases) passes.
- The example
[here](https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/onnx_ptq/README.md#quantize-an-onnx-model-with-custom-op)
also failed before this fix, now passes.

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors).

- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: ✅
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A

### Additional info
Fixing regression inserted by
#1565

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Preserve metadata for TensorRT plugin outputs during cleanup and
correctly reconcile output data types so custom plugin ops remain intact
after optimization/quantization.
* **Tests**
* Added a GPU ONNX regression test covering int4 quantization with AWQ
calibration to ensure TensorRT plugins are retained.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Gwenaelle Cunha Sergio <gcunhasergio@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…#1702)

### What does this PR do?

Type of change: Bug fix

Fixes nvbug **6311147** (OMNIML-5103).
`examples/deepseek/deepseek_v3/ptq.py` resolved the cloned DeepSeek-V3 /
DeepSeek-V3.2-Exp inference repos relative to its own directory
(`deepseek_v3/`) via `Path(__file__).resolve().parent`. But the
[README](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/deepseek)
clones those repos into the parent `examples/deepseek/` directory and
runs the script from there, so the lookup landed one level too deep and
raised `ValueError: DeepSeek-V3 or DeepSeek-V3.2-Exp not found` (the
error message also printed the wrong directory).

The fix resolves from `parent.parent` via a single `DEEPSEEK_DIR` base
shared by both repo paths and the error message.

### Usage

```bash
# Run from examples/deepseek/ as documented in the README, after cloning
# DeepSeek-V3 (or DeepSeek-V3.2-Exp) into that directory:
torchrun --nproc-per-node 8 --master_port=12346 deepseek_v3/ptq.py \
  --model_path $DS_CKPT \
  --config DeepSeek-V3/inference/configs/config_671B.json \
  --quant_cfg NVFP4_DEFAULT_CFG \
  --output_path $FP4_QUANT_PATH
```

### Testing

- Confirmed against the repro path: with the file at
`examples/deepseek/deepseek_v3/ptq.py` and the repos cloned into
`examples/deepseek/`, `Path(__file__).resolve().parent.parent` now
points at `examples/deepseek/` so `DeepSeek-V3/inference` resolves
correctly.
- Verified the sibling `examples/deepseek/deepseek_v4/` does not share
the bug (it takes an explicit `--dsv4_inference_dir` argument instead).
- `pre-commit` clean.

### Before your PR is "*Ready for review*"

- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A (one-line path fix in an
example script that requires the DeepSeek repos + multi-GPU checkpoint
to exercise)
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A (bug is in a 0.45-cycle example, not a regression from a released
version)
- Did you get Claude approval on this PR?: ❌ (not yet run)

### Additional Information

nvbug 6311147 / OMNIML-5103.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved path resolution in the example script to more reliably locate
the required inference repository.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Type of change: Bug fix

Extends the calibration/memory-probe `use_cache` guard to Step 3.7-style
nested text configs. Step 3.7 remote code reads the language config
under `model.config.text_config` directly and raises `AttributeError`
when `use_cache` is absent during PTQ calibration with Transformers >5.

This keeps the existing Step 3.5 behavior and applies the same temporary
set/restore logic to the nested text config.

### Usage

No API change. PTQ calibration continues to use the existing
forward-loop path.

### Testing

- `pre-commit run ruff-format --files
modelopt/torch/utils/dataset_utils.py
tests/unit/torch/utils/test_dataset_utils.py`
- `pre-commit run ruff-check --files
modelopt/torch/utils/dataset_utils.py
tests/unit/torch/utils/test_dataset_utils.py`
- `python -m py_compile modelopt/torch/utils/dataset_utils.py
tests/unit/torch/utils/test_dataset_utils.py`
- `python -m pytest tests/unit/torch/utils/test_dataset_utils.py -k
"disable_use_cache or iter_use_cache_configs or
forward_loop_runs_under_disabled" -vv`

### Before your PR is "*Ready for review*"

- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: ✅
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A
- Did you get Claude approval on this PR?: N/A

### Additional Information

This is separate from PR #1693. Step 3.7 needs both fixes if both
failure paths are exercised: this PR fixes PTQ calibration-time
`use_cache` handling, while PR #1693 fixes exported config `layer_types`
metadata for deployment config loading.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Improved handling of cache flags stored in nested model configuration
objects: cache is reliably disabled during dataset operations and
restored or removed afterward.

* **Tests**
* Added unit tests covering nested-config disabling, restoration/removal
of cache flags post-operation, and deduplication when nested configs
reference the same object.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Fixes #1658

Type of change: Bug fix, documentation

This PR updates the Puzzletron dataset preparation flow to use the
already published
prebuilt dataset `nvidia/Puzzle-KD-Nemotron-Post-Training-Dataset-v2` by
default,
avoiding the need to download the full raw
`nvidia/Nemotron-Post-Training-Dataset-v2` dataset (~136 GB) just to
filter it
down to the same ~2.6 GB result.

Changes included:
- Add `PREBUILT_KD_DATASET` constant in `prepare_dataset.py`
- Short-circuit dataset preparation when `dataset_name` matches the
prebuilt dataset,
  loading it directly and skipping the download + filtering pipeline
- Update 8 Puzzletron example configs to use the prebuilt dataset path
by default
- Update the Puzzletron README to document the default ~3 GB path and
clarify that
the raw ~136 GB path is still available if users want to reproduce
preprocessing

Default lightweight path:

```bash
python -m modelopt.torch.puzzletron.dataset.prepare_dataset \
  --dataset_name nvidia/Puzzle-KD-Nemotron-Post-Training-Dataset-v2 \
  --output_dir path/to/Puzzle-KD-Nemotron-Post-Training-Dataset-v2
```

Raw dataset path (existing behavior, still supported):

```bash
python -m modelopt.torch.puzzletron.dataset.prepare_dataset \
  --dataset_name nvidia/Nemotron-Post-Training-Dataset-v2 \
  --output_dir path/to/Nemotron-Post-Training-Dataset-v2
```

- Ran `pre-commit run --all-files`
- Most hooks passed successfully
- Local pre-commit `mypy` reported unrelated existing errors in:
  - `modelopt/torch/opt/config_loader.py`
  - `modelopt/recipe/loader.py`
- Verified this change separately with a local mock-based test:
  - prebuilt dataset path correctly loads and saves directly
  - original raw dataset path remains untouched

- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A
- Did you get Claude approval on this PR?: N/A

This change preserves the original raw-dataset workflow for users who
explicitly want
to regenerate the filtered dataset from scratch, while making the
default example flow
much lighter and easier to use.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

* **Documentation**
* Updated setup instructions to use a prebuilt, optimized dataset by
default, simplifying the model compression workflow.

* **Chores**
* Updated model compression configurations across multiple examples to
use the prebuilt dataset.
* Enhanced dataset preparation to support prebuilt dataset handling for
more efficient setup.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Sabari07 <sabursd18@gmail.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
fixes the oom (cpu ram) issue (reported in #1681)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Optimized memory management during model validation operations.
Explicit resource cleanup procedures are now performed after each
solution validation, preventing memory accumulation and eliminating
out-of-memory errors during extended validation workflows.

* **Configuration**
  * Updated default validation dataset configuration setting.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 requested review from a team as code owners June 15, 2026 18:08
@kevalmorabia97 kevalmorabia97 requested review from Edwardf0t1, cjluo-nv and realAsma and removed request for a team June 15, 2026 18:08
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds new Alpamayo and Qwen-Image example workflows, updates LLM quantization and sparse-attention export paths, expands recipe and tutorial documentation, improves dataset and Puzzletron utilities, fixes ONNX handling, and adjusts CI workflows, caching, and evaluation scripts.

Changes

CI and workflow updates

Layer / File(s) Summary
Extension cache action
.github/actions/cache-extensions/action.yml
Adds a composite action that sets TORCH_EXTENSIONS_DIR, restores cached torch extension artifacts, and backdates kernel sources on cache hits.
Workflow rollout
.github/workflows/_example_tests_runner.yml, .github/workflows/gpu_tests.yml, .github/workflows/example_tests.yml, .github/workflows/unit_tests.yml
Switches runner defaults, wires the cache action into workflows, removes one example matrix entry, shortens one GPU timeout, and enables Codecov skip_validation.

Alpamayo quantization example

Layer / File(s) Summary
Quantization script
examples/alpamayo/quantize.py
Adds Alpamayo message building, teacher-forced loss patching, calibration loops, PTQ and AutoQuantize flows, CLI handling, and checkpoint export.
Example docs and tests
examples/alpamayo/README.md, tests/examples/alpamayo/test_quantize.py
Documents the example workflow and adds tests for Parquet clip-id loading and message construction.

LLM quantization and evaluation fixes

Layer / File(s) Summary
Shared MXFP4/NVFP4 numerics and expert calibration
modelopt/torch/quantization/utils/numeric_utils.py, examples/llm_ptq/cast_mxfp4_to_nvfp4.py, modelopt/torch/quantization/plugins/huggingface.py, modelopt/torch/quantization/plugins/transformers_trainer.py
Moves MXFP4→NVFP4 math into shared utilities, updates cast code to consume them, adds transposed expert-weight calibration, and narrows quant_cfg typing.
DeepSeek and HF cast/load flows
examples/deepseek/..., examples/llm_ptq/example_utils.py, examples/llm_ptq/hf_ptq.py
Adds lossless DeepSeek V4 cast support, updates DeepSeek path resolution, detects original HF MXFP4 configs, threads preview attention masks, and resolves local source checkpoint paths before casting.
Runtime and evaluation wiring
modelopt/deploy/llm/generate.py, examples/llm_eval/*, examples/llm_ptq/run_tensorrt_llm.py, examples/llm_ptq/scripts/*, examples/specdec_bench/specdec_bench/models/vllm.py
Adds configurable KV-cache reuse, disables it for context-logit example paths, fixes MTP speculative decoding config, and updates evaluation scripts and shell wiring.
Coverage and interface docs
tests/examples/llm_ptq/*, tests/gpu/torch/quantization/*, tests/unit/torch/quantization/*, examples/llm_qat/ARGUMENTS.md, tests/_test_utils/torch/transformers_models.py
Adds cast and calibration regression coverage, tests the new HF quant-method helper, updates quant-arg docs, and adjusts a tiny transformer test helper.

FastGen DMD2 diffusion stack

Layer / File(s) Summary
FastGen core APIs
modelopt/torch/fastgen/*, modelopt/torch/fastgen/methods/*, modelopt/torch/fastgen/plugins/__init__.py
Adds the FastGen package with DMD configs, EMA, flow-matching utilities, losses, loaders, base pipeline classes, and the DMD training pipeline.
Qwen-Image plugin and example flow
modelopt/torch/fastgen/plugins/qwen_image.py, modelopt_recipes/general/distillation/dmd2_qwen_image.yaml, examples/diffusers/fastgen/*
Adds the Qwen-Image plugin, a DMD2 recipe, training configs, training recipe, export utility, inference pipeline, requirements, and example documentation.
Fixtures and FastGen test coverage
tests/_test_utils/torch/diffusers_models.py, tests/examples/diffusers/conftest.py, tests/unit/torch/fastgen/*
Adds tiny Qwen-Image fixtures and unit coverage for FastGen math, plugin behavior, gradient routing, optimizer steps, hook guards, and EMA/timestep conversion behavior.

Sparse attention export and example updates

Layer / File(s) Summary
Exported sparse-attention schema
modelopt/torch/sparsity/attention_sparsity/*, modelopt/torch/export/unified_export_hf.py
Adds grouped sparse-attention export metadata, records log-space calibration fields and initial_disabled_steps, updates checkpoint parsing, and injects sparse config into exported component config.json files.
Wan skip-softmax example flow
examples/diffusers/README.md, examples/diffusers/sparsity/*, CHANGELOG.rst
Extends sparse-attention docs and the Wan example with calibration-frame control, checkpoint export, and exported disabled-step metadata.
Schema and export coverage
tests/examples/diffusers/sparsity/test_sparsity.py, tests/unit/torch/sparsity/attention_sparsity/*
Adds export coverage for fixed-threshold and calibrated skip-softmax flows and updates schema assertions to the new grouped metadata shape.

Chat-template utilities and Puzzletron dataset updates

Layer / File(s) Summary
Chat-template preprocessing
modelopt/torch/utils/dataset_utils.py, modelopt/torch/utils/plugins/megatron_preprocess_data.py, tests/unit/torch/utils/test_dataset_utils.py, tests/gpu/torch/utils/test_dataset_utils.py
Adds message normalization helpers for reasoning content and tool calls, uses split-specific raw dataset files, and expands nested use_cache handling and tests.
Puzzletron dataset and validation flow
modelopt/torch/puzzletron/dataset/prepare_dataset.py, modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py, examples/puzzletron/*
Prefers a prebuilt Puzzle-KD dataset, updates Puzzletron docs and configs to that dataset, and frees validation model references between iterations.

Recipes and tutorial documentation

Layer / File(s) Summary
Recipe catalog and presets
modelopt_recipes/README.md, modelopt_recipes/ptq.md, modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml, modelopt_recipes/general/ptq/*, modelopt_recipes/models/.../nvfp4_w4a16.yaml, docs/source/guides/10_recipes.rst, tests/unit/recipe/*
Expands recipe docs, adds new PTQ recipe variants and disabled quantizer patterns, adds a model-specific Nemotron recipe, and updates recipe coverage.
Tutorial and changelog refresh
README.md, CHANGELOG.rst, examples/megatron_bridge/*, examples/pruning/*, tests/_test_utils/torch/megatron/models.py, tests/gpu_megatron/torch/quantization/plugins/test_megatron.py
Updates top-level release notes and tutorial links, adds a Megatron Bridge end-to-end tutorial, refreshes evaluator templates and pruning docs, and gates a Megatron test on Mamba availability.

ONNX calibration and metadata fixes

Layer / File(s) Summary
Histogram and value-info handling
modelopt/onnx/quantization/ort_patching.py, modelopt/onnx/utils.py
Casts histogram ranges to Python floats for narrow-value cases and preserves trt.plugins outputs during stale value_info cleanup.
ONNX coverage
tests/gpu/onnx/test_ort_patching.py, tests/gpu/onnx/quantization/test_plugin.py
Adds tests for fp16 narrow-range histogram collection and int4 AWQ quantization with TensorRT plugin nodes preserved.

Sequence Diagram(s)

sequenceDiagram
  participant Config as DMD2 config
  participant Recipe as DMD2DiffusionRecipe
  participant Pipeline as DMDPipeline
  participant Checkpoint as sidecar checkpoint

  Config->>Recipe: load config and overrides
  Recipe->>Pipeline: build student, teacher, fake_score, discriminator
  Recipe->>Pipeline: run student or fake-score phase
  Pipeline-->>Recipe: return phase losses
  Recipe->>Checkpoint: save student and DMD2 sidecar state
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • NVIDIA/Model-Optimizer#1653: Also adds the DeepSeek V4 --cast_mxfp4_to_nvfp4 lossless routed-expert cast path and related MXFP4→NVFP4 numeric handling.
  • NVIDIA/Model-Optimizer#1678: Also updates GPT-OSS MXFP4→NVFP4 PTQ/cast handling, including transposed expert calibration behavior and related example flows.
  • NVIDIA/Model-Optimizer#1670: Also refactors modelopt/torch/utils/dataset_utils.py and megatron_preprocess_data.py around chat-template message preparation and tool-call normalization.

Suggested labels

cherry-pick-0.45.0

Suggested reviewers

  • realAsma
  • Edwardf0t1
  • meenchen
  • Fridah-nv
  • cjluo-nv
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cherry-picks/release-0.45.0

@kevalmorabia97 kevalmorabia97 requested review from AAnoosheh and jenchen13 and removed request for a team, Edwardf0t1, cjluo-nv and realAsma June 15, 2026 18:10
@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-15 19:23 UTC

@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 69.60396% with 307 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.81%. Comparing base (2c52e7b) to head (e276eef).

Files with missing lines Patch % Lines
modelopt/torch/fastgen/ema.py 38.59% 70 Missing ⚠️
modelopt/torch/fastgen/methods/dmd.py 72.64% 61 Missing ⚠️
modelopt/torch/fastgen/plugins/qwen_image.py 59.63% 44 Missing ⚠️
modelopt/torch/fastgen/discriminators.py 0.00% 37 Missing ⚠️
modelopt/torch/fastgen/flow_matching.py 64.36% 31 Missing ⚠️
modelopt/torch/fastgen/loader.py 34.88% 28 Missing ⚠️
modelopt/torch/utils/dataset_utils.py 86.36% 12 Missing ⚠️
...delopt/torch/puzzletron/dataset/prepare_dataset.py 12.50% 7 Missing ⚠️
modelopt/torch/fastgen/pipeline.py 73.91% 6 Missing ⚠️
modelopt/torch/fastgen/factory.py 83.33% 3 Missing ⚠️
... and 4 more
Additional details and impacted files
@@                Coverage Diff                 @@
##           release/0.45.0    #1734      +/-   ##
==================================================
- Coverage           77.48%   76.81%   -0.68%     
==================================================
  Files                 489      504      +15     
  Lines               54415    55332     +917     
==================================================
+ Hits                42165    42501     +336     
- Misses              12250    12831     +581     
Flag Coverage Δ
examples 42.28% <10.59%> (-0.64%) ⬇️
gpu 57.65% <12.67%> (-1.40%) ⬇️
regression 14.64% <1.38%> (-0.16%) ⬇️
unit 54.26% <67.32%> (+0.25%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 18

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/llm_ptq/example_utils.py (1)

646-659: ⚠️ Potential issue | 🟠 Major

Fix has_pack_quantized_config() to handle object-style quantization configs like get_original_hf_quant_method() does.

The function calls .get() directly on quantization_config without checking whether it's a dict or object. In Transformers 4.56.0, quantization config objects (e.g., Mxfp4Config) do not support dict-style .get() access—only attribute access. This will raise AttributeError at runtime if quantization_config is an object instead of a dict.

The same file already demonstrates the correct pattern in get_original_hf_quant_method() (lines 546–548), which uses isinstance(quant_cfg, dict) to branch between .get() for dicts and getattr() for objects. Apply this same guard to has_pack_quantized_config() on lines 630 and 636.

Suggested fix
        def has_pack_quantized_config(config):
+           def _cfg_get(qcfg, key, default=None):
+               return qcfg.get(key, default) if isinstance(qcfg, dict) else getattr(qcfg, key, default)
+
            # Check top-level quantization_config
            if hasattr(config, "quantization_config"):
-               if config.quantization_config.get("format", None) == "pack-quantized":
+               if _cfg_get(config.quantization_config, "format") == "pack-quantized":
                    return True
            # Check nested text_config.quantization_config (for multi-modal models like kimi k2.5)
            if hasattr(config, "text_config") and hasattr(
                config.text_config, "quantization_config"
            ):
-               if config.text_config.quantization_config.get("format", None) == "pack-quantized":
+               if _cfg_get(config.text_config.quantization_config, "format") == "pack-quantized":
                    return True
            return False
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_ptq/example_utils.py` around lines 646 - 659, The
`has_pack_quantized_config()` function calls `.get()` directly on
`quantization_config` without checking whether it's a dict or an object, which
causes AttributeError at runtime when the config is an object like `Mxfp4Config`
that doesn't support dict-style access. Fix this by adopting the same pattern
already used in `get_original_hf_quant_method()` at lines 546-548: add an
`isinstance(quantization_config, dict)` check to branch between using `.get()`
for dict-style configs and `getattr()` for object-style configs. Apply this
guard to both `.get()` calls in `has_pack_quantized_config()` around lines 630
and 636.
🧹 Nitpick comments (1)
modelopt/torch/fastgen/plugins/__init__.py (1)

24-27: ⚡ Quick win

Define explicit __all__ in package __init__.py before wildcard re-export.

This package re-exports plugin symbols but does not declare its own __all__. Add module-level __all__ and extend it from qwen_image.__all__ when the plugin import succeeds so the public surface stays explicit.

As per coding guidelines, “Define the public API with __all__ at the top of each module and re-export via from .module import * in package __init__.py files.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/fastgen/plugins/__init__.py` around lines 24 - 27, The package
__init__.py file in modelopt/torch/fastgen/plugins/ performs a wildcard import
from qwen_image but does not define its own __all__ to explicitly declare the
public API. Define a module-level __all__ variable (can be initialized as an
empty list or with expected symbols), then within the import_plugin context
block for qwen_image, extend __all__ to include the symbols from
qwen_image.__all__ after the successful import. This ensures the public surface
of the package remains explicit and follows the coding guidelines for defining
public APIs.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/alpamayo/quantize.py`:
- Around line 650-652: The global torch.no_grad() context manager wrapping the
main() function call disables gradient computation globally, which prevents the
--quantize auto path from computing gradients needed by the auto_quantize() API
for gradient-based sensitivity scoring during the search phase. Remove the with
torch.no_grad(): wrapper and call main() directly without the context manager to
allow gradients to flow through the loss function when needed.
- Around line 461-467: The debug logging statements that extract GPU tensor
values to CPU scalars using .item() calls (on v_pred and v_target with
torch.isfinite, min, max, and abs().mean() operations) create unnecessary
CPU-GPU synchronization points in the optimization loop hot path. Either remove
these print statements entirely, or gate them behind a conditional debug flag
(such as if debug_logging:) that defaults to False so they do not execute during
normal operation. This will eliminate the synchronization overhead while
preserving the ability to enable detailed logging when needed for debugging.

In `@examples/deepseek/deepseek_v4/quantize_to_nvfp4.py`:
- Around line 329-335: The per-block amax computation uses
mxfp4_to_nvfp4_per_block_amax which internally recomputes k_max, but this may
differ from the shared k_max passed to the parent function, causing the
in-range/out-of-range classification to mismatch with the weight_scale_2
computed from the shared k_max. To fix this, either pass the shared k_max
parameter into the mxfp4_to_nvfp4_per_block_amax helper function so it uses the
correct shared value for both classification and scaling, or compute the
per_block_scale and in-range logic directly in this location using the shared
k_max instead of relying on the helper's internally recomputed value. This
ensures the block classification and weight_scale_2 derivation use the same
k_max reference.

In `@examples/diffusers/fastgen/dmd2_recipe.py`:
- Line 669: The torch.load calls at lines 669, 674, 687, and 696 in
dmd2_recipe.py use weights_only=False when loading checkpoints from the
user-supplied restore_from parameter, creating a Remote Code Execution risk if
checkpoints are untrusted or tampered with. For each of these four locations,
either change weights_only=False to weights_only=True to safely deserialize only
tensor data, or if weights_only=False is absolutely necessary for functionality,
add an inline comment explaining the security justification and request approval
from `@NVIDIA/modelopt-setup-codeowners`. The preferred approach is to switch all
four calls to weights_only=True unless there is a documented reason why the
model architecture requires full pickle deserialization.

In `@examples/diffusers/fastgen/export_diffusers_qwen_image.py`:
- Around line 49-52: The example usage in the export_diffusers_qwen_image.py
script shows `--base_pipeline_path Qwen/Qwen-Image`, but the export_diffusers()
function requires a local directory path and will fail with a non-directory
input like a model identifier. Replace the Qwen/Qwen-Image reference in the
example usage (around line 51) with a local snapshot directory path (e.g.,
/path/to/local/qwen_image_base or similar) to accurately reflect the expected
input format.

In `@examples/diffusers/fastgen/inference_dmd2_qwen_image.py`:
- Line 483: The os.makedirs call at line 483 crashes when output_png is a bare
filename because os.path.dirname returns an empty string. Before calling
os.makedirs on the dirname of output_png, guard against empty parent paths by
checking if the dirname is empty and using "." (current directory) as a fallback
when it is. This ensures the code handles both full paths and bare filenames
gracefully.
- Line 153: The torch.load call loading the EMA checkpoint with
weights_only=False enables unsafe pickle deserialization, creating a
code-execution risk for malicious files. Since the ema_path parameter is
caller-supplied without documented safety justification, and EMA state contains
only model weights which can be safely deserialized, change weights_only=False
to weights_only=True in the torch.load call on line 153 to disable pickle
deserialization and load only tensor data safely.
- Around line 145-146: The directory validation at lines 145-146 using
os.path.isdir rejects HuggingFace model IDs like the documented CLI default
"Qwen/Qwen-Image" at line 505, preventing diffusers from resolving the model.
Either remove the os.path.isdir check and let diffusers handle both local paths
and model IDs, or update the CLI default and help text at line 505 to require a
local snapshot path instead. Additionally, add an inline comment at line 153
where torch.load is called with weights_only=False explaining that it is safe
because the EMA checkpoint is internally-generated and trusted, not
user-supplied, to satisfy security guidelines.

In `@examples/llm_ptq/scripts/huggingface_example.sh`:
- Around line 296-305: Variable expansions in the python command invocations are
unquoted, which can cause argument splitting or glob expansion if the variables
contain spaces or special characters. In the mmlu.py command starting at line
296, wrap all variable expansions including $MODEL_ABS_PATH, $SAVE_PATH,
$MMLU_DATA_PATH, and $mmlu_flags in double quotes to ensure they are treated as
single arguments. Apply the same quoting fix to the corresponding command
invocation at lines 320-323 for consistency, wrapping all variable expansions in
that location with double quotes as well.

In `@modelopt/deploy/llm/generate.py`:
- Around line 291-295: Replace the assert statement in the
generate_context_logits() method that validates enable_kv_cache_reuse with an
explicit if statement that raises a ValueError. The current assert can be
stripped when Python runs with optimization flags (like -O), which would
silently allow incorrect behavior in this public API method. Change the
condition to check if self._enable_kv_cache_reuse is True, and if so, raise a
ValueError with the same descriptive error message that currently appears in the
assert.

In `@modelopt/torch/fastgen/__init__.py`:
- Around line 57-68: The package API surface is not explicitly curated via
`__all__` in the two `__init__.py` files, making API drift likely. In
modelopt/torch/fastgen/__init__.py at lines 57-68, add an explicit `__all__`
list that aggregates all exported names from the wildcard imports (config, ema,
factory, loader, methods.dmd, pipeline modules) and explicitly includes the
module-level re-exports (flow_matching, losses, utils, plugins) to define the
curated public API surface. In modelopt/torch/fastgen/methods/__init__.py at
line 18, after the wildcard re-export from .dmd, add a line that imports __all__
from the dmd module and assigns it as __all__ to explicitly pin the exported
surface, ensuring both files follow the coding guideline of declaring public
surfaces with explicit `__all__` declarations.

In `@modelopt/torch/fastgen/config.py`:
- Around line 94-103: The _check_bounds validator method uses assert statements
to validate external input from YAML configuration, which is unsafe because
asserts can be disabled in optimized Python runs (with -O flag), allowing
invalid configuration to pass validation. Replace all four assert statements
with explicit ValueError raises instead, maintaining the same validation logic
and error messages but using the raise ValueError syntax to guarantee validation
always occurs regardless of Python optimization settings.

In `@modelopt/torch/fastgen/discriminators.py`:
- Around line 92-94: The feature_indices filtering at line 92 in the __init__
method only checks the upper bound (i < num_blocks) but allows negative indices
and can result in an empty set, causing torch.cat to fail later at line 136.
Replace the current filter condition with proper validation that enforces 0 <= i
< num_blocks for each index. Additionally, add a check after filtering to raise
an informative error immediately if feature_indices becomes empty, rather than
allowing silent failure downstream. This validates the input once at the
interface boundary as per coding guidelines.

In `@modelopt/torch/fastgen/ema.py`:
- Around line 127-129: The EMA shadow initialization and reset paths do not
respect the local_shard mode and unnecessarily call _gather_full(), which
triggers expensive all-gathers and memory spikes. Add a conditional check for
config.mode == "local_shard" before calling _gather_full() in the shadow
initialization and reset logic. When in local_shard mode, use the local shard
directly (the parameter p itself or a detached copy) instead of gathering the
full tensor across all ranks. Apply this fix at all locations where
_gather_full() is called during shadow initialization and reset operations,
including the code block around self._shadow[clean] assignment and any similar
shadow update paths.

In `@modelopt/torch/fastgen/flow_matching.py`:
- Around line 185-195: The _truncated_lognormal function creates all tensors on
CPU during sampling and only transfers the final result to the requested device,
causing unnecessary host/device traffic. Pass the device parameter directly to
all tensor creation calls including torch.tensor calls for log_min_t, log_max_t,
mean, and std tensors, as well as the torch.rand call for sampling u, to ensure
all intermediate computations happen on the target device from the start rather
than requiring a final .to() transfer.

In `@modelopt/torch/fastgen/loader.py`:
- Around line 74-93: The candidate paths in the _candidate_paths function are
being added in the wrong order. Currently, filesystem paths are appended before
built-in recipe paths, but the documented contract states built-in recipes
should be checked first. For both the string and Path branches of the
conditional logic, reverse the order of the candidate appends so that all
_BUILTIN_RECIPES_LIB.joinpath calls are executed before the corresponding Path
or direct filesystem path calls. This applies to all four
candidate.append/extend calls in the string branch and all four calls in the
Path branch.

In `@tests/examples/diffusers/sparsity/test_sparsity.py`:
- Around line 165-170: Move the imports currently at lines 165–170 (from
diffusers import AutoencoderKLWan and WanPipeline, the
modelopt.torch.sparsity.attention_sparsity import, the
modelopt.torch.export.export_hf_checkpoint import, and the SparseAttentionModule
import) from inside the test function to the module scope at the top of the file
with the other imports. If any of these imports require deferred loading due to
optional dependencies or circular import concerns, keep them in the function and
add a brief comment explaining why.

In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 38-59: The test reuses the same messages variable after it has
been processed by the first prepare_messages_for_chat_template call, which may
have modified the input. For the second assertion that tests
normalize_tool_calls=False, create a fresh copy of the messages input with the
identical structure instead of reusing the modified messages variable. This
ensures the non-normalizing code path is tested with unmodified input and can
properly detect regressions in how it preserves content when normalization is
disabled.

---

Outside diff comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 646-659: The `has_pack_quantized_config()` function calls `.get()`
directly on `quantization_config` without checking whether it's a dict or an
object, which causes AttributeError at runtime when the config is an object like
`Mxfp4Config` that doesn't support dict-style access. Fix this by adopting the
same pattern already used in `get_original_hf_quant_method()` at lines 546-548:
add an `isinstance(quantization_config, dict)` check to branch between using
`.get()` for dict-style configs and `getattr()` for object-style configs. Apply
this guard to both `.get()` calls in `has_pack_quantized_config()` around lines
630 and 636.

---

Nitpick comments:
In `@modelopt/torch/fastgen/plugins/__init__.py`:
- Around line 24-27: The package __init__.py file in
modelopt/torch/fastgen/plugins/ performs a wildcard import from qwen_image but
does not define its own __all__ to explicitly declare the public API. Define a
module-level __all__ variable (can be initialized as an empty list or with
expected symbols), then within the import_plugin context block for qwen_image,
extend __all__ to include the symbols from qwen_image.__all__ after the
successful import. This ensures the public surface of the package remains
explicit and follows the coding guidelines for defining public APIs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 82f25edc-4ce3-4c7d-9a1c-2b2541d741f9

📥 Commits

Reviewing files that changed from the base of the PR and between 2c52e7b and e276eef.

⛔ Files ignored due to path filters (2)
  • examples/alpamayo/0417_16rows_train_set_for_calibration_25.10.parquet is excluded by !**/*.parquet
  • examples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/figures/learning_curves.png is excluded by !**/*.png
📒 Files selected for processing (122)
  • .github/actions/cache-extensions/action.yml
  • .github/workflows/_example_tests_runner.yml
  • .github/workflows/example_tests.yml
  • .github/workflows/gpu_tests.yml
  • .github/workflows/unit_tests.yml
  • CHANGELOG.rst
  • README.md
  • docs/source/guides/10_recipes.rst
  • examples/alpamayo/README.md
  • examples/alpamayo/quantize.py
  • examples/deepseek/README.md
  • examples/deepseek/deepseek_v3/ptq.py
  • examples/deepseek/deepseek_v4/quantize_to_nvfp4.py
  • examples/diffusers/README.md
  • examples/diffusers/fastgen/README.md
  • examples/diffusers/fastgen/configs/dmd2_qwen_image.yaml
  • examples/diffusers/fastgen/configs/dmd2_qwen_image_smoke.yaml
  • examples/diffusers/fastgen/dmd2_finetune.py
  • examples/diffusers/fastgen/dmd2_recipe.py
  • examples/diffusers/fastgen/export_diffusers_qwen_image.py
  • examples/diffusers/fastgen/inference_dmd2_qwen_image.py
  • examples/diffusers/fastgen/requirements.txt
  • examples/diffusers/sparsity/README.md
  • examples/diffusers/sparsity/wan22_skip_softmax.py
  • examples/llm_eval/lm_eval_tensorrt_llm.py
  • examples/llm_eval/mmlu.py
  • examples/llm_eval/run_simple_eval.sh
  • examples/llm_ptq/cast_mxfp4_to_nvfp4.py
  • examples/llm_ptq/example_utils.py
  • examples/llm_ptq/hf_ptq.py
  • examples/llm_ptq/run_tensorrt_llm.py
  • examples/llm_ptq/scripts/huggingface_example.sh
  • examples/llm_ptq/scripts/parser.sh
  • examples/llm_qat/ARGUMENTS.md
  • examples/megatron_bridge/README.md
  • examples/megatron_bridge/requirements.txt
  • examples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/ABLATIONS.md
  • examples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md
  • examples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/nemo_evaluator.yaml
  • examples/megatron_bridge/tutorials/README.md
  • examples/pruning/README.md
  • examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md
  • examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md
  • examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/nemo_evaluator.yaml
  • examples/pruning/minitron_vs_puzzletron/README.md
  • examples/puzzletron/README.md
  • examples/puzzletron/configs/gptoss-20b_remove_experts_memory/gptoss-20b_remove_experts_memory.yaml
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml
  • examples/puzzletron/configs/llama-3_2-3B_pruneffn_memory/llama-3_2-3B_pruneffn_memory.yaml
  • examples/puzzletron/configs/mistral-small-24b-instruct-2501_pruneffn_memory/mistral-small-24b-instruct-2501_pruneffn_memory.yaml
  • examples/puzzletron/configs/nemotron-nano-12b-v2/nemotron_nano_12b_v2_pruneffn_memory.yaml
  • examples/puzzletron/configs/qwen2_5_7b_instruct_pruneffn_memory/qwen2_5_7b_instruct_pruneffn_memory.yaml
  • examples/puzzletron/configs/qwen3-8b_pruneffn_memory/qwen3_8b_pruneffn_memory.yaml
  • examples/specdec_bench/specdec_bench/models/vllm.py
  • modelopt/deploy/llm/generate.py
  • modelopt/onnx/quantization/ort_patching.py
  • modelopt/onnx/utils.py
  • modelopt/torch/export/unified_export_hf.py
  • modelopt/torch/fastgen/__init__.py
  • modelopt/torch/fastgen/config.py
  • modelopt/torch/fastgen/discriminators.py
  • modelopt/torch/fastgen/ema.py
  • modelopt/torch/fastgen/factory.py
  • modelopt/torch/fastgen/flow_matching.py
  • modelopt/torch/fastgen/loader.py
  • modelopt/torch/fastgen/losses.py
  • modelopt/torch/fastgen/methods/__init__.py
  • modelopt/torch/fastgen/methods/dmd.py
  • modelopt/torch/fastgen/pipeline.py
  • modelopt/torch/fastgen/plugins/__init__.py
  • modelopt/torch/fastgen/plugins/qwen_image.py
  • modelopt/torch/fastgen/utils.py
  • modelopt/torch/puzzletron/dataset/prepare_dataset.py
  • modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py
  • modelopt/torch/quantization/plugins/huggingface.py
  • modelopt/torch/quantization/plugins/transformers_trainer.py
  • modelopt/torch/quantization/utils/numeric_utils.py
  • modelopt/torch/sparsity/attention_sparsity/calibration/calibrate.py
  • modelopt/torch/sparsity/attention_sparsity/calibration/calibrator.py
  • modelopt/torch/sparsity/attention_sparsity/config.py
  • modelopt/torch/sparsity/attention_sparsity/conversion.py
  • modelopt/torch/sparsity/attention_sparsity/plugins/sparse_attn_config.py
  • modelopt/torch/utils/dataset_utils.py
  • modelopt/torch/utils/plugins/megatron_preprocess_data.py
  • modelopt_recipes/README.md
  • modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml
  • modelopt_recipes/general/distillation/dmd2_qwen_image.yaml
  • modelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8_cast.yaml
  • modelopt_recipes/general/ptq/nvfp4_mlp_only-kv_fp8_cast.yaml
  • modelopt_recipes/general/ptq/nvfp4_omlp_only-kv_fp8_cast.yaml
  • modelopt_recipes/general/ptq/nvfp4_weight_only-kv_fp8_cast.yaml
  • modelopt_recipes/models/Nemotron-H/Nemotron-3-Nano-4B/nvfp4_w4a16.yaml
  • modelopt_recipes/ptq.md
  • tests/_test_utils/torch/diffusers_models.py
  • tests/_test_utils/torch/megatron/models.py
  • tests/_test_utils/torch/transformers_models.py
  • tests/examples/alpamayo/test_quantize.py
  • tests/examples/diffusers/conftest.py
  • tests/examples/diffusers/sparsity/test_sparsity.py
  • tests/examples/diffusers_sparsity/test_sparsity.py
  • tests/examples/llm_eval/test_llm_eval.py
  • tests/examples/llm_ptq/test_cast_mxfp4_to_nvfp4.py
  • tests/examples/llm_ptq/test_example_utils.py
  • tests/gpu/onnx/quantization/test_plugin.py
  • tests/gpu/onnx/test_ort_patching.py
  • tests/gpu/torch/quantization/test_gpt_oss_mxfp4_nvfp4_cast_cuda.py
  • tests/gpu/torch/utils/test_dataset_utils.py
  • tests/gpu_megatron/torch/quantization/plugins/test_megatron.py
  • tests/unit/recipe/test_loader.py
  • tests/unit/recipe/test_presets.py
  • tests/unit/torch/fastgen/conftest.py
  • tests/unit/torch/fastgen/test_dmd_gradient_routing.py
  • tests/unit/torch/fastgen/test_dmd_math.py
  • tests/unit/torch/fastgen/test_dmd_pipeline_step.py
  • tests/unit/torch/fastgen/test_hook_requirements.py
  • tests/unit/torch/fastgen/test_pred_type_conversion.py
  • tests/unit/torch/fastgen/test_qwen_image_plugin.py
  • tests/unit/torch/quantization/plugins/test_huggingface.py
  • tests/unit/torch/quantization/test_numeric_utils.py
  • tests/unit/torch/sparsity/attention_sparsity/test_sparse_attention_conversion.py
  • tests/unit/torch/sparsity/attention_sparsity/test_sparse_attn_config.py
  • tests/unit/torch/utils/test_dataset_utils.py
💤 Files with no reviewable changes (2)
  • examples/megatron_bridge/requirements.txt
  • tests/examples/diffusers_sparsity/test_sparsity.py

Comment on lines +461 to +467
print(
f"[autoquant-fwd] v_pred: finite={torch.isfinite(v_pred).all().item()} "
f"min={v_pred.min().item():.4g} max={v_pred.max().item():.4g} "
f"abs_mean={v_pred.abs().mean().item():.4g} | "
f"v_target: finite={torch.isfinite(v_target).all().item()} "
f"min={v_target.min().item():.4g} max={v_target.max().item():.4g}"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate the file and check its size
fd -a "quantize.py" examples/alpamayo/

# Then read the relevant section around lines 461-473
cat -n examples/alpamayo/quantize.py | sed -n '450,480p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1827


🏁 Script executed:

# Check context around forward_step to see if there's any debug flag or conditional logging
cat -n examples/alpamayo/quantize.py | sed -n '440,480p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2247


Remove debug logging or gate behind debug flag to avoid CPU-GPU sync in AutoQuantize hot path.

Lines 461–467 and 470–473 extract GPU tensor values with .item() for logging on every iteration of the optimization loop. This introduces synchronization points that can materially slow the search. Either remove the prints or gate them behind a conditional debug flag (e.g., if debug_logging:).

Per coding guidelines, avoid tensor-to-Python scalar extraction in GPU hot paths unless the CPU needs the value. Here these extractions are only for logging and are unnecessary for computation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/alpamayo/quantize.py` around lines 461 - 467, The debug logging
statements that extract GPU tensor values to CPU scalars using .item() calls (on
v_pred and v_target with torch.isfinite, min, max, and abs().mean() operations)
create unnecessary CPU-GPU synchronization points in the optimization loop hot
path. Either remove these print statements entirely, or gate them behind a
conditional debug flag (such as if debug_logging:) that defaults to False so
they do not execute during normal operation. This will eliminate the
synchronization overhead while preserving the ability to enable detailed logging
when needed for debugging.

Source: Coding guidelines

Comment on lines +650 to +652
if __name__ == "__main__":
with torch.no_grad():
main()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the root cause in this PR file and inspect auto-quantize internals for grad usage.
rg -n -C3 'with torch\.no_grad\(\):|auto_quantize\(|def forward_step|def loss_func|mse_loss' examples/alpamayo/quantize.py
rg -n -C3 'def auto_quantize|backward\(|autograd|grad' modelopt/torch/quantization

Repository: NVIDIA/Model-Optimizer

Length of output: 50379


Remove the global torch.no_grad() wrapper around main().

The global autograd disable at line 651 prevents gradient computation in the --quantize auto path (line 485), which relies on loss-driven search with explicit .backward() calls. The auto_quantize() API uses gradient-based sensitivity scoring by default and expects gradients to flow through the loss function during the search phase.

✅ Proposed fix
 if __name__ == "__main__":
-    with torch.no_grad():
-        main()
+    main()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/alpamayo/quantize.py` around lines 650 - 652, The global
torch.no_grad() context manager wrapping the main() function call disables
gradient computation globally, which prevents the --quantize auto path from
computing gradients needed by the auto_quantize() API for gradient-based
sensitivity scoring during the search phase. Remove the with torch.no_grad():
wrapper and call main() directly without the context manager to allow gradients
to flow through the loss function when needed.

Comment on lines +329 to +335
per_block_amax = mxfp4_to_nvfp4_per_block_amax(blocks, e8m0) # (out, nblk16) fp32

m = k_max - E4M3_KMAX
weight_scale_2 = torch.tensor(2.0**m, dtype=torch.float32, device=bf16.device).reshape(())
per_block_scale = (
(per_block_amax / (E2M1_MAX * weight_scale_2))
.clamp(min=2**-9, max=E4M3_MAX)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the shared k_max when deriving OOR per-block amax.

mxfp4_to_nvfp4_per_block_amax(blocks, e8m0) recomputes k_max from this tensor, but w1/w3 cast uses the shared fused-GEMM1 k_max passed into _quantize_weight_nvfp4_lossless. A w1/w3 tensor with a lower local max can classify blocks as in-range here while weight_scale_2 is based on the larger shared max, so the OOR fallback no longer matches the scale actually emitted. Thread the shared k_max into the per-block helper or compute in_range/data-derived fallback locally from the same k_max.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/deepseek/deepseek_v4/quantize_to_nvfp4.py` around lines 329 - 335,
The per-block amax computation uses mxfp4_to_nvfp4_per_block_amax which
internally recomputes k_max, but this may differ from the shared k_max passed to
the parent function, causing the in-range/out-of-range classification to
mismatch with the weight_scale_2 computed from the shared k_max. To fix this,
either pass the shared k_max parameter into the mxfp4_to_nvfp4_per_block_amax
helper function so it uses the correct shared value for both classification and
scaling, or compute the per_block_scale and in-range logic directly in this
location using the shared k_max instead of relying on the helper's internally
recomputed value. This ensures the block classification and weight_scale_2
derivation use the same k_max reference.

)

if os.path.isfile(ema_path) and self._dmd_pipeline.ema is not None:
ema_state = torch.load(ema_path, map_location="cpu", weights_only=False)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, find and examine the file
find . -name "dmd2_recipe.py" -type f

Repository: NVIDIA/Model-Optimizer

Length of output: 110


🏁 Script executed:

# Check the specific lines mentioned
head -n 700 examples/diffusers/fastgen/dmd2_recipe.py | tail -n 50

Repository: NVIDIA/Model-Optimizer

Length of output: 2721


🏁 Script executed:

# Get broader context around the torch.load calls
sed -n '650,710p' examples/diffusers/fastgen/dmd2_recipe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3251


🏁 Script executed:

# Check if there are inline comments justifying weights_only=False
grep -n "weights_only" examples/diffusers/fastgen/dmd2_recipe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 440


🏁 Script executed:

# Verify file location and type
ls -la examples/diffusers/fastgen/dmd2_recipe.py && wc -l examples/diffusers/fastgen/dmd2_recipe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 205


🏁 Script executed:

# Check where these paths originate - find the function/method these lines are in
sed -n '600,670p' examples/diffusers/fastgen/dmd2_recipe.py | head -80

Repository: NVIDIA/Model-Optimizer

Length of output: 3510


🏁 Script executed:

# Look for where ema_path, state_path, disc_path, disc_opt_path are defined
grep -n "ema_path\|state_path\|disc_path\|disc_opt_path" examples/diffusers/fastgen/dmd2_recipe.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 1505


🏁 Script executed:

# Find the function definition and parameter origins
sed -n '600,670p' examples/diffusers/fastgen/dmd2_recipe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3510


🏁 Script executed:

# Check where ckpt_dir comes from and what the function signature is
sed -n '550,650p' examples/diffusers/fastgen/dmd2_recipe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 5259


🏁 Script executed:

# Check how restore_from is set and where it comes from
grep -n "restore_from" examples/diffusers/fastgen/dmd2_recipe.py | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 1433


🏁 Script executed:

# Look at the load_checkpoint method to understand the flow
sed -n '700,800p' examples/diffusers/fastgen/dmd2_recipe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 4337


🏁 Script executed:

# Check where load_checkpoint is called and what calls _restore_dmd_extras
grep -B5 -A5 "_restore_dmd_extras" examples/diffusers/fastgen/dmd2_recipe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1660


Replace unsafe torch.load(..., weights_only=False) with weights_only=True in checkpoint restore path.

These four torch.load calls at lines 669, 674, 687, 696 deserialize checkpoint files with pickle enabled. The checkpoint paths originate from user-supplied restore_from input, creating an RCE risk if a checkpoint is untrusted or tampered with. None of the calls include inline comments justifying why weights_only=False is necessary and safe. Switch to weights_only=True, or if False is genuinely unavoidable, add an inline trust rationale and request security exception approval from @NVIDIA/modelopt-setup-codeowners.

Suggested fix
-            ema_state = torch.load(ema_path, map_location="cpu", weights_only=False)
+            ema_state = torch.load(ema_path, map_location="cpu", weights_only=True)

-            state = torch.load(state_path, map_location="cpu", weights_only=False)
+            state = torch.load(state_path, map_location="cpu", weights_only=True)

-                disc_state = torch.load(disc_path, map_location="cpu", weights_only=False)
+                disc_state = torch.load(disc_path, map_location="cpu", weights_only=True)

-                disc_opt_state = torch.load(disc_opt_path, map_location="cpu", weights_only=False)
+                disc_opt_state = torch.load(disc_opt_path, map_location="cpu", weights_only=True)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ema_state = torch.load(ema_path, map_location="cpu", weights_only=False)
ema_state = torch.load(ema_path, map_location="cpu", weights_only=True)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/diffusers/fastgen/dmd2_recipe.py` at line 669, The torch.load calls
at lines 669, 674, 687, and 696 in dmd2_recipe.py use weights_only=False when
loading checkpoints from the user-supplied restore_from parameter, creating a
Remote Code Execution risk if checkpoints are untrusted or tampered with. For
each of these four locations, either change weights_only=False to
weights_only=True to safely deserialize only tensor data, or if
weights_only=False is absolutely necessary for functionality, add an inline
comment explaining the security justification and request approval from
`@NVIDIA/modelopt-setup-codeowners`. The preferred approach is to switch all four
calls to weights_only=True unless there is a documented reason why the model
architecture requires full pickle deserialization.

Source: Coding guidelines

Comment on lines +49 to +52
python export_diffusers_qwen_image.py \\
--student_path /path/to/checkpoint/epoch_0_step_500/model/consolidated \\
--base_pipeline_path Qwen/Qwen-Image \\
--output_dir /path/to/output/qwen_image_dmd2 \\

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the --base_pipeline_path usage example.

The example on Line 51 uses Qwen/Qwen-Image, but export_diffusers() requires a local directory and fails on non-directory input. Please update the example/help text to a local snapshot path (or broaden input handling).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/diffusers/fastgen/export_diffusers_qwen_image.py` around lines 49 -
52, The example usage in the export_diffusers_qwen_image.py script shows
`--base_pipeline_path Qwen/Qwen-Image`, but the export_diffusers() function
requires a local directory path and will fail with a non-directory input like a
model identifier. Replace the Qwen/Qwen-Image reference in the example usage
(around line 51) with a local snapshot directory path (e.g.,
/path/to/local/qwen_image_base or similar) to accurately reflect the expected
input format.

Comment on lines +127 to +129
full = _gather_full(p.detach(), fsdp2=config.fsdp2)
target_dtype = _resolve_dtype(config.dtype, full.dtype)
self._shadow[clean] = copy.deepcopy(full).to(dtype=target_dtype)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

local_shard mode is bypassed during EMA shadow init/reset.

When config.mode == "local_shard", these paths still call _gather_full(...), which can trigger unnecessary all-gathers and memory spikes.

Suggested fix
-                full = _gather_full(p.detach(), fsdp2=config.fsdp2)
+                full = (
+                    _gather_full(p.detach(), fsdp2=config.fsdp2)
+                    if config.mode == "full_tensor"
+                    else (p.detach().to_local() if _is_distributed_tensor(p) else p.detach())
+                )
                 target_dtype = _resolve_dtype(config.dtype, full.dtype)
                 self._shadow[clean] = copy.deepcopy(full).to(dtype=target_dtype)
...
-            live = _gather_full(p.detach(), fsdp2=self.config.fsdp2)
+            live = (
+                _gather_full(p.detach(), fsdp2=self.config.fsdp2)
+                if self.config.mode == "full_tensor"
+                else (p.detach().to_local() if _is_distributed_tensor(p) else p.detach())
+            )
             shadow.copy_(live.to(device=shadow.device, dtype=shadow.dtype))

Also applies to: 252-253

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/fastgen/ema.py` around lines 127 - 129, The EMA shadow
initialization and reset paths do not respect the local_shard mode and
unnecessarily call _gather_full(), which triggers expensive all-gathers and
memory spikes. Add a conditional check for config.mode == "local_shard" before
calling _gather_full() in the shadow initialization and reset logic. When in
local_shard mode, use the local shard directly (the parameter p itself or a
detached copy) instead of gathering the full tensor across all ranks. Apply this
fix at all locations where _gather_full() is called during shadow initialization
and reset operations, including the code block around self._shadow[clean]
assignment and any similar shadow update paths.

Comment on lines +185 to +195
log_min_t = torch.tensor(math.log(min_t), dtype=torch.float64)
log_max_t = torch.tensor(math.log(max_t), dtype=torch.float64)
normal = Normal(
torch.tensor(mean, dtype=torch.float64),
torch.tensor(std, dtype=torch.float64),
)
cdf_min = normal.cdf(log_min_t)
cdf_max = normal.cdf(log_max_t)
u = torch.rand(n, dtype=torch.float64) * (cdf_max - cdf_min) + cdf_min
t = normal.icdf(u).exp()
return t.to(device=device, dtype=dtype)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

_truncated_lognormal ignores the requested device during sampling.

This path samples on CPU and transfers later, which adds avoidable host/device traffic in training loops.

Suggested fix
-    log_min_t = torch.tensor(math.log(min_t), dtype=torch.float64)
-    log_max_t = torch.tensor(math.log(max_t), dtype=torch.float64)
+    log_min_t = torch.tensor(math.log(min_t), dtype=torch.float64, device=device)
+    log_max_t = torch.tensor(math.log(max_t), dtype=torch.float64, device=device)
     normal = Normal(
-        torch.tensor(mean, dtype=torch.float64),
-        torch.tensor(std, dtype=torch.float64),
+        torch.tensor(mean, dtype=torch.float64, device=device),
+        torch.tensor(std, dtype=torch.float64, device=device),
     )
     cdf_min = normal.cdf(log_min_t)
     cdf_max = normal.cdf(log_max_t)
-    u = torch.rand(n, dtype=torch.float64) * (cdf_max - cdf_min) + cdf_min
+    u = torch.rand(n, dtype=torch.float64, device=device) * (cdf_max - cdf_min) + cdf_min
     t = normal.icdf(u).exp()
     return t.to(device=device, dtype=dtype)

As per coding guidelines, keep tensor work on the GPU and avoid unnecessary CPU-GPU syncs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/fastgen/flow_matching.py` around lines 185 - 195, The
_truncated_lognormal function creates all tensors on CPU during sampling and
only transfers the final result to the requested device, causing unnecessary
host/device traffic. Pass the device parameter directly to all tensor creation
calls including torch.tensor calls for log_min_t, log_max_t, mean, and std
tensors, as well as the torch.rand call for sampling u, to ensure all
intermediate computations happen on the target device from the start rather than
requiring a final .to() transfer.

Source: Coding guidelines

Comment on lines +74 to +93
# Normalize to string for suffix probing; keep Path/Traversable behavior otherwise.
if isinstance(config_file, str):
base = config_file
if base.endswith(_SUFFIXES):
candidates.append(Path(base))
candidates.append(_BUILTIN_RECIPES_LIB.joinpath(base))
else:
candidates.extend(Path(base + suffix) for suffix in _SUFFIXES)
candidates.extend(_BUILTIN_RECIPES_LIB.joinpath(base + suffix) for suffix in _SUFFIXES)
elif isinstance(config_file, Path):
if config_file.suffix in _SUFFIXES:
candidates.append(config_file)
if not config_file.is_absolute():
candidates.append(_BUILTIN_RECIPES_LIB.joinpath(str(config_file)))
else:
candidates.extend(Path(str(config_file) + suffix) for suffix in _SUFFIXES)
if not config_file.is_absolute():
candidates.extend(
_BUILTIN_RECIPES_LIB.joinpath(str(config_file) + suffix) for suffix in _SUFFIXES
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Candidate resolution order is inverted vs the documented contract.

The module docs say built-in recipes are checked first, but _candidate_paths currently prioritizes filesystem paths.

Suggested fix
     if isinstance(config_file, str):
         base = config_file
         if base.endswith(_SUFFIXES):
-            candidates.append(Path(base))
-            candidates.append(_BUILTIN_RECIPES_LIB.joinpath(base))
+            candidates.append(_BUILTIN_RECIPES_LIB.joinpath(base))
+            candidates.append(Path(base))
         else:
-            candidates.extend(Path(base + suffix) for suffix in _SUFFIXES)
             candidates.extend(_BUILTIN_RECIPES_LIB.joinpath(base + suffix) for suffix in _SUFFIXES)
+            candidates.extend(Path(base + suffix) for suffix in _SUFFIXES)
     elif isinstance(config_file, Path):
         if config_file.suffix in _SUFFIXES:
-            candidates.append(config_file)
             if not config_file.is_absolute():
                 candidates.append(_BUILTIN_RECIPES_LIB.joinpath(str(config_file)))
+            candidates.append(config_file)
         else:
-            candidates.extend(Path(str(config_file) + suffix) for suffix in _SUFFIXES)
             if not config_file.is_absolute():
                 candidates.extend(
                     _BUILTIN_RECIPES_LIB.joinpath(str(config_file) + suffix) for suffix in _SUFFIXES
                 )
+            candidates.extend(Path(str(config_file) + suffix) for suffix in _SUFFIXES)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/fastgen/loader.py` around lines 74 - 93, The candidate paths
in the _candidate_paths function are being added in the wrong order. Currently,
filesystem paths are appended before built-in recipe paths, but the documented
contract states built-in recipes should be checked first. For both the string
and Path branches of the conditional logic, reverse the order of the candidate
appends so that all _BUILTIN_RECIPES_LIB.joinpath calls are executed before the
corresponding Path or direct filesystem path calls. This applies to all four
candidate.append/extend calls in the string branch and all four calls in the
Path branch.

Comment on lines +165 to +170
from diffusers import AutoencoderKLWan, WanPipeline

import modelopt.torch.sparsity.attention_sparsity as mtsa
from modelopt.torch.export import export_hf_checkpoint
from modelopt.torch.sparsity.attention_sparsity.sparse_attention import SparseAttentionModule

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Move imports to module scope.

At lines 165–170, imports are inside the test function without justification. Per guidelines, imports belong at file top so errors surface at collection time. Move them to module scope unless they require deferred loading (optional dependencies or circular imports) — in which case add a brief comment explaining why.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/examples/diffusers/sparsity/test_sparsity.py` around lines 165 - 170,
Move the imports currently at lines 165–170 (from diffusers import
AutoencoderKLWan and WanPipeline, the modelopt.torch.sparsity.attention_sparsity
import, the modelopt.torch.export.export_hf_checkpoint import, and the
SparseAttentionModule import) from inside the test function to the module scope
at the top of the file with the other imports. If any of these imports require
deferred loading due to optional dependencies or circular import concerns, keep
them in the function and add a brief comment explaining why.

Comment on lines +38 to +59
def test_prepare_messages_for_chat_template():
messages = [
{
"role": "assistant",
"content": "answer",
"reasoning_content": "think",
"tool_calls": [
{"function": {"name": "search", "arguments": '{"q": "x"}'}},
],
},
]
prepared = prepare_messages_for_chat_template(
messages, reasoning_content="native", normalize_tool_calls=True
)
assert prepared[0]["reasoning_content"] == "think"
assert prepared[0]["tool_calls"][0]["function"]["arguments"] == {"q": "x"}
assert (
prepare_messages_for_chat_template(
messages, reasoning_content="native", normalize_tool_calls=False
)
is messages
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use fresh input for the non-normalizing branch assertion.

The second assertion reuses messages after the normalizing call, so it can miss regressions in the normalize_tool_calls=False path’s content preservation.

Suggested test adjustment
 def test_prepare_messages_for_chat_template():
-    messages = [
+    messages = [
         {
             "role": "assistant",
             "content": "answer",
             "reasoning_content": "think",
             "tool_calls": [
                 {"function": {"name": "search", "arguments": '{"q": "x"}'}},
             ],
         },
     ]
     prepared = prepare_messages_for_chat_template(
         messages, reasoning_content="native", normalize_tool_calls=True
     )
     assert prepared[0]["reasoning_content"] == "think"
     assert prepared[0]["tool_calls"][0]["function"]["arguments"] == {"q": "x"}
+
+    raw_messages = [
+        {
+            "role": "assistant",
+            "content": "answer",
+            "reasoning_content": "think",
+            "tool_calls": [
+                {"function": {"name": "search", "arguments": '{"q": "x"}'}},
+            ],
+        },
+    ]
     assert (
         prepare_messages_for_chat_template(
-            messages, reasoning_content="native", normalize_tool_calls=False
+            raw_messages, reasoning_content="native", normalize_tool_calls=False
         )
-        is messages
+        is raw_messages
     )

As per coding guidelines, checked-in tests should protect expected behavior and regressions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 38 - 59, The test
reuses the same messages variable after it has been processed by the first
prepare_messages_for_chat_template call, which may have modified the input. For
the second assertion that tests normalize_tool_calls=False, create a fresh copy
of the messages input with the identical structure instead of reusing the
modified messages variable. This ensures the non-normalizing code path is tested
with unmodified input and can properly detect regressions in how it preserves
content when normalization is disabled.

Source: Coding guidelines

@kevalmorabia97 kevalmorabia97 merged commit a6c7243 into release/0.45.0 Jun 15, 2026
85 of 87 checks passed
@kevalmorabia97 kevalmorabia97 deleted the cherry-picks/release-0.45.0 branch June 15, 2026 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.