fastgen DMD2: make the Qwen-Image example self-contained on stock nemo_automodel by jingyu-ml · Pull Request #1688 · NVIDIA/Model-Optimizer

jingyu-ml · 2026-06-11T20:35:02Z

What does this PR do?

Type of change: new example / refactor (example self-containment)

The published examples/diffusers/fastgen DMD2 Qwen-Image distillation example previously relied on
local, unpublished modifications to the sibling nemo_automodel package (data path, collate, a
partial-load checkpointer, and Qwen-Image preprocessing). An external user running the published
Model-Optimizer example against official stock nemo_automodel would hit import/attribute errors.

This PR makes the example self-contained on stock nemo_automodel>=0.4.0 — with the changes kept
as small as possible: the team's actual AutoModel delta is only ~430 lines, so wherever the
upstream module is importable, the delta is expressed as a thin subclass/wrapper rather than a
full-file copy.

fastgen_data/ — the DMD2 data path:
- collate_fns.py — thin wrappers over stock: import the unpatched upstream collate +
  bucket sampler and re-add only the DMD2 delta (text_embeddings_mask, stacked from the
  per-item mask the stock collate omits; and an optional broadcast negative_text_embeddings
  for CFG). The builder loads an optional negative_prompt_embedding_path.
- text_to_image_dataset.py — a faithful vendored copy of the upstream reader (its
  prompt_embeds_mask emission is interleaved with cache loading, so wrapping it would force a
  redundant per-item torch.load; carried verbatim instead).
fastgen_checkpoint.py — PartialLoadCheckpointer(Checkpointer) that overrides only
load_optimizer (FSDP2 DefaultLoadPlanner(allow_partial_load=True)) so optimizer resume works
without patching upstream. Injected via an in-place re-bless of self.checkpointer in the
recipe's load_checkpoint (model-state load stays strict).
preprocess/ — Qwen-Image preprocessing (preprocessing_multiprocess.py + processors/),
trimmed to the image path (drops the flux/wan/hunyuan processors and the video base class).
It lives in AutoModel's top-level tools/ tree, which is not shipped in the pip package, so
it cannot be wrapped and is vendored; MultiTierBucketCalculator is imported from stock upstream.
make_negative_prompt_embedding.py — generates the optional CFG negative-prompt embedding.
All configs/*.yaml target fastgen_data.build_* (a test enumerates every config).
Licensing: the AutoModel-copied files are NVIDIA-authored Apache-2.0, so they carry only the
standard NVIDIA SPDX header (managed by the insert-license hook) — no per-file provenance note,
no duplicated license, no pre-commit exclusion, and no separate LICENSE note.
nemo_automodel[diffusion] version bound in requirements.txt.

The DMD2 math in modelopt/torch/fastgen/ is unchanged — only example/training-time glue moved.

Usage

# Install example deps (stock nemo_automodel) from a source checkout
pip install -r examples/diffusers/fastgen/requirements.txt

# Build the training cache from raw images (Qwen-Image VAE latents + text embeddings)
python examples/diffusers/fastgen/preprocess_qwen_image.py image \
    --image_dir <raw images> --output_dir <cache dir> --processor qwen_image \
    --caption_format meta_json

# Generate the CFG negative-prompt embedding once
python examples/diffusers/fastgen/make_negative_prompt_embedding.py \
    --output <cache dir>/negative_prompt_embedding.pt

# Point the config's data.dataloader.cache_dir + negative_prompt_embedding_path at the cache, then train.

See examples/diffusers/fastgen/README.md → "Requirements & self-contained data path".

Testing

New tests/examples/diffusers/fastgen/test_vendored_migration.py: environment-independent
invariants (every config targets a vendored builder; no tools.* imports; each former AutoModel
patch is vendored / wrapped / a documented exclusion; the former-vendored files carry the standard NVIDIA SPDX header, no provenance note or duplicate license) plus
dependency-guarded structural tests (the collate wrapper emits the batch contract + broadcasts the
negative embedding; the builder accepts negative_prompt_embedding_path; the checkpointer
overrides only load_optimizer; the Qwen-Image processor self-registers).
Validated 9/9 against a pure stock nemo_automodel 0.4.0 worktree (none of the local patches
present) via SLURM — re-run after this slim-down.
The migrated code path is exercised by a live multi-GPU DMD2 run that resumed from a checkpoint
through the vendored PartialLoadCheckpointer.
ruff check + ruff format --check clean on all changed files.

Before your PR is "Ready for review"

Is this change backward compatible?: ✅ (additive; bundled configs target the vendored builders)
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ (NeMo-AutoModel @ e42584e3, Apache-2.0; per review these NVIDIA-authored files carry the standard NVIDIA SPDX header, no separate provenance / LICENSE note; nemo_automodel was already a dependency)
Did you write any new necessary tests?: ✅
Did you update Changelog?: N/A (example-only change)
Did you get Claude approval on this PR?: ❌ (pending — opened as draft)

Additional Information

Opened as a draft pending: OSRB review of the vendored NeMo-AutoModel (Apache-2.0) code, CI green,
and (optional) a multi-GPU smoke-train + resume on stock upstream. Vendored from
NVIDIA-NeMo/Automodel at commit e42584e3.

Update (post-review)

Mid-run resume data-correctness fix (6ffbc52c9): on resume the StatefulDataLoader's restored state did not advance past the resume point, so each window re-served the same data slice and multi-window (SLURM-windowed) runs under-covered the dataset. Fixed by rebuilding a fresh loader and skipping the deterministic sampler to the position implied by global_step; added a SLURM-free, GPU-free CPU regression test (tests/examples/diffusers/fastgen/test_resume_dataloader.py).
Licensing review (be832ae95): the AutoModel-copied files are NVIDIA-authored Apache-2.0, so they now carry only the standard NVIDIA SPDX header — dropped the per-file provenance note, the duplicated original-license block, the insert-license pre-commit exclusion, and the LICENSE note.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added vendored Qwen‑Image preprocessing (multi-process) with processor registry support.
- Updated DMD2 data loading with a dedicated dataset/collation pipeline, including negative-prompt embedding/mask handling.
- Improved training resume behavior by rebuilding dataloader state and making checkpoint optimizer restore tolerant of partial FSDP2 optimizer shards.
Documentation
- Refreshed the fastgen README and config notes for real-data training; removed the prior mock-data smoke workflow.
Tests
- Added regression and migration tests covering vendored wiring, collate/dataloader contracts, processor registration, and resume/checkpoint behavior.
Chores
- Updated licenses/attribution, vendoring/tooling guards, requirements pinning, linting configuration, and repository ownership rules.

…n-time self-contained) Make the published DMD2 Qwen-Image example run against STOCK upstream nemo_automodel by removing its dependence on local AutoModel modifications (staged patches #1-5): - Vendor the patched data files into examples/diffusers/fastgen/fastgen_data/ (collate_fns, text_to_image_dataset, mock_dataloader), importing the UNPATCHED upstream helpers (sampler, base_dataset, text_to_video_dataset) from the stock nemo_automodel package. Provenance headers per CONTRIBUTING (source + commit e42584e3, Apache-2.0). - Repoint the published configs' dataloader _target_ to fastgen_data.build_* and add an explicit sys.path seam + startup resolution logging in dmd2_finetune.py (source-checkout). - Add fastgen_checkpoint.PartialLoadCheckpointer (overrides only load_optimizer with allow_partial_load=True) and inject it in dmd2_recipe.load_checkpoint, so both the student and fake-score optimizer restores tolerate FSDP2 partial shards without a patched AutoModel. No change to modelopt/torch/fastgen/ (DMD2 math). Preprocessing vendoring + tests + the removability audit follow in subsequent commits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

Bound nemo_automodel[diffusion] to the tested range (>=0.4.0,<1.0; 0.4.0 == the validated commit e42584e3) and add a runtime soft-guard in fastgen_data/__init__.py that converts a missing unpatched-upstream-helper ImportError into an actionable message naming the supported range. Supports AC-7 (bounded version) and AC-1 (clear failure on drift). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

…ined, no tools/ tree) Vendor the AutoModel preprocessing stack needed to build the VAE+text-embed cache from raw images, trimmed to the Qwen-Image path: - preprocess/processors/{base,base_video,registry,caption_loaders,qwen_image}.py + a trimmed processors/__init__.py that drops flux/wan/hunyuan (and their deps, e.g. nemo_automodel.shared.transformers_patches). The Qwen-Image processor self-registers. - preprocess/preprocessing_multiprocess.py (driver): the `from tools.diffusion.processors` import is rewritten to the vendored `.processors` package; MultiTierBucketCalculator is imported from the stock nemo_automodel package (it IS packaged). Image subcommand --processor defaults to qwen_image. - preprocess_qwen_image.py launcher mirroring dmd2_finetune.py's sys.path seam. Removes the example's dependence on AutoModel staged patches #7-8 and on the un-packaged AutoModel tools/ tree. Provenance headers per CONTRIBUTING. No tools.* imports remain. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

…per CONTRIBUTING) The 10 vendored files (fastgen_data/{collate_fns,text_to_image_dataset,mock_dataloader}.py and preprocess/{preprocessing_multiprocess,processors/*}.py) carry the original NeMo-AutoModel Apache-2.0 header plus a "Vendored from ... @ e42584e3" provenance note. Add them to the insert-license-py exclude so the hook does not prepend a second NVIDIA header on top. LICENSE Third-Party Software Notices is intentionally NOT modified: every vendored file is NVIDIA-copyright Apache-2.0 (same entity and license as this repo, from the sibling NVIDIA-NeMo/Automodel repo), so there is no external third-party copyright holder to list; the per-file provenance headers are the applicable record. OSRB sign-off remains a maintainer release gate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

Add examples/diffusers/fastgen/tests/test_vendored_migration.py with two tiers: - Environment-independent (run anywhere): published configs repointed to fastgen_data.*, no tools.* imports in vendored code, the AC-8 removability coverage audit (all nine staged AutoModel files vendored-or-excluded), and provenance headers on the 10 vendored files. - Dependency-guarded (importorskip nemo_automodel/torch, run in the container): data builders accept negative_prompt_embedding_path, collate emits the batch contract + broadcasts the negative embed, PartialLoadCheckpointer overrides only load_optimizer (model load stays strict) + the recipe injects it, and the qwen_image processor self-registers. The four environment-independent tests pass on CPU; the GPU/container end-to-end positive checks (smoke train, FSDP2 resume, preprocessing on real images) run via SLURM. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

Add a "Requirements & self-contained data path" section: runs on stock nemo_automodel (>=0.4.0,<1.0) from a source checkout; data loading (fastgen_data/) and preprocessing (preprocess/) are vendored so no nemo_automodel modifications are needed; how to build the cache via preprocess_qwen_image.py. Completes the train-time + preprocessing migration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

…style Round 1 closes the login-node-actionable AC-7 gaps Codex flagged: - LICENSE: record the NeMo-AutoModel vendoring under Third-Party Software Notices (Apache-2.0). - Vendored files: add the NVIDIA SPDX header after the original AutoModel license, per CONTRIBUTING "Copying code from other sources" (matches modelopt/.../calib_utils.py) — all 10. - Runtime guard: fastgen_data/__init__.py soft-warns when installed nemo_automodel is outside the tested >=0.4.0,<1.0 range; preprocess/__init__.py guards the multi_tier_bucketing import with an actionable message. - Tests: drop plan workflow markers (AC-/Phase/Step) from comments per the plan's code-style note. Not changed (deliberate): the collate prompt_embeds_mask -> text_embeddings_mask mapping is byte-identical to the patched AutoModel (faithful vendor; changing it would alter current training behavior). GPU/SLURM end-to-end validation (smoke/resume/preprocessing/revert-all-nine) remains deferred to the container environment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

… tests Address the Round 1 review's actionable gaps: - Add make_negative_prompt_embedding.py: generates the CFG negative_prompt_embedding.pt the canonical config requires, reusing the vendored QwenImageProcessor's text encoder (default negative prompt ""), saved in the dataloader's payload format {"embed", "mask"}. Closes the AC-5/AC-8 self-containment gap (preprocessing previously emitted only per-image caches, so a from-scratch user had no negative-prompt embedding to point the config at). Documented in README. - Move migration tests to tests/examples/diffusers/fastgen/ (the repo's example CI lane) and fix the example-dir resolution from the new location. - Strengthen the provenance test to assert the NVIDIA SPDX header presence + ordering (after the original license), not just the "Vendored from" note. - Update stale "python -m tools.diffusion..." invocation strings in the vendored driver docs to the preprocess_qwen_image.py launcher. GPU/SLURM end-to-end execution remains environment-deferred. No modelopt/torch/fastgen changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

…ig test Address the Round 2 review: - Every real-data config on disk now targets fastgen_data.build_text_to_image_multiresolution_dataloader. The 2 committed configs (dmd2_qwen_image.yaml, smoke) were already repointed; the 6 local-only experiment configs (1M_shift3, df0.7, 320k, 320k_shift3, 8steps, 8steps_320k — git-excluded, with hardcoded /lustre data paths) are repointed on disk so the user's own runs survive deleting the AutoModel patches. (They remain excluded from the published diff.) - The config test now enumerates configs/*.yaml and fails on ANY upstream dataloader target, so a new config cannot silently reintroduce the upstream dependence. Robust to whatever set ships (2 in a fresh clone, 8 on the user's disk). - Assert the builder's negative_prompt_embedding_path defaults to None (CFG-less is optional). GPU/SLURM end-to-end execution remains environment-deferred. No modelopt/torch/fastgen changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

…ntract The stock-upstream SLURM validation (8/9 pass) surfaced a test-fixture bug, not a migration bug: test_collate_emits_contract_keys_and_broadcasts_negative_prompt built a fake sample with only {latent, prompt_embeds, prompt_embeds_mask}, but collate_fn_production also requires crop_resolution / original_resolution / crop_offset / prompt / image_path / bucket_id / aspect_ratio (the keys TextToImageDataset emits). Add them so the test exercises the real collate path. Vendored code unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

Apply the repo ruff config to the vendored data path and preprocessing files: ruff-format + autofixes (pyupgrade py310, isort) plus manual fixes for E402 (logger placement), E501 (provenance comment wrap), N806 (lowercase dims in the video-only 5D shape validator), SIM103, RUF022 (sorted __all__), C401, PERF401. No behavioral change: correctness-critical imports (recipe checkpointer injection, re-exports, # noqa: F401 side-effect imports, MultiTierBucketCalculator, DefaultLoadPlanner) and all provenance/SPDX headers preserved. ruff check + format --check clean; env-independent migration tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

copy-pr-bot · 2026-06-11T20:35:06Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-11T20:35:12Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Vendored NeMo-AutoModel preprocessing and data loaders into the fastgen example, rewired configs/scripts to use local builders, added FSDP2-tolerant optimizer restore and training resume via dataloader state reconstruction, implemented preprocessing and negative-prompt tooling, and added tests validating the complete migration.

Changes

FastGen DMD2 NeMo-AutoModel Vendoring & Integration

Layer / File(s)	Summary
License, configuration, and documentation updates `.pre-commit-config.yaml`, `LICENSE`, `examples/diffusers/fastgen/README.md`, `examples/diffusers/fastgen/requirements.txt`, `pyproject.toml`, `.github/CODEOWNERS`	Updated Apache-2 attribution for vendored NVIDIA code, excluded fastgen files from automatic license insertion, extended README with vendored preprocessing and data-cache build instructions, pinned `nemo_automodel[diffusion]` to `>=0.4.0,<1.0`, added MyPy overrides for vendored preprocess code, and added CODEOWNERS entry.
Processor contracts, registry, and Qwen-Image processor `examples/diffusers/fastgen/preprocess/__init__.py`, `examples/diffusers/fastgen/preprocess/processors/{base,caption_loaders,qwen_image,registry}.py`, `examples/diffusers/fastgen/preprocess/processors/__init__.py`	Adds import guards for vendor dependencies, `BaseModelProcessor` abstraction defining encode/verify/cache contracts, caption loader implementations (sidecar/meta-JSON/JSONL), `ProcessorRegistry` for runtime processor discovery, and `QwenImageProcessor` implementing Qwen-Image model loading, image/text encoding, latent verification, and torch.save-ready cache payload construction.
Multiprocess preprocessing orchestration and launcher `examples/diffusers/fastgen/preprocess/preprocessing_multiprocess.py`, `examples/diffusers/fastgen/preprocess_qwen_image.py`	Implements multiprocess image/video preprocessing with per-GPU worker state, media discovery, bucketing, sharded metadata output, optional verification, CLI subcommands for `image`/`video` modes, and a dedicated Qwen-image launcher entrypoint.
Vendored data package, dataset, collate, and CLI tools `examples/diffusers/fastgen/fastgen_data/{__init__,collate_fns,text_to_image_dataset}.py`, `examples/diffusers/fastgen/make_negative_prompt_embedding.py`, `examples/diffusers/fastgen/configs/dmd2_qwen_image.yaml`, `examples/diffusers/fastgen/dmd2_finetune.py`	Adds `fastgen_data` package with import guards and upstream-version warnings, `TextToImageDataset` for cached .pt samples, `collate_fn_text_to_image` with negative-prompt broadcasting and mask reconstruction, multiresolution dataloader builder, negative-prompt embedding generator script, and rewires config/finetune defaults to use vendored builders.
FSDP2-tolerant checkpoint restore and training resume `examples/diffusers/fastgen/fastgen_checkpoint.py`, `examples/diffusers/fastgen/dmd2_recipe.py`	Introduces `PartialLoadCheckpointer` subclass that loads optimizer state via torch.distributed.checkpoint with `DefaultLoadPlanner(allow_partial_load=True)` to handle FSDP2 partial-shard states. Adds `make_optimizer_partial_load_tolerant()` helper to upgrade checkpointer instances in-place (idempotently) and injects the upgrade call into recipe `load_checkpoint`. Implements `_rebuild_dataloader_for_resume(global_step)` to reset sampler/dataloader state on training resume, reconstructing epoch/skip bookkeeping from restored global_step, enabling correct sample ordering across interruptions.
Validation tests `tests/examples/diffusers/fastgen/test_vendored_migration.py`, `tests/examples/diffusers/fastgen/test_resume_dataloader.py`	Adds invariant tests for YAML vendored-builder targets, staged AutoModel file dispositions, and provenance headers. Includes dependency-guarded structural tests for dataloader/collate negative-prompt broadcasting, checkpointer subclassing/idempotent upgrade, recipe injection, and processor registration. Adds regression tests for training resume dataloader rebuilding, verifying correct sample serving across epoch boundaries and no-op behavior on fresh starts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

NVIDIA/Model-Optimizer#1326: This PR's training resume and partial-load checkpointer changes extend the DMD2DiffusionRecipe framework and checkpoint handling introduced in that PR.

Suggested labels

cherry-pick-done

Suggested reviewers

shengliangxu
kevalmorabia97

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main objective: making the fastgen DMD2 Qwen-Image example self-contained on stock nemo_automodel by vendoring/wrapping upstream code.
Docstring Coverage	✅ Passed	Docstring coverage is 90.48% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	All NEW files in this PR pass security anti-pattern review: torch.load calls use weights_only=True; no numpy.load, hardcoded trust_remote_code, eval/exec on external input, or # nosec comments; all...
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jingyux/fastgen-vendor-automodel-patches

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-06-11T20:44:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.22%. Comparing base (bcd8dd4) to head (3208921).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1688       +/-   ##
===========================================
+ Coverage   58.45%   76.22%   +17.76%     
===========================================
  Files         510      511        +1     
  Lines       56271    56339       +68     
===========================================
+ Hits        32896    42942    +10046     
+ Misses      23375    13397     -9978

Flag	Coverage Δ
unit	`54.34% <ø> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…collate) Reduce the migration's footprint per review: (1) delete the vendored mock_dataloader.py and the mock-data smoke config + its README/recipe references (the published quick-start was broken on stock anyway; real training does not use it); (2) delete the vendored base_video.py (video code, dead in an image-only example) and make the --list_processors display image-only; (3) rewrite collate_fns.py as thin wrappers over stock nemo_automodel (import the unpatched collate + sampler; re-add only the DMD2 delta: text_embeddings_mask + broadcast negative-prompt embedding) instead of a full ~440-line vendored copy, also dropping the dead video collate/builder. text_to_image_dataset.py stays a faithful vendored copy (its prompt_embeds_mask change is interleaved with cache loading). Default config + tests/licensing updated; ruff + env-independent checks pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

…quality) The preprocess/ tree is vendored verbatim from NVIDIA-NeMo/Automodel (Apache-2.0) and uses the upstream worker-global pattern (module-level 'X | None' set lazily in _init_worker), which trips mypy's strict union-attr/arg-type under examples.*. Add a scoped mypy override (ignore_errors, like tests.*) for examples.diffusers.fastgen.preprocess.* so the faithful copy is not annotated/diverged from upstream. The authored fastgen_data/ wrapper is unaffected (already mypy-clean). 36 errors in 2 files -> 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

Assign modelopt/torch/fastgen (DMD2 library) and /examples/diffusers/fastgen (this example) to @NVIDIA/modelopt-torch-fastgen-codeowners. Both are more-specific entries placed after their parent rules (modelopt/torch and /examples/diffusers), so GitHub's last-match-wins routes fastgen review to the fastgen team, matching the repo convention (e.g. modelopt/torch/quantization carving out of modelopt/torch). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/diffusers/fastgen/dmd2_recipe.py (1)

671-671: ⚠️ Potential issue | 🔴 Critical

Fix unsafe checkpoint deserialization (weights_only=False) in diffusers fastgen restore paths

examples/diffusers/fastgen/dmd2_recipe.py restores sidecar checkpoint payloads using torch.load(..., weights_only=False) at lines 671, 676, 689, and 698 with no inline justification. Update these loads to weights_only=True (or add an inline comment explaining why weights_only=False is safe).

Suggested patch

-            ema_state = torch.load(ema_path, map_location="cpu", weights_only=False)
+            ema_state = torch.load(ema_path, map_location="cpu", weights_only=True)
...
-            state = torch.load(state_path, map_location="cpu", weights_only=False)
+            state = torch.load(state_path, map_location="cpu", weights_only=True)
...
-                disc_state = torch.load(disc_path, map_location="cpu", weights_only=False)
+                disc_state = torch.load(disc_path, map_location="cpu", weights_only=True)
...
-                disc_opt_state = torch.load(disc_opt_path, map_location="cpu", weights_only=False)
+                disc_opt_state = torch.load(disc_opt_path, map_location="cpu", weights_only=True)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/diffusers/fastgen/dmd2_recipe.py` at line 671, Summary: Unsafe
checkpoint deserialization uses torch.load(..., weights_only=False); change to
weights_only=True or justify why full deserialization is required. Locate the
torch.load calls that load sidecar checkpoints (e.g., the ema_state load
referenced by the symbol ema_state and variable ema_path in dmd2_recipe.py) and
update the argument weights_only=False to weights_only=True for all similar
restore paths (the other torch.load calls flagged in the review). If full object
deserialization truly is required, instead leave weights_only=False but add an
inline comment explaining why it is safe and unavoidable; otherwise prefer
weights_only=True to avoid executing arbitrary pickle payloads.

Source: Coding guidelines

🧹 Nitpick comments (1)

examples/diffusers/fastgen/preprocess/processors/base.py (1)
187-190: ⚡ Quick win

Move the NumPy import to module scope (or document why it must be deferred).

Line 187 imports numpy inside preprocess_image without an explicit circular/optional/heavy-import justification, so import errors surface at runtime instead of module load.

As per coding guidelines, imports should stay at file top unless there is a documented exception.
Proposed change
 from abc import ABC, abstractmethod
 from typing import Any
 
+import numpy as np
 import torch
 from PIL import Image
@@
     def preprocess_image(self, image: Image.Image) -> torch.Tensor:
@@
-        import numpy as np
-
         image_tensor = torch.from_numpy(np.array(image)).float() / 255.0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/diffusers/fastgen/preprocess/processors/base.py` around lines 187 -
190, The numpy import is done inside the preprocessing code (where image_tensor
is created in preprocess_image), which causes import errors to appear at
runtime; move "import numpy as np" to module scope near the other imports at the
top of examples/diffusers/fastgen/preprocess/processors/base.py (or add a short
comment explaining a deliberate deferred import) so that preprocess_image (and
the image_tensor creation) uses a top-level np import; update any references in
preprocess_image and related functions to use the module-scoped np.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/diffusers/fastgen/fastgen_data/collate_fns.py`:
- Line 103: The code uses torch.load(path, map_location="cpu",
weights_only=False) to load a user-configurable negative_prompt_embedding_path,
which allows unsafe pickle loading; change the load to a safe mode and validate
the input: ensure the loader uses weights_only=True (i.e., torch.load(path,
map_location="cpu", weights_only=True)) or otherwise deserialize with a safe
format (e.g., torch.jit.load or a JSON/npz reader) and validate
negative_prompt_embedding_path exists and is from a trusted source before
loading; update the line creating payload (payload = torch.load(...)) and any
code that assumes pickled objects accordingly.

In `@examples/diffusers/fastgen/preprocess/processors/base.py`:
- Around line 208-210: The code currently dereferences
models["vae"].config.scaling_factor without ensuring the attribute exists;
update the getter so that after confirming "vae" is in models and models["vae"]
has a config, you also check that the config has a scaling_factor (e.g.,
hasattr(models["vae"].config, "scaling_factor") or getattr with default) and
only return it when present, otherwise return the fallback 0.18215; reference
the models dict, the "vae" entry, its config attribute, and the scaling_factor
property when implementing this guard.

In `@examples/diffusers/fastgen/preprocess/processors/caption_loaders.py`:
- Line 425: The import "from collections import defaultdict" is placed inside a
function in caption_loaders.py—move this import to the module top-level and
remove the in-function import; search for the in-function occurrence of "from
collections import defaultdict" (and any use of defaultdict) and add a single
top-of-file import statement instead so import errors surface at module load and
the function body no longer contains the import.

In `@examples/diffusers/fastgen/preprocess/processors/qwen_image.py`:
- Around line 184-191: The second value returned by pipeline.encode_prompt (the
prompt_embeds_mask) is being ignored in encode_prompt calls and therefore not
serialized; update the encode_prompt usage in qwen_image.py to capture both
outputs (e.g., prompt_embeds, prompt_embeds_mask = pipeline.encode_prompt(...))
and include prompt_embeds_mask in the returned/persisted dict alongside
prompt_embeds (apply the same .detach().cpu().to(torch.bfloat16) or appropriate
dtype/shape handling as done for prompt_embeds) so downstream TextToImageDataset
receives the real mask instead of falling back to an all-ones mask.
- Line 217: The autocast call inside verify_latent is using an invalid dtype for
CUDA (torch.float32); update the autocast usage in verify_latent (qwen_image.py)
to use a valid reduced-precision dtype such as torch.float16 or torch.bfloat16,
or remove the dtype argument so autocast picks the default, ensuring the with
torch.no_grad(), autocast(...) context no longer raises and causes false
verification failures.

In `@LICENSE`:
- Around line 226-228: Add a top-of-file vendored header to each Python file
listed (dmd2_finetune.py, dmd2_recipe.py, export_diffusers_qwen_image.py,
fastgen_checkpoint.py, fastgen_data/__init__.py, fastgen_data/collate_fns.py,
inference_dmd2_qwen_image.py, make_negative_prompt_embedding.py,
preprocess_qwen_image.py) that matches the LICENSE expectation: a single-line or
block comment beginning with "Vendored from" and including the original
NVIDIA-NeMo/Automodel source path and commit SHA (e.g., "Vendored from
NVIDIA-NeMo/Automodel: <original/path> @ <commit>"); place it at the very top of
each file before imports and include any short attribution or license snippet
required by the original file so each file contains the "Vendored from" string
referenced in LICENSE.

In `@tests/examples/diffusers/fastgen/test_vendored_migration.py`:
- Around line 157-163: Move the local import of inspect out of the test function
to a top-level import; update the module imports to include "import inspect" at
the module scope and remove the "import inspect" line from inside
test_data_builders_importable_and_accept_negative_prompt_path so the test uses
the module-level inspect symbol.

---

Outside diff comments:
In `@examples/diffusers/fastgen/dmd2_recipe.py`:
- Line 671: Summary: Unsafe checkpoint deserialization uses torch.load(...,
weights_only=False); change to weights_only=True or justify why full
deserialization is required. Locate the torch.load calls that load sidecar
checkpoints (e.g., the ema_state load referenced by the symbol ema_state and
variable ema_path in dmd2_recipe.py) and update the argument weights_only=False
to weights_only=True for all similar restore paths (the other torch.load calls
flagged in the review). If full object deserialization truly is required,
instead leave weights_only=False but add an inline comment explaining why it is
safe and unavoidable; otherwise prefer weights_only=True to avoid executing
arbitrary pickle payloads.

---

Nitpick comments:
In `@examples/diffusers/fastgen/preprocess/processors/base.py`:
- Around line 187-190: The numpy import is done inside the preprocessing code
(where image_tensor is created in preprocess_image), which causes import errors
to appear at runtime; move "import numpy as np" to module scope near the other
imports at the top of examples/diffusers/fastgen/preprocess/processors/base.py
(or add a short comment explaining a deliberate deferred import) so that
preprocess_image (and the image_tensor creation) uses a top-level np import;
update any references in preprocess_image and related functions to use the
module-scoped np.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5a5a2507-bada-45a1-9617-9f3d3b0b7710

📥 Commits

Reviewing files that changed from the base of the PR and between dd49a46 and c7fc9bf.

📒 Files selected for processing (23)

.pre-commit-config.yaml
LICENSE
examples/diffusers/fastgen/README.md
examples/diffusers/fastgen/configs/dmd2_qwen_image.yaml
examples/diffusers/fastgen/configs/dmd2_qwen_image_smoke.yaml
examples/diffusers/fastgen/dmd2_finetune.py
examples/diffusers/fastgen/dmd2_recipe.py
examples/diffusers/fastgen/fastgen_checkpoint.py
examples/diffusers/fastgen/fastgen_data/__init__.py
examples/diffusers/fastgen/fastgen_data/collate_fns.py
examples/diffusers/fastgen/fastgen_data/text_to_image_dataset.py
examples/diffusers/fastgen/make_negative_prompt_embedding.py
examples/diffusers/fastgen/preprocess/__init__.py
examples/diffusers/fastgen/preprocess/preprocessing_multiprocess.py
examples/diffusers/fastgen/preprocess/processors/__init__.py
examples/diffusers/fastgen/preprocess/processors/base.py
examples/diffusers/fastgen/preprocess/processors/caption_loaders.py
examples/diffusers/fastgen/preprocess/processors/qwen_image.py
examples/diffusers/fastgen/preprocess/processors/registry.py
examples/diffusers/fastgen/preprocess_qwen_image.py
examples/diffusers/fastgen/requirements.txt
pyproject.toml
tests/examples/diffusers/fastgen/test_vendored_migration.py

💤 Files with no reviewable changes (1)

examples/diffusers/fastgen/configs/dmd2_qwen_image_smoke.yaml

coderabbitai · 2026-06-11T23:06:35Z

+        if "vae" in models and hasattr(models["vae"], "config"):
+            return models["vae"].config.scaling_factor
+        return 0.18215  # Default for most models


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard scaling_factor existence before dereference.

Line 209 assumes models["vae"].config.scaling_factor always exists once config exists. If scaling_factor is missing, this raises instead of using the default fallback.

Proposed change

def get_vae_scaling_factor(self, models: dict[str, Any]) -> float: @@ - if "vae" in models and hasattr(models["vae"], "config"): - return models["vae"].config.scaling_factor + if "vae" in models and hasattr(models["vae"], "config"): + scaling_factor = getattr(models["vae"].config, "scaling_factor", None) + if scaling_factor is not None: + return scaling_factor return 0.18215 # Default for most models

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/diffusers/fastgen/preprocess/processors/base.py` around lines 208 - 210, The code currently dereferences models["vae"].config.scaling_factor without ensuring the attribute exists; update the getter so that after confirming "vae" is in models and models["vae"] has a config, you also check that the config has a scaling_factor (e.g., hasattr(models["vae"].config, "scaling_factor") or getattr with default) and only return it when present, otherwise return the fallback 0.18215; reference the models dict, the "vae" entry, its config attribute, and the scaling_factor property when implementing this guard.

coderabbitai · 2026-06-11T23:06:35Z

+            prompt_embeds, _ = pipeline.encode_prompt(
+                prompt=prompt,
+                device=device,
+            )
+
+        return {
+            "prompt_embeds": prompt_embeds.detach().cpu().to(torch.bfloat16),
+        }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

prompt_embeds_mask is dropped before cache serialization.

Line 184 receives a second output from encode_prompt but discards it, and Lines 261-266 persist only prompt_embeds. Downstream, TextToImageDataset falls back to an all-ones mask when the mask key is absent, so the true sequence mask contract is lost.

Proposed change

with torch.no_grad(): - prompt_embeds, _ = pipeline.encode_prompt( + prompt_embeds, prompt_embeds_mask = pipeline.encode_prompt( prompt=prompt, device=device, ) - return { + result = { "prompt_embeds": prompt_embeds.detach().cpu().to(torch.bfloat16), } + if prompt_embeds_mask is not None: + result["prompt_embeds_mask"] = prompt_embeds_mask.detach().cpu().to(torch.long) + return result @@ return { # Image latent "latent": latent, # Text embeddings "prompt_embeds": text_encodings["prompt_embeds"], + "prompt_embeds_mask": text_encodings.get("prompt_embeds_mask"), # Metadata "original_resolution": metadata["original_resolution"],

Also applies to: 261-266

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/diffusers/fastgen/preprocess/processors/qwen_image.py` around lines 184 - 191, The second value returned by pipeline.encode_prompt (the prompt_embeds_mask) is being ignored in encode_prompt calls and therefore not serialized; update the encode_prompt usage in qwen_image.py to capture both outputs (e.g., prompt_embeds, prompt_embeds_mask = pipeline.encode_prompt(...)) and include prompt_embeds_mask in the returned/persisted dict alongside prompt_embeds (apply the same .detach().cpu().to(torch.bfloat16) or appropriate dtype/shape handling as done for prompt_embeds) so downstream TextToImageDataset receives the real mask instead of falling back to an all-ones mask.

coderabbitai · 2026-06-11T23:06:35Z

+  Copyright (c) 2025 NVIDIA CORPORATION — portions of examples/diffusers/fastgen/ are
+    vendored from NVIDIA-NeMo/Automodel (https://github.com/NVIDIA-NeMo/Automodel); see the
+    per-file "Vendored from" headers for the exact source paths and commit.


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check that vendored files have "Vendored from" headers. # Expected: Each file starts with a comment containing "Vendored from" + source URL. vendored_files=( "examples/diffusers/fastgen/fastgen_data/text_to_image_dataset.py" "examples/diffusers/fastgen/preprocess/preprocessing_multiprocess.py" "examples/diffusers/fastgen/preprocess/processors/__init__.py" "examples/diffusers/fastgen/preprocess/processors/base.py" "examples/diffusers/fastgen/preprocess/processors/registry.py" "examples/diffusers/fastgen/preprocess/processors/caption_loaders.py" "examples/diffusers/fastgen/preprocess/processors/qwen_image.py" ) echo "Checking for 'Vendored from' headers in vendored files..." for file in "${vendored_files[@]}"; do if [ -f "$file" ]; then if head -20 "$file" | grep -q "Vendored from"; then echo "✓ $file: has Vendored from header" else echo "✗ $file: MISSING Vendored from header" fi else echo "⚠ $file: file not found" fi done

Repository: NVIDIA/Model-Optimizer

Length of output: 763

🏁 Script executed:

#!/bin/bash set -euo pipefail dir="examples/diffusers/fastgen" echo "Scanning for Python files under: $dir" mapfile -t files < <(find "$dir" -type f -name "*.py" | sort) echo "Found ${`#files`[@]} .py files." missing=() for f in "${files[@]}"; do if head -30 "$f" | rg -q "Vendored from"; then echo "✓ $f" else echo "✗ $f (missing 'Vendored from' in first 30 lines)" missing+=("$f") fi done if [ ${`#missing`[@]} -gt 0 ]; then echo echo "Missing count: ${`#missing`[@]}" printf '%s\n' "${missing[@]}" exit 1 else echo echo "All scanned Python files have 'Vendored from' headers." fi

Repository: NVIDIA/Model-Optimizer

Length of output: 201

🏁 Script executed:

#!/bin/bash set -euo pipefail dir="examples/diffusers/fastgen" echo "Scanning for Python files under: $dir" tmp="$(mktemp)" find "$dir" -type f -name "*.py" | sort > "$tmp" mapfile -t files < "$tmp" rm -f "$tmp" echo "Found ${`#files`[@]} .py files." missing=() for f in "${files[@]}"; do if head -40 "$f" | rg -q "Vendored from"; then echo "✓ $f" else echo "✗ $f (missing 'Vendored from' in first 40 lines)" missing+=("$f") fi done if [ ${`#missing`[@]} -gt 0 ]; then echo echo "Missing count: ${`#missing`[@]}" printf '%s\n' "${missing[@]}" exit 1 else echo echo "All scanned Python files have 'Vendored from' headers." fi

Repository: NVIDIA/Model-Optimizer

Length of output: 2025

🏁 Script executed:

#!/bin/bash set -euo pipefail dir="examples/diffusers/fastgen" tmp="$(mktemp)" find "$dir" -type f -name "*.py" | sort > "$tmp" mapfile -t files < "$tmp" rm -f "$tmp" echo "Checking where 'Vendored from' appears (first 200 lines + anywhere in file):" for f in "${files[@]}"; do first200_has=0 any_has=0 if head -200 "$f" | rg -q "Vendored from"; then first200_has=1 fi if rg -q "Vendored from" "$f"; then any_has=1 fi if [ "$any_has" -eq 1 ]; then if [ "$first200_has" -eq 1 ]; then echo "✓ $f (Vendored from in first 200 lines)" else echo "⚠ $f (Vendored from exists but after first 200 lines)" fi else echo "✗ $f (no 'Vendored from' anywhere)" fi done

Repository: NVIDIA/Model-Optimizer

Length of output: 1689

Add missing # Vendored from ... headers to vendored fastgen Python files

LICENSE points to per-file “Vendored from” headers, but these vendored files under examples/diffusers/fastgen/ contain no “Vendored from” string anywhere:

examples/diffusers/fastgen/dmd2_finetune.py

examples/diffusers/fastgen/dmd2_recipe.py

examples/diffusers/fastgen/export_diffusers_qwen_image.py

examples/diffusers/fastgen/fastgen_checkpoint.py

examples/diffusers/fastgen/fastgen_data/__init__.py

examples/diffusers/fastgen/fastgen_data/collate_fns.py

examples/diffusers/fastgen/inference_dmd2_qwen_image.py

examples/diffusers/fastgen/make_negative_prompt_embedding.py

examples/diffusers/fastgen/preprocess_qwen_image.py

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@LICENSE` around lines 226 - 228, Add a top-of-file vendored header to each Python file listed (dmd2_finetune.py, dmd2_recipe.py, export_diffusers_qwen_image.py, fastgen_checkpoint.py, fastgen_data/__init__.py, fastgen_data/collate_fns.py, inference_dmd2_qwen_image.py, make_negative_prompt_embedding.py, preprocess_qwen_image.py) that matches the LICENSE expectation: a single-line or block comment beginning with "Vendored from" and including the original NVIDIA-NeMo/Automodel source path and commit SHA (e.g., "Vendored from NVIDIA-NeMo/Automodel: <original/path> @ <commit>"); place it at the very top of each file before imports and include any short attribution or license snippet required by the original file so each file contains the "Vendored from" string referenced in LICENSE.

…e diffusers team Drop the /examples/diffusers/fastgen carve-out; @NVIDIA/modelopt-torch-fastgen-codeowners owns modelopt/torch/fastgen (the DMD2 library). The example stays under /examples/diffusers (@NVIDIA/modelopt-examples-diffusers-codeowners). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

Apply the safe, sensible CodeRabbit findings: (1) collate_fns.py loads the negative-prompt embedding with weights_only=True (it is a dict of plain tensors saved by make_negative_prompt_embedding.py, so it loads cleanly) — removes the pickle/RCE vector per SECURITY.md; (2) move 'from collections import defaultdict' (caption_loaders) and 'import inspect' (test) to module top; (3) drop the no-op autocast(dtype=torch.float32) in QwenImageProcessor.verify_latent — the latent is already .float(), so this is behavior-identical and removes the invalid-CUDA-autocast-dtype warning. Preprocessing/data only; no change to DMD2 training math. Intentionally not changed (would alter validated behavior or are false positives): not persisting prompt_embeds_mask in the cache (would switch the text-attention mask from the all-ones fallback the existing caches + running jobs use to a real mask — changes training math); the speculative scaling_factor guard (Qwen-Image VAE always defines it; vendored verbatim); 'Vendored from' headers on authored files (those correctly carry the standard NVIDIA SPDX header — LICENSE states only 'portions' are vendored). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

jingyu-ml · 2026-06-12T04:25:35Z

/ok to test 85ae40c

kevalmorabia97 · 2026-06-12T05:52:36Z

These are also nvidia authored and apache 2 license files so wr dont need to exclude in pre commit or have duplicate nvidia license. Let license hook auto add correct license everywhere

Good call — addressed in be832ae. The files copied from NVIDIA-NeMo/Automodel now carry only the standard NVIDIA SPDX header (managed by the insert-license hook): I dropped their pre-commit insert-license exclusion, removed the per-file Vendored from provenance note + the duplicated original-license block, and removed the NVIDIA-NeMo/Automodel note from LICENSE (the HuggingFace / MIT / BSD third-party notices are unchanged). The migration test now asserts the standard header + no duplicate license instead of requiring the provenance note.

Happy to keep a one-line adapted from note below the header on any of these if you'd prefer.

On resume the StatefulDataLoader's restored state does not advance past the resume point: re-checkpointing after a resume fails to capture progress, so every window re-served the same data slice and multi-window (e.g. SLURM time-limited) runs under-covered the dataset. Verified on production checkpoints (steps 999 / 12999 / 24999 all resume to the same sample) and a CPU probe. Only the data position is affected -- the step counter is restored correctly -- but that is the substantive part. The reliably-restored counter is global_step, so the new DMD2DiffusionRecipe._rebuild_dataloader_for_resume discards the stuck loader state on resume: it rebuilds a fresh StatefulDataLoader and skips the deterministic sampler to the position implied by global_step (epoch = global_step // epoch_len, skip = (global_step % epoch_len) * grad_acc). This is CPU-verified bit-identical to a no-resume run on the real dataloader, including across epoch boundaries, and also corrects the resumed epoch label and progress-bar position. It is intentionally not wrapped in try/except: the inputs are always present on this path, and silently falling back to the stuck loader would reinstate the bug. Add a SLURM-free, GPU-free CPU regression test (tests/examples/diffusers/fastgen/test_resume_dataloader.py) that drives the real SequentialBucketSampler + StatefulDataLoader through the recipe method and asserts the first sample served after a resume equals a clean no-resume run's sample at that global step (mid-epoch, epoch boundary, and cross-epoch), plus a fresh-start no-op check; it fails on the previous loader-state resume. DMD2 training math is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

…iew) Per review: the files copied from NVIDIA-NeMo/Automodel are NVIDIA-authored Apache-2.0, so they do not need the third-party "copying code" treatment. Drop the per-file "Vendored from" provenance note and the duplicated original-license block (each file now carries only the standard NVIDIA SPDX header that the insert-license hook manages), remove their insert-license pre-commit exclusion, and remove the NVIDIA-NeMo/Automodel note from LICENSE's third-party section (the genuine HuggingFace / MIT / BSD notices are unchanged). Update the migration test to assert the standard header + no provenance / no duplicate license instead of requiring the provenance note. Headers and config only -- no code or behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/diffusers/fastgen/dmd2_recipe.py (1)

720-727: ⚠️ Potential issue | 🔴 Critical

Use safe deserialization for sidecar checkpoint restores.

The recipe saves and restores checkpoint sidecars (ema_shadow.pt, dmd_state.pt, discriminator.pt, discriminator_optimizer.pt) containing only state dicts and scalars. These should use weights_only=True since they contain no arbitrary Python objects.

-            ema_state = torch.load(ema_path, map_location="cpu", weights_only=False)
+            ema_state = torch.load(ema_path, map_location="cpu", weights_only=True)
@@ Line 726
-            state = torch.load(state_path, map_location="cpu", weights_only=False)
+            state = torch.load(state_path, map_location="cpu", weights_only=True)
@@ Line 739
-                disc_state = torch.load(disc_path, map_location="cpu", weights_only=False)
+                disc_state = torch.load(disc_path, map_location="cpu", weights_only=True)
@@ Line 748
-                disc_opt_state = torch.load(disc_opt_path, map_location="cpu", weights_only=False)
+                disc_opt_state = torch.load(disc_opt_path, map_location="cpu", weights_only=True)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/diffusers/fastgen/dmd2_recipe.py` around lines 720 - 727, The
torch.load() calls for loading EMA state (ema_path) and iteration state
(state_path) are using weights_only=False, but these sidecar checkpoint files
only contain state dicts and scalars with no arbitrary Python objects. Change
weights_only=False to weights_only=True in both torch.load() calls to enable
safe deserialization for these sidecar checkpoint restores.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/examples/diffusers/fastgen/test_resume_dataloader.py`:
- Around line 95-101: Move the optional dependency guards and imports from
inside the test functions to module scope at the top of the file. The imports
for torch, nemo_automodel, torchdata, SequentialBucketSampler,
StatefulDataLoader, and dmd2_recipe (wrapped in pytest.importorskip calls)
should be relocated to module-level after any sys.path setup, rather than being
duplicated inside test bodies. This ensures import errors surface at
collection/skip time rather than mid-test execution, following the guideline
that imports belong at the top of the file.

---

Outside diff comments:
In `@examples/diffusers/fastgen/dmd2_recipe.py`:
- Around line 720-727: The torch.load() calls for loading EMA state (ema_path)
and iteration state (state_path) are using weights_only=False, but these sidecar
checkpoint files only contain state dicts and scalars with no arbitrary Python
objects. Change weights_only=False to weights_only=True in both torch.load()
calls to enable safe deserialization for these sidecar checkpoint restores.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 38a101c0-2324-42d2-a207-98a20b45964d

📥 Commits

Reviewing files that changed from the base of the PR and between 85ae40c and 6ffbc52.

📒 Files selected for processing (2)

examples/diffusers/fastgen/dmd2_recipe.py
tests/examples/diffusers/fastgen/test_resume_dataloader.py

coderabbitai · 2026-06-15T23:24:32Z

+    pytest.importorskip("torch")
+    pytest.importorskip("nemo_automodel")
+    pytest.importorskip("torchdata")
+    from nemo_automodel.components.datasets.diffusion.sampler import SequentialBucketSampler
+    from torchdata.stateful_dataloader import StatefulDataLoader
+
+    dmd2_recipe = pytest.importorskip("dmd2_recipe")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Move optional dependency guards/imports to module scope.

These imports run inside test bodies and are duplicated, so import failures surface mid-test instead of at collection/skip time. Keep the pytest.importorskip guards, but hoist them after the sys.path setup. As per coding guidelines, “Imports belong at the top of the file so import errors surface at collection time, not mid-test.”

🧪 Proposed fix

if str(_FASTGEN_DIR) not in sys.path: sys.path.insert(0, str(_FASTGEN_DIR)) +pytest.importorskip("torch") +_sampler_mod = pytest.importorskip("nemo_automodel.components.datasets.diffusion.sampler") +_stateful_dataloader_mod = pytest.importorskip("torchdata.stateful_dataloader") +dmd2_recipe = pytest.importorskip("dmd2_recipe") + +SequentialBucketSampler = _sampler_mod.SequentialBucketSampler +StatefulDataLoader = _stateful_dataloader_mod.StatefulDataLoader + _N = 20 # samples (== batches, batch_size 1) per epoch in the synthetic dataset @@ - pytest.importorskip("torch") - pytest.importorskip("nemo_automodel") - pytest.importorskip("torchdata") - from nemo_automodel.components.datasets.diffusion.sampler import SequentialBucketSampler - from torchdata.stateful_dataloader import StatefulDataLoader - - dmd2_recipe = pytest.importorskip("dmd2_recipe") # The reset logs only on the main process; force True off the distributed path. monkeypatch.setattr(dmd2_recipe, "is_main_process", lambda: True, raising=False) @@ - pytest.importorskip("torch") - pytest.importorskip("nemo_automodel") - pytest.importorskip("torchdata") - from nemo_automodel.components.datasets.diffusion.sampler import SequentialBucketSampler - from torchdata.stateful_dataloader import StatefulDataLoader - - dmd2_recipe = pytest.importorskip("dmd2_recipe") monkeypatch.setattr(dmd2_recipe, "is_main_process", lambda: True, raising=False)

Also applies to: 140-146

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/examples/diffusers/fastgen/test_resume_dataloader.py` around lines 95 - 101, Move the optional dependency guards and imports from inside the test functions to module scope at the top of the file. The imports for torch, nemo_automodel, torchdata, SequentialBucketSampler, StatefulDataLoader, and dmd2_recipe (wrapped in pytest.importorskip calls) should be relocated to module-level after any sys.path setup, rather than being duplicated inside test bodies. This ensures import errors surface at collection/skip time rather than mid-test execution, following the guideline that imports belong at the top of the file.

Source: Coding guidelines

kevalmorabia97 · 2026-06-16T03:19:30Z

/ok to test 76bd04a

The resume fix re-assigned `self.dataloader` to a fresh StatefulDataLoader, but `dataloader` is already a tracked state key, so BaseRecipe.__setattr__ raised `RuntimeError: State key 'dataloader' is already tracked` and crashed every rank right after restoring the checkpoint (observed on a 128-GPU resume from epoch_0_step_199). Assign via `self.__dict__["dataloader"] = ...` instead: it updates the attribute while leaving it tracked (the existing `__state_tracked` entry is unchanged, so the rebuilt loader is still checkpointed), matching the pattern the recipe already uses for its other extras. Modelopt-side only -- no Automodel change. Harden the regression test to drive `_rebuild_dataloader_for_resume` through a real `DMD2DiffusionRecipe` instance (`object.__new__`) so BaseRecipe.__setattr__ state-tracking is exercised. The previous SimpleNamespace stub had no `__setattr__`, so it never hit the "already tracked" guard and missed this crash. Verified the old line raises the exact error and the `__dict__` assignment does not (and keeps `dataloader` tracked). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

jingyu-ml and others added 11 commits June 10, 2026 17:41

jingyu-ml added the cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Jun 11, 2026

jingyu-ml self-assigned this Jun 11, 2026

jingyu-ml marked this pull request as ready for review June 11, 2026 22:48

jingyu-ml requested review from a team as code owners June 11, 2026 22:48

jingyu-ml requested a review from kevalmorabia97 June 11, 2026 22:48

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

jingyu-ml and others added 2 commits June 11, 2026 16:09

kevalmorabia97 reviewed Jun 12, 2026

View reviewed changes

kevalmorabia97 requested changes Jun 12, 2026

View reviewed changes

jingyu-ml requested a review from a team as a code owner June 15, 2026 23:15

jingyu-ml requested review from ChenhanYu and chochowski June 15, 2026 23:15

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

jingyu-ml requested a review from kevalmorabia97 June 15, 2026 23:25

kevalmorabia97 removed the request for review from chochowski June 16, 2026 03:16

Merge branch 'main' into jingyux/fastgen-vendor-automodel-patches

76bd04a

kevalmorabia97 requested a review from a team June 16, 2026 03:19

Conversation

jingyu-ml commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Update (post-review)

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Jun 11, 2026

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

codecov Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jingyu-ml commented Jun 12, 2026

Uh oh!

kevalmorabia97 Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

jingyu-ml Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

kevalmorabia97 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jingyu-ml commented Jun 11, 2026 •

edited

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

codecov Bot commented Jun 11, 2026 •

edited

Loading