Skip to content

fastgen DMD2: make the Qwen-Image example self-contained on stock nemo_automodel#1688

Open
jingyu-ml wants to merge 20 commits into
mainfrom
jingyux/fastgen-vendor-automodel-patches
Open

fastgen DMD2: make the Qwen-Image example self-contained on stock nemo_automodel#1688
jingyu-ml wants to merge 20 commits into
mainfrom
jingyux/fastgen-vendor-automodel-patches

Conversation

@jingyu-ml

@jingyu-ml jingyu-ml commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: new example / refactor (example self-containment)

The published examples/diffusers/fastgen DMD2 Qwen-Image distillation example previously relied on
local, unpublished modifications to the sibling nemo_automodel package (data path, collate, a
partial-load checkpointer, and Qwen-Image preprocessing). An external user running the published
Model-Optimizer example against official stock nemo_automodel would hit import/attribute errors.

This PR makes the example self-contained on stock nemo_automodel>=0.4.0 — with the changes kept
as small as possible: the team's actual AutoModel delta is only ~430 lines, so wherever the
upstream module is importable, the delta is expressed as a thin subclass/wrapper rather than a
full-file copy.

  • fastgen_data/ — the DMD2 data path:
    • collate_fns.pythin wrappers over stock: import the unpatched upstream collate +
      bucket sampler and re-add only the DMD2 delta (text_embeddings_mask, stacked from the
      per-item mask the stock collate omits; and an optional broadcast negative_text_embeddings
      for CFG). The builder loads an optional negative_prompt_embedding_path.
    • text_to_image_dataset.py — a faithful vendored copy of the upstream reader (its
      prompt_embeds_mask emission is interleaved with cache loading, so wrapping it would force a
      redundant per-item torch.load; carried verbatim instead).
  • fastgen_checkpoint.pyPartialLoadCheckpointer(Checkpointer) that overrides only
    load_optimizer (FSDP2 DefaultLoadPlanner(allow_partial_load=True)) so optimizer resume works
    without patching upstream. Injected via an in-place re-bless of self.checkpointer in the
    recipe's load_checkpoint (model-state load stays strict).
  • preprocess/ — Qwen-Image preprocessing (preprocessing_multiprocess.py + processors/),
    trimmed to the image path (drops the flux/wan/hunyuan processors and the video base class).
    It lives in AutoModel's top-level tools/ tree, which is not shipped in the pip package, so
    it cannot be wrapped and is vendored; MultiTierBucketCalculator is imported from stock upstream.
  • make_negative_prompt_embedding.py — generates the optional CFG negative-prompt embedding.
  • All configs/*.yaml target fastgen_data.build_* (a test enumerates every config).
  • Licensing: the AutoModel-copied files are NVIDIA-authored Apache-2.0, so they carry only the
    standard NVIDIA SPDX header (managed by the insert-license hook) — no per-file provenance note,
    no duplicated license, no pre-commit exclusion, and no separate LICENSE note.
    nemo_automodel[diffusion] version bound in requirements.txt.

The DMD2 math in modelopt/torch/fastgen/ is unchanged — only example/training-time glue moved.

Usage

# Install example deps (stock nemo_automodel) from a source checkout
pip install -r examples/diffusers/fastgen/requirements.txt

# Build the training cache from raw images (Qwen-Image VAE latents + text embeddings)
python examples/diffusers/fastgen/preprocess_qwen_image.py image \
    --image_dir <raw images> --output_dir <cache dir> --processor qwen_image \
    --caption_format meta_json

# Generate the CFG negative-prompt embedding once
python examples/diffusers/fastgen/make_negative_prompt_embedding.py \
    --output <cache dir>/negative_prompt_embedding.pt

# Point the config's data.dataloader.cache_dir + negative_prompt_embedding_path at the cache, then train.

See examples/diffusers/fastgen/README.md → "Requirements & self-contained data path".

Testing

  • New tests/examples/diffusers/fastgen/test_vendored_migration.py: environment-independent
    invariants (every config targets a vendored builder; no tools.* imports; each former AutoModel
    patch is vendored / wrapped / a documented exclusion; the former-vendored files carry the standard NVIDIA SPDX header, no provenance note or duplicate license) plus
    dependency-guarded structural tests (the collate wrapper emits the batch contract + broadcasts the
    negative embedding; the builder accepts negative_prompt_embedding_path; the checkpointer
    overrides only load_optimizer; the Qwen-Image processor self-registers).
  • Validated 9/9 against a pure stock nemo_automodel 0.4.0 worktree (none of the local patches
    present) via SLURM — re-run after this slim-down.
  • The migrated code path is exercised by a live multi-GPU DMD2 run that resumed from a checkpoint
    through the vendored PartialLoadCheckpointer.
  • ruff check + ruff format --check clean on all changed files.

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅ (additive; bundled configs target the vendored builders)
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ (NeMo-AutoModel @ e42584e3, Apache-2.0; per review these NVIDIA-authored files carry the standard NVIDIA SPDX header, no separate provenance / LICENSE note; nemo_automodel was already a dependency)
  • Did you write any new necessary tests?: ✅
  • Did you update Changelog?: N/A (example-only change)
  • Did you get Claude approval on this PR?: ❌ (pending — opened as draft)

Additional Information

Opened as a draft pending: OSRB review of the vendored NeMo-AutoModel (Apache-2.0) code, CI green,
and (optional) a multi-GPU smoke-train + resume on stock upstream. Vendored from
NVIDIA-NeMo/Automodel at commit e42584e3.

Update (post-review)

  • Mid-run resume data-correctness fix (6ffbc52c9): on resume the StatefulDataLoader's restored state did not advance past the resume point, so each window re-served the same data slice and multi-window (SLURM-windowed) runs under-covered the dataset. Fixed by rebuilding a fresh loader and skipping the deterministic sampler to the position implied by global_step; added a SLURM-free, GPU-free CPU regression test (tests/examples/diffusers/fastgen/test_resume_dataloader.py).
  • Licensing review (be832ae95): the AutoModel-copied files are NVIDIA-authored Apache-2.0, so they now carry only the standard NVIDIA SPDX header — dropped the per-file provenance note, the duplicated original-license block, the insert-license pre-commit exclusion, and the LICENSE note.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added vendored Qwen‑Image preprocessing (multi-process) with processor registry support.
    • Updated DMD2 data loading with a dedicated dataset/collation pipeline, including negative-prompt embedding/mask handling.
    • Improved training resume behavior by rebuilding dataloader state and making checkpoint optimizer restore tolerant of partial FSDP2 optimizer shards.
  • Documentation
    • Refreshed the fastgen README and config notes for real-data training; removed the prior mock-data smoke workflow.
  • Tests
    • Added regression and migration tests covering vendored wiring, collate/dataloader contracts, processor registration, and resume/checkpoint behavior.
  • Chores
    • Updated licenses/attribution, vendoring/tooling guards, requirements pinning, linting configuration, and repository ownership rules.

jingyu-ml and others added 11 commits June 10, 2026 17:41
…n-time self-contained)

Make the published DMD2 Qwen-Image example run against STOCK upstream nemo_automodel by
removing its dependence on local AutoModel modifications (staged patches #1-5):

- Vendor the patched data files into examples/diffusers/fastgen/fastgen_data/
  (collate_fns, text_to_image_dataset, mock_dataloader), importing the UNPATCHED upstream
  helpers (sampler, base_dataset, text_to_video_dataset) from the stock nemo_automodel
  package. Provenance headers per CONTRIBUTING (source + commit e42584e3, Apache-2.0).
- Repoint the published configs' dataloader _target_ to fastgen_data.build_* and add an
  explicit sys.path seam + startup resolution logging in dmd2_finetune.py (source-checkout).
- Add fastgen_checkpoint.PartialLoadCheckpointer (overrides only load_optimizer with
  allow_partial_load=True) and inject it in dmd2_recipe.load_checkpoint, so both the student
  and fake-score optimizer restores tolerate FSDP2 partial shards without a patched AutoModel.

No change to modelopt/torch/fastgen/ (DMD2 math). Preprocessing vendoring + tests + the
removability audit follow in subsequent commits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Bound nemo_automodel[diffusion] to the tested range (>=0.4.0,<1.0; 0.4.0 == the validated
commit e42584e3) and add a runtime soft-guard in fastgen_data/__init__.py that converts a
missing unpatched-upstream-helper ImportError into an actionable message naming the
supported range. Supports AC-7 (bounded version) and AC-1 (clear failure on drift).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
…ined, no tools/ tree)

Vendor the AutoModel preprocessing stack needed to build the VAE+text-embed cache from raw
images, trimmed to the Qwen-Image path:
- preprocess/processors/{base,base_video,registry,caption_loaders,qwen_image}.py + a trimmed
  processors/__init__.py that drops flux/wan/hunyuan (and their deps, e.g.
  nemo_automodel.shared.transformers_patches). The Qwen-Image processor self-registers.
- preprocess/preprocessing_multiprocess.py (driver): the `from tools.diffusion.processors`
  import is rewritten to the vendored `.processors` package; MultiTierBucketCalculator is
  imported from the stock nemo_automodel package (it IS packaged). Image subcommand
  --processor defaults to qwen_image.
- preprocess_qwen_image.py launcher mirroring dmd2_finetune.py's sys.path seam.

Removes the example's dependence on AutoModel staged patches #7-8 and on the un-packaged
AutoModel tools/ tree. Provenance headers per CONTRIBUTING. No tools.* imports remain.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
…per CONTRIBUTING)

The 10 vendored files (fastgen_data/{collate_fns,text_to_image_dataset,mock_dataloader}.py and
preprocess/{preprocessing_multiprocess,processors/*}.py) carry the original NeMo-AutoModel
Apache-2.0 header plus a "Vendored from ... @ e42584e3" provenance note. Add them to the
insert-license-py exclude so the hook does not prepend a second NVIDIA header on top.

LICENSE Third-Party Software Notices is intentionally NOT modified: every vendored file is
NVIDIA-copyright Apache-2.0 (same entity and license as this repo, from the sibling
NVIDIA-NeMo/Automodel repo), so there is no external third-party copyright holder to list;
the per-file provenance headers are the applicable record. OSRB sign-off remains a maintainer
release gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Add examples/diffusers/fastgen/tests/test_vendored_migration.py with two tiers:
- Environment-independent (run anywhere): published configs repointed to fastgen_data.*,
  no tools.* imports in vendored code, the AC-8 removability coverage audit (all nine staged
  AutoModel files vendored-or-excluded), and provenance headers on the 10 vendored files.
- Dependency-guarded (importorskip nemo_automodel/torch, run in the container): data builders
  accept negative_prompt_embedding_path, collate emits the batch contract + broadcasts the
  negative embed, PartialLoadCheckpointer overrides only load_optimizer (model load stays
  strict) + the recipe injects it, and the qwen_image processor self-registers.

The four environment-independent tests pass on CPU; the GPU/container end-to-end positive
checks (smoke train, FSDP2 resume, preprocessing on real images) run via SLURM.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Add a "Requirements & self-contained data path" section: runs on stock nemo_automodel
(>=0.4.0,<1.0) from a source checkout; data loading (fastgen_data/) and preprocessing
(preprocess/) are vendored so no nemo_automodel modifications are needed; how to build the
cache via preprocess_qwen_image.py. Completes the train-time + preprocessing migration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
…style

Round 1 closes the login-node-actionable AC-7 gaps Codex flagged:
- LICENSE: record the NeMo-AutoModel vendoring under Third-Party Software Notices (Apache-2.0).
- Vendored files: add the NVIDIA SPDX header after the original AutoModel license, per
  CONTRIBUTING "Copying code from other sources" (matches modelopt/.../calib_utils.py) — all 10.
- Runtime guard: fastgen_data/__init__.py soft-warns when installed nemo_automodel is outside the
  tested >=0.4.0,<1.0 range; preprocess/__init__.py guards the multi_tier_bucketing import with an
  actionable message.
- Tests: drop plan workflow markers (AC-/Phase/Step) from comments per the plan's code-style note.

Not changed (deliberate): the collate prompt_embeds_mask -> text_embeddings_mask mapping is
byte-identical to the patched AutoModel (faithful vendor; changing it would alter current training
behavior). GPU/SLURM end-to-end validation (smoke/resume/preprocessing/revert-all-nine) remains
deferred to the container environment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
… tests

Address the Round 1 review's actionable gaps:
- Add make_negative_prompt_embedding.py: generates the CFG negative_prompt_embedding.pt the
  canonical config requires, reusing the vendored QwenImageProcessor's text encoder (default
  negative prompt ""), saved in the dataloader's payload format {"embed", "mask"}. Closes the
  AC-5/AC-8 self-containment gap (preprocessing previously emitted only per-image caches, so a
  from-scratch user had no negative-prompt embedding to point the config at). Documented in README.
- Move migration tests to tests/examples/diffusers/fastgen/ (the repo's example CI lane) and fix
  the example-dir resolution from the new location.
- Strengthen the provenance test to assert the NVIDIA SPDX header presence + ordering (after the
  original license), not just the "Vendored from" note.
- Update stale "python -m tools.diffusion..." invocation strings in the vendored driver docs to the
  preprocess_qwen_image.py launcher.

GPU/SLURM end-to-end execution remains environment-deferred. No modelopt/torch/fastgen changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
…ig test

Address the Round 2 review:
- Every real-data config on disk now targets fastgen_data.build_text_to_image_multiresolution_dataloader.
  The 2 committed configs (dmd2_qwen_image.yaml, smoke) were already repointed; the 6 local-only
  experiment configs (1M_shift3, df0.7, 320k, 320k_shift3, 8steps, 8steps_320k — git-excluded, with
  hardcoded /lustre data paths) are repointed on disk so the user's own runs survive deleting the
  AutoModel patches. (They remain excluded from the published diff.)
- The config test now enumerates configs/*.yaml and fails on ANY upstream dataloader target, so a
  new config cannot silently reintroduce the upstream dependence. Robust to whatever set ships
  (2 in a fresh clone, 8 on the user's disk).
- Assert the builder's negative_prompt_embedding_path defaults to None (CFG-less is optional).

GPU/SLURM end-to-end execution remains environment-deferred. No modelopt/torch/fastgen changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
…ntract

The stock-upstream SLURM validation (8/9 pass) surfaced a test-fixture bug, not a migration bug:
test_collate_emits_contract_keys_and_broadcasts_negative_prompt built a fake sample with only
{latent, prompt_embeds, prompt_embeds_mask}, but collate_fn_production also requires
crop_resolution / original_resolution / crop_offset / prompt / image_path / bucket_id /
aspect_ratio (the keys TextToImageDataset emits). Add them so the test exercises the real
collate path. Vendored code unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Apply the repo ruff config to the vendored data path and preprocessing files: ruff-format + autofixes (pyupgrade py310, isort) plus manual fixes for E402 (logger placement), E501 (provenance comment wrap), N806 (lowercase dims in the video-only 5D shape validator), SIM103, RUF022 (sorted __all__), C401, PERF401. No behavioral change: correctness-critical imports (recipe checkpointer injection, re-exports, # noqa: F401 side-effect imports, MultiTierBucketCalculator, DefaultLoadPlanner) and all provenance/SPDX headers preserved. ruff check + format --check clean; env-independent migration tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Vendored NeMo-AutoModel preprocessing and data loaders into the fastgen example, rewired configs/scripts to use local builders, added FSDP2-tolerant optimizer restore and training resume via dataloader state reconstruction, implemented preprocessing and negative-prompt tooling, and added tests validating the complete migration.

Changes

FastGen DMD2 NeMo-AutoModel Vendoring & Integration

Layer / File(s) Summary
License, configuration, and documentation updates
.pre-commit-config.yaml, LICENSE, examples/diffusers/fastgen/README.md, examples/diffusers/fastgen/requirements.txt, pyproject.toml, .github/CODEOWNERS
Updated Apache-2 attribution for vendored NVIDIA code, excluded fastgen files from automatic license insertion, extended README with vendored preprocessing and data-cache build instructions, pinned nemo_automodel[diffusion] to >=0.4.0,<1.0, added MyPy overrides for vendored preprocess code, and added CODEOWNERS entry.
Processor contracts, registry, and Qwen-Image processor
examples/diffusers/fastgen/preprocess/__init__.py, examples/diffusers/fastgen/preprocess/processors/{base,caption_loaders,qwen_image,registry}.py, examples/diffusers/fastgen/preprocess/processors/__init__.py
Adds import guards for vendor dependencies, BaseModelProcessor abstraction defining encode/verify/cache contracts, caption loader implementations (sidecar/meta-JSON/JSONL), ProcessorRegistry for runtime processor discovery, and QwenImageProcessor implementing Qwen-Image model loading, image/text encoding, latent verification, and torch.save-ready cache payload construction.
Multiprocess preprocessing orchestration and launcher
examples/diffusers/fastgen/preprocess/preprocessing_multiprocess.py, examples/diffusers/fastgen/preprocess_qwen_image.py
Implements multiprocess image/video preprocessing with per-GPU worker state, media discovery, bucketing, sharded metadata output, optional verification, CLI subcommands for image/video modes, and a dedicated Qwen-image launcher entrypoint.
Vendored data package, dataset, collate, and CLI tools
examples/diffusers/fastgen/fastgen_data/{__init__,collate_fns,text_to_image_dataset}.py, examples/diffusers/fastgen/make_negative_prompt_embedding.py, examples/diffusers/fastgen/configs/dmd2_qwen_image.yaml, examples/diffusers/fastgen/dmd2_finetune.py
Adds fastgen_data package with import guards and upstream-version warnings, TextToImageDataset for cached .pt samples, collate_fn_text_to_image with negative-prompt broadcasting and mask reconstruction, multiresolution dataloader builder, negative-prompt embedding generator script, and rewires config/finetune defaults to use vendored builders.
FSDP2-tolerant checkpoint restore and training resume
examples/diffusers/fastgen/fastgen_checkpoint.py, examples/diffusers/fastgen/dmd2_recipe.py
Introduces PartialLoadCheckpointer subclass that loads optimizer state via torch.distributed.checkpoint with DefaultLoadPlanner(allow_partial_load=True) to handle FSDP2 partial-shard states. Adds make_optimizer_partial_load_tolerant() helper to upgrade checkpointer instances in-place (idempotently) and injects the upgrade call into recipe load_checkpoint. Implements _rebuild_dataloader_for_resume(global_step) to reset sampler/dataloader state on training resume, reconstructing epoch/skip bookkeeping from restored global_step, enabling correct sample ordering across interruptions.
Validation tests
tests/examples/diffusers/fastgen/test_vendored_migration.py, tests/examples/diffusers/fastgen/test_resume_dataloader.py
Adds invariant tests for YAML vendored-builder targets, staged AutoModel file dispositions, and provenance headers. Includes dependency-guarded structural tests for dataloader/collate negative-prompt broadcasting, checkpointer subclassing/idempotent upgrade, recipe injection, and processor registration. Adds regression tests for training resume dataloader rebuilding, verifying correct sample serving across epoch boundaries and no-op behavior on fresh starts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/Model-Optimizer#1326: This PR's training resume and partial-load checkpointer changes extend the DMD2DiffusionRecipe framework and checkpoint handling introduced in that PR.

Suggested labels

cherry-pick-done

Suggested reviewers

  • shengliangxu
  • kevalmorabia97
🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main objective: making the fastgen DMD2 Qwen-Image example self-contained on stock nemo_automodel by vendoring/wrapping upstream code.
Docstring Coverage ✅ Passed Docstring coverage is 90.48% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed All NEW files in this PR pass security anti-pattern review: torch.load calls use weights_only=True; no numpy.load, hardcoded trust_remote_code, eval/exec on external input, or # nosec comments; all...
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jingyux/fastgen-vendor-automodel-patches

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.22%. Comparing base (bcd8dd4) to head (3208921).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1688       +/-   ##
===========================================
+ Coverage   58.45%   76.22%   +17.76%     
===========================================
  Files         510      511        +1     
  Lines       56271    56339       +68     
===========================================
+ Hits        32896    42942    +10046     
+ Misses      23375    13397     -9978     
Flag Coverage Δ
unit 54.34% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…collate)

Reduce the migration's footprint per review: (1) delete the vendored mock_dataloader.py and the mock-data smoke config + its README/recipe references (the published quick-start was broken on stock anyway; real training does not use it); (2) delete the vendored base_video.py (video code, dead in an image-only example) and make the --list_processors display image-only; (3) rewrite collate_fns.py as thin wrappers over stock nemo_automodel (import the unpatched collate + sampler; re-add only the DMD2 delta: text_embeddings_mask + broadcast negative-prompt embedding) instead of a full ~440-line vendored copy, also dropping the dead video collate/builder. text_to_image_dataset.py stays a faithful vendored copy (its prompt_embeds_mask change is interleaved with cache loading). Default config + tests/licensing updated; ruff + env-independent checks pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
@jingyu-ml jingyu-ml added the cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Jun 11, 2026
@jingyu-ml jingyu-ml self-assigned this Jun 11, 2026
…quality)

The preprocess/ tree is vendored verbatim from NVIDIA-NeMo/Automodel (Apache-2.0) and uses the upstream worker-global pattern (module-level 'X | None' set lazily in _init_worker), which trips mypy's strict union-attr/arg-type under examples.*. Add a scoped mypy override (ignore_errors, like tests.*) for examples.diffusers.fastgen.preprocess.* so the faithful copy is not annotated/diverged from upstream. The authored fastgen_data/ wrapper is unaffected (already mypy-clean). 36 errors in 2 files -> 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
@jingyu-ml jingyu-ml marked this pull request as ready for review June 11, 2026 22:48
@jingyu-ml jingyu-ml requested review from a team as code owners June 11, 2026 22:48
@jingyu-ml jingyu-ml requested a review from kevalmorabia97 June 11, 2026 22:48
Assign modelopt/torch/fastgen (DMD2 library) and /examples/diffusers/fastgen (this example) to @NVIDIA/modelopt-torch-fastgen-codeowners. Both are more-specific entries placed after their parent rules (modelopt/torch and /examples/diffusers), so GitHub's last-match-wins routes fastgen review to the fastgen team, matching the repo convention (e.g. modelopt/torch/quantization carving out of modelopt/torch).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/diffusers/fastgen/dmd2_recipe.py (1)

671-671: ⚠️ Potential issue | 🔴 Critical

Fix unsafe checkpoint deserialization (weights_only=False) in diffusers fastgen restore paths

examples/diffusers/fastgen/dmd2_recipe.py restores sidecar checkpoint payloads using torch.load(..., weights_only=False) at lines 671, 676, 689, and 698 with no inline justification. Update these loads to weights_only=True (or add an inline comment explaining why weights_only=False is safe).

Suggested patch
-            ema_state = torch.load(ema_path, map_location="cpu", weights_only=False)
+            ema_state = torch.load(ema_path, map_location="cpu", weights_only=True)
...
-            state = torch.load(state_path, map_location="cpu", weights_only=False)
+            state = torch.load(state_path, map_location="cpu", weights_only=True)
...
-                disc_state = torch.load(disc_path, map_location="cpu", weights_only=False)
+                disc_state = torch.load(disc_path, map_location="cpu", weights_only=True)
...
-                disc_opt_state = torch.load(disc_opt_path, map_location="cpu", weights_only=False)
+                disc_opt_state = torch.load(disc_opt_path, map_location="cpu", weights_only=True)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/diffusers/fastgen/dmd2_recipe.py` at line 671, Summary: Unsafe
checkpoint deserialization uses torch.load(..., weights_only=False); change to
weights_only=True or justify why full deserialization is required. Locate the
torch.load calls that load sidecar checkpoints (e.g., the ema_state load
referenced by the symbol ema_state and variable ema_path in dmd2_recipe.py) and
update the argument weights_only=False to weights_only=True for all similar
restore paths (the other torch.load calls flagged in the review). If full object
deserialization truly is required, instead leave weights_only=False but add an
inline comment explaining why it is safe and unavoidable; otherwise prefer
weights_only=True to avoid executing arbitrary pickle payloads.

Source: Coding guidelines

🧹 Nitpick comments (1)
examples/diffusers/fastgen/preprocess/processors/base.py (1)

187-190: ⚡ Quick win

Move the NumPy import to module scope (or document why it must be deferred).

Line 187 imports numpy inside preprocess_image without an explicit circular/optional/heavy-import justification, so import errors surface at runtime instead of module load.

As per coding guidelines, imports should stay at file top unless there is a documented exception.

Proposed change
 from abc import ABC, abstractmethod
 from typing import Any
 
+import numpy as np
 import torch
 from PIL import Image
@@
     def preprocess_image(self, image: Image.Image) -> torch.Tensor:
@@
-        import numpy as np
-
         image_tensor = torch.from_numpy(np.array(image)).float() / 255.0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/diffusers/fastgen/preprocess/processors/base.py` around lines 187 -
190, The numpy import is done inside the preprocessing code (where image_tensor
is created in preprocess_image), which causes import errors to appear at
runtime; move "import numpy as np" to module scope near the other imports at the
top of examples/diffusers/fastgen/preprocess/processors/base.py (or add a short
comment explaining a deliberate deferred import) so that preprocess_image (and
the image_tensor creation) uses a top-level np import; update any references in
preprocess_image and related functions to use the module-scoped np.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/diffusers/fastgen/fastgen_data/collate_fns.py`:
- Line 103: The code uses torch.load(path, map_location="cpu",
weights_only=False) to load a user-configurable negative_prompt_embedding_path,
which allows unsafe pickle loading; change the load to a safe mode and validate
the input: ensure the loader uses weights_only=True (i.e., torch.load(path,
map_location="cpu", weights_only=True)) or otherwise deserialize with a safe
format (e.g., torch.jit.load or a JSON/npz reader) and validate
negative_prompt_embedding_path exists and is from a trusted source before
loading; update the line creating payload (payload = torch.load(...)) and any
code that assumes pickled objects accordingly.

In `@examples/diffusers/fastgen/preprocess/processors/base.py`:
- Around line 208-210: The code currently dereferences
models["vae"].config.scaling_factor without ensuring the attribute exists;
update the getter so that after confirming "vae" is in models and models["vae"]
has a config, you also check that the config has a scaling_factor (e.g.,
hasattr(models["vae"].config, "scaling_factor") or getattr with default) and
only return it when present, otherwise return the fallback 0.18215; reference
the models dict, the "vae" entry, its config attribute, and the scaling_factor
property when implementing this guard.

In `@examples/diffusers/fastgen/preprocess/processors/caption_loaders.py`:
- Line 425: The import "from collections import defaultdict" is placed inside a
function in caption_loaders.py—move this import to the module top-level and
remove the in-function import; search for the in-function occurrence of "from
collections import defaultdict" (and any use of defaultdict) and add a single
top-of-file import statement instead so import errors surface at module load and
the function body no longer contains the import.

In `@examples/diffusers/fastgen/preprocess/processors/qwen_image.py`:
- Around line 184-191: The second value returned by pipeline.encode_prompt (the
prompt_embeds_mask) is being ignored in encode_prompt calls and therefore not
serialized; update the encode_prompt usage in qwen_image.py to capture both
outputs (e.g., prompt_embeds, prompt_embeds_mask = pipeline.encode_prompt(...))
and include prompt_embeds_mask in the returned/persisted dict alongside
prompt_embeds (apply the same .detach().cpu().to(torch.bfloat16) or appropriate
dtype/shape handling as done for prompt_embeds) so downstream TextToImageDataset
receives the real mask instead of falling back to an all-ones mask.
- Line 217: The autocast call inside verify_latent is using an invalid dtype for
CUDA (torch.float32); update the autocast usage in verify_latent (qwen_image.py)
to use a valid reduced-precision dtype such as torch.float16 or torch.bfloat16,
or remove the dtype argument so autocast picks the default, ensuring the with
torch.no_grad(), autocast(...) context no longer raises and causes false
verification failures.

In `@LICENSE`:
- Around line 226-228: Add a top-of-file vendored header to each Python file
listed (dmd2_finetune.py, dmd2_recipe.py, export_diffusers_qwen_image.py,
fastgen_checkpoint.py, fastgen_data/__init__.py, fastgen_data/collate_fns.py,
inference_dmd2_qwen_image.py, make_negative_prompt_embedding.py,
preprocess_qwen_image.py) that matches the LICENSE expectation: a single-line or
block comment beginning with "Vendored from" and including the original
NVIDIA-NeMo/Automodel source path and commit SHA (e.g., "Vendored from
NVIDIA-NeMo/Automodel: <original/path> @ <commit>"); place it at the very top of
each file before imports and include any short attribution or license snippet
required by the original file so each file contains the "Vendored from" string
referenced in LICENSE.

In `@tests/examples/diffusers/fastgen/test_vendored_migration.py`:
- Around line 157-163: Move the local import of inspect out of the test function
to a top-level import; update the module imports to include "import inspect" at
the module scope and remove the "import inspect" line from inside
test_data_builders_importable_and_accept_negative_prompt_path so the test uses
the module-level inspect symbol.

---

Outside diff comments:
In `@examples/diffusers/fastgen/dmd2_recipe.py`:
- Line 671: Summary: Unsafe checkpoint deserialization uses torch.load(...,
weights_only=False); change to weights_only=True or justify why full
deserialization is required. Locate the torch.load calls that load sidecar
checkpoints (e.g., the ema_state load referenced by the symbol ema_state and
variable ema_path in dmd2_recipe.py) and update the argument weights_only=False
to weights_only=True for all similar restore paths (the other torch.load calls
flagged in the review). If full object deserialization truly is required,
instead leave weights_only=False but add an inline comment explaining why it is
safe and unavoidable; otherwise prefer weights_only=True to avoid executing
arbitrary pickle payloads.

---

Nitpick comments:
In `@examples/diffusers/fastgen/preprocess/processors/base.py`:
- Around line 187-190: The numpy import is done inside the preprocessing code
(where image_tensor is created in preprocess_image), which causes import errors
to appear at runtime; move "import numpy as np" to module scope near the other
imports at the top of examples/diffusers/fastgen/preprocess/processors/base.py
(or add a short comment explaining a deliberate deferred import) so that
preprocess_image (and the image_tensor creation) uses a top-level np import;
update any references in preprocess_image and related functions to use the
module-scoped np.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5a5a2507-bada-45a1-9617-9f3d3b0b7710

📥 Commits

Reviewing files that changed from the base of the PR and between dd49a46 and c7fc9bf.

📒 Files selected for processing (23)
  • .pre-commit-config.yaml
  • LICENSE
  • examples/diffusers/fastgen/README.md
  • examples/diffusers/fastgen/configs/dmd2_qwen_image.yaml
  • examples/diffusers/fastgen/configs/dmd2_qwen_image_smoke.yaml
  • examples/diffusers/fastgen/dmd2_finetune.py
  • examples/diffusers/fastgen/dmd2_recipe.py
  • examples/diffusers/fastgen/fastgen_checkpoint.py
  • examples/diffusers/fastgen/fastgen_data/__init__.py
  • examples/diffusers/fastgen/fastgen_data/collate_fns.py
  • examples/diffusers/fastgen/fastgen_data/text_to_image_dataset.py
  • examples/diffusers/fastgen/make_negative_prompt_embedding.py
  • examples/diffusers/fastgen/preprocess/__init__.py
  • examples/diffusers/fastgen/preprocess/preprocessing_multiprocess.py
  • examples/diffusers/fastgen/preprocess/processors/__init__.py
  • examples/diffusers/fastgen/preprocess/processors/base.py
  • examples/diffusers/fastgen/preprocess/processors/caption_loaders.py
  • examples/diffusers/fastgen/preprocess/processors/qwen_image.py
  • examples/diffusers/fastgen/preprocess/processors/registry.py
  • examples/diffusers/fastgen/preprocess_qwen_image.py
  • examples/diffusers/fastgen/requirements.txt
  • pyproject.toml
  • tests/examples/diffusers/fastgen/test_vendored_migration.py
💤 Files with no reviewable changes (1)
  • examples/diffusers/fastgen/configs/dmd2_qwen_image_smoke.yaml

Comment thread examples/diffusers/fastgen/fastgen_data/collate_fns.py Outdated
Comment on lines +208 to +210
if "vae" in models and hasattr(models["vae"], "config"):
return models["vae"].config.scaling_factor
return 0.18215 # Default for most models

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard scaling_factor existence before dereference.

Line 209 assumes models["vae"].config.scaling_factor always exists once config exists. If scaling_factor is missing, this raises instead of using the default fallback.

Proposed change
     def get_vae_scaling_factor(self, models: dict[str, Any]) -> float:
@@
-        if "vae" in models and hasattr(models["vae"], "config"):
-            return models["vae"].config.scaling_factor
+        if "vae" in models and hasattr(models["vae"], "config"):
+            scaling_factor = getattr(models["vae"].config, "scaling_factor", None)
+            if scaling_factor is not None:
+                return scaling_factor
         return 0.18215  # Default for most models
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/diffusers/fastgen/preprocess/processors/base.py` around lines 208 -
210, The code currently dereferences models["vae"].config.scaling_factor without
ensuring the attribute exists; update the getter so that after confirming "vae"
is in models and models["vae"] has a config, you also check that the config has
a scaling_factor (e.g., hasattr(models["vae"].config, "scaling_factor") or
getattr with default) and only return it when present, otherwise return the
fallback 0.18215; reference the models dict, the "vae" entry, its config
attribute, and the scaling_factor property when implementing this guard.

Comment thread examples/diffusers/fastgen/preprocess/processors/caption_loaders.py Outdated
Comment on lines +184 to +191
prompt_embeds, _ = pipeline.encode_prompt(
prompt=prompt,
device=device,
)

return {
"prompt_embeds": prompt_embeds.detach().cpu().to(torch.bfloat16),
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

prompt_embeds_mask is dropped before cache serialization.

Line 184 receives a second output from encode_prompt but discards it, and Lines 261-266 persist only prompt_embeds. Downstream, TextToImageDataset falls back to an all-ones mask when the mask key is absent, so the true sequence mask contract is lost.

Proposed change
         with torch.no_grad():
-            prompt_embeds, _ = pipeline.encode_prompt(
+            prompt_embeds, prompt_embeds_mask = pipeline.encode_prompt(
                 prompt=prompt,
                 device=device,
             )
 
-        return {
+        result = {
             "prompt_embeds": prompt_embeds.detach().cpu().to(torch.bfloat16),
         }
+        if prompt_embeds_mask is not None:
+            result["prompt_embeds_mask"] = prompt_embeds_mask.detach().cpu().to(torch.long)
+        return result
@@
         return {
             # Image latent
             "latent": latent,
             # Text embeddings
             "prompt_embeds": text_encodings["prompt_embeds"],
+            "prompt_embeds_mask": text_encodings.get("prompt_embeds_mask"),
             # Metadata
             "original_resolution": metadata["original_resolution"],

Also applies to: 261-266

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/diffusers/fastgen/preprocess/processors/qwen_image.py` around lines
184 - 191, The second value returned by pipeline.encode_prompt (the
prompt_embeds_mask) is being ignored in encode_prompt calls and therefore not
serialized; update the encode_prompt usage in qwen_image.py to capture both
outputs (e.g., prompt_embeds, prompt_embeds_mask = pipeline.encode_prompt(...))
and include prompt_embeds_mask in the returned/persisted dict alongside
prompt_embeds (apply the same .detach().cpu().to(torch.bfloat16) or appropriate
dtype/shape handling as done for prompt_embeds) so downstream TextToImageDataset
receives the real mask instead of falling back to an all-ones mask.

Comment thread examples/diffusers/fastgen/preprocess/processors/qwen_image.py Outdated
Comment thread LICENSE Outdated
Comment on lines +226 to +228
Copyright (c) 2025 NVIDIA CORPORATION — portions of examples/diffusers/fastgen/ are
vendored from NVIDIA-NeMo/Automodel (https://github.com/NVIDIA-NeMo/Automodel); see the
per-file "Vendored from" headers for the exact source paths and commit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check that vendored files have "Vendored from" headers.
# Expected: Each file starts with a comment containing "Vendored from" + source URL.

vendored_files=(
  "examples/diffusers/fastgen/fastgen_data/text_to_image_dataset.py"
  "examples/diffusers/fastgen/preprocess/preprocessing_multiprocess.py"
  "examples/diffusers/fastgen/preprocess/processors/__init__.py"
  "examples/diffusers/fastgen/preprocess/processors/base.py"
  "examples/diffusers/fastgen/preprocess/processors/registry.py"
  "examples/diffusers/fastgen/preprocess/processors/caption_loaders.py"
  "examples/diffusers/fastgen/preprocess/processors/qwen_image.py"
)

echo "Checking for 'Vendored from' headers in vendored files..."
for file in "${vendored_files[@]}"; do
  if [ -f "$file" ]; then
    if head -20 "$file" | grep -q "Vendored from"; then
      echo "$file: has Vendored from header"
    else
      echo "$file: MISSING Vendored from header"
    fi
  else
    echo "$file: file not found"
  fi
done

Repository: NVIDIA/Model-Optimizer

Length of output: 763


🏁 Script executed:

#!/bin/bash
set -euo pipefail

dir="examples/diffusers/fastgen"
echo "Scanning for Python files under: $dir"
mapfile -t files < <(find "$dir" -type f -name "*.py" | sort)

echo "Found ${`#files`[@]} .py files."
missing=()

for f in "${files[@]}"; do
  if head -30 "$f" | rg -q "Vendored from"; then
    echo "$f"
  else
    echo "$f (missing 'Vendored from' in first 30 lines)"
    missing+=("$f")
  fi
done

if [ ${`#missing`[@]} -gt 0 ]; then
  echo
  echo "Missing count: ${`#missing`[@]}"
  printf '%s\n' "${missing[@]}"
  exit 1
else
  echo
  echo "All scanned Python files have 'Vendored from' headers."
fi

Repository: NVIDIA/Model-Optimizer

Length of output: 201


🏁 Script executed:

#!/bin/bash
set -euo pipefail

dir="examples/diffusers/fastgen"
echo "Scanning for Python files under: $dir"

tmp="$(mktemp)"
find "$dir" -type f -name "*.py" | sort > "$tmp"

mapfile -t files < "$tmp"
rm -f "$tmp"

echo "Found ${`#files`[@]} .py files."

missing=()
for f in "${files[@]}"; do
  if head -40 "$f" | rg -q "Vendored from"; then
    echo "$f"
  else
    echo "$f (missing 'Vendored from' in first 40 lines)"
    missing+=("$f")
  fi
done

if [ ${`#missing`[@]} -gt 0 ]; then
  echo
  echo "Missing count: ${`#missing`[@]}"
  printf '%s\n' "${missing[@]}"
  exit 1
else
  echo
  echo "All scanned Python files have 'Vendored from' headers."
fi

Repository: NVIDIA/Model-Optimizer

Length of output: 2025


🏁 Script executed:

#!/bin/bash
set -euo pipefail

dir="examples/diffusers/fastgen"
tmp="$(mktemp)"
find "$dir" -type f -name "*.py" | sort > "$tmp"
mapfile -t files < "$tmp"
rm -f "$tmp"

echo "Checking where 'Vendored from' appears (first 200 lines + anywhere in file):"

for f in "${files[@]}"; do
  first200_has=0
  any_has=0
  if head -200 "$f" | rg -q "Vendored from"; then
    first200_has=1
  fi
  if rg -q "Vendored from" "$f"; then
    any_has=1
  fi

  if [ "$any_has" -eq 1 ]; then
    if [ "$first200_has" -eq 1 ]; then
      echo "$f (Vendored from in first 200 lines)"
    else
      echo "$f (Vendored from exists but after first 200 lines)"
    fi
  else
    echo "$f (no 'Vendored from' anywhere)"
  fi
done

Repository: NVIDIA/Model-Optimizer

Length of output: 1689


Add missing # Vendored from ... headers to vendored fastgen Python files

LICENSE points to per-file “Vendored from” headers, but these vendored files under examples/diffusers/fastgen/ contain no “Vendored from” string anywhere:

  • examples/diffusers/fastgen/dmd2_finetune.py
  • examples/diffusers/fastgen/dmd2_recipe.py
  • examples/diffusers/fastgen/export_diffusers_qwen_image.py
  • examples/diffusers/fastgen/fastgen_checkpoint.py
  • examples/diffusers/fastgen/fastgen_data/__init__.py
  • examples/diffusers/fastgen/fastgen_data/collate_fns.py
  • examples/diffusers/fastgen/inference_dmd2_qwen_image.py
  • examples/diffusers/fastgen/make_negative_prompt_embedding.py
  • examples/diffusers/fastgen/preprocess_qwen_image.py
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@LICENSE` around lines 226 - 228, Add a top-of-file vendored header to each
Python file listed (dmd2_finetune.py, dmd2_recipe.py,
export_diffusers_qwen_image.py, fastgen_checkpoint.py, fastgen_data/__init__.py,
fastgen_data/collate_fns.py, inference_dmd2_qwen_image.py,
make_negative_prompt_embedding.py, preprocess_qwen_image.py) that matches the
LICENSE expectation: a single-line or block comment beginning with "Vendored
from" and including the original NVIDIA-NeMo/Automodel source path and commit
SHA (e.g., "Vendored from NVIDIA-NeMo/Automodel: <original/path> @ <commit>");
place it at the very top of each file before imports and include any short
attribution or license snippet required by the original file so each file
contains the "Vendored from" string referenced in LICENSE.

Comment thread tests/examples/diffusers/fastgen/test_vendored_migration.py
jingyu-ml and others added 2 commits June 11, 2026 16:09
…e diffusers team

Drop the /examples/diffusers/fastgen carve-out; @NVIDIA/modelopt-torch-fastgen-codeowners owns modelopt/torch/fastgen (the DMD2 library). The example stays under /examples/diffusers (@NVIDIA/modelopt-examples-diffusers-codeowners).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Apply the safe, sensible CodeRabbit findings: (1) collate_fns.py loads the negative-prompt embedding with weights_only=True (it is a dict of plain tensors saved by make_negative_prompt_embedding.py, so it loads cleanly) — removes the pickle/RCE vector per SECURITY.md; (2) move 'from collections import defaultdict' (caption_loaders) and 'import inspect' (test) to module top; (3) drop the no-op autocast(dtype=torch.float32) in QwenImageProcessor.verify_latent — the latent is already .float(), so this is behavior-identical and removes the invalid-CUDA-autocast-dtype warning. Preprocessing/data only; no change to DMD2 training math.

Intentionally not changed (would alter validated behavior or are false positives): not persisting prompt_embeds_mask in the cache (would switch the text-attention mask from the all-ones fallback the existing caches + running jobs use to a real mask — changes training math); the speculative scaling_factor guard (Qwen-Image VAE always defines it; vendored verbatim); 'Vendored from' headers on authored files (those correctly carry the standard NVIDIA SPDX header — LICENSE states only 'portions' are vendored).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
@jingyu-ml

Copy link
Copy Markdown
Contributor Author

/ok to test 85ae40c

Comment thread LICENSE

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are also nvidia authored and apache 2 license files so wr dont need to exclude in pre commit or have duplicate nvidia license. Let license hook auto add correct license everywhere

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — addressed in be832ae. The files copied from NVIDIA-NeMo/Automodel now carry only the standard NVIDIA SPDX header (managed by the insert-license hook): I dropped their pre-commit insert-license exclusion, removed the per-file Vendored from provenance note + the duplicated original-license block, and removed the NVIDIA-NeMo/Automodel note from LICENSE (the HuggingFace / MIT / BSD third-party notices are unchanged). The migration test now asserts the standard header + no duplicate license instead of requiring the provenance note.

Happy to keep a one-line adapted from note below the header on any of these if you'd prefer.

On resume the StatefulDataLoader's restored state does not advance past the
resume point: re-checkpointing after a resume fails to capture progress, so
every window re-served the same data slice and multi-window (e.g. SLURM
time-limited) runs under-covered the dataset. Verified on production
checkpoints (steps 999 / 12999 / 24999 all resume to the same sample) and a CPU
probe. Only the data position is affected -- the step counter is restored
correctly -- but that is the substantive part.

The reliably-restored counter is global_step, so the new
DMD2DiffusionRecipe._rebuild_dataloader_for_resume discards the stuck loader
state on resume: it rebuilds a fresh StatefulDataLoader and skips the
deterministic sampler to the position implied by global_step (epoch =
global_step // epoch_len, skip = (global_step % epoch_len) * grad_acc). This is
CPU-verified bit-identical to a no-resume run on the real dataloader, including
across epoch boundaries, and also corrects the resumed epoch label and
progress-bar position. It is intentionally not wrapped in try/except: the inputs
are always present on this path, and silently falling back to the stuck loader
would reinstate the bug.

Add a SLURM-free, GPU-free CPU regression test
(tests/examples/diffusers/fastgen/test_resume_dataloader.py) that drives the real
SequentialBucketSampler + StatefulDataLoader through the recipe method and
asserts the first sample served after a resume equals a clean no-resume run's
sample at that global step (mid-epoch, epoch boundary, and cross-epoch), plus a
fresh-start no-op check; it fails on the previous loader-state resume. DMD2
training math is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
@jingyu-ml jingyu-ml requested a review from a team as a code owner June 15, 2026 23:15
…iew)

Per review: the files copied from NVIDIA-NeMo/Automodel are NVIDIA-authored
Apache-2.0, so they do not need the third-party "copying code" treatment. Drop
the per-file "Vendored from" provenance note and the duplicated original-license
block (each file now carries only the standard NVIDIA SPDX header that the
insert-license hook manages), remove their insert-license pre-commit exclusion,
and remove the NVIDIA-NeMo/Automodel note from LICENSE's third-party section (the
genuine HuggingFace / MIT / BSD notices are unchanged). Update the migration test
to assert the standard header + no provenance / no duplicate license instead of
requiring the provenance note.

Headers and config only -- no code or behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/diffusers/fastgen/dmd2_recipe.py (1)

720-727: ⚠️ Potential issue | 🔴 Critical

Use safe deserialization for sidecar checkpoint restores.

The recipe saves and restores checkpoint sidecars (ema_shadow.pt, dmd_state.pt, discriminator.pt, discriminator_optimizer.pt) containing only state dicts and scalars. These should use weights_only=True since they contain no arbitrary Python objects.

-            ema_state = torch.load(ema_path, map_location="cpu", weights_only=False)
+            ema_state = torch.load(ema_path, map_location="cpu", weights_only=True)
@@ Line 726
-            state = torch.load(state_path, map_location="cpu", weights_only=False)
+            state = torch.load(state_path, map_location="cpu", weights_only=True)
@@ Line 739
-                disc_state = torch.load(disc_path, map_location="cpu", weights_only=False)
+                disc_state = torch.load(disc_path, map_location="cpu", weights_only=True)
@@ Line 748
-                disc_opt_state = torch.load(disc_opt_path, map_location="cpu", weights_only=False)
+                disc_opt_state = torch.load(disc_opt_path, map_location="cpu", weights_only=True)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/diffusers/fastgen/dmd2_recipe.py` around lines 720 - 727, The
torch.load() calls for loading EMA state (ema_path) and iteration state
(state_path) are using weights_only=False, but these sidecar checkpoint files
only contain state dicts and scalars with no arbitrary Python objects. Change
weights_only=False to weights_only=True in both torch.load() calls to enable
safe deserialization for these sidecar checkpoint restores.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/examples/diffusers/fastgen/test_resume_dataloader.py`:
- Around line 95-101: Move the optional dependency guards and imports from
inside the test functions to module scope at the top of the file. The imports
for torch, nemo_automodel, torchdata, SequentialBucketSampler,
StatefulDataLoader, and dmd2_recipe (wrapped in pytest.importorskip calls)
should be relocated to module-level after any sys.path setup, rather than being
duplicated inside test bodies. This ensures import errors surface at
collection/skip time rather than mid-test execution, following the guideline
that imports belong at the top of the file.

---

Outside diff comments:
In `@examples/diffusers/fastgen/dmd2_recipe.py`:
- Around line 720-727: The torch.load() calls for loading EMA state (ema_path)
and iteration state (state_path) are using weights_only=False, but these sidecar
checkpoint files only contain state dicts and scalars with no arbitrary Python
objects. Change weights_only=False to weights_only=True in both torch.load()
calls to enable safe deserialization for these sidecar checkpoint restores.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 38a101c0-2324-42d2-a207-98a20b45964d

📥 Commits

Reviewing files that changed from the base of the PR and between 85ae40c and 6ffbc52.

📒 Files selected for processing (2)
  • examples/diffusers/fastgen/dmd2_recipe.py
  • tests/examples/diffusers/fastgen/test_resume_dataloader.py

Comment on lines +95 to +101
pytest.importorskip("torch")
pytest.importorskip("nemo_automodel")
pytest.importorskip("torchdata")
from nemo_automodel.components.datasets.diffusion.sampler import SequentialBucketSampler
from torchdata.stateful_dataloader import StatefulDataLoader

dmd2_recipe = pytest.importorskip("dmd2_recipe")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Move optional dependency guards/imports to module scope.

These imports run inside test bodies and are duplicated, so import failures surface mid-test instead of at collection/skip time. Keep the pytest.importorskip guards, but hoist them after the sys.path setup. As per coding guidelines, “Imports belong at the top of the file so import errors surface at collection time, not mid-test.”

🧪 Proposed fix
 if str(_FASTGEN_DIR) not in sys.path:
     sys.path.insert(0, str(_FASTGEN_DIR))
 
+pytest.importorskip("torch")
+_sampler_mod = pytest.importorskip("nemo_automodel.components.datasets.diffusion.sampler")
+_stateful_dataloader_mod = pytest.importorskip("torchdata.stateful_dataloader")
+dmd2_recipe = pytest.importorskip("dmd2_recipe")
+
+SequentialBucketSampler = _sampler_mod.SequentialBucketSampler
+StatefulDataLoader = _stateful_dataloader_mod.StatefulDataLoader
+
 _N = 20  # samples (== batches, batch_size 1) per epoch in the synthetic dataset
@@
-    pytest.importorskip("torch")
-    pytest.importorskip("nemo_automodel")
-    pytest.importorskip("torchdata")
-    from nemo_automodel.components.datasets.diffusion.sampler import SequentialBucketSampler
-    from torchdata.stateful_dataloader import StatefulDataLoader
-
-    dmd2_recipe = pytest.importorskip("dmd2_recipe")
     # The reset logs only on the main process; force True off the distributed path.
     monkeypatch.setattr(dmd2_recipe, "is_main_process", lambda: True, raising=False)
@@
-    pytest.importorskip("torch")
-    pytest.importorskip("nemo_automodel")
-    pytest.importorskip("torchdata")
-    from nemo_automodel.components.datasets.diffusion.sampler import SequentialBucketSampler
-    from torchdata.stateful_dataloader import StatefulDataLoader
-
-    dmd2_recipe = pytest.importorskip("dmd2_recipe")
     monkeypatch.setattr(dmd2_recipe, "is_main_process", lambda: True, raising=False)

Also applies to: 140-146

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/examples/diffusers/fastgen/test_resume_dataloader.py` around lines 95 -
101, Move the optional dependency guards and imports from inside the test
functions to module scope at the top of the file. The imports for torch,
nemo_automodel, torchdata, SequentialBucketSampler, StatefulDataLoader, and
dmd2_recipe (wrapped in pytest.importorskip calls) should be relocated to
module-level after any sys.path setup, rather than being duplicated inside test
bodies. This ensures import errors surface at collection/skip time rather than
mid-test execution, following the guideline that imports belong at the top of
the file.

Source: Coding guidelines

@jingyu-ml jingyu-ml requested a review from kevalmorabia97 June 15, 2026 23:25
@kevalmorabia97 kevalmorabia97 removed the request for review from chochowski June 16, 2026 03:16
@kevalmorabia97 kevalmorabia97 requested a review from a team June 16, 2026 03:19
@kevalmorabia97

Copy link
Copy Markdown
Collaborator

/ok to test 76bd04a

The resume fix re-assigned `self.dataloader` to a fresh StatefulDataLoader, but
`dataloader` is already a tracked state key, so BaseRecipe.__setattr__ raised
`RuntimeError: State key 'dataloader' is already tracked` and crashed every rank
right after restoring the checkpoint (observed on a 128-GPU resume from
epoch_0_step_199). Assign via `self.__dict__["dataloader"] = ...` instead: it
updates the attribute while leaving it tracked (the existing `__state_tracked`
entry is unchanged, so the rebuilt loader is still checkpointed), matching the
pattern the recipe already uses for its other extras. Modelopt-side only -- no
Automodel change.

Harden the regression test to drive `_rebuild_dataloader_for_resume` through a
real `DMD2DiffusionRecipe` instance (`object.__new__`) so BaseRecipe.__setattr__
state-tracking is exercised. The previous SimpleNamespace stub had no
`__setattr__`, so it never hit the "already tracked" guard and missed this crash.
Verified the old line raises the exact error and the `__dict__` assignment does
not (and keeps `dataloader` tracked).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants