[NVBug 6287315] Fix unified HF export for Llama4 MoE models#1744
[NVBug 6287315] Fix unified HF export for Llama4 MoE models#1744shengliangxu wants to merge 3 commits into
Conversation
GptOss and Llama4 Moe are 2 special handling models we have across the file. They always appear in pairs in special handling code path, but this problematic export path does not include Llama4 MoE. Adding it fix the export failure. Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
📝 WalkthroughWalkthroughThe uncalibrated-experts input-quantizer ChangesLlama4 MoE HF Export Fix
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
modelopt/torch/export/unified_export_hf.py (1)
870-870: 💤 Low valueConsider renaming for clarity (optional).
The variable
gpt_oss_linear_namesnow applies to bothQuantGptOssExpertsandQuantLlama4TextExperts. Consider renaming tofused_expert_linear_namesfor clarity.♻️ Optional refactor
- gpt_oss_linear_names = ["gate_up_proj", "down_proj"] - for linear_name in gpt_oss_linear_names: + fused_expert_linear_names = ["gate_up_proj", "down_proj"] + for linear_name in fused_expert_linear_names:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/export/unified_export_hf.py` at line 870, The variable `gpt_oss_linear_names` at line 870 is misleading because it is now used for both `QuantGptOssExperts` and `QuantLlama4TextExperts` model types, not exclusively for GPT-OSS models. Rename the variable from `gpt_oss_linear_names` to `fused_expert_linear_names` throughout the code to better reflect its broader applicability to fused expert layer types, ensuring all references to this variable are updated consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@modelopt/torch/export/unified_export_hf.py`:
- Line 870: The variable `gpt_oss_linear_names` at line 870 is misleading
because it is now used for both `QuantGptOssExperts` and
`QuantLlama4TextExperts` model types, not exclusively for GPT-OSS models. Rename
the variable from `gpt_oss_linear_names` to `fused_expert_linear_names`
throughout the code to better reflect its broader applicability to fused expert
layer types, ensuring all references to this variable are updated consistently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ebe35685-23b4-4b2c-8553-ef1f0bf471f4
📒 Files selected for processing (2)
CHANGELOG.rstmodelopt/torch/export/unified_export_hf.py
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Small, correct bug fix (+9/-3, 2 files): the uncalibrated-experts input-quantizer amax fallback in _export_transformers_checkpoint previously matched only QuantGptOssExperts, so Llama4 MoE export failed. The new branch also matches QuantLlama4TextExperts.
Verified against the codebase:
_QuantLlama4TextExperts(registered forLlama4TextExperts) defines exactly the same singulargate_up_proj_input_quantizer/down_proj_input_quantizer/gate_up_proj+down_projfused layout as_QuantGptOssExperts, so routing it through the same branch is correct._process_quantized_modulesalready handles bothLlama4TextExpertsandGptOssExpertstogether (amax fallback + weight export), matching the new branch's comment.- Branch ordering is safe:
_QuantLlama4TextExpertsuses the singulargate_up_proj_weight_quantizer, so it does not get caught by the earlier_QuantFusedExpertsgate_up_proj_weight_quantizers(plural) elif and correctly falls through.
Licensing clean (standard NVIDIA header on existing file, CHANGELOG entry only). No design-review concerns (additive bug fix). No prompt-injection issues in the PR content.
Flagging for a human look only because there is no automated test for the fixed path — the author states a full Llama4 MoE checkpoint is needed and the branch parallels the already-covered GPT-OSS path, which is reasonable but worth an owner's sign-off.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1744 +/- ##
==========================================
- Coverage 77.12% 76.55% -0.58%
==========================================
Files 511 511
Lines 56273 56273
==========================================
- Hits 43399 43077 -322
- Misses 12874 13196 +322
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
What does this PR do?
Type of change: Bug fix
Fixes unified HuggingFace checkpoint export for Llama4 MoE models (NVBug 6287315).
GptOssExpertsandLlama4TextExpertsare the two fused-expert model families thatget special handling throughout
modelopt/torch/export/unified_export_hf.py, and theyappear together in every other special-cased path (e.g. the BMM-style weight
transposition at L626-629 and the uncalibrated-experts handling in
_process_quantized_modulesat L796-798). The uncalibrated-experts input-quantizeramaxfallback inside_export_transformers_checkpoint, however, special-cased onlyQuantGptOssExperts, so Llama4 MoE fell through and export failed.Since both wrappers use the same fused
gate_up_proj/down_projlayout with singularinput quantizers,
QuantLlama4TextExpertsis now handled by the same branch, restoringLlama4 MoE export.
Usage
No API change. Quantizing and exporting a Llama4 MoE model now succeeds:
Testing
Verified that unified HF export of a quantized Llama4 MoE checkpoint — which previously
failed per NVBug 6287315 — now completes successfully. The change extends an
already-special-cased branch that mirrors the GPT-OSS handling (same
gate_up_proj/down_projfused layout), so behavior for all other model types is unchanged.Before your PR is "Ready for review"
CONTRIBUTING.md: N/AAdditional Information
cherry-pick-0.45.0for backport torelease/0.45.0.Summary by CodeRabbit