Skip to content

Fix Step3.5 HF export RoPE metadata#1745

Closed
meenchen wants to merge 1 commit into
mainfrom
fix-step35-rope-theta-config
Closed

Fix Step3.5 HF export RoPE metadata#1745
meenchen wants to merge 1 commit into
mainfrom
fix-step35-rope-theta-config

Conversation

@meenchen

@meenchen meenchen commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: Bug fix

Fixes exported Step-3.5 Hugging Face checkpoints for Transformers 5.x/vLLM by adding rope_theta to llama3 RoPE metadata when save_pretrained() emits rope_parameters or rope_scaling without that required key.

Transformers 5.10.2 validates rope_parameters during config construction and raises:

KeyError: "Missing required keys in `rope_parameters` for 'rope_type'='llama3': {'rope_theta'}"

The sanitizer uses config.json["rope_theta"] when present, otherwise falls back to model.config.rope_theta.

Usage

Export Step-3.5 HF checkpoints as usual with hf_ptq.py, then serve the exported checkpoint with vLLM/Transformers 5.x.

Testing

  • pre-commit run ruff-format --files modelopt/torch/export/plugins/hf_checkpoint_utils.py modelopt/torch/export/plugins/__init__.py modelopt/torch/export/unified_export_hf.py tests/unit/torch/export/test_hf_checkpoint_utils.py
  • pre-commit run ruff-check --files modelopt/torch/export/plugins/hf_checkpoint_utils.py modelopt/torch/export/plugins/__init__.py modelopt/torch/export/unified_export_hf.py tests/unit/torch/export/test_hf_checkpoint_utils.py
  • pre-commit run mypy --files modelopt/torch/export/plugins/hf_checkpoint_utils.py modelopt/torch/export/plugins/__init__.py modelopt/torch/export/unified_export_hf.py tests/unit/torch/export/test_hf_checkpoint_utils.py
  • pre-commit run mypy --all-files
  • PYTHONPATH=/private/tmp/modelopt-step35-rope-fix python -m pytest tests/unit/torch/export/test_hf_checkpoint_utils.py -vv
  • Config-load E2E with Transformers 5.10.2 and a minimal vLLM Step3p5Config reproduction:
    • unsanitized config reproduced the reported Missing required keys ... {'rope_theta'} error
    • sanitized config loaded successfully with rope_parameters["rope_theta"] == 500000

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: ✅
  • Did you update Changelog?: N/A
  • Did you get Claude approval on this PR?: N/A

Additional Information

This is separate from PR #1693. PR #1693 handles layer_types length sanitization; this PR handles the earlier Step-3.5 RoPE config validation failure.

Summary by CodeRabbit

  • New Features

    • Added configuration sanitization for Hugging Face model exports to properly adjust LLaMA3 RoPE parameters and rope_theta values during the export process.
  • Tests

    • Added comprehensive unit tests covering configuration sanitization scenarios, including rope_theta injection, parameter metadata updates, and non-LLaMA3 compatibility.

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
@meenchen meenchen requested review from a team as code owners June 16, 2026 03:36
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds LLaMA3 RoPE configuration sanitization for Hugging Face checkpoint export. New helpers in hf_checkpoint_utils.py resolve and inject rope_theta into rope_parameters/rope_scaling entries when absent. The plugins/__init__.py guards the import with a conditional no-op fallback. export_hf_checkpoint calls the sanitizer after loading config.json.

Changes

LLaMA3 RoPE Config Sanitization on HF Export

Layer / File(s) Summary
RoPE sanitization helpers
modelopt/torch/export/plugins/hf_checkpoint_utils.py
Adds _get_rope_theta, _sanitize_llama3_rope_config, and the public sanitize_hf_config_for_deployment function that injects rope_theta into LLaMA3 rope_parameters and rope_scaling entries in-place.
Conditional plugin fallback
modelopt/torch/export/plugins/__init__.py
Wraps hf_checkpoint_utils in import_plugin, adds a guarded no-op sanitize_hf_config_for_deployment fallback when the symbol is absent, and reorders vllm_fakequant_hf/vllm_fakequant_megatron imports to follow the fallback definition.
Export call site and tests
modelopt/torch/export/unified_export_hf.py, tests/unit/torch/export/test_hf_checkpoint_utils.py
Imports and calls sanitize_hf_config_for_deployment on config_data inside export_hf_checkpoint after loading config.json. Unit tests cover rope_theta propagation, model config fallback, non-overwrite, and non-LLaMA3 no-op behavior.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix Step3.5 HF export RoPE metadata' directly describes the main change: fixing Hugging Face checkpoint exports for Step-3.5 models by addressing RoPE metadata issues with Transformers 5.x compatibility.
Docstring Coverage ✅ Passed Docstring coverage is 84.62% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed All 6 critical security anti-patterns were checked and are absent: no torch.load(weights_only=False), numpy.load(allow_pickle=True), trust_remote_code=True, eval/exec, # nosec comments, or non-perm...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-step35-rope-theta-config

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-16 04:50 UTC

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/export/plugins/__init__.py`:
- Around line 27-34: The fallback definition of
sanitize_hf_config_for_deployment is a complete no-op that returns None without
performing any sanitization, which silently disables the RoPE compatibility fix
when the hf_checkpoint_utils plugin import fails. Replace the no-op fallback
function with an implementation that performs the essential sanitization logic,
specifically the rope_theta injection to the config_data dictionary. This
ensures the RoPE compatibility fix is always applied regardless of whether the
optional plugin successfully imports, preventing silent runtime failures on
environments where the plugin is unavailable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5bc364af-5c34-441c-af04-b8e23600acbe

📥 Commits

Reviewing files that changed from the base of the PR and between a21197c and 0427643.

📒 Files selected for processing (4)
  • modelopt/torch/export/plugins/__init__.py
  • modelopt/torch/export/plugins/hf_checkpoint_utils.py
  • modelopt/torch/export/unified_export_hf.py
  • tests/unit/torch/export/test_hf_checkpoint_utils.py

Comment on lines +27 to +34
with import_plugin("hf_checkpoint_utils"):
from .hf_checkpoint_utils import *

if "sanitize_hf_config_for_deployment" not in globals():

def sanitize_hf_config_for_deployment(config_data: dict[str, Any], model: Any) -> None:
"""No-op fallback when Hugging Face checkpoint utilities are unavailable."""
return None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Silent no-op fallback can disable the RoPE compatibility fix at runtime.

When hf_checkpoint_utils fails to import, this fallback makes sanitize_hf_config_for_deployment a no-op, so export continues without rope_theta injection. That silently reintroduces the config incompatibility on environments where the optional plugin import fails.

Suggested fix (keep sanitization always available)
# modelopt/torch/export/plugins/__init__.py
 with import_plugin("hf_checkpoint_utils"):
     from .hf_checkpoint_utils import *
 
-if "sanitize_hf_config_for_deployment" not in globals():
-
-    def sanitize_hf_config_for_deployment(config_data: dict[str, Any], model: Any) -> None:
-        """No-op fallback when Hugging Face checkpoint utilities are unavailable."""
-        return None
+if "sanitize_hf_config_for_deployment" not in globals():
+    raise ImportError(
+        "sanitize_hf_config_for_deployment is required for HF export compatibility but "
+        "could not be imported from hf_checkpoint_utils."
+    )
# modelopt/torch/export/plugins/hf_checkpoint_utils.py
-from huggingface_hub import snapshot_download
-from huggingface_hub.errors import LocalEntryNotFoundError
 ...
 def copy_hf_ckpt_remote_code(
     pretrained_model_path: str | os.PathLike, save_directory: str | os.PathLike
 ):
+    # Local import keeps unrelated helpers (e.g., config sanitization) available
+    # even when huggingface_hub is not installed.
+    from huggingface_hub import snapshot_download
+    from huggingface_hub.errors import LocalEntryNotFoundError
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/export/plugins/__init__.py` around lines 27 - 34, The fallback
definition of sanitize_hf_config_for_deployment is a complete no-op that returns
None without performing any sanitization, which silently disables the RoPE
compatibility fix when the hf_checkpoint_utils plugin import fails. Replace the
no-op fallback function with an implementation that performs the essential
sanitization logic, specifically the rope_theta injection to the config_data
dictionary. This ensures the RoPE compatibility fix is always applied regardless
of whether the optional plugin successfully imports, preventing silent runtime
failures on environments where the plugin is unavailable.

@meenchen meenchen self-assigned this Jun 16, 2026
@meenchen meenchen added the cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Jun 16, 2026
@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.53%. Comparing base (bcd8dd4) to head (0427643).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/export/plugins/__init__.py 66.66% 2 Missing ⚠️
...delopt/torch/export/plugins/hf_checkpoint_utils.py 95.45% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1745       +/-   ##
===========================================
+ Coverage   58.45%   76.53%   +18.07%     
===========================================
  Files         510      511        +1     
  Lines       56271    56366       +95     
===========================================
+ Hits        32896    43138    +10242     
+ Misses      23375    13228    -10147     
Flag Coverage Δ
examples 41.81% <63.33%> (+19.37%) ⬆️
gpu 57.77% <63.33%> (+37.17%) ⬆️
regression 14.70% <30.00%> (+0.07%) ⬆️
unit 54.36% <83.33%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@meenchen

Copy link
Copy Markdown
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant