Skip to content

[Feat]: Domino support#1710

Open
h-guo18 wants to merge 5 commits into
mainfrom
haoguo/domino
Open

[Feat]: Domino support#1710
h-guo18 wants to merge 5 commits into
mainfrom
haoguo/domino

Conversation

@h-guo18

@h-guo18 h-guo18 commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: New feature

Adds Domino speculative decoding: the parallel DFlash draft backbone plus a lightweight GRU causal correction head. The backbone produces base logits for a full draft block in one forward; a GRU over the block's teacher-forced tokens produces a causal state that is fused with the backbone hidden state and projected to a vocab-sized logit correction on the block suffix — injecting the intra-block causal dependency the parallel backbone lacks. Trained with a dual loss (1-λ)*final + λ*base, where λ_base decays linearly 1→0 (curriculum: learn the parallel backbone first, then the correction).

Reuses the DFlash mode/config/recipe; selected via dflash_architecture_config.projector_type=domino and routed to its own registry so HFDominoModel does not shadow HFDFlashModel. Exports in the z-lab/SpecForge drafter format (prefix_gru.* / embed_proj.*).

Note: the inference side (vLLM / AR evaluation) is intentionally not wired up yet — the correction head is not applied in serving. To be added once the inference path lands.

Usage

# Online training (recipe: projector_type=domino)
uv run launch.py --yaml examples/Qwen/Qwen3-8B/hf_online_domino.yaml --yes

Testing

CPU unit tests in tests/unit/torch/speculative/plugins/test_hf_domino.py cover conversion routing, the training forward (dual loss + grads), the λ schedule, and the export format. Online Qwen3-8B training validated end-to-end (loss curve below).

image

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅ (opt-in via projector_type=domino; DFlash path unchanged)
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A (no new dependency)
  • Did you write any new necessary tests?: ✅
  • Did you update Changelog?: ✅
  • Did you get Claude approval on this PR?: ❌

Additional Information

Reference: SpecForge PR #571 (z-lab); drafter format huggingface.co/Huang2020/Qwen3-8B-Domino-b16.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added Domino speculative-decoding training with a curriculum-based loss scaling schedule.
    • Extended speculative model export to include Domino-specific draft head configuration.
  • Documentation & Configuration

    • Added a Domino speculative-decoding training recipe.
    • Added an HF Online Domino launcher configuration for Qwen3-8B.
  • Tests

    • Added CPU unit tests for Domino conversion, training loss behavior, lambda scheduling, and Domino export outputs.

Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 13, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0aecc1ee-ad66-4101-9890-443e63e322f7

📥 Commits

Reviewing files that changed from the base of the PR and between ba737bc and 35437c5.

📒 Files selected for processing (3)
  • modelopt/torch/export/plugins/hf_spec_export.py
  • modelopt/torch/speculative/plugins/hf_domino.py
  • modelopt_recipes/general/speculative_decoding/domino.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
  • modelopt_recipes/general/speculative_decoding/domino.yaml
  • modelopt/torch/export/plugins/hf_spec_export.py
  • modelopt/torch/speculative/plugins/hf_domino.py

📝 Walkthrough

Walkthrough

Adds a "Domino" speculative decoding training variant on top of DFlash. New components include a DominoModule with a GRU causal correction head, an HFDominoModel plugin with dual cross-entropy loss and lambda curriculum scheduling, a DominoDMRegistry for routing conversion, a DominoExporter for config export, and supporting recipe/launcher YAML files. Inference is not yet wired.

Changes

Domino Speculative Decoding Training

Layer / File(s) Summary
Config fields and conversion registry routing
modelopt/torch/speculative/config.py, modelopt/torch/speculative/dflash/conversion.py, modelopt/torch/speculative/plugins/__init__.py
Adds dflash_lambda_base_start and dflash_lambda_base_decay_ratio Pydantic fields to DFlashConfig. Introduces DominoDMRegistry and updates convert_to_dflash_model to select it when projector_type == "domino". Re-exports hf_domino symbols from the plugins package.
HFDFlashModel extensibility hook
modelopt/torch/speculative/plugins/hf_dflash.py
Refactors HFDFlashModel.modify() to use an overridable _build_draft_module factory hook instead of directly constructing DFlashModule, enabling subclasses to build augmented draft modules while reusing setup.
DominoModule definition
modelopt/torch/speculative/plugins/modeling_domino.py
Introduces DominoModule(DFlashModule) with a bias-free prefix_gru GRU and an embed_proj MLP head for vocab-sized logit corrections, initialized via _init_head_weights using a normal distribution sampling.
HFDominoModel training plugin and DominoLambdaCallback
modelopt/torch/speculative/plugins/hf_domino.py
Adds compute_lambda_base for linear decay scheduling. Implements HFDominoModel with _apply_domino_head (per-block GRU + suffix logit correction), _compute_domino_loss (dual weighted cross-entropy with exponential decay and accuracy metrics), and forward (anchor sampling, backbone+head execution). Adds DominoLambdaCallback to update _lambda_base each trainer step.
DominoExporter and training entry-point callback
modelopt/torch/export/plugins/hf_spec_export.py, examples/speculative_decoding/main.py
Adds DominoExporter(DFlashExporter) overriding _export_config to inject emb_dim and GRU/projector fields into config.json. Registers DominoLambdaCallback in the training script when projector_type == "domino".
Unit tests for conversion, forward, schedule, and export
tests/unit/torch/speculative/plugins/test_hf_domino.py
Covers conversion routing to HFDominoModel/DominoModule, GRU structure and head dimensions, forward-pass dual loss with gradient flow, lambda_base = 0 behavior, compute_lambda_base linear decay, and safetensors/config.json export layout.
Domino recipe config, launcher pipeline, and changelog
modelopt_recipes/general/speculative_decoding/domino.yaml, tools/launcher/examples/Qwen/Qwen3-8B/hf_online_domino.yaml, CHANGELOG.rst
Adds domino.yaml training recipe with projector_type: domino, GRU/draft dimensions, and lambda curriculum fields. Adds a two-task Qwen3-8B SLURM launcher pipeline. Adds a CHANGELOG entry noting training support and pending inference wiring.

Sequence Diagram(s)

sequenceDiagram
  participant Trainer
  participant DominoLambdaCallback
  participant HFDominoModel
  participant DraftBackbone
  participant DominoModule

  Trainer->>DominoLambdaCallback: on_step_begin(global_step)
  DominoLambdaCallback->>HFDominoModel: _lambda_base = compute_lambda_base(global_step, total_steps, ...)
  Trainer->>HFDominoModel: forward(input_ids, labels, loss_mask)
  HFDominoModel->>HFDominoModel: anchor sampling + loss_mask build
  HFDominoModel->>DraftBackbone: run backbone (no-grad base hidden states)
  DraftBackbone-->>HFDominoModel: hidden_states, base_logits
  HFDominoModel->>DominoModule: _apply_domino_head(hidden_states, base_logits, anchors)
  DominoModule-->>HFDominoModel: corrected final_logits
  HFDominoModel->>HFDominoModel: _compute_domino_loss(final_logits, base_logits, _lambda_base)
  HFDominoModel-->>Trainer: ModelOutput(loss, base_loss, final_loss, base_accuracy, lambda_base)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • yeyu-nvidia
  • benchislett
  • ChenhanYu
🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title '[Feat]: Domino support' is partially related to the changeset—it refers to a real part of the change (Domino feature addition) but is overly broad and vague, lacking specificity about what Domino is or what aspect is being added. Consider a more specific title like '[Feat]: Add Domino speculative decoding with GRU causal correction head' or '[Feat]: Implement Domino variant with dual-loss training curriculum' to better convey the primary change.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 82.14% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed All Python files pass security audit against SECURITY.md guidelines. No unsafe torch.load, numpy.load, hardcoded trust_remote_code, eval/exec, nosec comments, or new dependencies detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch haoguo/domino

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1710/

Built to branch gh-pages at 2026-06-15 21:56 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 84.31373% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.00%. Comparing base (9f37fe1) to head (35437c5).
⚠️ Report is 12 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/speculative/plugins/hf_domino.py 79.73% 31 Missing ⚠️
...elopt/torch/speculative/plugins/modeling_domino.py 95.65% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1710      +/-   ##
==========================================
- Coverage   77.12%   77.00%   -0.13%     
==========================================
  Files         511      513       +2     
  Lines       56236    56614     +378     
==========================================
+ Hits        43374    43596     +222     
- Misses      12862    13018     +156     
Flag Coverage Δ
unit 54.66% <84.31%> (+0.27%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@h-guo18 h-guo18 changed the title add domino support [Feat]: Domino support Jun 13, 2026
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
@h-guo18 h-guo18 marked this pull request as ready for review June 15, 2026 18:52
@h-guo18 h-guo18 requested review from a team as code owners June 15, 2026 18:52
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
@h-guo18 h-guo18 requested a review from ChenhanYu June 15, 2026 18:53
@copy-pr-bot

copy-pr-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@h-guo18

h-guo18 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

/claude review

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/speculative_decoding/main.py`:
- Line 285: The import of DominoLambdaCallback from
modelopt.torch.speculative.plugins.hf_domino at line 285 is currently inside a
function, which violates the repository's import-placement guidelines. Move this
import to the module scope at the top of the file with other imports, unless
there is a specific reason for the in-function placement (such as circular
dependency, optional dependency, or performance concerns). If such a reason
exists, keep the in-function import but add a brief comment above it explaining
the concrete justification for the non-standard placement.

In `@modelopt/torch/speculative/config.py`:
- Around line 135-150: Add schema bounds validation to the
dflash_lambda_base_start and dflash_lambda_base_decay_ratio ModeloptField
definitions to enforce that these normalized weight/fraction fields accept only
values in the valid range (0 to 1). This will cause invalid configuration values
to be rejected at config load time rather than being silently masked downstream,
following the coding guideline to validate external input at the interface
boundary.

In `@modelopt/torch/speculative/dflash/conversion.py`:
- Around line 44-49: The registry selection logic currently silently defaults to
DFlashDMRegistry for any unknown projector_type value, which can hide typos and
route users incorrectly. Replace the conditional expression with explicit
validation that checks if the projector_type is one of the supported values
("domino" or the default). Raise an appropriate error (e.g., ValueError) if the
projector_type is unsupported, ensuring invalid input is rejected at the
interface boundary rather than silently falling back to a default registry.

In `@modelopt/torch/speculative/plugins/modeling_domino.py`:
- Around line 58-59: Add validation for the pure_draft_prefix_len attribute
immediately after it is read from config in the initialization block. The
validation should check that pure_draft_prefix_len is non-negative and strictly
less than the block_size to ensure suffix correction works properly. If the
validation fails, raise a clear ValueError with a descriptive message indicating
the valid range requirement. This validation should occur at the config
interface boundary during module initialization, right after line 58 where
pure_draft_prefix_len is assigned from the config getattr call.

In `@tests/unit/torch/speculative/plugins/test_hf_domino.py`:
- Line 99: Move the in-function import of DFlashModule from line 99 and the
corresponding import at line 184 to the top-level of the test module (at the
beginning of the file with other imports). These imports do not have explicit
circular dependency, optional dependency, or heavy-import justifications that
would warrant keeping them in-function, so they should follow test conventions
by being at module-level to catch import errors during test collection rather
than execution.

In `@tools/launcher/examples/Qwen/Qwen3-8B/hf_online_domino.yaml`:
- Around line 72-75: The environment section in the YAML configuration file for
this new Qwen3-8B model config is missing two required environment variables. In
the environment list at lines 72-75 (which currently contains only
MAX_FINAL_LOSS and MIN_FINAL_ACC), add the two required launcher environment
variables MLM_MODEL_CFG and QUANT_CFG as additional list items, following the
same format as the existing environment variables. These must be explicitly set
according to the launcher coding guidelines for new model configurations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: afea7d4d-38ed-4205-b10b-8d8ee062ad26

📥 Commits

Reviewing files that changed from the base of the PR and between 9f37fe1 and 9d904c3.

📒 Files selected for processing (12)
  • CHANGELOG.rst
  • examples/speculative_decoding/main.py
  • modelopt/torch/export/plugins/hf_spec_export.py
  • modelopt/torch/speculative/config.py
  • modelopt/torch/speculative/dflash/conversion.py
  • modelopt/torch/speculative/plugins/__init__.py
  • modelopt/torch/speculative/plugins/hf_dflash.py
  • modelopt/torch/speculative/plugins/hf_domino.py
  • modelopt/torch/speculative/plugins/modeling_domino.py
  • modelopt_recipes/general/speculative_decoding/domino.yaml
  • tests/unit/torch/speculative/plugins/test_hf_domino.py
  • tools/launcher/examples/Qwen/Qwen3-8B/hf_online_domino.yaml

Comment thread examples/speculative_decoding/main.py Outdated
Comment thread modelopt/torch/speculative/config.py
Comment thread modelopt/torch/speculative/dflash/conversion.py Outdated
Comment thread modelopt/torch/speculative/plugins/modeling_domino.py
Comment thread tests/unit/torch/speculative/plugins/test_hf_domino.py Outdated
Comment thread tools/launcher/examples/Qwen/Qwen3-8B/hf_online_domino.yaml

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review

Reviewed the Domino training-only addition end-to-end. The algorithm is correct: GRU teacher-forcing on input_ids[anchor..anchor+bs-1] with shift_label=True predicts anchor+k+1 from anchor..anchor+k (no leakage), the suffix slice + base-logit add matches the SpecForge formulation, and the dual loss / λ-curriculum are wired correctly. The DominoDMRegistry split keeps HFDominoModel from shadowing HFDFlashModel, and config / state-dict round-tripping looks safe (new fields have defaults; old saved configs without projector_type keep routing to the DFlash registry).

Findings

  • CRITICAL: 0
  • IMPORTANT: 1
    • Silent eval bypass: in non-training mode forward delegates to HFDFlashModel.forward, so pseudo_speculative_generate / AR validation never applies the trained Domino head. Acknowledged in the PR description, but an estimate_ar/ar_validate_steps user would silently get backbone-only acceptance numbers. Suggest a logger.warning_once here and a short note in domino.yaml.
  • SUGGESTION: 3
    • DominoExporter._export_config uses getattr(draft_config, "emb_dim") (no default) — fails at export time after a long train if the user's dflash_architecture_config omits a head field. Prefer validating in HFDominoModel.modify.
    • DominoLambdaCallback falls back to total_steps=1 when state.max_steps is unset, which silently flips λ_base to 0 from step 1. Add a one-shot warning when the fallback is taken.
    • With lambda_base == 1.0 the head params drop out of the autograd graph; the recipe correctly sets ddp_find_unused_parameters: true but the dependency is invisible — worth a one-line note next to that flag in domino.yaml.

Overall risk

Low. Training-only path, opt-in via projector_type=domino, no behavior change for existing DFlash users. The findings are about ergonomics / fail-loud-vs-fail-silent rather than algorithm correctness.

Comment thread modelopt/torch/speculative/plugins/hf_domino.py
Comment thread modelopt/torch/export/plugins/hf_spec_export.py Outdated
Comment thread modelopt/torch/speculative/plugins/hf_domino.py Outdated
Comment thread modelopt/torch/speculative/plugins/hf_domino.py
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
@h-guo18 h-guo18 requested a review from a team as a code owner June 15, 2026 21:40
@h-guo18 h-guo18 self-assigned this Jun 15, 2026
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
@ChenhanYu

Copy link
Copy Markdown
Collaborator

The results look good. The the final checkpoint the same as DFlash? Or require additional vLLM support?

@h-guo18

h-guo18 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

The results look good. The the final checkpoint the same as DFlash? Or require additional vLLM support?

The final checkpoint contains new components other than typical DFlash. vLLM/TRTLLM is not yet ready in their main branches. We can add serve/specdecbench support once it's available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants