Enable NVFP4 grouped MLP GLU RHT amax path by sraman-rgb · Pull Request #3073 · NVIDIA/TransformerEngine

sraman-rgb · 2026-06-01T21:41:33Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-06-01T21:47:45Z

Greptile Summary

This PR adds the NVFP4 grouped MLP GLU + Random Hadamard Transform (RHT) amax path, enabling a new fused kernel (grouped_gemm_glu_hadamard_wrapper_sm100) that jointly runs the FC1 GEMM+GLU with NVFP4 RHT amax collection, then reuses those pre-computed amaxes for FC2 input quantization rather than re-scanning the activation output.

C++ layer: Refactors group_quantize_nvfp4_impl to accept a compute_amax flag (skipping the Hadamard amax step when pre-computed values are provided), adds allreduce_nvfp4_amax_tensors helper, and introduces nvfp4_group_quantize_with_amax for the multi-group NVFP4 path. quantize gains a compute_amax=true default parameter for the single-group path.
Python layer: Extracts _wrap_single_nvfp4_as_grouped as a reusable helper, adds _group_quantize_with_amax_for_grouped_mlp, and teaches ForwardGroupedMLP_CuTeGEMMGLU to use grouped_gemm_glu_hadamard_kernel when all conditions are met (NVFP4 + swiglu + with_rht + with_post_rht_amax). The TMEM env-var flag is cached via lru_cache to avoid hot-path lookups.

Confidence Score: 5/5

Safe to merge; the allreduce ordering is correct in both the multi-group and single-group paths, and the new GLU+Hadamard kernel selection is properly gated with graceful fallback.

The distributed allreduce logic is the highest-risk area. In the multi-group path, allreduce_nvfp4_amax_tensors is called before group_quantize_nvfp4_impl, so the cast kernel always sees globally-reduced amaxes. In the single-group path, reduce_amaxes() fires at line 2473 of quantize_impl, before the nvte_quantize_with_hadamard_transform call, so the same ordering guarantee holds. The empty-input early-return correctly participates in the collective when compute_amax=false. The only concern is that quantize_impl is now public, which is a minor API surface issue with no present defect.

transformer_engine/pytorch/csrc/common.h — the promotion of quantize_impl to public is worth a second look before wider adoption of this API.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/csrc/common.h	Moves `quantize_impl` from `private` to `public` in `NVFP4Quantizer` to allow C++ extension code to call it directly with `compute_amax=false`; widens API surface of the class.
transformer_engine/pytorch/csrc/extensions/cast.cpp	Adds `allreduce_nvfp4_amax_tensors` helper, refactors `group_quantize_nvfp4_impl` to require RHT+post-RHT-amax and accept `compute_amax` flag, and introduces `nvfp4_group_quantize_with_amax` that allreduces externally-provided amaxes before the cast kernel.
transformer_engine/pytorch/csrc/quantizer.cpp	Restructures `quantize_impl` so that `reduce_amaxes()` runs before the cast kernel for both `compute_amax=true` and `compute_amax=false`; the empty-input early-return correctly calls `reduce_amaxes()` in the `compute_amax=false` branch to keep distributed collectives synchronized.
transformer_engine/pytorch/csrc/extensions/pybind.cpp	Adds `compute_amax=true` default to the `quantize` Python binding and registers `nvfp4_group_quantize_with_amax` via a helper; clean, backward-compatible.
transformer_engine/pytorch/ops/_common.py	Extracts `_wrap_single_nvfp4_as_grouped`, adds `_group_quantize_with_amax_for_grouped_mlp` for both multi-group (C++ allreduce) and single-group (allreduce inside `quantize_impl`) NVFP4 paths.
transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py	Adds `_use_tmem_post_rht_amax` (properly `lru_cache`d), detects and activates the GLU+Hadamard path when conditions are met, routes amax tensors from the kernel output into `_group_quantize_with_amax_for_grouped_mlp`.

Sequence Diagram

sequenceDiagram
    participant FW as ForwardGroupedMLP (forward)
    participant KG as grouped_gemm_glu_hadamard_kernel
    participant PY as _group_quantize_with_amax_for_grouped_mlp
    participant CPP as nvfp4_group_quantize_with_amax (C++)
    participant QI as quantize_impl (C++)
    participant AR as allreduce / reduce_amaxes
    participant CK as nvte cast kernel

    FW->>KG: FC1 GEMM + GLU + RHT amax
    KG-->>FW: "fc1_kernel_out {d_tensor, amax_tensor, post_rht_amax_tensor}"
    FW->>PY: _group_quantize_with_amax_for_grouped_mlp(...)
    alt "num_groups != 1"
        PY->>CPP: tex.nvfp4_group_quantize_with_amax
        CPP->>AR: allreduce_nvfp4_amax_tensors (before kernel)
        AR-->>CPP: global amaxes
        CPP->>CK: "group_quantize_nvfp4_impl(compute_amax=false)"
        CK-->>PY: GroupedTensor
    else "num_groups == 1"
        PY->>QI: "tex.quantize(compute_amax=False)"
        QI->>AR: reduce_amaxes() at line 2473 (before cast kernel)
        AR-->>QI: global amaxes
        QI->>CK: nvte_quantize_with_hadamard_transform
        CK-->>PY: FP4 tensor wrapped as GroupedTensor
    end
    FW->>FW: FC2 GEMM using grouped_fc2_x

_{Reviews (7): Last reviewed commit: "Address NVFP4 precomputed amax review co..." | Re-trigger Greptile}

vthumbe1503

Mostly LGTM. Left a few comments on code duplication and other minor issues.

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

timmoon10 · 2026-06-02T20:41:21Z


 py::object quantize(const at::Tensor &tensor, py::handle quantizer, const py::object &output,
-                    std::optional<at::Tensor> noop_flag) {
+                    std::optional<at::Tensor> noop_flag, bool compute_amax) {


This is leaking NVFP4-specific details into a generic API.

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

Enable NVFP4 grouped MLP GLU RHT amax path

8842a9a

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

sraman-rgb requested review from ksivaman and timmoon10 as code owners June 1, 2026 21:41

github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label Jun 1, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

3c1c4c9

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/csrc/extensions/cast.cpp

Comment thread transformer_engine/pytorch/ops/_common.py Outdated

Comment thread transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py Outdated

vthumbe1503 reviewed Jun 1, 2026

View reviewed changes

sraman-rgb and others added 3 commits June 1, 2026 16:37

Address NVFP4 GLU RHT amax review comments

98343dc

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

Deduplicate grouped NVFP4 quantize helper

0e9f54f

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

7cdd8f4

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/csrc/extensions/cast.cpp Outdated

Route precomputed amax through NVFP4 quantize

ff2bfd7

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/_common.py

sraman-rgb added 2 commits June 2, 2026 10:00

Reduce single-group NVFP4 precomputed amaxes

ec3f106

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

Handle empty NVFP4 precomputed amax reduction

2614527

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

timmoon10 requested changes Jun 2, 2026

View reviewed changes

Address NVFP4 precomputed amax review comments

22b9b0e

Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable NVFP4 grouped MLP GLU RHT amax path#3073

Enable NVFP4 grouped MLP GLU RHT amax path#3073
sraman-rgb wants to merge 9 commits into
NVIDIA:mainfrom
sraman-rgb:nvfp4-grouped-mlp-glu-rht-amax

sraman-rgb commented Jun 1, 2026

Uh oh!

greptile-apps Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vthumbe1503 left a comment

Uh oh!

Uh oh!

Uh oh!

timmoon10 Jun 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sraman-rgb commented Jun 1, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vthumbe1503 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

timmoon10 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Jun 1, 2026 •

edited

Loading