Skip to content

[DeepSeek-V4] Implement model integration, decoders, and configuration stack#4153

Open
parambole wants to merge 1 commit into
mainfrom
dsv4_model_integrate
Open

[DeepSeek-V4] Implement model integration, decoders, and configuration stack#4153
parambole wants to merge 1 commit into
mainfrom
dsv4_model_integrate

Conversation

@parambole

@parambole parambole commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Description

This PR introduces native architectural and routing support for the DeepSeek V4 model in MaxText.

Why & What: DeepSeek V4 introduces non-uniform architectural features that require explicit configuration unrolling. This PR solves the integration by implementing:

  • Compressed Attention (CSA/HCA): Bypasses standard MLA instantiation and natively integrates DeepSeek V4's alternating CSA and HCA attention blocks.
  • Hybrid Routing: Implements DeepSeek's transition from fixed Hash Routing (early layers) to learned Token Routing (later layers) natively within the MoE framework.
  • Architectural Scanning: Unrolls the 44-layer configuration to properly handle the [0, 0] prefix compression ratios, the perfectly alternating [4, 128] scanned middle layers, and the [4, 0] suffix layers.

Tests

  • Unit Tests: Verified mathematical parity against reference implementations using tests/unit/deepseek_v4_vs_reference_test.py.
  • E2E Compilation: Successfully compiled the full DeepSeek V4 model on a simulated v5p-512 mesh to guarantee memory constraints and HLO generation.

Compile Command to Reproduce:

python3  -m  maxtext.trainers.pre_train.train_compile  src/maxtext/configs/base.yml
  base_output_directory=/tmp/maxtext_logs
  run_name=dsv4_v5p512_compile
  per_device_batch_size=1
  enable_checkpointing=false
  model_name=deepseek4
  compile_topology=v5p-512
  compile_topology_num_slices=1
  ici_fsdp_parallelism=-1
  steps=1
  max_target_length=4096
  async_checkpointing=false
  tokenizer_type=huggingface
  tokenizer_path=deepseek-ai/DeepSeek-V3
  attention=dot_product
  dtype=bfloat16
  weight_dtype=bfloat16
  megablox=False
  sparse_matmul=False
  dataset_type=synthetic
  scan_layers=true

Proof of Compilation:

Memory analysis: CompiledMemoryStats(generated_code_size_in_bytes=260855808, argument_size_in_bytes=18401962496, output_size_in_bytes=18401889280, alias_size_in_bytes=18401880576, temp_size_in_bytes=94892786400, host_generated_code_size_in_bytes=0, host_argument_size_in_bytes=0, host_output_size_in_bytes=0, host_alias_size_in_bytes=0, host_temp_size_in_bytes=0)

Checklist

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

@parambole parambole force-pushed the dsv4_model_integrate branch 2 times, most recently from 2a19018 to 23adce0 Compare June 12, 2026 20:00
@parambole parambole marked this pull request as ready for review June 12, 2026 20:09
@parambole parambole changed the title Add DeepSeek V4 architecture support [DeepSeek-V4] Implement model integration, decoders, and configuration stack Jun 12, 2026
@github-actions

Copy link
Copy Markdown

🤖 Hi @parambole, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

@github-actions

Copy link
Copy Markdown

🤖 I'm sorry @parambole, but I was unable to process your request. Please see the logs for more details.

Comment thread src/maxtext/configs/models/deepseek4.yml Outdated
This commit introduces full support for DeepSeek V4 by integrating its
compressed attention mechanisms, MoE routing, and architectural layers.

Key changes:
- Add `deepseek4.yml` configuration and `DeepSeek4DecoderLayer` implementation.
- Implement hybrid Hash Routing and Token Routing for MoE layers.
- Add prefix/suffix layer unrolling for non-uniform compression blocks.
- Fix Pydantic validation for base MLP dimensions.
- Bypass MLA instantiation in favor of native CompressedAttention (CSA/HCA).
@parambole parambole force-pushed the dsv4_model_integrate branch from 23adce0 to 6deaacc Compare June 12, 2026 21:17
Comment thread src/maxtext/configs/models/deepseek4.yml

@entrpn entrpn left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one comment, everything else looks good.

@RissyRan

Copy link
Copy Markdown
Collaborator

Are you able to have a real run and check profile to see if the scan blocks order as expected? Compile test won't be able to verify a RunTime error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants