Skip to content

Update multimodal docs with new models#4148

Open
hengtaoguo wants to merge 1 commit into
mainfrom
hengtaoguo-doc
Open

Update multimodal docs with new models#4148
hengtaoguo wants to merge 1 commit into
mainfrom
hengtaoguo-doc

Conversation

@hengtaoguo

@hengtaoguo hengtaoguo commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Description

  • Update multimodal.md to include new models supported over the past few months.
  • Add Qwen3-Omni announce in the qwen_moe doc, including E2E test scripts.

Tests

Build docs for readthedocs:

# Build:
cd docs && sphinx-build -b html . _build/html

# Serve:
python -m http.server -d _build/html

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@Rohan-Bierneni Rohan-Bierneni left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Can we also add a link in

Qwen3 is a family of open-source large language models from the Qwen team at Alibaba. This documentation covers the integration of the following Qwen Mixture-of-Experts (MoE) models into MaxText:
that has a link to this multimodal doc for qwen3.5 section

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Comment thread tests/end_to_end/tpu/qwen/moe/run_qwen_moe.md
Comment thread tests/end_to_end/tpu/qwen/moe/qwen3-omni-30b-a3b/1_test_qwen3_omni_30b_a3b.sh Outdated

@Rohan-Bierneni Rohan-Bierneni left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the change I have left some comments

@Rohan-Bierneni Rohan-Bierneni left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall script are formatted very good split between part 1 and 2. Mainly we need some changes to script 2 to make it compatible with our xlml tests.

We also need another pr in xlml once this is merged to add the tests in our DAG.

Thank you for making these changes!

fi

# ---
# Step 1: Checkpoint Conversion

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add the command to convert to scanned checkpoint below.

If scanned checkpoitn will not be used/is not supported in the checkpoint util then no worries we can skip it

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the deepstack features in the vision encoder, Omni doesn't have a full scanned support. We can skip it and use unscanned for now.

export TOKENIZER_PATH="${TOKENIZER_PATH:-Qwen/Qwen3-Omni-30B-A3B-Instruct}"

# Base output path where the MaxText checkpoint from Step 1 was written.
export BASE_OUTPUT_PATH="${BASE_OUTPUT_PATH:-gs://your-gcs-bucket/qwen3-omni-30b-a3b_maxtext_ckpt}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this will be used by xlml for tests, the default value should output to gs://runner-maxtext-logs/$(date +%Y-%m-%d-%H-%M).

Similar to how it is set here:

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, done.

BASE_OUTPUT_PATH=${BASE_OUTPUT_PATH%/}
echo "Using BASE_OUTPUT_PATH = ${BASE_OUTPUT_PATH}"

UNSCANNED_CKPT_PATH=${BASE_OUTPUT_PATH}/unscanned/0/items

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XLML will pull an already converted checkpoint from gs://maxtext-model-checkpoints.

Can you upload a converted checkpoint to this gcs bucket, and set the value here to point to that checkpoint.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uploaded a new checkpoint to gs://maxtext-model-checkpoints/qwen3-omni-30b-a3b/unscanned/0/items.

exit 1
fi

# Strip trailing slash from base path to avoid malformed URIs

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be able to remove this logic if using hardcoded gcs bucket gs://runner-maxtext-logs

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this redundant block.

UNSCANNED_CKPT_PATH=${BASE_OUTPUT_PATH}/unscanned/0/items

# ---
# Step 2a: Multimodal Decode — text + image

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the decode tests, do you think we should have forward_pass_logit_check test as well. This would compare to hf golden logits and would need to be stored in specific gcs bucket.

Ex:

GOLDEN_LOGITS_DISK_LOCATION="/deps/tests/assets/golden_logits/golden_data_${MODEL_NAME}.jsonl"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great but vision branch may undergo some precision issue, as we translated image processing steps from Torch to JAX, and some interpolation operations cannot closely match. So now we mostly look at whether the output are making sense. I will do that as a follow up.

# Uses a test image from the repo assets.
# max_prefill_predict_length accounts for image tokens (~256) + text prompt tokens.
# ---
python3 -m maxtext.inference.decode src/maxtext/configs/base.yml \

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we point to base.yml file currently in toher scripts is like this:

${MAXTEXT_CONFIGS_DIR:-${MAXTEXT_REPO_ROOT:-$PWD}/src/maxtext/configs}/base.yml

Can we change it to this for other instances in this script for consistency.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch, all base.yml paths have been updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants