extract_answer: prefer boxed{N} extraction, fall back to legacy tags by py4 · Pull Request #4150 · AI-Hypercomputer/maxtext

py4 · 2026-06-11T22:35:34Z

Summary

extract_answer (used by the built-in check_numbers reward) returned the raw <answer> content. Modern reasoning models (Qwen3, DeepSeek-R1, etc.) emit the final answer as \boxed{N} (the maxtext GSM8K chat template itself asks for {solution_start_token}\boxed{}{solution_end_token}). So the extractor returned the literal string \boxed{42}, which math_verify cannot match against a bare numeric gold like 42. Result: ~0% accuracy on Qwen3/GSM8K even when the model's numeric answer is correct, a silent scoring failure rather than a model failure.

This makes the extractor consistent with maxtext's own chat template.

Strategy (priority order)

If <answer>...</answer> is present, scope to the last block's content; otherwise use the full response.
Inside the scope, extract the last \boxed{N} via a brace-balanced scan (handles nested LaTeX like \boxed{\frac{1}{2}}), with a permissive regex fallback.
If no \boxed is found, fall back to the legacy {solution_start_token}...{solution_end_token} regex, so recipes that emit plain-text answers are unaffected.

Step 3 is what keeps this backward compatible.

Tests

tests/post_training/unit/extract_answer_test.py (10 cases, cpu_only + post_training):

Boxed extraction:

inside <answer> tags, without tags, nested LaTeX, multiple boxed (last wins), whitespace stripping, negatives, and <answer>-tag scoping over a \boxed that appears in <reasoning>.

Legacy fallback (no \boxed):

plain-text answer inside <answer> tags still extracts; last-answer-wins preserved; no-answer returns FALLBACK_ANSWER.

All 10 executed and passed against the real function (Ran 10 tests ... OK).

Files

File	Change
`src/maxtext/trainers/post_train/rl/utils_rl.py`	`extract_answer`: boxed extraction + legacy fallback (+40/-4)
`tests/post_training/unit/extract_answer_test.py`	new, 10 cases

Checklist

Pyink-clean (--pyink-indentation=2 --line-length=122)
Backward compatible: legacy plain-text answers still extract via the tier-3 fallback
No effect on non-RL paths
Unit test covering boxed extraction + legacy fallback, verified passing

Modern reasoning models (Qwen3, DeepSeek-R1, etc.) emit `\boxed{N}` inside `<answer>...</answer>` (or with no answer tags at all). The legacy regex returned the raw `<answer>` content — e.g. `\boxed{42}` as a string — which math_verify cannot match against a bare numeric gold like "42". Result: ~0% accuracy on Qwen3/GSM8K even when the model's numeric answer is correct. New strategy (priority order): 1. If `<answer>...</answer>` is present, use the last block's content as the search scope; otherwise use the full response. 2. Inside the scope, extract the last `\boxed{N}` via brace-balanced scan + permissive regex fallback. 3. If no `\boxed` is found, fall back to the legacy `{solution_start_token}...{solution_end_token}` regex (backward-compat for recipes that emit plain-text answers).

codecov · 2026-06-11T22:43:07Z

Codecov Report

❌ Patch coverage is 80.95238% with 4 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/trainers/post_train/rl/utils_rl.py	80.95%	2 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract_answer: prefer boxed{N} extraction, fall back to legacy tags#4150

extract_answer: prefer boxed{N} extraction, fall back to legacy tags#4150
py4 wants to merge 1 commit into
mainfrom
pr/extract-answer-boxed

py4 commented Jun 11, 2026

Uh oh!

codecov Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

py4 commented Jun 11, 2026

Summary

Strategy (priority order)

Tests

Files

Checklist

Uh oh!

codecov Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Jun 11, 2026 •

edited

Loading