extract_answer: prefer boxed{N} extraction, fall back to legacy tags#4150
Open
py4 wants to merge 1 commit into
Open
extract_answer: prefer boxed{N} extraction, fall back to legacy tags#4150py4 wants to merge 1 commit into
py4 wants to merge 1 commit into
Conversation
Modern reasoning models (Qwen3, DeepSeek-R1, etc.) emit `\boxed{N}`
inside `<answer>...</answer>` (or with no answer tags at all). The
legacy regex returned the raw `<answer>` content — e.g. `\boxed{42}`
as a string — which math_verify cannot match against a bare numeric
gold like "42". Result: ~0% accuracy on Qwen3/GSM8K even when the
model's numeric answer is correct.
New strategy (priority order):
1. If `<answer>...</answer>` is present, use the last block's
content as the search scope; otherwise use the full response.
2. Inside the scope, extract the last `\boxed{N}` via brace-balanced
scan + permissive regex fallback.
3. If no `\boxed` is found, fall back to the legacy
`{solution_start_token}...{solution_end_token}` regex
(backward-compat for recipes that emit plain-text answers).
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
extract_answer(used by the built-incheck_numbersreward) returned the raw<answer>content. Modern reasoning models (Qwen3, DeepSeek-R1, etc.) emit the final answer as\boxed{N}(the maxtext GSM8K chat template itself asks for{solution_start_token}\boxed{}{solution_end_token}). So the extractor returned the literal string\boxed{42}, whichmath_verifycannot match against a bare numeric gold like42. Result: ~0% accuracy on Qwen3/GSM8K even when the model's numeric answer is correct, a silent scoring failure rather than a model failure.This makes the extractor consistent with maxtext's own chat template.
Strategy (priority order)
<answer>...</answer>is present, scope to the last block's content; otherwise use the full response.\boxed{N}via a brace-balanced scan (handles nested LaTeX like\boxed{\frac{1}{2}}), with a permissive regex fallback.\boxedis found, fall back to the legacy{solution_start_token}...{solution_end_token}regex, so recipes that emit plain-text answers are unaffected.Step 3 is what keeps this backward compatible.
Tests
tests/post_training/unit/extract_answer_test.py(10 cases,cpu_only+post_training):Boxed extraction:
<answer>tags, without tags, nested LaTeX, multiple boxed (last wins), whitespace stripping, negatives, and<answer>-tag scoping over a\boxedthat appears in<reasoning>.Legacy fallback (no
\boxed):<answer>tags still extracts; last-answer-wins preserved; no-answer returnsFALLBACK_ANSWER.All 10 executed and passed against the real function (
Ran 10 tests ... OK).Files
src/maxtext/trainers/post_train/rl/utils_rl.pyextract_answer: boxed extraction + legacy fallback (+40/-4)tests/post_training/unit/extract_answer_test.pyChecklist
--pyink-indentation=2 --line-length=122)