On Qwen3 base models, verifier choice alone can swing reported MATH-500 accuracy by ~80 percentage points end-to-end — without touching the model. The dominant axis, once a \boxed{...} answer-extraction fallback is in place (covered in our earlier post), is symbolic equivalence: responses such as \dfrac{1}{2} vs. \frac{1}{2}, 0.5 vs. \frac{1}{2}, or 2\sqrt{2} vs. \sqrt{8} are all mathematically equivalent but fail under string-comparison scorers and succeed under sympy-based ones — alone worth ~18 pp on Qwen3-8B base MATH-500. We walk through a three-policy decomposition (A → B → C) that pins down where each percentage point comes from. We noticed the gap while building CRISP, a reasoning-compression method; we discuss a CRISP-specific complication in the final section.
While evaluating Qwen3 base models on MATH-500 we cross-checked accuracy under two scorer implementations: an internal extension of math_dapo (the scoring scripts that ship with the DAPO-Math training corpus), and HuggingFace math_verify (a sympy-backed reference implementation widely used in modern reasoning evaluations such as lm-eval-harness and Open-R1). On the same generated responses, the two scorers disagreed by ~20 percentage points on the Qwen3-{8,14}B MATH-500 baselines.
The puzzle. If both numbers come from the same generated MATH-500 responses, where do 20 percentage points come from?
The rest of this post walks through what we found, which turns out to be more interesting than “the two libraries disagree.” The gap decomposes cleanly into two distinct mechanisms (answer extraction and symbolic equivalence), and the second — which we believe is under-discussed in the literature — dominates on Qwen3 outputs.
We originally hit this while iterating on CRISP, an on-policy self-distillation method for compressing reasoning-model outputs (arXiv:2603.05433). The base-model story stands on its own and is the focus of this post; a separate complication that arises only after training is in §7.
To make the discussion concrete, every accuracy number in this post comes from sampling at temperature=0.6, top_p=0.95, max_tokens=30000 (Qwen3 thinking-mode's recommended decoding settings), with this prompt template:
Solve the following math problem step by step. The last line
of your response should be of the form Answer: $Answer (without
quotes) where $Answer is the answer to the problem.
{problem}
This is the prompt template that ships with the DAPO-Math corpus, and the explicit Answer: $Answer instruction is exactly what math_dapo's extractor is built to expect.
There is a tension that lives one layer underneath. Qwen3 thinking-mode was trained with \boxed{...} as the final-answer convention — the post-training data wires the model to emit <think>...</think> followed by a polished solution ending in \boxed{X}. At eval time we ask it to follow a different instructional contract (Answer: $Answer). The base model's response to this conflict is to produce both: a polished solution with a \boxed{X} at the bottom, and sometimes a final Answer: X line. Often only the \boxed{X} survives — the model honours its training before it honours the prompt's last-line instruction.
This eval-prompt-vs-training-format mismatch is the underlying reason the three scorer policies in the next section read such different numbers from the same generations. SCORER A trusts the eval prompt's contract; SCORER B and C add fallbacks that catch the model's trained habit.
The same (response, gold_answer) pair can be turned into a score in dramatically different ways. We isolate three policies that span the design space encountered in current base-model evaluation pipelines.
math_dapo (vanilla)Extract: regex on last 300 chars for Answer: X. Compare: string equality after a fixed normalization rule list (\dfrac→\frac, strip \left/\right, strip $ delimiters, etc.).
Misses ~60–98% of Qwen3 base-model responses because thinking-mode Qwen3 writes \boxed{...} after </think>, not Answer:.
math_dapo + \boxed{} fallbackExtract: SCORER A regex; if no match, also try \boxed{X}. Compare: same string equality after normalization.
A natural fix once you notice Qwen3 emits boxed answers. The 70–80% baselines commonly reported on Qwen3 MATH-500 use this policy.
math_verifyExtract: HuggingFace math_verify's default extraction (LatexExtractionConfig for \boxed{...}, ExprExtractionConfig for anchored expressions). Compare: sympy symbolic equivalence.
A single library call on the full response. Closest to what most modern reasoning-eval pipelines (lm-eval, Open-R1) use out of the box, and what we recommend as the default for base-model evaluation.
The defining design choices are along two axes:
sympy?Our earlier post focused on the first axis (Answer Format Sensitivity in Qwen3 Math Reasoning Evaluation); this post focuses on the second. A subtle fourth policy, useful only after a model has been trained in a way that shifts its output format, is introduced in §7.
We re-scored the same per-sample generations (500 problems × 8 samples on MATH-500, 30 problems × 8 samples on each AIME year) for Qwen3-{8,14}B base under all three policies.
| Model | Benchmark | A | B | C |
|---|---|---|---|---|
Qwen3-8B |
MATH-500 | 18.55 | 78.15 | 96.35 |
| AIME 24 | 5.00 | 73.33 | 75.83 | |
| AIME 25 | 2.92 | 65.42 | 66.67 | |
Qwen3-14B |
MATH-500 | 35.05 | 71.73 | 94.53 |
| AIME 24 | 2.08 | 76.67 | 80.00 | |
| AIME 25 | 2.92 | 68.75 | 70.42 |
Accuracy (%) under each scorer policy on identical generations. Bold = the column we recommend as the default for base-model reporting.
The table is information-dense; the most striking observations:
\boxed{} fallback) recovers most of the gap on MATH-500: +59.6 pp on 8B, +36.7 pp on 14B. This is the extraction story documented in our earlier post.sympy symbolic equivalence) recovers another +18.2 pp on 8B MATH-500 and +22.8 pp on 14B MATH-500. This is the gap that closes the 70–80% baseline up to 95–97%, and it is the focus of this post.Stepping across the three policies on a single cell pins down where each percentage point originates. Taking Qwen3-8B base MATH-500 as the running example:
| Transition | Δ on 8B-base MATH-500 | Mechanism |
|---|---|---|
| A → B | +59.60 pp | Add \boxed{} fallback to the extractor |
| B → C | +18.20 pp | Replace string-equality with sympy symbolic equivalence |
The two contributors are the \boxed{} extraction fallback (A → B, +59.60 pp) and the sympy symbolic-equivalence backend (B → C, +18.20 pp). The first arises because Qwen3 emits a boxed answer after </think> and the Answer: regex misses it; this is the subject of our earlier post. The second — the focus of this post — arises because once the answer is extracted, the model's surface form often differs from gold in ways that no fixed string-normalisation table can canonicalise.
What does a symbolic-equivalence-recoverable failure look like? Five archetypal cases from the actual MATH-500 samples:
| Gold | Model response (tail) | B | C |
|---|---|---|---|
\frac{1}{2} | \boxed{\left(\dfrac{1}{2}\right)} | 0 | 1 |
1/2 | \boxed{0.5} | 0 | 1 |
\sqrt{8} | \boxed{2\sqrt{2}} | 0 | 1 |
(x-1)^2 | \boxed{x^2 - 2x + 1} | 0 | 1 |
\frac{p-q}{2} | \boxed{\frac{1}{2}(p - q)} | 0 | 1 |
Under SCORER B, each of these is treated as a wrong answer because the post-normalisation string differs from gold. Under SCORER C, math_verify's sympy backend recognises every one as the correct answer. None of the model behaviour changed; only the scorer's verdict did.
The headline is general. Verifier choice is a first-order methodological decision in reasoning-model evaluation, not a downstream detail. An ~80-pp accuracy gap on identical generated responses can hinge entirely on the scorer's extraction policy (Answer:-regex vs. \boxed{...} extraction vs. both) and on whether equivalence is checked by string match or symbolic computation. This is true on plain Qwen3 base models with no training pipeline involved.
Two specific recommendations for future reasoning-evaluation reports:
math_verify” is ambiguous — default ExtractionConfig or custom? Single-call or dual-pass? A small implementation footnote eliminates ambiguity that can swing the headline by 20 pp.Headline accuracy claims for reasoning models can otherwise reflect scorer rebasing rather than model behaviour — and, as the next section shows, the effect can grow when training itself reshapes the format of the model's output.
The three-policy taxonomy above is sufficient for evaluating base models. For models that have been trained in a way that changes their output format, a subtle gap opens between SCORER C and an extended dual-pass policy that we call SCORER D. We hit this while building CRISP and want to share the mechanism, because it can plausibly arise from any training method whose objective indirectly reshapes the model's output format.
CRISP (Compressed Reasoning via Iterative Self-Policy Distillation; arXiv:2603.05433) is a teacher–student framework for compressing reasoning-model outputs. The teacher is the base model conditioned on a concise system prompt; because the prompt nudges the model toward brevity, the teacher's per-token distribution is shorter by construction. The student is the same base model, unconditioned, trained via reverse-KL to match the teacher's distribution — transferring the teacher's brevity into the unconditioned student without needing ground-truth labels or a token-budget penalty. A typical Qwen3-8B response on MATH-500 drops from ~4,900 tokens at base to ~2,050 tokens at step-100 (a 59% reduction) at comparable accuracy.
CRISP's teacher is the base model under a concise system prompt. When Qwen3 is asked to be concise, it strips LaTeX decoration before it strips content:
</think> answer block (where \dfrac, \left/\right, $...$ typically live).Answer: X line because that's the shortest form satisfying the prompt's instruction.0.5 in place of \dfrac{1}{2}; 17 in place of \boxed{17}).The student inherits this style via reverse-KL. Re-scoring the trained model under the same three policies shows the effect:
| Model | Benchmark | State | A | B | C |
|---|---|---|---|---|---|
Qwen3-8B |
MATH-500 | base | 18.55 | 78.15 | 96.35 |
| MATH-500 | step-100 | 84.97 | 85.70 | 93.00 | |
Qwen3-14B |
MATH-500 | base | 35.05 | 71.73 | 94.53 |
| MATH-500 | step-100 | 84.17 | 84.55 | 91.00 |
Accuracy (%) on MATH-500, base vs CRISP step-100, under the same three scorers as §4.
SCORER A's normalisation rule list now succeeds on payloads it previously failed on (a decorated \dfrac{1}{2} that didn't string-match gold \frac{1}{2} becomes the simpler 1/2, which does match after normalisation): A jumps from 18.55 to 84.97 on 8B, and B follows. SCORER C, by contrast, drops from 96.35 to 93.00. Same model behaviour at the level of correctness; different surface form; different scorer reading.
This is a generic concern for any training method whose objective indirectly reshapes output format: cross-row comparisons under different scorers can read large gains or large regressions depending on which scorer the reader applies. The conservative practice (as already recommended in §6) is to apply a single fixed policy to both rows.
The C-column drop on step-100 deserves a closer look. A material fraction of step-100 responses take the form Answer: X with no accompanying \boxed{X} — the trained model treats the Answer: line as its sole final commitment, exactly as the prompt instructed.
SCORER C (single-call math_verify) cannot extract these unless X is a digit-led number or arithmetic expression — the only forms ExprExtractionConfig's anchored regex (in math_verify/parser.py) matches:
number_re = r"-?\d+(\.\d+)?"
operators = ['+','-','*','/','^','(',')']
expr_re = r"-?\(?-?\d[\d\.\s+\-*/^()]*[+\-*/^()][\d\.\s+\-*/^()]+\)?"
expr_or_number = (expr_re | number_re)
For MATH-500 text answers (Evelyn, north), multi-choice letters (A, B), and fraction or set expressions written without LaTeX delimiters, SCORER C returns 0 despite the model having followed the prompt's instructional contract verbatim.
The fix is a dual-pass extension on top of SCORER C:
math_verify dual-passExtract: SCORER C, plus a second pass that regex-extracts Answer: X, wraps as \boxed{X}, and runs math_verify on that constructed string. Compare: OR of both passes.
Gives credit for both Qwen3's native \boxed{} format and the explicitly requested Answer: X format.
Quantitatively, SCORER D recovers 129 / 4000 samples (+3.23 pp) on 8B step-100 MATH-500 and 219 / 4000 (+5.47 pp) on 14B step-100 MATH-500. On base, where the model still writes \boxed{X} almost everywhere, SCORER D adds at most +0.73 pp on top of SCORER C — which is why we omit it from the main taxonomy in §3. SCORER D is the policy we use in CRISP's in-training validation and rollout filter; it is reasonable to consider for any reasoning-model evaluation where training has shifted the model toward bare-text answers.
verl/verl/utils/reward_score/math_verify.py (in-training validation) and workspace/src/self_distill_hybrid/sd_verifier.py (training-time rollout filter).