May 26, 2026

Scorer Choice in Math Reasoning Evaluation

TL;DR

On Qwen3 base models, verifier choice alone can swing reported MATH-500 accuracy by ~80 percentage points end-to-end — without touching the model. The dominant axis, once a \boxed{...} answer-extraction fallback is in place (covered in our earlier post), is symbolic equivalence: responses such as \dfrac{1}{2} vs. \frac{1}{2}, 0.5 vs. \frac{1}{2}, or 2\sqrt{2} vs. \sqrt{8} are all mathematically equivalent but fail under string-comparison scorers and succeed under sympy-based ones — alone worth ~18 pp on Qwen3-8B base MATH-500. We walk through a three-policy decomposition (A → B → C) that pins down where each percentage point comes from. We noticed the gap while building CRISP, a reasoning-compression method; we discuss a CRISP-specific complication in the final section.

1. Background: how we got to this question

While evaluating Qwen3 base models on MATH-500 we cross-checked accuracy under two scorer implementations: an internal extension of math_dapo (the scoring scripts that ship with the DAPO-Math training corpus), and HuggingFace math_verify (a sympy-backed reference implementation widely used in modern reasoning evaluations such as lm-eval-harness and Open-R1). On the same generated responses, the two scorers disagreed by ~20 percentage points on the Qwen3-{8,14}B MATH-500 baselines.

The puzzle. If both numbers come from the same generated MATH-500 responses, where do 20 percentage points come from?

The rest of this post walks through what we found, which turns out to be more interesting than “the two libraries disagree.” The gap decomposes cleanly into two distinct mechanisms (answer extraction and symbolic equivalence), and the second — which we believe is under-discussed in the literature — dominates on Qwen3 outputs.

We originally hit this while iterating on CRISP, an on-policy self-distillation method for compressing reasoning-model outputs (arXiv:2603.05433). The base-model story stands on its own and is the focus of this post; a separate complication that arises only after training is in §7.

2. The eval setup, and an inherent format mismatch

To make the discussion concrete, every accuracy number in this post comes from sampling at temperature=0.6, top_p=0.95, max_tokens=30000 (Qwen3 thinking-mode's recommended decoding settings), with this prompt template:

Solve the following math problem step by step. The last line
of your response should be of the form Answer: $Answer (without
quotes) where $Answer is the answer to the problem.

{problem}

This is the prompt template that ships with the DAPO-Math corpus, and the explicit Answer: $Answer instruction is exactly what math_dapo's extractor is built to expect.

There is a tension that lives one layer underneath. Qwen3 thinking-mode was trained with \boxed{...} as the final-answer convention — the post-training data wires the model to emit <think>...</think> followed by a polished solution ending in \boxed{X}. At eval time we ask it to follow a different instructional contract (Answer: $Answer). The base model's response to this conflict is to produce both: a polished solution with a \boxed{X} at the bottom, and sometimes a final Answer: X line. Often only the \boxed{X} survives — the model honours its training before it honours the prompt's last-line instruction.

This eval-prompt-vs-training-format mismatch is the underlying reason the three scorer policies in the next section read such different numbers from the same generations. SCORER A trusts the eval prompt's contract; SCORER B and C add fallbacks that catch the model's trained habit.

3. Three scorer policies

The same (response, gold_answer) pair can be turned into a score in dramatically different ways. We isolate three policies that span the design space encountered in current base-model evaluation pipelines.

SCORER A — math_dapo (vanilla)

Extract: regex on last 300 chars for Answer: X. Compare: string equality after a fixed normalization rule list (\dfrac\frac, strip \left/\right, strip $ delimiters, etc.).

Misses ~60–98% of Qwen3 base-model responses because thinking-mode Qwen3 writes \boxed{...} after </think>, not Answer:.

SCORER B — math_dapo + \boxed{} fallback

Extract: SCORER A regex; if no match, also try \boxed{X}. Compare: same string equality after normalization.

A natural fix once you notice Qwen3 emits boxed answers. The 70–80% baselines commonly reported on Qwen3 MATH-500 use this policy.

SCORER C — math_verify

Extract: HuggingFace math_verify's default extraction (LatexExtractionConfig for \boxed{...}, ExprExtractionConfig for anchored expressions). Compare: sympy symbolic equivalence.

A single library call on the full response. Closest to what most modern reasoning-eval pipelines (lm-eval, Open-R1) use out of the box, and what we recommend as the default for base-model evaluation.

The defining design choices are along two axes:

  1. Extraction — which substring of the response is treated as the model's committed answer?
  2. Comparison — once a substring is extracted, how is it compared to gold? String equality after canonicalisation, or symbolic equivalence via sympy?

Our earlier post focused on the first axis (Answer Format Sensitivity in Qwen3 Math Reasoning Evaluation); this post focuses on the second. A subtle fourth policy, useful only after a model has been trained in a way that shifts its output format, is introduced in §7.

4. The base-model table

We re-scored the same per-sample generations (500 problems × 8 samples on MATH-500, 30 problems × 8 samples on each AIME year) for Qwen3-{8,14}B base under all three policies.

ModelBenchmark ABC
Qwen3-8B MATH-500 18.5578.1596.35
AIME 24 5.0073.3375.83
AIME 25 2.9265.4266.67
Qwen3-14B MATH-500 35.0571.7394.53
AIME 24 2.0876.6780.00
AIME 25 2.9268.7570.42

Accuracy (%) under each scorer policy on identical generations. Bold = the column we recommend as the default for base-model reporting.

The table is information-dense; the most striking observations:

  1. Reading across: SCORER A reads 2–35% on these base models, while SCORER C reads 67–96%. Same generations, same model, same problems — a swing of up to ~78 pp from scorer choice alone.
  2. A → B (still string-equality, only adding the \boxed{} fallback) recovers most of the gap on MATH-500: +59.6 pp on 8B, +36.7 pp on 14B. This is the extraction story documented in our earlier post.
  3. B → C (replacing string-equality with sympy symbolic equivalence) recovers another +18.2 pp on 8B MATH-500 and +22.8 pp on 14B MATH-500. This is the gap that closes the 70–80% baseline up to 95–97%, and it is the focus of this post.

5. Decomposing the gap

Stepping across the three policies on a single cell pins down where each percentage point originates. Taking Qwen3-8B base MATH-500 as the running example:

TransitionΔ on 8B-base MATH-500Mechanism
A → B+59.60 ppAdd \boxed{} fallback to the extractor
B → C+18.20 ppReplace string-equality with sympy symbolic equivalence
Gap decomposition — Qwen3-8B base, MATH-500 (4,000 samples) How each scorer-policy change contributes to the 77.8 pp swing 18.55 +59.60 +18.20 SCORER A = 18.55 SCORER B = 78.15 → 96.35 0% 20% 40% 60% 80% 100% A — math_dapo Answer-regex + string equality A → B — add \boxed{} extraction fallback B → C — sympy symbolic equivalence (this post)
Figure 1. Each coloured segment shows how much one scorer-policy change adds to the cumulative reading. The green segment (B → C) is the symbolic-equivalence contribution — the main subject of this post.

The two contributors are the \boxed{} extraction fallback (A → B, +59.60 pp) and the sympy symbolic-equivalence backend (B → C, +18.20 pp). The first arises because Qwen3 emits a boxed answer after </think> and the Answer: regex misses it; this is the subject of our earlier post. The second — the focus of this post — arises because once the answer is extracted, the model's surface form often differs from gold in ways that no fixed string-normalisation table can canonicalise.

Concrete examples of B→C recoveries

What does a symbolic-equivalence-recoverable failure look like? Five archetypal cases from the actual MATH-500 samples:

GoldModel response (tail)BC
\frac{1}{2}\boxed{\left(\dfrac{1}{2}\right)}01
1/2\boxed{0.5}01
\sqrt{8}\boxed{2\sqrt{2}}01
(x-1)^2\boxed{x^2 - 2x + 1}01
\frac{p-q}{2}\boxed{\frac{1}{2}(p - q)}01

Under SCORER B, each of these is treated as a wrong answer because the post-normalisation string differs from gold. Under SCORER C, math_verify's sympy backend recognises every one as the correct answer. None of the model behaviour changed; only the scorer's verdict did.

6. Implications for evaluating reasoning models

The headline is general. Verifier choice is a first-order methodological decision in reasoning-model evaluation, not a downstream detail. An ~80-pp accuracy gap on identical generated responses can hinge entirely on the scorer's extraction policy (Answer:-regex vs. \boxed{...} extraction vs. both) and on whether equivalence is checked by string match or symbolic computation. This is true on plain Qwen3 base models with no training pipeline involved.

Two specific recommendations for future reasoning-evaluation reports:

  1. Explicitly disclose the verifier. Prompt template, extraction policy, equivalence-checking backend, and library version. “We used math_verify” is ambiguous — default ExtractionConfig or custom? Single-call or dual-pass? A small implementation footnote eliminates ambiguity that can swing the headline by 20 pp.
  2. Apply the verifier identically to all rows of the comparison. A reported “+X pp from training” must use the same scorer policy on the baseline checkpoint and the trained checkpoint. Within-column comparisons (fixed scorer, varying training state) are the only ones invariant to the scorer-rebasing effect.

Headline accuracy claims for reasoning models can otherwise reflect scorer rebasing rather than model behaviour — and, as the next section shows, the effect can grow when training itself reshapes the format of the model's output.

7. A wrinkle from CRISP: when training reshapes the output format

The three-policy taxonomy above is sufficient for evaluating base models. For models that have been trained in a way that changes their output format, a subtle gap opens between SCORER C and an extended dual-pass policy that we call SCORER D. We hit this while building CRISP and want to share the mechanism, because it can plausibly arise from any training method whose objective indirectly reshapes the model's output format.

What CRISP is

CRISP (Compressed Reasoning via Iterative Self-Policy Distillation; arXiv:2603.05433) is a teacher–student framework for compressing reasoning-model outputs. The teacher is the base model conditioned on a concise system prompt; because the prompt nudges the model toward brevity, the teacher's per-token distribution is shorter by construction. The student is the same base model, unconditioned, trained via reverse-KL to match the teacher's distribution — transferring the teacher's brevity into the unconditioned student without needing ground-truth labels or a token-budget penalty. A typical Qwen3-8B response on MATH-500 drops from ~4,900 tokens at base to ~2,050 tokens at step-100 (a 59% reduction) at comparable accuracy.

The format shift

CRISP's teacher is the base model under a concise system prompt. When Qwen3 is asked to be concise, it strips LaTeX decoration before it strips content:

The student inherits this style via reverse-KL. Re-scoring the trained model under the same three policies shows the effect:

ModelBenchmarkState ABC
Qwen3-8B MATH-500base 18.5578.1596.35
MATH-500step-100 84.9785.7093.00
Qwen3-14B MATH-500base 35.0571.7394.53
MATH-500step-100 84.1784.5591.00

Accuracy (%) on MATH-500, base vs CRISP step-100, under the same three scorers as §4.

SCORER A's normalisation rule list now succeeds on payloads it previously failed on (a decorated \dfrac{1}{2} that didn't string-match gold \frac{1}{2} becomes the simpler 1/2, which does match after normalisation): A jumps from 18.55 to 84.97 on 8B, and B follows. SCORER C, by contrast, drops from 96.35 to 93.00. Same model behaviour at the level of correctness; different surface form; different scorer reading.

This is a generic concern for any training method whose objective indirectly reshapes output format: cross-row comparisons under different scorers can read large gains or large regressions depending on which scorer the reader applies. The conservative practice (as already recommended in §6) is to apply a single fixed policy to both rows.

SCORER D: a fourth policy for trained-model evaluation

The C-column drop on step-100 deserves a closer look. A material fraction of step-100 responses take the form Answer: X with no accompanying \boxed{X} — the trained model treats the Answer: line as its sole final commitment, exactly as the prompt instructed.

SCORER C (single-call math_verify) cannot extract these unless X is a digit-led number or arithmetic expression — the only forms ExprExtractionConfig's anchored regex (in math_verify/parser.py) matches:

number_re = r"-?\d+(\.\d+)?"
operators = ['+','-','*','/','^','(',')']
expr_re   = r"-?\(?-?\d[\d\.\s+\-*/^()]*[+\-*/^()][\d\.\s+\-*/^()]+\)?"
expr_or_number = (expr_re | number_re)

For MATH-500 text answers (Evelyn, north), multi-choice letters (A, B), and fraction or set expressions written without LaTeX delimiters, SCORER C returns 0 despite the model having followed the prompt's instructional contract verbatim.

The fix is a dual-pass extension on top of SCORER C:

SCORER D — math_verify dual-pass

Extract: SCORER C, plus a second pass that regex-extracts Answer: X, wraps as \boxed{X}, and runs math_verify on that constructed string. Compare: OR of both passes.

Gives credit for both Qwen3's native \boxed{} format and the explicitly requested Answer: X format.

Quantitatively, SCORER D recovers 129 / 4000 samples (+3.23 pp) on 8B step-100 MATH-500 and 219 / 4000 (+5.47 pp) on 14B step-100 MATH-500. On base, where the model still writes \boxed{X} almost everywhere, SCORER D adds at most +0.73 pp on top of SCORER C — which is why we omit it from the main taxonomy in §3. SCORER D is the policy we use in CRISP's in-training validation and rollout filter; it is reasonable to consider for any reasoning-model evaluation where training has shifted the model toward bare-text answers.


Paper references and reproducibility