Scorer Choice in Math Reasoning Evaluation

TL;DR

On Qwen3 base models, verifier choice alone can swing reported MATH-500 accuracy by ~80 percentage points end-to-end — without touching the model. The dominant axis, once a \boxed{...} answer-extraction fallback is in place (covered in our earlier post), is symbolic equivalence: responses such as \dfrac{1}{2} vs. \frac{1}{2}, 0.5 vs. \frac{1}{2}, or 2\sqrt{2} vs. \sqrt{8} are all mathematically equivalent but fail under string-comparison scorers and succeed under sympy-based ones — alone worth ~18 pp on Qwen3-8B base MATH-500. We walk through a three-policy decomposition (A → B → C) that pins down where each percentage point comes from. We noticed the gap while building CRISP, a reasoning-compression method; we discuss a CRISP-specific complication in the final section.

1. Background: how we got to this question

While evaluating Qwen3 base models on MATH-500 we cross-checked accuracy under two scorer implementations: an internal extension of math_dapo (the scoring scripts that ship with the DAPO-Math training corpus), and HuggingFace math_verify (a sympy-backed reference implementation widely used in modern reasoning evaluations such as lm-eval-harness and Open-R1). On the same generated responses, the two scorers disagreed by ~20 percentage points on the Qwen3-{8,14}B MATH-500 baselines.

The rest of this post walks through what we found, which turns out to be more interesting than “the two libraries disagree.” The gap decomposes cleanly into two distinct mechanisms (answer extraction and symbolic equivalence), and the second — which we believe is under-discussed in the literature — dominates on Qwen3 outputs.

We originally hit this while iterating on CRISP, an on-policy self-distillation method for compressing reasoning-model outputs (arXiv:2603.05433). The base-model story stands on its own and is the focus of this post; a separate complication that arises only after training is in §7.

2. The eval setup, and an inherent format mismatch

To make the discussion concrete, every accuracy number in this post comes from sampling at temperature=0.6, top_p=0.95, max_tokens=30000 (Qwen3 thinking-mode's recommended decoding settings), with this prompt template:

This is the prompt template that ships with the DAPO-Math corpus, and the explicit Answer: $Answer instruction is exactly what math_dapo's extractor is built to expect.

There is a tension that lives one layer underneath. Qwen3 thinking-mode was trained with \boxed{...} as the final-answer convention — the post-training data wires the model to emit <think>...</think> followed by a polished solution ending in \boxed{X}. At eval time we ask it to follow a different instructional contract (Answer: $Answer). The base model's response to this conflict is to produce both: a polished solution with a \boxed{X} at the bottom, and sometimes a final Answer: X line. Often only the \boxed{X} survives — the model honours its training before it honours the prompt's last-line instruction.

This eval-prompt-vs-training-format mismatch is the underlying reason the three scorer policies in the next section read such different numbers from the same generations. SCORER A trusts the eval prompt's contract; SCORER B and C add fallbacks that catch the model's trained habit.

3. Three scorer policies

The same (response, gold_answer) pair can be turned into a score in dramatically different ways. We isolate three policies that span the design space encountered in current base-model evaluation pipelines.

SCORER A — `math_dapo` (vanilla)

Extract: regex on last 300 chars for Answer: X. Compare: string equality after a fixed normalization rule list (\dfrac→\frac, strip \left/\right, strip $ delimiters, etc.).

Misses ~60–98% of Qwen3 base-model responses because thinking-mode Qwen3 writes \boxed{...} after </think>, not Answer:.

SCORER B — `math_dapo` + `\boxed{}` fallback

Extract: SCORER A regex; if no match, also try \boxed{X}. Compare: same string equality after normalization.

A natural fix once you notice Qwen3 emits boxed answers. The 70–80% baselines commonly reported on Qwen3 MATH-500 use this policy.

SCORER C — `math_verify`

Extract: HuggingFace math_verify's default extraction (LatexExtractionConfig for \boxed{...}, ExprExtractionConfig for anchored expressions). Compare: sympy symbolic equivalence.

A single library call on the full response. Closest to what most modern reasoning-eval pipelines (lm-eval, Open-R1) use out of the box, and what we recommend as the default for base-model evaluation.

Our earlier post focused on the first axis (Answer Format Sensitivity in Qwen3 Math Reasoning Evaluation); this post focuses on the second. A subtle fourth policy, useful only after a model has been trained in a way that shifts its output format, is introduced in §7.

4. The base-model table

We re-scored the same per-sample generations (500 problems × 8 samples on MATH-500, 30 problems × 8 samples on each AIME year) for Qwen3-{8,14}B base under all three policies.

Accuracy (%) under each scorer policy on identical generations. Bold = the column we recommend as the default for base-model reporting.

5. Decomposing the gap

Model	Benchmark	A	B	C
`Qwen3-8B`	MATH-500	18.55	78.15	96.35
AIME 24	5.00	73.33	75.83
AIME 25	2.92	65.42	66.67
`Qwen3-14B`	MATH-500	35.05	71.73	94.53
AIME 24	2.08	76.67	80.00
AIME 25	2.92	68.75	70.42

Stepping across the three policies on a single cell pins down where each percentage point originates. Taking Qwen3-8B base MATH-500 as the running example:

Transition	Δ on 8B-base MATH-500	Mechanism
A → B	+59.60 pp	Add `\boxed{}` fallback to the extractor
B → C	+18.20 pp	Replace string-equality with `sympy` symbolic equivalence

The two contributors are the \boxed{} extraction fallback (A → B, +59.60 pp) and the sympy symbolic-equivalence backend (B → C, +18.20 pp). The first arises because Qwen3 emits a boxed answer after </think> and the Answer: regex misses it; this is the subject of our earlier post. The second — the focus of this post — arises because once the answer is extracted, the model's surface form often differs from gold in ways that no fixed string-normalisation table can canonicalise.

Concrete examples of B→C recoveries

What does a symbolic-equivalence-recoverable failure look like? Five archetypal cases from the actual MATH-500 samples:

Gold	Model response (tail)	C
`\frac{1}{2}`	`\boxed{\left(\dfrac{1}{2}\right)}`	1
`1/2`	`\boxed{0.5}`	1
`\sqrt{8}`	`\boxed{2\sqrt{2}}`	1
`(x-1)^2`	`\boxed{x^2 - 2x + 1}`	1
`\frac{p-q}{2}`	`\boxed{\frac{1}{2}(p - q)}`	1

Under SCORER B, each of these is treated as a wrong answer because the post-normalisation string differs from gold. Under SCORER C, math_verify's sympy backend recognises every one as the correct answer. None of the model behaviour changed; only the scorer's verdict did.

6. Implications for evaluating reasoning models

The headline is general. Verifier choice is a first-order methodological decision in reasoning-model evaluation, not a downstream detail. An ~80-pp accuracy gap on identical generated responses can hinge entirely on the scorer's extraction policy (Answer:-regex vs. \boxed{...} extraction vs. both) and on whether equivalence is checked by string match or symbolic computation. This is true on plain Qwen3 base models with no training pipeline involved.

Headline accuracy claims for reasoning models can otherwise reflect scorer rebasing rather than model behaviour — and, as the next section shows, the effect can grow when training itself reshapes the format of the model's output.

7. A wrinkle from CRISP: when training reshapes the output format

The three-policy taxonomy above is sufficient for evaluating base models. For models that have been trained in a way that changes their output format, a subtle gap opens between SCORER C and an extended dual-pass policy that we call SCORER D. We hit this while building CRISP and want to share the mechanism, because it can plausibly arise from any training method whose objective indirectly reshapes the model's output format.

What CRISP is

CRISP (Compressed Reasoning via Iterative Self-Policy Distillation; arXiv:2603.05433) is a teacher–student framework for compressing reasoning-model outputs. The teacher is the base model conditioned on a concise system prompt; because the prompt nudges the model toward brevity, the teacher's per-token distribution is shorter by construction. The student is the same base model, unconditioned, trained via reverse-KL to match the teacher's distribution — transferring the teacher's brevity into the unconditioned student without needing ground-truth labels or a token-budget penalty. A typical Qwen3-8B response on MATH-500 drops from ~4,900 tokens at base to ~2,050 tokens at step-100 (a 59% reduction) at comparable accuracy.

The format shift

CRISP's teacher is the base model under a concise system prompt. When Qwen3 is asked to be concise, it strips LaTeX decoration before it strips content:

The student inherits this style via reverse-KL. Re-scoring the trained model under the same three policies shows the effect:

Accuracy (%) on MATH-500, base vs CRISP step-100, under the same three scorers as §4.

Model	Benchmark	State	A	B	C
`Qwen3-8B`	MATH-500	base	18.55	78.15	96.35
MATH-500	step-100	84.97	85.70	93.00
`Qwen3-14B`	MATH-500	base	35.05	71.73	94.53
MATH-500	step-100	84.17	84.55	91.00

SCORER A's normalisation rule list now succeeds on payloads it previously failed on (a decorated \dfrac{1}{2} that didn't string-match gold \frac{1}{2} becomes the simpler 1/2, which does match after normalisation): A jumps from 18.55 to 84.97 on 8B, and B follows. SCORER C, by contrast, drops from 96.35 to 93.00. Same model behaviour at the level of correctness; different surface form; different scorer reading.

This is a generic concern for any training method whose objective indirectly reshapes output format: cross-row comparisons under different scorers can read large gains or large regressions depending on which scorer the reader applies. The conservative practice (as already recommended in §6) is to apply a single fixed policy to both rows.

SCORER D: a fourth policy for trained-model evaluation

The C-column drop on step-100 deserves a closer look. A material fraction of step-100 responses take the form Answer: X with no accompanying \boxed{X} — the trained model treats the Answer: line as its sole final commitment, exactly as the prompt instructed.

SCORER C (single-call math_verify) cannot extract these unless X is a digit-led number or arithmetic expression — the only forms ExprExtractionConfig's anchored regex (in math_verify/parser.py) matches:

For MATH-500 text answers (Evelyn, north), multi-choice letters (A, B), and fraction or set expressions written without LaTeX delimiters, SCORER C returns 0 despite the model having followed the prompt's instructional contract verbatim.

SCORER D — `math_verify` dual-pass

Extract: SCORER C, plus a second pass that regex-extracts Answer: X, wraps as \boxed{X}, and runs math_verify on that constructed string. Compare: OR of both passes.

Gives credit for both Qwen3's native \boxed{} format and the explicitly requested Answer: X format.

Quantitatively, SCORER D recovers 129 / 4000 samples (+3.23 pp) on 8B step-100 MATH-500 and 219 / 4000 (+5.47 pp) on 14B step-100 MATH-500. On base, where the model still writes \boxed{X} almost everywhere, SCORER D adds at most +0.73 pp on top of SCORER C — which is why we omit it from the main taxonomy in §3. SCORER D is the policy we use in CRISP's in-training validation and rollout filter; it is reasonable to consider for any reasoning-model evaluation where training has shifted the model toward bare-text answers.

TL;DR

1. Background: how we got to this question

2. The eval setup, and an inherent format mismatch

3. Three scorer policies

SCORER A — math_dapo (vanilla)

SCORER B — math_dapo + \boxed{} fallback

SCORER C — math_verify

4. The base-model table

5. Decomposing the gap

Concrete examples of B→C recoveries

6. Implications for evaluating reasoning models

7. A wrinkle from CRISP: when training reshapes the output format

What CRISP is

The format shift

SCORER D: a fourth policy for trained-model evaluation

SCORER D — math_verify dual-pass

Paper references and reproducibility

SCORER A — `math_dapo` (vanilla)

SCORER B — `math_dapo` + `\boxed{}` fallback

SCORER C — `math_verify`

SCORER D — `math_verify` dual-pass