Back to feed
arXiv cs.LG·

ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

Signal
78
Hype
25
In three linesARBITER corrects majority vote failures in test-time sampling. Reasoning trajectories cluster into stable basins that aren't necessarily accurate. ARBITER uses hidden states and model-derived evidence to add conservative signals to consensus, recovering ~22% of oracle gap on Llama-3.1-8B MMLU-HS-Math (78%→82%).
Read source
Your take?
ReasoningEvalsBenchmarks

Summary generated by Claude — human-verified