ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling
Signal
78
Hype
25
In three linesARBITER corrects majority vote failures in test-time sampling. Reasoning trajectories cluster into stable basins that aren't necessarily accurate. ARBITER uses hidden states and model-derived evidence to add conservative signals to consensus, recovering ~22% of oracle gap on Llama-3.1-8B MMLU-HS-Math (78%→82%).Read source
Your take?
Summary generated by Claude — human-verified