arXiv cs.CL·19 May 2026

Evaluating Language Models' Evaluations of Games

Signal

Hype

In three linesarXiv paper evaluating how language and reasoning models assess board games. Testing 100+ games with 450 human judgments, reasoning models align better with humans than standard LLMs for evaluating game fairness and fun. Paradox: as models approach game-theoretic optimality, their fit to human judgments weakens.

Read source

Your take?

Reasoning Evals Benchmarks Papers

Summary generated by Claude — human-verified

Evaluating Language Models' Evaluations of Games

Other angles on this story