Back to feed
arXiv cs.AI·

Evaluating Language Models' Evaluations of Games

Signal
72
Hype
15
In three linesarXiv study comparing game evaluations by language and reasoning models against human judgments. Dataset of 100+ board games and 450+ human evaluations. Reasoning models align better with humans, but show non-monotonic relationship: as models approach game-theoretic optimality, fit to human data weakens.
Read source
Your take?
ReasoningEvalsBenchmarks

Summary generated by Claude — human-verified