arXiv cs.AI·19 May 2026

Evaluating Language Models' Evaluations of Games

Signal

Hype

In three linesarXiv study comparing game evaluations by language and reasoning models against human judgments. Dataset of 100+ board games and 450+ human evaluations. Reasoning models align better with humans, but show non-monotonic relationship: as models approach game-theoretic optimality, fit to human data weakens.

Read source

Your take?

Reasoning Evals Benchmarks

Summary generated by Claude — human-verified

Evaluating Language Models' Evaluations of Games

Other angles on this story