GIM: Evaluating models via tasks that integrate multiple cognitive domains
Signal
82
Hype
15
In three linesGIM is a benchmark of 820 original problems evaluating LLMs via integration of multiple cognitive domains (constraint satisfaction, state tracking, epistemic vigilance) rather than memorization or pure abstract reasoning. IRT calibration over >200k prompt-response pairs, 28 models, extensive study of compute vs capability trade-off across 11 models and 35 configurations.Read source
Your take?
Summary generated by Claude — human-verified