Back to feed
arXiv cs.CL·

GIM: Evaluating models via tasks that integrate multiple cognitive domains

Signal
82
Hype
15
In three linesGIM is a benchmark of 820 original problems evaluating LLMs via integration of multiple cognitive domains (constraint satisfaction, state tracking, epistemic vigilance) rather than memorization or pure abstract reasoning. IRT calibration over >200k prompt-response pairs, 28 models, extensive study of compute vs capability trade-off across 11 models and 35 configurations.
Read source
Your take?
BenchmarksEvalsReasoning

Summary generated by Claude — human-verified