arXiv cs.CL·19 May 2026

Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

Signal

Hype

In three linesResearchers train KinGPT (25M parameters) on chess data and demonstrate that high benchmark scores of chess-trained LLMs stem primarily from pattern-matching rather than genuine rule understanding. LLM-Modulo, a verifier-in-the-loop framework, improves RedPajama 3B from 1.2% to 21.2% best-move accuracy. Training code, datasets, and model checkpoints open-sourced.

Read source

Your take?

Benchmarks Evals Fine-tuning Papers Open source

Summary generated by Claude — human-verified

Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

Other angles on this story