I trained a 75M parameter LLM from scratch on 18B tokens and it beats a model almost double its size
Signal
72
Hype
35
In three linesKeyLM, a 75M parameter model trained on 18B tokens, outperforms SmolLM-135M-Instruct on IFEval (17.85 vs 17.15) despite half the size and 30x less training data. Standard architecture: GQA, RoPE, SwiGLU, 24 layers, trained on FineWeb-Edu, Wikipedia, Reddit and public datasets.Read source
Your take?
Summary generated by Claude — human-verified