Back to feed
Reddit r/LocalLLaMA·

I trained a 75M parameter LLM from scratch on 18B tokens and it beats a model almost double its size

Signal
72
Hype
35
In three linesKeyLM, a 75M parameter model trained on 18B tokens, outperforms SmolLM-135M-Instruct on IFEval (17.85 vs 17.15) despite half the size and 30x less training data. Standard architecture: GQA, RoPE, SwiGLU, 24 layers, trained on FineWeb-Edu, Wikipedia, Reddit and public datasets.
Read source
Your take?
Open sourceBenchmarksCode generation

Summary generated by Claude — human-verified