Reddit r/LocalLLaMA·2 June 2026

I trained a 75M parameter LLM from scratch on 18B tokens and it beats a model almost double its size

Signal

Hype

In three linesKeyLM, a 75M parameter model trained on 18B tokens, outperforms SmolLM-135M-Instruct on IFEval (17.85 vs 17.15) despite half the size and 30x less training data. Standard architecture: GQA, RoPE, SwiGLU, 24 layers, trained on FineWeb-Edu, Wikipedia, Reddit and public datasets.

Read source

Your take?

Open source Benchmarks Code generation

Summary generated by Claude — human-verified

I trained a 75M parameter LLM from scratch on 18B tokens and it beats a model almost double its size

Other angles on this story