arXiv cs.CL·19 May 2026

Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

Signal

Hype

In three linesSynPro, a synthetic data generation framework, helps LLMs learn more thoroughly from limited organic corpora via rephrasing and reformatting. Optimized with RL, it unlocks 3.7-5.2x more effective tokens than simple repetition on 400M and 1.1B models, even surpassing the non-data-bound oracle at 1.1B scale. Code open-sourced.

Read source

Your take?

Reinforcement learning Benchmarks Open source

Summary generated by Claude — human-verified

Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

Other angles on this story