Back to feed
Hugging Face Blog·

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Signal
75
Hype
25
In three linesHugging Face introduces Cosmopedia, a method for generating large-scale synthetic data for LLM pre-training. The dataset contains 30 billion tokens generated via Mixtral 8x7B, covering mathematics, science, and programming. Models trained on this data achieve performance comparable to models pre-trained on natural data.
Read source
Your take?
Fine-tuningOpen sourceBenchmarksPapers

Summary generated by Claude — human-verified