Hugging Face Blog·20 March 2024

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Signal

Hype

In three linesHugging Face introduces Cosmopedia, a method for generating large-scale synthetic data for LLM pre-training. The dataset contains 30 billion tokens generated via Mixtral 8x7B, covering mathematics, science, and programming. Models trained on this data achieve performance comparable to models pre-trained on natural data.

Read source

Your take?

Fine-tuning Open source Benchmarks Papers

Summary generated by Claude — human-verified

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Other angles on this story