Back to feed
arXiv cs.CL·

EmbGen: Teaching with Reassembled Corpora

Signal
72
Hype
18
In three linesEmbGen is a synthetic data generation pipeline that decomposes a corpus into entity-description pairs, reassembles them via embedding similarity, then generates QA pairs with proximity and cluster-specialized sampling. On three datasets, EmbGen improves Binary Accuracy by 12.5% (5M tokens) to 88.9% (20M tokens) on the most heterogeneous dataset versus baselines.
Read source
Your take?
Fine-tuningRAGEmbeddingsBenchmarks

Summary generated by Claude — human-verified