SEDD: Scalable and Efficient Dataset Deduplication with GPUs
Signal
82
Hype
15
In three linesSEDD is a GPU-accelerated deduplication framework using MinHash LSH. It outperforms SlimPajama's CPU tool by 158× and NVIDIA NeMo Curator's GPU tool by 7.8× on 30M documents. MinHash signature generation 375× faster. Deduplicates 1.2T tokens in 3 hours on 32-GPU V100 cluster.Read source
Your take?
Summary generated by Claude — human-verified