SEDD: Scalable and Efficient Dataset Deduplication with GPUs
SEDD is a GPU-accelerated deduplication framework using MinHash LSH. It outperforms SlimPajama's CPU tool by 158× and NVIDIA NeMo Curator's GPU tool by 7.8× on 30M documents. MinHash signature generation 375× faster. Deduplicates 1.2T tokens in 3 hours on 32-GPU V100 cluster.