arXiv cs.AI·19 May 2026

SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

Signal

Hype

In three linesSomaliWeb v1: Somali corpus of 819,322 documents (~303M tokens) with BPE-16K tokenizer and language-identification benchmark. Reveals major defects in existing distributions (HPLT v2: 17.3% duplicates, 56.1% mojibake). Tokenizer 40.2% more efficient than GPT-4's cl100k_base on FLORES-200.

Read source

Your take?

Embeddings Open source Benchmarks Papers

Summary generated by Claude — human-verified

SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

Other angles on this story