Back to feed
arXiv cs.AI·

SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

Signal
78
Hype
15
In three linesSomaliWeb v1: Somali corpus of 819,322 documents (~303M tokens) with BPE-16K tokenizer and language-identification benchmark. Reveals major defects in existing distributions (HPLT v2: 17.3% duplicates, 56.1% mojibake). Tokenizer 40.2% more efficient than GPT-4's cl100k_base on FLORES-200.
Read source
Your take?
EmbeddingsOpen sourceBenchmarksPapers

Summary generated by Claude — human-verified