Back to feed
arXiv cs.CL·

SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

Signal
78
Hype
15
In three linesSomaliWeb v1: Somali corpus of 819,322 documents (~303M tokens) with BPE-16K tokenizer and language-identification benchmark. Reveals critical defects in HPLT v2 (17.3% exact duplicates, 56.1% mojibake, 10.7% near-duplicates). Tokenizer 40.2% more efficient than cl100k_base on FLORES-200.
Read source
Your take?
EmbeddingsBenchmarksOpen sourcePapers

Summary generated by Claude — human-verified