Back to feed
arXiv cs.CL·

Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

Signal
78
Hype
15
In three linesStudy reveals visibility asymmetry in multilingual datasets: 118 languages (59% of 200 most-spoken) have zero catalogued datasets per LRE Map and LDC. Using LLM-assisted citation-mining on Semantic Scholar, authors identify 609 unique datasets across 53 low-visibility languages, 356 openly accessible. Data scarcity is a documentation and discoverability issue, not just production.
Read source
Your take?
BenchmarksOpen source

Summary generated by Claude — human-verified