Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP
Signal
78
Hype
15
In three linesStudy reveals dataset visibility asymmetry in multilingual NLP: 118 languages (59% of 200 most-spoken) have zero catalogued datasets per LRE Map and LDC. Using LLM-assisted citation-mining on Semantic Scholar, authors identify 609 unique datasets across 53 low-visibility languages, 356 publicly accessible. Data scarcity is a documentation and discoverability issue, not just production.Read source
Your take?
Summary generated by Claude — human-verified