Released a free 9.8M doc Indic multilingual corpus — Hindi, Bengali, Tamil, Telugu + 7 more (CC0, HuggingFace) [P]
Signal
72
Hype
15
In three linesFree multilingual corpus of 9.8M documents across 11 Indic languages (Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Urdu, English). 8.4B tokens, CC0 license, available on HuggingFace.Read source
Your take?
Summary generated by Claude — human-verified