Back to feed
Reddit r/MachineLearning·

Released a free 9.8M doc Indic multilingual corpus — Hindi, Bengali, Tamil, Telugu + 7 more (CC0, HuggingFace) [P]

Signal
72
Hype
15
In three linesFree multilingual corpus of 9.8M documents across 11 Indic languages (Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Urdu, English). 8.4B tokens, CC0 license, available on HuggingFace.
Read source
Your take?
Open sourceEmbeddings

Summary generated by Claude — human-verified