I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful.
Signal
78
Hype
25
In three linesComplete Usenet corpus (1980–2013) released for local fine-tuning: 103.1B tokens, 408M posts, zero AI contamination. Pre-SEO, pre-algorithm internet writing across 33 years. Organized by domain hierarchies (comp.*, sci.*, rec.*). Free samples available, full corpus under license.Read source
Your take?
Summary generated by Claude — human-verified