Reddit r/LocalLLaMA·27 May 2026

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful.

Signal

Hype

In three linesComplete Usenet corpus (1980–2013) released for local fine-tuning: 103.1B tokens, 408M posts, zero AI contamination. Pre-SEO, pre-algorithm internet writing across 33 years. Organized by domain hierarchies (comp.*, sci.*, rec.*). Free samples available, full corpus under license.

Read source

Your take?

Fine-tuning Open source Benchmarks

Summary generated by Claude — human-verified

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful.

Other angles on this story