Pretraining Language Models on Historical Text
Signal
78
Hype
15
In three linesTypewriterLM is a 7.24B parameter language model trained exclusively on English text predating 1913. Authors construct TypewriterCorpus (54B tokens), a cleaned historical corpus with leakage mitigation, and introduce lexically grounded instruction tuning to ground responses in historical documents. Three datasets and History-Event benchmark are released.Read source
Your take?
Summary generated by Claude — human-verified