Back to feed
arXiv cs.CL·

Pretraining Language Models on Historical Text

Signal
78
Hype
15
In three linesTypewriterLM is a 7.24B parameter language model trained exclusively on English text predating 1913. Authors construct TypewriterCorpus (54B tokens), a cleaned historical corpus with leakage mitigation, and introduce lexically grounded instruction tuning to ground responses in historical documents. Three datasets and History-Event benchmark are released.
Read source
Your take?
PapersFine-tuningBenchmarksEvals

Summary generated by Claude — human-verified