Training GPT-like model on non-language series [R]
Researcher trains Transformer-decoder models (100M–500M params) on 750M tokens of non-language series. Setup: AdamW, lr=1e-3, batch=4M tokens, 16 layers. Model fails to learn basic auto-regressive behavior and repeatedly generates single token.