arXiv cs.AI·19 May 2026

Training Infinitely Deep and Wide Transformers

Signal

Hype

In three linesTheoretical paper on transformer training in mean-field regime (infinite depth and width). Authors model training as controlling a neural PDE (vs ODE for ResNets), establish well-posedness of forward pass, derive explicit formulas for Wasserstein gradients, and prove gradient flow convergence to global minima under NTK injectivity conditions.

Read source

Your take?

Reasoning Papers Benchmarks

Summary generated by Claude — human-verified

Training Infinitely Deep and Wide Transformers

Other angles on this story