Training Infinitely Deep and Wide Transformers
Signal
75
Hype
15
In three linesTheoretical paper on transformer training in mean-field regime (infinite depth and width). Authors model training as controlling a neural PDE (vs ODE for ResNets), establish well-posedness of forward pass, derive explicit formulas for Wasserstein gradients, and prove gradient flow convergence to global minima under NTK injectivity conditions.Read source
Your take?
Summary generated by Claude — human-verified