Balancing Learning Rates Across Layers: Exact Two-Step Dynamics and Optimal Scaling in Linear Neural Networks
Theoretical study of optimal learning rates in 2-3 layer linear neural networks. Derives exact closed-form expressions for gradients and test loss after 1-2 gradient descent steps. Key finding: unequal layer-wise learning rates minimize loss in early training, while equal rates become optimal later. Code released.