Gradient Flow Dynamics of Teacher-Student Distillation with the Squared Loss
Published in Presented at the Summer@Simons poster session., 2024
Recommended citation: Berkan Ottlik. (2024). "Gradient Flow Dynamics of Teacher-Student Distillation with the Squared Loss". https://berkan.xyz/files/underparameterizedDynamics.pdf
We study a teacher-student learning setup, where a “student” one layer neural network tries to approximate a fixed “teacher” one layer neural network. We analyze the population gradient flow dynamics in the previously unstudied setting with exactly and under-parameterization, even Hermite polynomial activation functions, and squared loss. In the toy model with 2 teacher neurons and 2 student neurons, we fully characterize all critical points. We identify “tight-balance” critical points which are frequently encountered in simulation and greatly slow down training. We prove that with favorable initialization, we avoid tight-balance critical points and converge to the global optimum. We extend tight-balance critical points and favorable initializations to the multi-neuron exact and under-parameterized regimes. Additionally, we compare dynamics under the squared loss to the simpler correlation loss and describe the loss landscape in the multi-neuron exact and under-parameterized regimes. Finally, we discuss potential implications our work could have for training neural networks with even activation functions.