Paper: Deep Neural Networks as Gaussian Processes
Set-up:
It was already known that in the infinite width limit, a one layer neural network (NN) with an i.i.d. prior over its parameters is equivalent to a Gaussian process (GP).
The point of this paper is to extend the result to deep neural networks (DNNs). They do this by taking the hidden layer widths to infinity in succession (why does it matter that it’s in succession?). Recursively, we have
But of course, we only care about at and , so we can integrate against the joint at only those two points. We are left with a bivariate distribution with covariance matrix entries , , and . Thus, we can write
where F is a deterministic function whose form only depends on . Assuming Gaussian initialization, the base case is the linear kernel (with bias) corresponding to the first layer
See my GPs notes for how to do Bayesian prediction with GPs. Most notably, you can just do Gaussian process regression or kernelized ridge regression (KRR),
where is your ridge penalty / noise.
If there are no hidden layers, our kernel is just the linear kernel (with a bias) and our NNGP is just ridge regression. With weight decay (l2 regularization) training the linear model with GD converges to the same solution (without l2 it converges to least squares).
Ok now if we have one hidden layer and our activation function is , what happens? Our kernel is
where
This is kinda ugly and IDK what to do with it. The limitations of kernels results should hold. I ran a few inductive bias experiments to compare the NNGP with KRR to NNs with AdamW but they are not that interesting and I think they were a waste of time (see the dropdown below).
Deep signal propagation studies the statistics of hidden representation in deep NNs. They found some cool links to this work, most cleanly for tanh and also for ReLU.
For tanh, the deep signal prop works identified an ordered and a chaotic phase, depending on and . In the ordered phase, similar inputs to the NN yield similar outputs. This occurs when dominates . In the NNGP, this manifests as approaching a constant function. In the chaotic phase, similar inputs to the NN yield vastly different outputs. This occurs when dominates . In the NNGP, this manifests as approaching a constant function and approaching a smaller constant function. In other words, in the chaotic phase, the diagonal of the kernel matrix is some value and off diagonals are all some other, smaller, value.
Interestingly, the NNGP performs best near the threshold between the chaotic and ordered phase. As depth increases, we converge towards , and only perform well closer and closer to the threshold. We do well at the threshold, because there, convergence to is much slower (this is bc of some deep signal prop stuff I don’t understand).
They ran experiments (Figure 1) that showed on MNIST and CIFAR-10 NNs and NNGP do essentially equally well. This indicates that feature learning is not important to do well on MNIST and CIFAR-10! (TODO: Find similar experiment on ImageNet and other datasets).
Additionally, they ran experiments (Figure 2) that showed increasing width improves generalization for fully connected MLPs on CIFAR-10. TODO: why I should expect this?
They also show that the NNGP uncertainty is well correlated with empirical error on MNIST and CIFAR. It’s nice that you get uncertainty estimates for free.
TODO: How computationally expensive is the NNGP?
TODO: How does the NNGP compare to the NTK, RBF, and other kernels?