A Gaussian process (GP) is a stochastic process (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distribution (copied from Wikipedia).

Some GPs (specifically stationary GPs, meaning it has a constant mean and covariance only depends on relative position of data points), have an explicit representation, where you can write it as

f(x)=g(R,x),f(x) = g(R, x),

for some random variables RR and some fixed deterministic function gg (I think this is true).

Bayesian inference and prediction with a GP

We have a training set D={(x(1),y(1)),,(x(n),y(n))}\mathcal{D} = \{ (x^{(1)}, y^{(1)}), \ldots, (x^{(n)}, y^{(n)}) \}, and we wish to make a Bayesian prediction (see my notes on Bayesian stuff for the difference between Bayesian inference and prediction) on a test sample xx using our GP, which is a distribution over functions.

We consider the following noise model, y(i)=g(x(i))+ϵ(i)y^{(i)} = g(x^{(i)}) + \epsilon^{(i)}, where ϵ(i)N(0,σn2)\epsilon^{(i)} \sim \mathcal{N}(0, \sigma_n^2).

Our joint prior on labels is

[Yf]N([m(X)m(x)],[K(X,X)+σn2Id K(x,X)K(x,X)K(x,x)]).\begin{bmatrix} Y \\ f \end{bmatrix} \sim \mathcal{N} \Big(\begin{bmatrix} m(X) \\ m(x) \end{bmatrix}, \begin{bmatrix} K(X, X) + \sigma_n^2 I_d \ & K(x, X) \\ K(x, X) & K(x, x) \end{bmatrix} \Big).

From here, you just do the standard conditioning a joint Gaussian stuff (Wikipedia page for this). Then you can write this as a new GP. This is your posterior predictive.

You can place a GP prior over functions and then compute the posterior after seeing data to have some probability distribution for what the rest of the functin looks like. This is what they call Gaussian process regression. They get these cool visualizations from it like this from on scikit-learn. Cool GP Regression

But also, the argmax sample from the posterior predictive corresponds to the kernel ridge regression solution,

f^(x)=K(x,X)(K(X,X)σn2Id)1Y,\hat{f}(x) = K(x, X) (K(X, X) - \sigma_n^2 I_{d} )^{-1} Y,

where σn2\sigma_n^2 is your ridge penalty / noise. This is just the mean solution in the plot above, so if you don’t care about the uncertainty quantification just do this ^.

TODO: Maybe an example with a linear kernel?