263-5210-00: Probabilistic Artificial Intelligence

Section 3 Gaussian Processes

Swiss Federal Institute of Technology Zurich

Eidgenössische Technische Hochschule Zürich

Last Edit Date: 10/12/2024

Disclaimer and Term of Use:

We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file

This personal note is adapted from Professor Andreas Krause. Please contact us to delete this file if you think your rights have been violated.

This work is licensed under a Creative Commons Attribution 4.0 International License.

We obtain Gaussian processes by extending the function-space view of Bayesian linear regression to infinite domains. We are still concerned with the problem of estimating the value of a function $f : X \rightarrow \mathbb{R}$ at arbitrary points $x^* \in X$ given training data $\{ x_i, y_i \}_{i = 1}^n$, where the labels are assumed to be corrupted by homoscedastic Gaussian noise,

$$y_i = f(x_i) + \epsilon_i$$

where $\epsilon_i \sim N(0, \sigma_n^2)$. We will denote $\textbf{X}$ as the design matrix (collection of training inputs) and $\textbf{y}$ as the vector of training labels.

Definition of Gaussian Process:

A Gaussian process is an infinite set of random variables such that any finite number of them are jointly Gaussian.

We use a set $\mathcal{X}$ to index the collection of random variables. A Gaussian process is characterized by a mean funciton $\mu : \mathcal{X} \rightarrow \mathbb{R}$ and a covariance function (kernel function) $k : \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ such that for any $A = \{ x_1, ..., x_m \} \subseteq \mathcal{X}$ we have

$$f_A = \left[ f_{x_1} \cdots f_{x_M} \right]^T \sim N(\mu_a, \textbf{K}_{AA})$$

where $\mu_A = \begin{bmatrix} \mu(x_1) \\ \vdots \\ \mu(x_m) \end{bmatrix}$ and $\textbf{K}_{AA} = \begin{bmatrix} k(x_1, x_1) & \cdots & k(x_1, x_m) \\ \vdots & \ddots & \vdots \\ k(x_m, x_1) & \cdots & k(x_m, x_m) \end{bmatrix}$.

We denote it as $f \sim \mathcal{GP}(\mu, k)$. In particular, given a mean function, covariance function, and using the homoscedastic noise assumption

$$y^* \mid x^*, \mu, k \sim N(\mu(x^*), k(x^*, x^*) + \sigma_n^2).$$

The random variable $f_x$ represents the value of the function $f(x)$ at location $x \in \mathcal{X}$, we thus write $f(x) = f_{x}$. Intuitively, a Gaussian process can be interpreted as a normal distribution over functions - and it therefore often called an infinited-dimentisonal Gaussian.

Commonly, for notational simplicity, the mean function is taken to be zero. Note that for a fixed mean this is not a restriction, as we can simply apply the zero-mean Gaussian process to the difference between the mean and the observations.

Learning and Inference¶

First, let us look at learning and inference in the context of Gaussian processes. Given a priot $f \sim \mathcal{GP}(\mu, k)$ and (noisy) observations $y_i = f(x_i) + \epsilon_i$ with $\epsilon_i \sim N(0, \sigma_n^2)$ at locations $A = \{ x_1, \cdots, x_n \}$ we can write the joint distribution of the observations $y_{1:n}$ and the noise free prediction $f^*$ at a test point $x^*$ as

$$\begin{bmatrix} \textbf{y} \\ f^* \end{bmatrix} \mid x^*, x_{1:n} \sim N(\tilde{\mu}, \tilde{\textbf{K}})$$

where $\tilde{\mu} = \begin{bmatrix} \mu_A \\ \mu(x^*) \end{bmatrix}$, $\tilde{\textbf{K}} = \begin{bmatrix} \textbf{K}_{AA} + \sigma_n^2 \textbf{I} & \textbf{k}_{x^*,A} \\ \textbf{k}_{x^*, A}^T & k(x^*, x^*) \end{bmatrix}$, $\textbf{k}_{x, A} = \begin{bmatrix} k(x, x_1) \\ \vdots \\ k(x, x_n) \end{bmatrix}$.

Deriving the conditional distribution, we obtain that the Gaussian process posterior is given by

$$f \mid x_{1:n}, y_{1:n} \sim \mathcal{GP}(\mu', k')$$

where

$\mu'(x) = \mu(x) + \textbf{k}_{x, A}^T (\textbf{K}_{AA} + \sigma_n^2 \textbf{I})^{-1} (\textbf{y}_A - \mu_A)$
$k'(x, x') = k(x, x') - \textbf{k}_{x, A}^T (\textbf{K}_{AA} + \sigma_n^2 \textbf{I})^{-1} k_{x', A}$.

Observer that analogously to Bayesian linear regression, the posterior covariance can only decrease when conditioning on additional data, and is independent of the observations $y_i$.

The predictive posterior at the test point $x^*$ is

$$y^* \mid x^*, x_{1:n}, y_{1:n} \sim N(\mu^*, k^* + \sigma_n^2)$$

Sampling¶

Often, we are not interested in the full predictive posterior distribution, but merely want to obtain samples of our Gaussian process model. We now briefly examine two approaches.

Approach 1

Consider a discretized subset of points

$$f = [f_1, \cdots, f_n]$$

that we want to sample. Note that $f \sim N(\mu, \textbf{K})$. We have already seen

$$f = \textbf{K}^{1/2}\epsilon + \mu$$

where $\textbf{K}^{1/2}$ is the square root of $\textbf{K}$ and $\epsilon \sim N(0, \textbf{I})$ is standard Gaussian noise. However, computing the square root of $\textbf{K}$ takes $O(n^3)$ time.

Approach 2

Recall the product rule

$$p(f_1, \cdots, f_n) = \prod_{i = 1}^n p(f_i \mid f_{1:i - 1})$$

This is the joint distribution factorizes neatly into a product where each factor only depends on the "outcomes" of preceding factors. We can therefore obtain samples one-by-one, each time condtioning on one more observation:

$$f_1 \sim p(f_1)$$

$$f_2 \sim p(f_2 \mid f_1)$$

$$f_3 \sim p(f_3 \mid f_1, f_2)$$

This general approach is known as forward sampling. Due to the matrix inverse in the formula of the GP posterior, this approach also takes $O(n^3)$ time.

Kernel Functions¶

Although we have already seen the kernel functions define the "shape" of the functions that are realized from a Gaussian distribution over functions, namely a Gaussian process. Here, we generalize the notion of kernels to emcompass function classes beyond linear and polynomial functions.

Definition of Kernel Function

A kernel function $k$ is defined as

$$k(x, x') = Cov[f(x), f(x')]$$

for locations $x$ and $x'$ such that

$k(x, x') = k(x', x)$ for any $x, x' \in \mathcal{X}$ (symmetry), and
$\textbf{K}_{AA}$ is positive semi-definite for any $A \subseteq \mathcal{X}$.

We say that a kernel function is positive definite if $\textbf{K}_{AA}$ is positive definite for any $A \subseteq \mathcal{X}$.

The two defining conditions ensure that for any $A \subseteq \mathcal{X}$, $\textbf{K}_{AA}$ is a valid covariance matrix. Indeed, it can be shown that a function $k : \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ satisfying the above two conditions corresponds to a (not necessarily finite dimensional) feature representation in some feature space. The corresponding feature space is a reproducing kernel Hilbert space (RKHS).

Intuitively, the kernel function evaluated at locations $x$ and $x'$ describes how $f(x)$ and $f(x')$ are related. If $x$ and $x'$ are "close", then $f(x)$ and $f(x')$ are usually taken to be positively correlated, encoding a "smooth" function.

Common Kernels¶

We will now define some commonly used kernels. Often an additional factor $\sigma^2$ (output scale) is added, which we assume to be 1 for simplicity.

Linear kernel¶

The linear kernel is defined as

$$k(x, x', ; \Phi) = \Phi(x)^T \Phi(x')$$

where $\Phi$ is a nonlinear transformation.

Remark

GPs with linear kernel and BLR

A Gaussian process with a linear kernel is equivalent to Bayesian linear regression. This follows directly from the function-space view of Bayesian linear regression and comparing the derived kernel function with definition of the linear kernel.

Gaussian kernel¶

The Gaussian kernel (also known as squared exponential kernel or radial basis function (RBF) kernel) is defined as

$$k(x, x'; l) = \exp \left( -\frac{||x - x'||_2^2}{2l^2} \right)$$

where $l$ is the length scale. The larger the length scale $l$, the smoother the resulting functions. Furthermore, it turns out that the feature space corresponding to the Gaussian kernel is "infinitely dimentional".

No description has been provided for this image

The figure above shows the functions sampled according to a Gaussian process with a Gaussian kernel and length scales $l = 5$ on the left and $l = 1$ on the right.

Laplace kernel¶

The Laplace kernel (also known as the exponential kernel) is defined as

$$k(x, x'; l) = \exp \left( - \frac{||x - x'||_1}{l} \right)$$

As can be seen from the figure below, samples from a GP with Laplace kernel are non-smooth as opposed to the samples from a GP with Gaussian kernel. The left side has length scale $l = 10000$ and the right side has length scale $l = 10$.

Matern kernel¶

The Matern kernel trades the smoothness of the Gaussian and the Laplace kernels. It is defined as

$$k(x, x'; v, l) = \frac{2^{1 - v}}{\Gamma (v)} \left( \frac{\sqrt{2v} ||x - x'||_2}{l} \right)^v K_v \left( \frac{\sqrt{2v} ||x - x'||_2}{l} \right)$$

where $\Gamma$ is the Gamma function, $K_v$ is the modified Bessel function of the second kind, and $l$ a length scale parameter. The resulting functions are $\lceil v \rceil - 1$ times mean square differentiable. For $v = \frac{1}{2}$, the Matern kernel is equivalent to the Laplace kernel. For $v \rightarrow \infty$, the Matern kernel is equivalent to the Gaussian kernel. In particular, GPs with a Gaussian kernel are infinitely many times mean square differentiable whereas GPs with a Laplace kernel are mean square continuous but not mean square differentiable.

Kernel Composition¶

Given kernels $k_1 : \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ and $k_2 : \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$, they can be composed to obtain a nw kernel $k : \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ in the following ways:

$k (x, x') = k_1(x, x') + k_2(x, x')$
$k (x, x') = k_1(x, x') \cdot k_2(x, x')$
$k (x, x') = c \cdot k_1(x, x')$ for any $c > 0$
$k (x, x') = f(k_1(x, x'))$ for any polynomial $f$ with positive coefficients or $f = \exp$.

For example, the additive structure of a funciton $f(x) = f_1(x) + f_2(x)$ can be easily encoded in GP models. Suppose that $f_1 \sim \mathcal{GP}(\mu_1, k_1)$ and $f_2 \sim \mathcal{GP}(\mu_2, k_2)$, then the distribution of the sum of those two functions $f = f_1 + f_2 \sim \mathcal{GP}(\mu_1 + \mu_2, k_1 + k_2)$ is another GP.

Whereas the addition of two kernels $k_1$ and $k_2$ can be thought of as an OR operation (i.e., the kernel has high value if either $k_1$ or $k_2$ have high value), the mutiplication of $k_1$ and $k_2$ can be thought of as an AND operation (i.e., the kernel has high value if both $k_1$ and $k_2$ have high value). For example, the product of two linear kernels results in functions which are quadratic.

As mentioned previously, the constant $c$ of a scaled kernel function $k'(x, x') = c \times k(x, x')$ is generally called the output scale of a kernel, and it scales the variance $Var[f(x)] = c \times k(x, x)$ of the predictions $f(x)$ from $\mathcal{GP}(\mu, k')$.

Stationary and Isotropy¶

Kernel functions are commonly classified according to two properties.

Definition of stationary and isotropy:

A kernel $k : \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$ is called

stationary (or shift-invariant) if there exists a function $\tilde{k}$ such taht $\tilde{k}(x - x') = k(x, x')$
isotropy if there exists a function $\tilde{k}$ such that $\tilde{k}(||x - x'||_2) = k(x, x')$.

Note that stationary is a necessary condition for isotropy. In other words, insotropy implies stationarity.

Kernel	Stationary	Isotropic
Linear kernel	no	no
Gaussian kernel	yes	yes
$$k(x, x') = \exp(-\|\|x - x'\|\|_M^2)$$ where M is positive semi-definite	yes	no

For $x' = x$, stationary implies that the kernel must only depend on 0. In other words, a stationary kernel must depend on relative locations only. This is clearly not the case for the linear kernel, which depends on the absolute locations of $x$ and $x'$. Therefore, the linear kernel cannot be isotropic either.

For the Gaussian kernel, isotropy follows immediately from its definition.

The last kernel is clearly stationary by defintion, but not isotropic for general matrices $M$. Note that for $M = I$ it is indeed isotropic.

Making Predictions with GPs¶

Suppose $f \sim \mathcal{GP}(\mu, k)$ and for set $A = \{ \textbf{x}_1, \cdots, \textbf{x}_m \}$, we observer $y_i = f(x_i) + \epsilon_i$ where $\epsilon_i \sim N(0, \sigma^2)$.

Then $p(f \mid \textbf{x}_1, \cdots, \textbf{x}_m, y_1, \cdots, y_m) = \mathcal{GP}(f; \mu', k')$ where

$\mu'(x) = \mu(x) + \textbf{k}_{x, A} (\textbf{K}_{AA} + \sigma^{-2} \textbf{I})^{-1} (\textbf{y}_A - \mu_A)$
$k'(x, x') = k(x, x') - k_{x, A} (K_{AA} + \sigma^2 I)^{-1} k_{x', A}^T$
$k_{x, A} = \{ k(x, x_1), \cdots, k(x, x_m) \}$

Equivalently, we can have

$$p(f^* \mid x^*, x_{1:m}, y_{1:m}) = N(f^* \mid \mu'(x^*), k'(x^*, x^*))$$

Note that the posterior covariance $k'$ does not depend on $\textbf{y}_A$.

Applicaton¶

Spatial prediction¶

$No description has been provided for this image$

Protein Engineering¶

Outlook: GPs on $X \neq \mathbb{R}^n$¶

For example, consider $X$ to be the node set of a road network graph.

Solution #1

We can embed $ X $ into $ \mathbb{R}^d $ with

$$ \xi(x) = (\text{latitude}(x), \text{longitude}(x)) $$

and use $ k(x, x') = \tilde{k}(||\xi(x) - \xi(x')||)$ with $\tilde{k}: \mathbb{R} \rightarrow \mathbb{R}$.

Problem:

The travel distance can be large even if the nodes are close to each other in terms of $||\xi(x) - \xi(x')||$. For example, opposite moving directions in the same location are far from each other. A traffic jam can occur only on one side.

Solution #2

Use $k(x, x') = \tilde{k}(d(x, x'))$ with $\tilde{k} \rightarrow \mathbb{R} \rightarrow \mathbb{R}$ and $d(x, x')$ being the shortest path distance on the graph.

Problem:

$k$ will typically fail to be positive semi-definite. For example, $\tilde{k}(d(x, x')) = e^{-\frac{d(x, x')^2}{h^2}}$ is positive semi-definite for all $h > 0$ if and only if $X$ can be isometrically embedded into $\mathbb{R}^n$ for high enough $n$ to into an infinite-dim Hilbert space.

Solution #3

Laplacian-based generalizations of the squared exponential and Matern kernels.

Properties:

Always positive semi-definite.
Automatically respect the intrinsic geomeory of $X$.
If $|X| > 10^5$, requires non-trivial tricks / compute.

Common convention: Prior mean 0¶

Suppose $f \sim \mathcal{GP}(\mu, k)$ and $g: g(x) = f(x) - \mu(x)$. Then we can have

$$g \sim \mathcal{GP}(0, k), f(x) = g(x) + \mu(x)$$