263-5210-00: Probabilistic Artificial Intelligence
Section 2
Bayesian Linear Regression
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 10/07/2024
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Andreas Krause. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Basic Supervised Learning Pipeline¶
In supervised learning, the goal is to train a model using labeled data so it can make predictions on unseen data. This process involves several key steps, as outlined below:
Training Data (Labeled Examples)
The training data is crucial because it provides the model with examples to learn from. Each example consists of an input (e.g., an email) and its corresponding label (e.g., "spam" or "ham"). Typically comes from real-world applications. For instance, spam classification relies on a dataset of previously classified emails. The quality and size of the training data are key factors. More diverse and well-labeled data generally lead to better models that can generalize to unseen data.
Learning Method (Model Training)
The learning method is the algorithm used to train the model. This stage involves understanding the relationship between inputs and their labels. Various algorithms can be used, such as decision trees, neural networks, or support vector machines, depending on the data's complexity and nature. During training, the model attempts to minimize error, often through optimization techniques like minimizing loss functions in neural networks.
Classifier (The Model)
The classifier is the result of the training process, representing the trained model capable of making predictions. Captures the relationship learned from the training data and can classify new, unseen inputs based on this knowledge. Depends on how well it learned from the training data without overfitting (learning too specifically) or underfitting (failing to learn the data's patterns).
Test Data (Unseen Data for Predictions)
Once trained, the model is tested on new, unlabeled data called test data. This simulates real-world scenarios where labels are unknown, and the model must predict them. Evaluates how well the model generalizes to new, unseen data, ensuring it isn't just memorizing the training data but learning applicable patterns.
Prediction/Generalization
After training, the model uses the learned function to predict labels for new inputs, a step referred to as prediction. The ability to generalize well to unseen data is the true measure of the model’s success. A good model performs accurately not just on training data but also on new data. Balancing between underfitting (model too simple) and overfitting (model too complex and specific to the training data).
Linear Regression (with L2 Regularization) Review¶
In general, we can have such function
$$\textbf{y} = \mathbf{X} \mathbf{w}$$
where
$\mathbf{y} = \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix}_{n \times 1}$ is the response vector of size $n \times 1$, where each $y_i$ represents the predicted value for the $i$-th data point.
$\mathbf{X} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nd} \end{bmatrix}_{n \times d}$ is the design matrix (or feature matrix) of size $n \times d$, where each row corresponds to an individual observation, and each column corresponds to a feature. Each $x_{ij}$ represents the value of the $j$-th feature for the $i$-th data point.
$\mathbf{w} = \begin{bmatrix} w_1 \\ \vdots \\ w_d \end{bmatrix}_{d \times 1}$ is the weight vector (also called the coefficient vector) of size $d \times 1$, where each $w_j$ represents the weight associated with the $j$-th feature in the linear model.
Let us recall the L2 (ridge) regularization on linear regression.
To get the the model weights $\hat{\textbf{w}}$, we minimize the objective function:
$$ \begin{align} &\text{argmin}_{\textbf{w}} \sum_{i = 1}^n (y_i - \textbf{w}^T x_i) + \lambda ||\textbf{w}||_2^2 \\ &\Rightarrow \text{argmin}_{\textbf{w}} (\textbf{y} - \textbf{Xw})^T (\textbf{y} - \textbf{Xw}) + \lambda \textbf{w}^T \textbf{w} \\ &\Rightarrow \frac{\partial}{\partial \textbf{w}} \left( \textbf{y}^T \textbf{y} - 2\textbf{y}^T \textbf{Xw} + \textbf{w}^T \textbf{X}^T \textbf{Xw} + \lambda \textbf{w}^T \textbf{w} \right) \\ &\Rightarrow - 2\textbf{X}^T \textbf{y} + 2 \textbf{X}^T \textbf{Xw} + 2\lambda \textbf{w} = 0 \\ &\Rightarrow \textbf{X}^T \textbf{Xw} + \lambda \textbf{w} = \textbf{X}^T \textbf{y} \\ &\Rightarrow (\textbf{X}^T \textbf{X} + \lambda \textbf{I}) \textbf{w} = \textbf{X}^T \textbf{y} \\ \hat{\textbf{w}} &= (\textbf{X}^T \textbf{X} + \lambda \textbf{I})^{-1} \textbf{X}^T \textbf{y} \end{align} $$
This is the analytical solution. While in the real case, we usually use Gradient Descent to get $\hat{\textbf{w}}$.
Maximum Likelihood Estimation¶
Since our function class comprises linear functions of the form $\textbf{w}^T\textbf{x}$, the observation model can be simplified to
$$y_i = \textbf{w}^{*T}x_i + \epsilon_i$$
for some weight vector $\textbf{w}$. We also assume that $\epsilon_i \sim N(0, \sigma_n^2)$ is homoscedastic Gaussian noise. This observation model is equivalently characterized by the Gaussian likelihood,
$$y_i \mid x_i, \textbf{w} \sim N(\textbf{w}^T x_i, \sigma_n^2).$$
Based on this likelihood we can compute the MLE of the weights
$$\hat{\textbf{w}}_{MLE} = \text{argmax}_{\textbf{w} \in \mathbb{R}^d} \sum_{i = 1}^n \log p(y_i \mid x_i, \textbf{w}) = \text{argmin}_{\textbf{w}\in \mathbb{R}^d} \sum_{i = 1}^n (y_i - \textbf{w}^T x_i)^2.$$
In practice, the noise variance $\sigma_n^2$ is typically unknown and also has to be determined, for example, through maximum likelihood estimation.
Weight-space View¶
The most immediate and natural probabilistic interpretation of linear regression is to quantify uncertainty about the weights $\textbf{w}$. Recall that probabilistic inference requires specificatio of a generative model comprised of prior and likelihood. We will use Gaussian prior,
$$\textbf{w} \sim N(0, \sigma_p^2 \textbf{I})$$
and the Gaussian likelihood.
Remark:
Why a Gaussian prior?
The choice of using a Gaussian prior may seen somewhat arbitrary at first sight, except perhaps for the nice analytical properties of Gaussians that we have seen before. The maximum entropy principle provides a more fundamental justification for Gaussian priors since turns out that $N$ has the maximum entropy among all distributions on $\mathbb{R}^d$ with known mean and variance.
Next, let us derive the posterior distribution over the weights.
$$ \begin{align} &\log p(\textbf{w} \mid x_{1:n}, y_{1:n}) \\ = & \log p(\textbf{w}) + \log p(y_{1:n} \mid x_{1:n}, \textbf{w}) + \text{const}~~~~~\text{(Bayes' rule)}\\ = & \log p(\textbf{w}) + \sum_{i = 1}^n \log p(y_i \mid x_i, \textbf{w}) + \text{const}~~~~~\text{(independence)} \\ = & -\frac{1}{2} \left[ \sigma_p^{-2} ||\textbf{w}||_2^2 + \sigma_n^{-2} \sum_{i = 1}^n (y_i - \textbf{w}^T x_i)^2 \right] + \text{const}~~~~~\text{(Gaussian prior and likelihood)}\\ = & -\frac{1}{2} \left[ \sigma_p^{-2} ||\textbf{w}||_2^2 + \sigma_n^{-2} ||\textbf{y} - \textbf{Xw}||_2^2 \right] + \text{const}\\ = & -\frac{1}{2} \left[ \sigma_p^{-2} \textbf{w}^T \textbf{w} + \sigma_n^{-2} \left( \textbf{w}^T \textbf{X}^T \textbf{X} \textbf{w} - 2 \textbf{y}^T \textbf{X} \textbf{w} + \textbf{y}^T \textbf{y} \right) \right] + \text{const}\\ = & -\frac{1}{2} \left[ \textbf{w}^T (\sigma_n^{-2} \textbf{X}^T \textbf{X} + \sigma_p^{-2} \textbf{I}) \textbf{w} - 2 \sigma_n^{-2} \textbf{y}^T \textbf{Xw} \right] + \text{const} \end{align} $$
Observe that the log-posterior is a quadratic form in $\textbf{w}$, so the posterior distribution must be Gaussian
$$\textbf{w} \mid x_{1:n}, y_{1:n} \sim N(\mu, \Sigma)$$
where we can read off the mean and variance to be
$$\Sigma = (\sigma_n^{-2} \textbf{X}^T \textbf{X} + \sigma_p^{-2} \textbf{I})^{-1}$$
$$\mu = \sigma_n^{-2} \Sigma \textbf{X}^T \textbf{y}$$
This also shows taht Gaussians with known variance and linear likelihood are self-conjugate. It can be shown more generally taht Gaussian with known variance are self-conjugate to any Gaussian likelihood. For other generative models, the posterior can typically not be expressed in closed-form - this is a very special property of Gaussians.
Maximum a Posteriori Estimation¶
Computing the MAP estimate for the weights
$$ \begin{align} \hat{\textbf{w}}_{MAP} &= \text{argmax}_w \log p(y_{1:n} \mid x_{1:n}, \textbf{w}) + \log p(\textbf{w}) \\ &= \text{argmin}_{w} ||\textbf{y} - \textbf{Xw}||_2^2 + \frac{\sigma_n^2}{\sigma_p^2} ||w||_2^2 \end{align} $$
we observe that this is identical to ridge regression with weight decay $\lambda = \sigma_n^2 / \sigma_p^2$: $\hat{\textbf{w}}_{MAP} = \hat{\textbf{w}}_{ridge}$. The equation is simply the MLE loss with an additional $l_2$-regularization (originating from the prior) that encourages keeping weights small. Recall that the MAP estimate corresponds to the mode of the posterior distribution, which in the case of a Gaussian is simply its mean $\mu$. As to be expected, $\mu$ coincides with the analytical solution to ridge regression.
Another good example is Lasso as the MAP estimate with a Laplace prior.
Level sets of Gaussian and Laplace regulairzation. It can be seen that Laplace reguliarization is more effective in encouraging sparse solutions, that is, solutions where many components are se to exactly 0.
A commonly used alternative to ridge regression is the least absolute shrinkage and selection operator (or lasso), which regularizes with the $l_1$-norm:
$$\hat{\textbf{w}}_{lasso} = \text{argmin}_{\textbf{w} \in \mathbb{R}^d} ||\textbf{y} - \textbf{Xw}||_2^2 + \lambda ||\textbf{w}||_1$$
It turns out that lasso can also be viewed as peobabilistic inference, using a Laplace prior $\textbf{w} \sim \text{Laplace}(0, l)$ with length scale $l$ instead of a Gaussian prior.
Computing the MAP estimate for the wieghts yields,
$$ \begin{align} \hat{\textbf{w}}_{MAP} &= \text{argmax}_{\textbf{w}} \log p(y_{1:n} \mid x_{1:n}, \textbf{w}) + \log p(\textbf{w}) \\ &= \text{argmin}_{\textbf{w}} \sum_{i = 1}^n (y_i - \textbf{w}^T x_i)^2 + \frac{\sigma_n^2}{l} ||w||_1 \end{align} $$
which coincides with the lasso with weight decay $\lambda = \sigma_n^2 / l$.
To make predictions at a test point $x^*$, we define the model-predicted point $f^* = \textbf{w}_{MAP}^T \textbf{x}^*$ and obtain the label prediction
$$y^* \mid \textbf{x}^*, x_{1:n}, y_{1:n} \sim N(f^*, \sigma_n^2)$$
Here we observe that using point estimates such as the MAP estimate does not quantify uncertainty inthe weights. The MAP estimate simply collaps all mass of the posterior around its mode. This can be harmful when we are highly unsure about the best model, becuase we have observed insufficient data for example.
Probabilistic Inference¶
Rather than selecting a single weight vector $\hat{\textbf{w}}$ to make predictions, we can use the full posterior distribution. This is known as Bayesian linear regression (BLR) and illustrated with an example in the following figure.
Comparsion of linear regression (MLE), ridge regression (MAP estimate), and Bayesian linear regression when the data is generated according to $y \mid \textbf{w}, \textbf{x} \sim N(\textbf{w}^T \textbf{x}, \sigma_n^2)$. The true mean is shown in black, the MLE in blue, and the MAP estimate in red. The dark gray area denotes the epistemic uncertainty of Bayesian linear regression and the light gray area is the additional honoscedastic noise. On the left, $\sigma_n = 0.15$, while on the right $\sigma_n = 0.7$.
To make predictions at a test point $x^*$, we let $f^* = \textbf{w}^T \textbf{x}^*$ which has the distribution
$$f^* \mid \textbf{x}^*, \textbf{x}_{1:n}, y_{1:n} \sim N(\mu^T \textbf{x}^*, \textbf{x}^{*T} \Sigma\textbf{x}^*).$$
Note that this does not take into account the noise in the labels $\sigma_n^2$. For the lebel prediction $y^*$, we obtain
$$y^* \mid \textbf{x}^*, \textbf{x}_{1:n}, y_{1:n} \sim N(\mu^T \textbf{x}^*, \textbf{x}^{*T}\Sigma \textbf{x}^* + \sigma_n^2).$$
Recursive Probabilistic Inference¶
For Bayesian linear regression with a Gaussian prior and likelihood, this principle can be used to derive an efficient algorithm since also the posterior is a Gaussian
$$p^{(t)}(\textbf{w}) = N(\textbf{w}; \mu^{(t)}, \Sigma^{(t)}),$$
which can be stored efficiently using only $O(d^2)$ parameters. This leads to an efficient algorithm for Bayesian linear regression with time-indepdendent memory complexity $O(d)$ and round complexity $O(d^2)$. The interpretation of Bayesian linear regression is an online algorithm also highlights similarities to other sequential models such as Kalman filters.
"Today's posterior is tomorrow's prior."
Suppose $P(\textbf{w})$ is the prior, observed $y_{1:n}$ such that $P(y_{1:n} \mid \textbf{w}) = \prod_{i = 1}^n P(y_i \mid \textbf{w} x_i)$.
Define $p^{(j)}(\textbf{w}) = P(\textbf{w} \mid y_{1:j})$ as the posterior given the first $j$ observations. While $p^{(0)}(\textbf{w}) = P(\textbf{w})$.
Suppose we have computed $p^{(j - 1)}(\textbf{w})$ and observe $y_j$
$$ \begin{align} p^{(j)}(\textbf{w}) &= p(\textbf{w} \mid y_{1:j}) = \frac{1}{z} P(\textbf{w}) \cdot P_1(y_1 \mid \textbf{w}) \cdots P_{j - 1}(y_{j - 1} \mid \textbf{w}) \cdot P_j(y_j \mid \textbf{w}) \\ &= z \cdot P(\textbf{w} \mid y_{1:j-1}) \\ &= z \cdot p^{(j - 1)}(\textbf{w}) \\ &= f(P(\textbf{w} \mid y_{1:j-1}), y_j)\\ \end{align} $$
As we can see this is the recursivly update the posterior.
Aleatoric and Epistemic Uncertainty¶
Recall the following equation
$$y^* \mid \textbf{x}^*, \textbf{x}_{1:n}, y_{1:n} \sim N(\mu^T \textbf{x}^*, \textbf{x}^{*T}\Sigma \textbf{x}^* + \sigma_n^2).$$
The predictive posterior distribution from the equation above highlights a decomposition of uncertainty wherein $\textbf{x}^{*T} \Sigma \textbf{x}^*$ corresponds to the uncertainty about out model due to the lack of data (commonly referred to as the epistemic uncertainty) and $\sigma_n^2$ corresponds to the uncertainty about the labels that cannot be explained by the inputs and any model from the model class (commonly referred to as the aleatoric uncertainty, "irreducile noise", or simply "label noise").
It is a parctical modeling choice how much inaccuracy to attribute to epstemic or aleatoric uncertainty. Generally, when a poor model isused to explain a process, more inaccuracy has to be attributed to irreducible noise. For example, when a linear model is used to "explain" a non-linear process, most uncertainty is aleatoic as the model cannot explain the data well. As we use more expressive models, a larger portion of the uncertainty can be explained by the inputs.
Epistemic and aleatoric uncertainty can be formally defined in terms of the law of tatal variance
$$ Var[y^* \mid x^*] = \underbrace{ \mathbb{E}_{\theta}\left[ Var_{y^*} [y^* \mid x^*, \theta] \right] }_{\text{aleatoric uncertainty}} + \underbrace{ Var_{\theta} \left[ \mathbb{E}_{y^*} [y^* \mid x^*, \theta] \right] }_{\text{epistemic uncertainty}} $$
Here, the mean variability of predictions $y^*$ averaged across all models $\theta$ is the aletoric uncertainty. In contrast, the variability of the mean prediction $y^*$ under each model $\theta$ is the epistemic uncertainty.
Ridge Regression as Bayesian Inference¶
Let us assume the prior as
$$p(\textbf{w}) = N(0; \sigma_p^2 \textbf{I})$$
Here we also assume $w_i \sim N(\sigma_p^2)$ and $\textbf{w}$ is independent of $x_{1:n}$.
With the assumption that $y_{1:n}$ is conditional independent with $w_1 x_{1:n}$, we can have the confitional likelihood as
$$p(y_{1:n} \mid w_1 x_{1:n}) = \prod_{i = 1}^n p(y_i \mid w_1 x_i)$$
In particular, $p(y_i \mid w_1 x_i) = N(y_i; w^T x_i \sigma_n^2)$, which assumes $y_i = w^T x_i + \epsilon_i$ where $\epsilon \sim N(0, \sigma_a^2)$.
Bayesian Decision Theory¶
Given
Conditional distribution over labels $P(y \mid x)$, $y \rightarrow Y \in \mathbb{R}$
Set of actions (decisions): $A \rightarrow \mathbb{R}$
Cost function: $C: Y \times A \rightarrow \mathbb{R}$
Bayesian decision theory recommends to pick the action that minimizes the expected cost
$$a^* = \text{argmin}_{a \in A} \mathbb{E}_y [C(y, a) \mid \textbf{X}]$$
Optimal decisions for squared loss¶
Given
Estimated conditional distribution $P(y \mid x) = N(y; \mu_{Y|X = x}, \sigma_{Y|X}^2)$
Set of actions $A = \mathbb{R}$
Cost function: $C(y, a) = (y - a)^2$
Then the action that minimizes the expected cost
$$a^* = \text{argmin}_{a \in A} \mathbb{E}_y [C(y, a) \mid \textbf{x}]$$
is the conditional mean:
$$a^* = \mathbb{E}_y [Y \mid x] = \int y p(y \mid x) dx = \mu_{Y \mid X = x}$$
Asymmetric cost for regression¶
Given
Estimated conditional distribution $P(y \mid x) = N(y; \mu_{Y|X = x}, \sigma_{Y \mid X}^2)$
Set of actions $A = \mathbb{R}$
Cost function $C(y, a) = c_1 \text{max}(y - a, 0) + c_2 \text{max}(a - y, 0)$, where $\text{max}(y - a, 0)$ is the underestimation error and $\text{max}(a - y, 0)$ is the overestimation error.
Then the action that minimizes the expected cost
$$a^* = \text{argmin}_{a \in A} \mathbb{E}_y [C(y, a) \mid \textbf{x}]$$
is
$$a^* = \mu_{Y \mid X = x} + \sigma_{Y \mid X} \Phi^{-1} \left( \frac{c_1}{c_1 + c_2} \right).$$
The function $\Phi$ looks like a sigmoid function.
Choosing hyperparameters¶
In Bayesian linear regression, we need to specify the (co-)variance of the prior $\sigma_p$ and the variance of the noise $\sigma_n$. These are hyperparameters of the model (governing thie distribution of the parameters $\textbf{w}$).
How should we choose them? Currently, we can:
Choose $\hat{\lambda} = \frac{\hat{\sigma}_n^2}{\hat{\sigma}_p^2}$ through cross validation.
Then estimate $\hat{\sigma}_n^2 = \frac{1}{n} \sum_{i = 1}^n (y_i - \hat{w}^T \textbf{x}_i)^2$ as the empirical variance of the residual, and solve for $\hat{\sigma}^2 = \frac{\hat{\sigma}_n^2}{\hat{\lambda}}$.
Graphical models¶
We know that arbitrary joint distributions can be represented as product of conditionals through chain rule
$$P(X_{1:n}) = P(X_1) \cdot P(X_2 \mid X_1) \cdots P(X_n \mid X_{1:n-1})$$
Often, factors only depend on subsets of variables. So the equation above can be represented by
$$\prod_{i = 1}^n P(X_i \mid X_{s_i})$$
where $s_i \subseteq \{ 1, \cdots, i - 1\}$. We can represent the resulting product as a directed acyclic graph.
Grapical model for BLR¶
Prior $P(\textbf{w}) = N(0, \sigma_{p}^2 \textbf{I})$ and $P(y_{1:n} \mid x_{1:n}, \textbf{w}) = \prod_{i = 1}^n P(y_i \mid \textbf{w}, x_i)$.
Joint distribution $P(\textbf{w}, y_{1:n} \mid x_{1:n}) = P(\textbf{w}) \cdot \prod_{i = 1}^n (y_i \mid x_i \textbf{w})$.
σ_n²
v
[w]
/ \
σ_n / \ σ_n
v v
[y1] ... [yn]
/ \
[x1] [xn]