263-5210-00: Probabilistic Artificial Intelligence

Section 1 Fundamentals of Inference

Swiss Federal Institute of Technology Zurich

Eidgenössische Technische Hochschule Zürich

Last Edit Date: 09/29/2024

Disclaimer and Term of Use:

We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file

This personal note is adapted from Professor Andreas Krause. Please contact us to delete this file if you think your rights have been violated.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Probability¶

There are two ways to interpret probability. In the frquenctist interpretation, on interprets the probability of an even (say a coin coming up "heads" when flipping it) as the limit of relative frequencies in repeated independent experiments. That is,

$$\text{Probability} = \lim_{N \rightarrow \infty} \frac{\text{Number of events happening in N trails}}{N}$$

This interpretation is natural, but has a few issues. It is not very difficult to conceive of settings where repeated experiments do not make sense. Consider the outcome "Person X will live for at least 80 years." There is no way in which we could conduct multiple independent experiments in this case. If we are not able to conduct repeated experiments, out notion of probability is simply a subjective measure of uncertainty about outcomes.

Probability Spaces¶

A probability space is a mathematical model for a random experiment. The set of all possible outcomes of the experiment $\Omega$ is called sample space. An event $A \subseteq \Omega$ of interest may be any combination of poossible outcomes. The set of all events $\mathcal{A} \subseteq P(\Omega)$ that we are interested in of oftern called the event space of the experiment. This set of events is required to be a $\sigma$-algebra over the sample space.

The definition of $\sigma$-algebra is as follows:

Given the set $\Omega$, the set $\mathcal{A} \subseteq P(\Omega)$ is a $\sigma$-algebra over $\Omega$ if the follwoing properties are satisfied:

$\Omega \in \mathcal{A}$;
if $A \in \mathcal{A}$, then $\bar{A} \in \mathcal{A}$ (closedness under complements); and
if we have $A_i \in \mathcal{A}$ for all $i$, then $\bigcup_{i = 1}^{\infty} A_i \in \mathcal{A}$ (closedness under countable unions).

Note that the three properties of $\sigma$-algebras correspond to characteristics we universally expect when working with random experiments.

Example:

The event space $\mathcal{A}$ can also be thought as "how much information is available about the experiment." For example, if the experiement is a throw of a die and $\Omega$ is the set of possible values on the die: $\Omega = \{1, ..., 6\}$, then the following $\mathcal{A}$ implies that the observer cannot distinguish between 1 and 3:

$$\mathcal{A} = \{\emptyset, \Omega, \{1,3,5\}, \{2,4,6\}\}$$

Intuitively, the observer only understands the parity of the face of the die.

The definition of probability measure, introduced by Russian mathematician Andrey Kolmogorov, is as follows:

Given the set $\Omega$ and the $\sigma$-algebra $\mathcal{A}$ over $\Omega$, the function

$$P: \mathcal{A} \rightarrow R$$

is a probability measure on $\mathcal{A}$ if the Kolmogorov axioms are satisfied:

$0 \leq P(A) \leq 1$ for any $A \in \mathcal{A}$;
$P(\Omega) = 1$; and
$P(\bigcup_{i = 1}^{\infty}) = \sum_{i = 1}^{\infty} P(A_i)$ for any countable set og mutually disjoint events $\{A_i \in \mathcal{A}\}_i$.

Remarkably, all further statements about probability follow from these three natural axioms. For an event $A \in \mathcal{A}$, we call $P(A)$ the probbility of $A$. We are now ready to define a probability space.

The definition of probability space is as follows:

A probability space is a triple ($\Omega$, $\mathcal{A}$, P) where

$\Omega$ is a sample space
$\mathcal{A}$ is a $\sigma$-algebra over $\Omega$, and
$P$ is a probability measure on $\mathcal{A}$

Example:

In our context, we often have that $\Omega$ is the set of real numbers $R$ or a compact subset of it. In this case, a natural event space is the $\sigma$-algebra generated by the set of events:

$$A_x = \{ x' \in \Omega : x' \leq x \}$$

The smallest $\sigma$-algebra $\mathcal{A}$ containing all sets $A_x$ is called the Borel $\sigma$-algebra. $\mathcal{A}$ contains all "reasonable" subsets of $\Omega$ (expect for some pathological examples). For example, $\mathcal{A}$ includes all single set $\{ x \}$, as well as all countable unions of intervals.

In the case of discrete $\Omega$, in fact $\mathcal{A} = P(\Omega)$, i.e., the Borel $\sigma$-algebra contains all subsets of $\Omega$.

Random Variables¶

A function that maps a graph to its number of edges is a random variable. The definition of random variable is as follows:

A random variable $X$ is a function

$$X: \Omega \rightarrow T$$

where $T$ is called targe space of the random variable, and where $X$ respects the information available in the $\sigma$-algebra $\mathcal{A}$. That is,

$$\forall S \subseteq T: \{ \omega \in \Omega : X(\omega) \in S \} \in A.$$

Concrete values $x$ of a random variable $X$ are often referred to as states or realizations of $X$. The probability that $X$ takes on a value in $S \subseteq T$ is

$$P(X \in S) = P(\{ \omega \in \Omega : X(\omega) \in S \}).$$

Distributions¶

Consider a random variable $X$ on a probability space ($\Omega$, $\mathcal{A}$, $P$), where $\Omega$ is a compact subset of $R$, and $\mathcal{A}$ the Borel $\sigma$-algebra.

In this case, we can refer to the probability that $X$ assumes a particular state or set of states by writing (in discrete situation)

$$p_X(x) = P(X = x)$$

$$P_X(x) = P(X \leq x)$$

Here $p_X$ is referred as the probability mass function (PMF) and $P_X$ is referred as the cumulative distribution function (CDF).

Continuout Distributions¶

A continuous distribution is a probability distribution that describes the likelihood of a random variable taking on any value in a range, where the range is infinite and uncountable.

Example:

The following example is about the Gaussian Distribution.

The probability density function is:

$$p(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp{\left( -\frac{(x - \mu)^2}{2\sigma^2} \right)}$$

where $\sigma$ is the standard deviation and $\mu$ is the mean. If $\mu = 0$ and $\sigma^2 = 1$, this distribution is called the standard normal disttibution. Note that the Gaussian CDF cannot be expressed in clised-form.

Joint Distributions¶

A joint probability (as opposed to a marginal probability) is the probability of two ro more events occuring simultaneously:

$$P(A, B) = P(A \cap B)$$

In terms of random variables, this concept extends to joint distributions. Instead of characterizing a single random variable, a joint distribution is a function $p_X : R^n \rightarrow R$, characterizing a random vector $\textbf{X} = [X_1, ..., X_n]^T$. For example, if the $X_i$ are discrete, the joint distribution characterizes joint probabilities of the form:

$$P(\textbf{X} = [X_1, ..., X_n]) = P(X_1 = x, ..., X_n = x_n)$$

and hence describes the relationship among all variables $X_i$. For this reason, a joint distribution is also called a generative model. We use $X_{i:j}$ to denote the random vector $[X_i, ..., X_j]^T$.

We can "sum out" (respectively "integrate out") variables from a joint distribution in a process called "marginalization" (sum rule):

$$p(x_{1:i-1}, x_{i + 1:n}) = \int_{X_i(\Omega)} p(x_{1:i - 1}, x_i, x_{i + 1:n}) dx_i$$

Example:

Suppose we have a table like the followin:

	Toothache		No Toothache
	Catch	No Catch	Catch	No Catch
Cavity	0.108	0.012	0.072	0.008
No Cavity	0.016	0.064	0.144	0.576

So we can get

$$P(\text{toothache}) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2$$

Conditional Probability¶

Conditional probability updates the probability of an event $A$ given some new information, for example, after observing the event $B$.

The definition of conditional probability is as follows:

Gven two events $A$ and $B$ such that $P(B) > 0$, the probability of $A$ conditioned on $B$ is given as

$$P(A | B) = \frac{P(A, B)}{P(B)}$$

Simply rearranging the terms yields

$$P(A, B) = P(A | B) \cdot P(B) = P(B|A) \cdot P(A)$$

Thus, the probability that both $A$ and $B$ occur can be calculated by mutiplying the probability of event $A$ and the probability of $B$ conditional on $A$ occuring.

Let us say $\textbf{Z} \sim \textbf{X} | \textbf{Y} = y$ (or simply $\textbf{Z} \sim \textbf{X} | y$) if $\textbf{Z}$ follows the conditional distribution

$$p_{\textbf{X}|\textbf{Y}}(\textbf{x} | \textbf{y}) = \frac{p_{\textbf{X},\textbf{Y}}(\textbf{x}, \textbf{y})}{p_{\textbf{Y}}(\textbf{y})}$$

If $\textbf{X}$ and $\textbf{Y}$ are discrete, we have that $p_{\textbf{X}|\textbf{Y}}(\textbf{x} | \textbf{y}) = P(\textbf{X} = \textbf{x} | \textbf{Y} = \textbf{y})$ as one would naturally expect.

Now we can extend $$P(A, B) = P(A | B) \cdot P(B) = P(B|A) \cdot P(A)$$ to a general product rule:

Given random variables $X_{1:n}$, we can have:

$$ \begin{align} p(x_{1:n}) &= p(x_1) \cdot \prod_{i = 2}^n p(x_i | x_{1:i-1}) \\ &= p(x_1) \cdot p(x_2 | x_1) \cdot p(x_3 | x_{1:2}) \cdot ... \cdot p(x_n | x_{1:n-1}) \end{align} $$

Posterior Inference¶

Given

Piror $P(X)$
Likelihood $P(Y | X)$

We can computes the posterior as follows:

$$ \begin{align} P(X | Y) &= \frac{P(X, Y)}{P(Y)} \\ &= \frac{P(X)P(Y|X)}{\sum_{X = x} P(Y, X = x)} \\ &= \frac{P(X)P(Y|X)}{\sum_{X=x} P(X = x)P(Y | X = x)} \end{align} $$

This is called Bayes' Rule.

Example

Suppose we know:

Prior probability $P(C) = 0.1$

Cavity	No Cavity
0.1	0.9

Likelihood $P(T | C) = 0.9$

	Toothache	No Toothache
Cavity	0.9	0.1
No Cavity	0.01	0.99

We can get the posterior by doing the follows:

$$ \begin{align} P(C | T) &= \frac{P(C) \cdot P(T | C)}{P(C) \cdot P(T | C) + P(C^c) \cdot P(T | C^c)} \\ &= \frac{0.1 \times 0.9}{0.1 \times 0.9 + 0.9 \times 0.01} \\ &= \frac{0.09}{0.09 + 0.009} \\ &\approx 0.909 \end{align} $$

Independent Random Variables¶

Random variables $X_1, ..., X_n$ are independent if

$$P(X_1 = x_1, ..., X_n = x_n) = P(x_1) \cdot P(x_2) \cdot ... \cdot P(x_n)$$

While independence is a very strong requirement a weaker one is called conditional independent.

Random variables $X$ and $Y$ are conditionally independent given $Z$ if and only if for all $x$, $y$, $z$:

$$P(X = x, Y = y \mid Z = z) = P(X = x \mid Z = z) \cdot P(Y = y \mid Z = z)$$

We can write such relation as follows:

$$\textbf{X} \perp \textbf{Y} \mid \textbf{Z}$$

Challenges with high-dimensional distributions¶

Representation (parametrization)
Learning (estimation)
Inference (prediction)

Suppose we have $n$ binary variables. How many parameters do we need to specify $P(X_1 = x_1, ..., X_n = x_n)$?

$$X_1$$	$$X_2$$	...	$$X_{n-1}$$	$$X_n$$	$$P(X)$$
0	0	...	0	0	0.01
0	0	...	1	0	0.001
0	0	...	1	1	0.213
...	...	...	...	...	...
1	1	...	1	1	0.0003

We need $2^n - 1$ parameters to specify $P(X_1 = x_1, ..., X_n = x_n)$.

We need to subtract 1 because the sum of all possible probabilities must equal 1, and hence the last probability can be determined as $ 1 - \sum_{(x_1, \dots, x_n) \neq (1, \dots, 1)} P(X_1 = x_1, \dots, X_n = x_n) $.

It is also hard to compute marginals. Suppose we have joing distribution $P(X_1, ..., X_n)$, then

$$P(X_i = x_i) = \sum_{x_1, ..., x_{i - 1}, x_{x + 1}, ..., x_n} P(x_1, ..., x_n).$$

If all $X_i$ are binary, there are also $2^n - 1$ terms in this sum.

Last but not least, it is hard to get the conditional queries. Suppose we have joint distribution $P(X_1, ..., X_n)$. Compute distribution of some variables given values for others:

$$P(X_1 = \cdot \mid X_7 = x_7) = \frac{P(X_1 = \cdot , X_7 = x_7)}{P(X_7 = x_7)} = \frac{1}{Z} P(X_1 = \cdot , X_7 = x_7),$$

where $Z = P(X_7 = x_7)$ and $P(X_1 = \cdot, X_7 = x_7) = \sum_{x_{2:6}} \sum_{x_{8:n}} P(X_1 = \cdot, x_2 : x_n)$, in this case, there are $2^n - 2$ terms.

Example: Multivariate Gaussian¶

We have the pdf of multivariate Gaussian as

$$p(x) = \frac{1}{2 \pi \sqrt{|\Sigma|}} \exp \left( -\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \right),$$

where $\Sigma = \begin{pmatrix} \sigma_1^2 & \sigma_{12} \\ \sigma_{21} & \sigma_2^2 \end{pmatrix}$ and $\mu = \begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix}$.

The multivariate Gaussian distribution can be generalized and denoted as

$$N(x; \mu, \Sigma) = \frac{1}{2 \pi \sqrt{|\Sigma|}} \exp \left( -\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \right)$$

where

$\Sigma = \begin{pmatrix} \sigma_1^2 & \sigma_{12} & \cdots & \sigma_{1n} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{21} & \sigma_2^2 & \cdots & \sigma_{n}^2 \end{pmatrix}$
$\mu = \begin{bmatrix} \mu_1, \cdots, \mu_n\end{bmatrix}^T$
$\mu_i = \mathbb{E} [X_i]$
$\sigma_i^2 = Var(X_i) = \mathbb{E} [(X_i - \mu_i)^2]$
$\sigma_{ij} = Cov(X_i, X_j) = \mathbb{E} [(X_i - \mu_i) (X_j - \mu_j)]$.

Thus joint distribution over $n$ variables requires only $O(n^2)$ parameters. In fact, Gaussians are independent if and only if they are uncorrelated.

Bayesian inference in Gaussian Distributions¶

Suppose we have a Gaussian random vector

$$\textbf{X} = \textbf{X}_V = [X_1, \cdots, X_d] \sim N(\mu_V, \Sigma_{VV})$$

Hereby $V = \{ i, \cdots, d \}$ is an index set.

Suppose we consider a subset of the variables $A = \{ i_1, \cdots, i_k \} \subseteq V$. The marginal distribution of variables indexed by $A$ is:

$$\textbf{X}_A = [X_{i1}, \cdots, X_{ik}] \sim N(\mu_A, \Sigma_{AA}),$$

where

$\mu_A = [\mu_{i1}, \cdots, \mu_{ik}]^T$
$\Sigma_{AA} = \begin{pmatrix} \sigma_{i1}^2 & \sigma_{i_1 i_2} & \cdots & \sigma_{i_1 i_k} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{i_k i_1} & \sigma_{i_k i_2} & \cdots & \sigma_{i_k}^2 \end{pmatrix}$

Conditional distribution¶

Suppose we have a Gaussian random vector

$$\textbf{X} = \textbf{X}_V = [X_1, \cdots, X_d] \sim N(\mu_V, \Sigma_{VV})$$

Further, suppose we take two disjoint subsets of $V$

$$A = \{i_1, \cdots, i_k \}, B = \{ j_1, \cdots, j_m \}$$

The conditional distribution

$$p(\textbf{X}_A \mid \textbf{X}_B = \textbf{x}_B) = N(\mu_{A \mid B}, \Sigma_{A \mid B})$$

is still Gaussian, where

$\mu_{A \mid B} = \mu_A + \Sigma_{AB} \Sigma_{BB}^{-1}(\textbf{x}_B - \mu_{B})$
$\Sigma_{A \mid B} = \Sigma_{AA} - \Sigma_{AB} \Sigma_{BB}^{-1} \Sigma_{BA}$

Note that $\Sigma_{AB} \in \mathbb{R}^{k \times m}$.

Multiples of Gaussians are Gaussian¶

Suppose we have a Gaussian random vector

$$\textbf{X} = \textbf{X}_V = [X_1, \cdots, X_d] \sim N(\mu_V, \Sigma_{VV})$$

Take a matrix $M \in \mathbb{R}^{m \times d}$

Then the random vector $\textbf{Y} = \textbf{MX}$ is still Gaussian:

$$\textbf{Y} \sim N(\textbf{M}_{\mu_V}, \textbf{M} \Sigma_{VV} \textbf{M}^T)$$

Sums of Gaussians are Gaussian¶

Suppose we have two independent Gaussian random vectors

$$\textbf{X} = \textbf{X}_V = [X_1, \cdots, X_d] \sim N(\mu_V, \Sigma_{VV})$$ $$\textbf{X}' = \textbf{X}_V' = [X_1', \cdots, X_d'] \sim N(\mu_V', \Sigma_{VV}')$$

Then the random vector $\textbf{Y} = \textbf{X} + \textbf{X}'$ is Gaussian:

$$\textbf{Y} \sim N(\mu_V + \mu_V', \Sigma_{VV} + \Sigma_{VV}')$$

Conditional Linear Gaussians¶

If $X$, $Y$ are jointly Gaussian, then $p(X = x \mid Y = y)$ is Gaussian, with mean linearly dependent on $y$:

$$p(X = x \mid Y = y) = N(x; \mu_{X \mid Y = y}, \sigma_{X \mid Y}^2)$$

where

$\mu_{X \mid Y = y} = \mu_X + \sigma_{XY} \sigma_{Y}^{-2} (y - \mu_Y)$
$\sigma_{X \mid Y} = \sigma_X^2 - \sigma_{XY}^2 \sigma_{Y}^{-2}$

Thus random variable $X$ can be viewed as a linear function of $Y$ with independent Gaussian noise added

$$X = aY + b + \epsilon$$

where

$\epsilon \sim N(0, \sigma_{X \mid Y}^2)$
$a = \sigma_{XY}\sigma_{Y}^{-2}$
$b = \mu_X - \sigma_{XY}\sigma_{Y}^{-2} \mu_y$

The converse also holds.