252-0535-00: Advanced Machine Learning
Section 1
Statistical Learning Overview
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 09/27/2024
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Joachim Buhmann and Doctor Carlos Jimenez. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Statistical Learning¶
There are four paradigms in science:
Ontological The world as it should be (necessary) |
Epistemic The world as it is (contingent) |
|
---|---|---|
Thinking With brains (natural) |
Mathematics (theoretical) | Physics (empirical) |
Computing With computers (artificial) |
Computer science (computational) | Data science (data-driven) |
Classifying iris flowers is an example of data science, the combination of computing and epistemic.
In the field of data science, we also have four paradigms:
Frequentism | Bayesianism | Statistical learning | Non-parametric statistics |
---|---|---|---|
|
|
|
|
|
|
|
|
Let us say, we want to predict a target random variable $Y$ with range in $\mathcal{Y}$ given only a random vector $X$ with range in $\mathcal{X}$. Formally, we want to find a function $f: \mathcal{X} \rightarrow \mathcal{Y}$ that minimizes the expected risk:
$$R(f) = R_{X,Y} [ 1 \{ f(X) \neq Y \} ]$$
This function cannot be computed because:
We do not have access to the joint distribution of $X$ and $Y$.
It is intractable to find this function $f$ without any assumptions on its structure.
It is unclear how to minimize the expected value of $\mathbb{1} \{f(X) \neq Y\}$.
We can deal with this as follows:
We restrict the space of possible choices of $f$ to a usually parameterizable set $\mathcal{H}$.
We collect a sample $Z = \{(x_i, y_i)\}_{i \le n}$.
We use a differentiable loss function $L: \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}$ to approximate $\mathbb{1} \{f(X) \neq Y\}$.
With these choices, we approximate $R(f)$ with the empirical risk:
$$\hat{R}(f) = \hat{L}(Z, f) = \frac{1}{n} \sum_i L(y_i, f(x_i))$$
we then approximate out desired $f$ with:
$$\hat{f} = \mathrm{argmin}_{f \in \mathcal{H}} \hat{R}(f)$$
Statistical learning in data science¶
Objects, data, and representations¶
We like to represent objects of interestn and characterize them according to their typical patterns for detection, classification, abstraction (compression), and so on.
We represent objects using measurements in a sample space. For example, flowers as objects and pixel intensity as measurements or petal and sepal measurements.
Feature space¶
Sample space: the mathematical space in which the data are represnted.
Numerical data is represented in $\mathbb{R}^d$
Boolean data is represented in $\mathbb{B} = \{ 0, 1 \}$.
Categorical data is represented in $\{1, 2, ..., K\}$.
Features are derived quantities or indirect observations which often significantly compress the information content of measurements.
Hypothesis class¶
We make assumptions on the nature of $f: \mathcal{X} \rightarrow \mathcal{Y}$, usually in the form of a parameterizable hypothesis class $\mathcal{H} = \{ f_{\theta}: \theta \in \Theta \}$, with $\Theta$ a subset of an Euclidean space.
For example, we may choose to assume that if $X$, $Y$ is a joint pair of random variables with range in $x \times y$ then:
$Y \sim \text{Bern}(0.5)$
$X \mid Y = y \sim N(\mu_y, \Sigma)$, for some $\{ \mu_y : y \in \mathcal{Y} \}, \Sigma$.
Using Bayes' Rule, we can naturally derive an estimator $f(x) = \mathrm{argmax}_{y \in \mathcal{Y}} g_{y,w}(x)$, where $g_{y,w}(x) = \text{Pr}(Y = y \mid X = x) = \sigma (w^T x)^y (1 - \sigma (w^T x))^{1 - y}$, for some $w \in \mathbb{R}^d$. Here $\sigma$ is the sigmoid function.
Therefore, we define $\mathcal{H}$ as $\{ g_{1, w} : w \in \mathbb{R}^d \}$.
Loss function¶
Let $g: X \rightarrow [0, 1]$, where $g(x)$, for $x \in X$, denotes the degree of belif that $x$ is of type $y = 1$. Negative log-likelihood or cross-entropy
$$-\sum_{i \le n} y_1 \log g(x_i) + (1 - y_i) \log (1 - g(x_i))$$
Training¶
We use gradient descent with learning rate $\eta$.
Randomly pick $\hat{w}$.
For $e = 1, 2, ...$, update $\hat{w} \leftarrow \hat{w} - \eta \nabla L(\{ (x_i, y_i) : i \le n \}, \hat{w})$.
Return $\hat{w}$.
Calculus Review¶
Derivation¶
A scalar function $f : \mathcal{R}^d \rightarrow \mathcal{R}$ is differentiable at $a \in \mathcal{R}^d$ if there exists a vector $\nabla \in \mathcal{R}^d$ such that
$$\lim_{h \rightarrow 0} \frac{f(a + h) - f(a) - \nabla^T h}{\mid \mid h \mid \mid} = 0.$$
Let $x = (x_1, ..., x_d)^T$ be a vector of variables in $\mathcal{R}^d$. One can demonstrate that
$$\nabla = \nabla f(a) := \frac{\partial{f}}{\partial{x}}\Bigg|_{x = a} := \left( \frac{\partial{f}}{\partial{x_i}} \right)_{i \le d} \Bigg|_{x = a}$$
The function $\frac{\partial{f}}{\partial{x}} : \mathbb{R}^d \rightarrow \mathbb{R}^d$ is called the derivative or the gradient of $f$.
Consider a vector field $F: \mathbb{R}^d \rightarrow \mathbb{R}^m$ and suppose that $F(x) = (F_1(x), ..., F_m(x))$. We say that $F$ is differentiable at $a \in \mathbb{R}^d$ if $F_i$ has continuous partial derivatices at $a$.
The derivative of $F$ is the function defined as
$$\frac{\partial{F}}{\partial{x}} = \left( \frac{\partial{F_i}}{\partial{x_j}} \right)_{i \le m, j \le d}.$$
For convenience, $\frac{\partial{F}}{\partial{x}}$ is usually represented as a matrix with $m$ rows and $d$ columns.
Useful properties¶
If $x$ is a variable vector in $\mathcal{R}^d$ and $a \in \mathcal{R}^d$, then
$$\frac{\partial}{\partial{x}} x^T a = \frac{\partial}{\partial{x}} a^T x = a.$$
If $A \in \mathcal{R}^{d \times d}$, then
$$\frac{\partial}{\partial{x}} x^T A x = (A + A^T) x.$$
Probability Review¶
Probability space¶
A probability space is a triple $(\Omega, \Sigma, P)$ where
$\Omega$ is a non-empty sample space.
$\Sigma$ is a single-algebra of subsets of $\Omega$, describing the measureable events of $\Omega$. It is characterized as the smallest collection subsets of $\Omega$ that fulfills the following:
$\emptyset \in \Sigma$
If $A \in \Sigma$ then $A^c \in \Sigma$
If $A_1, A_2, ... \in \Sigma$ then $A_1 \cup A_2 \cup ... \in \Sigma$.
$P: \Sigma \rightarrow [0, 1]$ is a probability measure satisfying the following:
$P(A) \ge 0$, for all $A \in \Sigma$.
$P(\Omega) = 1$.
If $A_1$, $A_2$, ... are disjoint sets in $\Sigma$ then $P(A_1 \cup A_2 \cup ...) = \sum_{i \ge 1}P(A)$.
Random variables¶
A measurable space $(\Omega', \Sigma')$ is a pair consisting of just a sample sapce and a sigma-algebra.
Consider two measurable spaces $(\Omega, \Sigma)$, $(\Omega', \Sigma')$. A function $f: \Omega \rightarrow \Omega'$ is measurable if for any $A \in \Sigma'$, $f^{-1}(A) \in \Sigma$.
Let $(\Omega, \Sigma, P)$ be a probability space and $(\Omega', \Sigma')$ a measurable space. A random variable $X: \Omega \rightarrow \Omega'$ is a measurable function mapping the sample space of a probability space to another measurable space $(\Omega', \Sigma')$. Usually $\Omega'$ is $\mathbb{R}^d$ and $\Sigma'$ is the Borel sigma-algebra $\mathbb{B}(\mathbb{R}^d)$ over $\mathbb{R}^d$, which is the smallest sigma-algebra containing all products of open intervals in $\mathbb{R}$.
For $A \in \Sigma'$, we define P(X \in A) = P(X^{-1}(A)).
When $(\Omega', \Sigma') = (\mathbb{R}^d, \mathbb{B}(\mathbb{R}^d))$, we define the probability density function (pdf) of $X$ as the function $p: \mathbb{R}^d \rightarrow \mathbb{R}$ such that $P(X \in A) = \int_{x \in A} p(x)dx$.
For two random variables, $X, Y: (\Omega, \Sigma, P) \rightarrow (\Omega', \Sigma')$, we define $P(X\in A | Y\in B) = \frac{P(X\in A, Y \in B)}{P(Y \in B)}$.
Expectations¶
For a random variable $X : (\Omega, \Sigma, P) \rightarrow (\mathbb{R}^d, \mathbb{B}(\mathbb{R}^d))$, and a function $f : \mathbb{R}^d \rightarrow \mathbb{R}$, we define the expectation of $f(X)$ as
$$\mathbb{E}_{X \sim P} [f(X)] = \int f(x) p(x) dx.$$
We can drop the subscript in $\mathbb{E}$ when it is clear from the context.
For two random variables $X, y : (\Omega, \Sigma, P) \rightarrow (\mathbb{R}^d, \mathbb{B}(\mathbb{R}^d))$, the conditional expectation $\mathbb{E}[X\mid Y]$ is the funtion given by $y \rightarrow \mathbb{E}[X \mid Y = y]$.
Information Theory Review¶
Entropy¶
The entropy is a measure of randomness / information . uncertainty in a random variable.
It is derived from the following axioms:
For a random variable $X$, the entropy $H(X)$ must be non-negative
For any two independent random variables $X$ and $Y$, the entropy $H(X, Y)$ of the random variable $(X, Y)$ must satisfy $H(X, Y) = H(X) + H(Y)$.
For a random variable $X$, the entropy $H(X)$ is maximum when $X$ is uniform.
There exists only one function $H$ that fulfills these consitions. $H(X) = - \sum_{x \in \mathcal{X}} \mathbb{P}(X = x) \log \mathbb{P}(X = x)$, when $X$ is a discrete random variable with range in $\mathcal{X}$ and $H(X) = - \int p(x) \log p(x) dx$, when $X$ is a continuous random variable with range in $\mathcal{X}$.
When $X$ is discrete and the pmf is defined only by power of 2, then $H(X)$ is the expected code length.
Conditional entropy¶
We can define the conditional entropy, as the amount of information that $Y$ reveals about $X$.
$$H(X \mid Y) = \sum_{y} p(Y = y) H(X \mid Y = y)$$
Mutual information¶
If $H(X \mid Y)$ indicates the amount of information left in $X$ after $Y$ is revealed, then we can define the mutual information shared between $X$ and $Y$ as
$$I(X; Y) = H(X) - H(X | Y).$$
We can also demonstrate that $I(X; Y) = I(Y; X)$.
Kullback-Leibler divergence¶
Every encoding function $f$ mapping a pair (object, location) to a bitstring induces a probability distribution $p_f$ over such pairs. The distribution is simply given by $p_f(x, y) = 2^{-\mid f(x, y) \mid}$, where $\mid f(x, y) \mid$, is the length of $f(x, y)$.
Conversely, for every distribution $p$ over pairs (object, location), there is an (ideal) encoding function $f_p$ where $\mid f_p(x, y) \mid \approx - \log_2 p(x, y)$.
Suppose that we have a distribution $p$ and an encoding function $f$. How inefficient is $f$ with respect to $f_p$?
$$\sum_{x, y} p(x, y) \left( - \log p_f(x, y) + \log p(x, y) \right) = \sum_{x, y} p(x, y) \left( \log \frac{p(x, y)}{p_f(x, y)} \right).$$
We can generalize this to any pair of distributions $p$ and $q$, leading to the Kullback-Leibler divergence:
$$KL(p \mid q) = \sum_{x} p(z) \left( \log \frac{p(z)}{q(z)} \right).$$
Cross-entropy¶
We are given a distribution $p$ and we want to find the optimal encoding function $f$ for it. We should then minimize $KL(p \mid p_f)$. This is equivalent to minimizing
$$KL(p \mid p_f) = \sum_{x, y} p(x, y) \left( \log \frac{p(x, y)}{p_f(x, y)} \right) = \text{const} - \sum_{x, y} p(x, y) \left( \log p_f(x, y) \right).$$
We call this objective the cross-entropy between $p$ and $p_f$. We can generalize this to any pair of distributions $p$ and $q$, leading to:
$$CE(p \mid q) = - \sum_{z} p(z) \left( \log q(z) \right).$$