252-0535-00: Advanced Machine Learning
Section 1
Statistical Learning Overview
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 09/27/2024
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Joachim Buhmann and Doctor Carlos Jimenez. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Statistical Learning¶
There are four paradigms in science:
Ontological The world as it should be (necessary) |
Epistemic The world as it is (contingent) |
|
---|---|---|
Thinking With brains (natural) |
Mathematics (theoretical) | Physics (empirical) |
Computing With computers (artificial) |
Computer science (computational) | Data science (data-driven) |
Classifying iris flowers is an example of data science, the combination of computing and epistemic.
In the field of data science, we also have four paradigms:
Frequentism | Bayesianism | Statistical learning | Non-parametric statistics |
---|---|---|---|
|
|
|
|
|
|
|
|
Let us say, we want to predict a target random variable $Y$ with range in $\mathcal{Y}$ given only a random vector $X$ with range in $\mathcal{X}$. Formally, we want to find a function $f: \mathcal{X} \rightarrow \mathcal{Y}$ that minimizes the expected risk:
$$R(f) = R_{X,Y} [ 1 \{ f(X) \neq Y \} ]$$
This function cannot be computed because:
We do not have access to the joint distribution of $X$ and $Y$.
It is intractable to find this function $f$ without any assumptions on its structure.
It is unclear how to minimize the expected value of $\mathbb{1} \{f(X) \neq Y\}$.
We can deal with this as follows:
We restrict the space of possible choices of $f$ to a usually parameterizable set $\mathcal{H}$.
We collect a sample $Z = \{(x_i, y_i)\}_{i \le n}$.
We use a differentiable loss function $L: \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}$ to approximate $\mathbb{1} \{f(X) \neq Y\}$.
With these choices, we approximate $R(f)$ with the empirical risk:
$$\hat{R}(f) = \hat{L}(Z, f) = \frac{1}{n} \sum_i L(y_i, f(x_i))$$
we then approximate out desired $f$ with:
$$\hat{f} = \mathrm{argmin}_{f \in \mathcal{H}} \hat{R}(f)$$