252-0535-00: Advanced Machine Learning

Section 2 Anomaly Detection

Swiss Federal Institute of Technology Zurich

Eidgenössische Technische Hochschule Zürich

Last Edit Date: 09/28/2024

Disclaimer and Term of Use:

We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file

This personal note is adapted from Professor Joachim Buhmann and Doctor Carlos Jimenez. Please contact us to delete this file if you think your rights have been violated.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Anomaly detection is used everywhere. For example, banks need to identify fraudent transactions; doctors need to identify abnormal cells in patients; law enforcement needs to identify suspicious behaviors. How can we detect "anomalies"?

Suppose we have a "normal" set $N \subseteq \mathbb{R}^D$, we consider points not in $N$ as anomalies. The formal description of the anomaly detection task is following:

Given $n$ normal points $\{ x_1, ..., x_n \} \subseteq N$, estimate the anomaly detection function $\varphi : \mathbb{R}^D \rightarrow \{ 0, 1 \}$ such that $\varphi(x) = 1 \Leftrightarrow x \notin N.$

Note that we consider a new data point $x^*$ as anomaly if we have $\hat{\varphi}(x^*) = 1$, where $\hat{\varphi}$ is our estimation of $\varphi$.

To handle the difficulty of tracking the set $N$ in high dimensions, we assign each point an anomaly score. Linear projections of high-dimensional data often appear Gaussian in lower dimensions, so we use the following procedure:

Project the training data onto a low-dimensional space.
Fit a Gaussian Mixture Model (GMM) by maximizing likelihood.
Use the GMM to assign anomaly scores for detecting abnormal points.

Principal component analysis (PCA)¶

PCA is a method for dimension reduction. The idea is to project the data from $\mathbb{R}^D$ to $\mathbb{R}^d$ ($d$ is usually much smaller than $D$) such that the $d$ components have the largest variance (we believe that components with larger variance are more informative). Let $x_1, ..., x_n \in \mathbb{R}^D$ be the sample points and let $\bar{X} = \frac{1}{n} \sum_{i = 1}^n x_i$ be the sample mean.

Base case: $d = 1$¶

For any $u \in \mathbb{R}^D$ with $||u||_2^2 = u^T u = 1$, the sample mean of $\{ u^T x_1, ..., u^T x_n\}$ is

$$\frac{1}{n} \sum_{i = 1}^n u^T x_i = u^T \bar{X}$$

and the sample variance of $u$ is

$$ \begin{align} \frac{1}{n} \sum_{i = 1}^n \left( u^T x_i - u^T \bar{X} \right)^2 &= \frac{1}{n} \sum_{i = 1}^n u^T (x_i - \bar{X})(x_i - \bar{X})^T u \\ &= u^T \left( \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{X}) (x_i - \bar{X})^T \right) u. \end{align} $$

We can let

$$S := \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{X}) (x_i - \bar{X})^T.$$

Then the idea of PCA (maximizing the variance) gives us the first principal component (PC)

$$u_1 = \text{argmax}_{u \in \mathbb{R}^D, ||u||_2 = 1} u^T S u$$

We claim that $u$ is the unit qigenvector of $S$ that corresponds the largest eigenvalue. To see this, let $f(u) = u^T S u$ and $g(u) = 1 - u^T u$, then the optimization problem becomes maximizing $f(u)$ such that $g(u) = 0$. The Lagrangian of the problem is

$$ \begin{align} L(u, \lambda) &= f(u) + \lambda g(u) \\ &= u^T S u + \lambda (1 - u^T u)\\ &= u^T(S - \lambda I_D) u + \lambda, \end{align} $$

where $I_D$ is the $D \times D$ identity matrix. As $u_1$ is the maximizer of the problem, we have

$$\frac{\partial L}{\partial u} (u_1, \lambda) = 2(S - \lambda I_D) u_1 = 0,$$

so we have

$$S u_1 = \lambda u_1$$

and $u_1$ is an eigenvector of $S$. To see why it is associated with the largest eigenvalue, suppose the sigenvalue associated with $u_1$ is $\lambda_1$. For any unit eigenvector $u^*$ of $S$ associated with eigenvalue $\lambda^*$, we have

$$\lambda_1 = \lambda_1 u_1^T u_1 = u_1^T S u_1 \ge u^{*T} S u^* = \lambda^* u^{*T} u^* = \lambda^*,$$

where the inequality is due to the definition of $u_1$. Therefore, $u_1$ is an eigenvector of $S$ associated with its largest eigenvalue.

General case: $d > 1$¶

For $d > 1$, the remaining principal components can be derived with a similar idea. Let $X := \{ x_1, ..., x_n \}$ be the dataset. With the first principal component $u_1$, we can define

$$X_1 := \{ x - \text{proj}_{u_1} x : x \in X \} = \{ x - (u_1^T x) u_1 : x \in X \}$$

as the dataset without the first principal component. We can then define the second principal component as

$$u_1 := \text{argmax}_{u \in \mathbb{R}^D, ||u||_2 = 1} u^T S_1 u,$$

where $S_1$ is the sample covariance matrix of $X_1$. The remaining principal components can be determined similarly. One can prove that $u_j^T u_j = 0$ for $i \neq j$ and that $u_i$ is the eigenvector of $S$ associated with the $i$-th largest eigenvalue.

Proof¶

In the general case of PCA where we want to find more than one principal component, the process involves iteratively finding the next direction of maximum variance orthogonal to the previously identified principal components. Let's break this down step by step.

Step 1: Finding the First Principal Component $ u_1 $¶

We already know the first principal component, $ u_1 $, is the eigenvector of the covariance matrix $ S $ corresponding to the largest eigenvalue. It maximizes the variance in the dataset. Formally:

$$ u_1 = \arg\max_{u \in \mathbb{R}^D, \|u\|_2 = 1} u^T S u $$

where $ S $ is the sample covariance matrix of the data:

$$ S := \frac{1}{n} \sum_{i=1}^n (x_i - \bar{X})(x_i - \bar{X})^T $$

Step 2: Removing the First Principal Component¶

After finding the first principal component $ u_1 $, we remove its contribution from the dataset. For each data point $ x \in X $, we subtract the projection of $ x $ onto $ u_1 $:

$$ X_1 := \{ x - \text{proj}_{u_1}(x) : x \in X \} = \{ x - (u_1^T x) u_1 : x \in X \}. $$

This new dataset $ X_1 $ no longer contains variance along the direction $ u_1 $.

The sample covariance matrix for $ X_1 $ is denoted by $ S_1 $.

Step 3: Finding the Second Principal Component $ u_2 $¶

We now find the second principal component, $ u_2 $, which maximizes the variance in $ X_1 $, subject to the constraint $ \|u_2\|_2 = 1 $. Formally:

$$ u_2 = \arg\max_{u \in \mathbb{R}^D, \|u\|_2 = 1} u^T S_1 u. $$

Since $ X_1 $ is orthogonal to $ u_1 $, the second principal component $ u_2 $ will also be orthogonal to $ u_1 $.

Step 4: Generalizing to Higher Principal Components¶

We can generalize this process to find more principal components. After finding $ u_1, u_2, \dots, u_i $, the $ (i+1) $-th principal component is found by removing the contributions of all previous components and maximizing the variance in the remaining data.

Define the $ i $-th dataset $ X_i $ as:

$$ X_i := \{ x - \sum_{j=1}^{i} \text{proj}_{u_j}(x) : x \in X \} = \{ x - \sum_{j=1}^{i} (u_j^T x) u_j : x \in X \}. $$

The sample covariance matrix for $ X_i $ is denoted by $ S_i $.

The $ (i+1) $-th principal component is then:

$$ u_{i+1} = \arg\max_{u \in \mathbb{R}^D, \|u\|_2 = 1} u^T S_i u. $$

This ensures that each new principal component maximizes the remaining variance in directions orthogonal to the previous components.

Step 5: Orthogonality of Principal Components¶

The principal components $ u_1, u_2, \dots, u_d $ are mutually orthogonal, meaning $ u_i^T u_j = 0 $ for $ i \neq j $.

The first principal component $ u_1 $ maximizes the variance.
The second principal component $ u_2 $ is orthogonal to $ u_1 $ and maximizes the remaining variance in the dataset.
This orthogonality extends to all subsequent components: $ u_{i+1} $ is orthogonal to $ u_1, u_2, \dots, u_i $ and maximizes the variance in the subspace orthogonal to them.

Thus, the principal components form an orthogonal set of vectors.

Step 6: Principal Components as Eigenvectors of $ S $¶

Each principal component $ u_i $ is the eigenvector of the original sample covariance matrix $ S $ associated with the $ i $-th largest eigenvalue $ \lambda_i $. This is because each component maximizes the variance, and variance is maximized along the directions corresponding to the largest eigenvalues of $ S $. Therefore, for each $ i $:

$$ S u_i = \lambda_i u_i, $$

where $ \lambda_i $ is the eigenvalue associated with the eigenvector $ u_i $, representing the amount of variance captured by the $ i $-th principal component.

Conclusion¶

The principal components $ u_1, u_2, \dots, u_d $ are the eigenvectors of the sample covariance matrix $ S $, corresponding to the $ d $ largest eigenvalues. They capture the directions of maximum variance in the data, and their corresponding eigenvalues represent the amount of variance explained by each component.

Gaussian Misture Model (GMM) and EM algorithm¶

Suppose we obtained the first $d$ principal components $u_1, ..., u_d$ through PCA, then we can project the data from $\mathbb{R}^D$ to $\mathbb{R}^d$:

$$x \rightarrow (u_1^T x, ..., u_d^T x)^T.$$

Generating a data point by GMM can be considered as a two-step process.

Randomly pick one distribution from $K$ distinct Gaussian distributions.
Generate the data point according to the selected distribution.

Formally speaking, let $\{ \mathcal{N} (\mu_k, \Sigma_k) \}_{k = 1}^K$ be $K$ Gaussian distribution on $\mathbb{R}^d$ and let $\{ \pi_{k} \}_{k = 1}^K$ be a probability distribution (non-negative, $\sum_{k = 1}^K \pi_i = 1$). Then the pdf of the GMM is

$$p(x) = \sum_{k = 1}^K \pi_k \mathcal{N} (x \mid \mu_k, \Sigma_k),$$

where $\mathcal{N} (x \mid \mu_k, \Sigma_k)$ is the pdf of the distribution $\mathcal{N} (\mu_k, \Sigma_k)$. Let the number of components $K$ be fixed, the GMM is then determined by the parameter family $\theta = \{ \pi_k, \mu_k, \Sigma_k \}_{k = 1}^K$. We can then apply maximum-likelihood estimation (MLE) to estimate the parameters. The log-likelihood is

$$\log p_{\theta} (X) = \sum_{i = 1}^n \log p_{\theta} (x_i) = \sum_{i = 1}^n \log \left( \sum_{k = 1}^K \pi_k \mathcal{N} (x_i \mid \mu_k, \Sigma_k) \right).$$

Note that the above term is intractable, we may try to decompose it into tractable terms that can be computed. Let us temporarily assume that we know which distribution each data point comes from.

For $z_i \in \{ 1, ..., K \}$ be the indices of the Gaussian distribution according to which $x_i$ is generated. Then the MLE of the extended dataset $\{ (x_i, z_i) \}_{i = 1}^n$ is

$$ \begin{align} \log p_{\theta} (X, Z) &= \sum_{i = 1}^n \log p_{\theta} (x_i, z_i) \\ &= \sum_{i = 1}^n \log (\pi_{z_i} \mathcal{N} (x_i \mid \mu_{z_i}, \Sigma_{z_i})) \\ &= \sum_{i = 1}^n \log \pi_{z_i} + \sum_{i = 1}^n \log \mathcal{N} (x_i \mid \mu_{z_i}, \Sigma_{z_i}), \end{align} $$

which is much easier to maximize. Suppose $q$ is a distribution over $\{ 1, ..., K \}$, we then have

$$ \begin{align} \log p_{\theta}(X) &= \mathbb{E}_{z \sim q} (\log p_{\theta} (X)) \\ &= \mathbb{E}_{z \sim q} \left[ \log \left( \frac{p_{\theta}(X, Z)}{p_{\theta}(Z \mid X)} \cdot \frac{q(Z)}{q(Z)} \right) \right] \\ &= \mathbb{E}_{z \sim q} \left[ \log \left( \frac{p_{\theta}(X, Z)}{q(Z)} \cdot \frac{q(Z)}{p_{\theta}(Z \mid X)} \right) \right] \\ &= \mathbb{E}_{z \sim q} \left[ \log \left( \frac{p_{\theta}(X, Z)}{q(Z)} \right) + \log \left( \frac{q(Z)}{p_{\theta}(Z \mid X)} \right) \right] \\ &= \mathbb{E}_{z \sim q} \left[ \log \left( \frac{p_{\theta} (X, Z)}{q(Z)} \right) \right] + \mathbb{E}_{z \sim q} \left[ \log \left( \frac{q(Z)}{p_{\theta}(Z \mid X)} \right) \right]. \end{align} $$

We can have

$$ \begin{align} M(q, \theta) &= \mathbb{E}_{z \sim q} \left[ \log \left( \frac{p_{\theta}(X, Z)}{q(Z)} \right) \right] \\ E(q, \theta) &= \mathbb{E}_{z \sim q} \left[ \log \left( \frac{q(Z)}{p_{\theta}(Z \mid X)} \right) \right]. \end{align} $$

We also can have the following two properties

$\log p_{\theta} (X) \ge M(q, \theta)$, $\forall q, \theta$

This property can be shown as follows:

$$ \begin{align} \mathbb{E}_{z \sim q} \left[ \log \left( \frac{q(Z)}{p_{\theta}(Z \mid X)} \right) \right] &= \mathbb{E}_{z \sim q} \left[ - \log \left( \frac{p_{\theta}(Z \mid X)}{q(Z)} \right) \right] \\ &\ge - \log \mathbb{E}_{z \sim q} \left[ \frac{p_{\theta} (Z \mid X)}{q(Z)} \right] \\ &= - \log \left( \sum_{Z \in \{ 1, ..., K \}^n} q(Z) \frac{p_{\theta}(Z \mid X)}{q(Z)} \right) \\ &= - \log \left( \sum_{Z \in \{ 1, ..., K \}^n} p_{\theta}(Z \mid X) \right) \\ &= - \log \left( 1 \right) \\ &= 0, \end{align} $$

where the second line is due to Jensen's Inequality and the fact that $-\log$ is a convex function.
$\log p_{\theta} (X) = M(q^*, \theta)$ when $q^* = p_{\theta}(\cdot \mid X)$, $\forall \theta$

Note that

$$M(q, \theta) = \log_{p_{\theta}} (X) \Leftrightarrow E(q, \theta) = 0,$$

which is obviously satisfied when $q(Z) = p_{\theta} (Z \mid X)$.

Motivated by the properties, we have the following EM algorithm.

Initialize $\theta_0$
For $t = 1, ...$, update

$q^* = \text{argmin}_q E(q, \theta^{t - 1}) = p_{\theta^{t - 1}}(\cdot \mid X)$
$\theta^t = \text{argmax}_{\theta} M(q^*, \theta^{t - 1}) = \text{argmax}_{\theta} \mathbb{E}_{z \sim q} \left[ \log p_{\theta} (X, Z) \right]$

Until convergence.

Note that the updates for both $q$ and $\theta$ are tractable, as we discussed above. It is not hard to see that $\{ \log p_{\theta^t}(x) \}_{t \ge 0}$ is non-decreasing by the properties of $E$ and $M$.

To show the non-decreasing property, let $q^*(Z) = p_{\theta^{t}} (Z \mid Z)$, we then have

$$ \begin{align} \log p_{\theta^{t + 1}} (X) &\ge M(q^*, \theta^{t + 1}) \\ &\ge M(q^*, \theta^t) \\ &= \log p_{\theta^t} (X). \end{align} $$

Despite its computational advantage, there is no guarantee that the EM algorithm can converge to the global maximizer of $\log p_{\theta}(x)$, which means it may only converge to a local optimum.