227-0391-00: Medical Image Analysis

Section 4 Segmentation

Swiss Federal Institute of Technology Zurich

Eidgenössische Technische Hochschule Zürich

Last Edit Date: 07/01/2025

Disclaimer and Term of Use:

We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file

This personal note is adapted from Professor Ender Konukoglu, Professor Mauricio Reyes, and Doctor Ertunc Erdil. Please contact us to delete this file if you think your rights have been violated.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Basic segmentation¶

Approach 1: Thresholding¶

No description has been provided for this image

Simple model

$$L(x) = \begin{cases} 1 & I(x) > \tau \\ 0 & I(x) \le \tau \end{cases}$$

or vice-versa.
Particularly useful in easy foreground-background subtraction.
Preprocessing methods for region of interest determination.
Used routinely in histology and other microscopic images where stains provide the necessary contrast.

How to determine the threshold: histogram analysis¶

How to determine the threshold: automatic methods¶

Methods to determine peaks and troughs of the histogram
Otsu's thresholding method

$$\min p_1 \sigma_1^2 + p_2 \sigma_2^2$$

where $p_{1,2}$ are the number of pixels in fore and background. $\sigma_{1,2}^2$ are the intensity variances in the groups.

Remarks

Global thresholding (for the entire image) or local thresholding (per ROI or patch)

Image noise can lead to isolated FG or BG islands

Multiple objects would require multiple thresholds

A nice generalizaiton is K-Means algorithm

Approach 2: K-Means clustering¶

Unsupervised clustering algorithm
Voxels / pixels are clustered accoring to features
- Intensity
- Color
- Temporal sequence
- Other features, e.g., gradient, wavelet transform, ...
Automatically assigns each voxel / pixel to one of N clusters.
Assume we have two features (shown in x and y axis), multiple pixels (each point) and two clusters.

Overview of the K-Means algorithm¶

We want to distribute pixels to N groups using features
Criteria
- Homogeneity within groups
- Reducing variance over features within group
- Take into account multiple features
Unsupervised - no information on the structures not groups
Iterative algorithm that assigns pixels to groups, computes group means and reassigns based on distance to group centers

Remarks

It aims to create clusters that have as homogeneous as possible features within clusters. Aims to find cluster assignments and mean values that explain the data as best as possible with the given number of clusters.

Very easy to implement.

Choice of number of clusters has a major influence.

Methods exist to choose it automatically:

Heuristic methods based on variance

Non-parametric Bayesian methods

Initializaiton is important. K-Means can get stuck in locla minima:

Run it multiple times with different initializations

Multi-scale methods for robustness

Probabilistic formulation is possible that can be more robust and allows extensions, e.g., $$\max \ln p(I)$$

Expectation-Maximization segmentation - mixture model¶

Similar idea as K-Means, but in a probabilistic view.
Start by assuming all pixels are independent and model the intensities as a mixture model.
Mixture model for a given feature vector $f$

$$p(f) = \sum_{n=1}^{N} p(f|c=n) p(c=n)$$

$p(c)$ is the prior probability of observing (latent) label $c$. Also called mixture coefficients.
$p(f|c)$ is the likelihood model defined at that point. Often taken as a Gaussian

$$p(f|c=n) = \mathcal{N}(f|\mu_n, \Sigma_n) = \frac{1}{\sqrt{2\pi |\Sigma_n|}} \exp\left( -\frac{1}{2} (f - \mu_n)^T \Sigma_n^{-1} (f - \mu_n) \right)$$

$\mu_n$ and $\Sigma_n$ are class specific parameters - Gaussian mixture model.

Mixture model - 1D example with 3 (latent) labels¶

Remark

Gaussian mixture model may not be a bad approximate model for the intensities.

How to use the mixture model?¶

Fit the modle parameters to the image intensities in an unsupervised way, we will only assume a number of classes N and assume intensities at different pixels are independent from each other and they are the features to which we will fit the model

$$\max_{\theta} \log p(I|\theta) = \max_{\theta} \log \prod_{x \in \Omega} p(I(x) | \theta) = \max_{\theta} \sum_{x \in \Omega} \log p(I(x) | \theta)$$

$$\log p(I(x)|\theta) = \log \sum_{n=1}^{N} p(I(x) | c(x) = n)p(c(x) = n)$$

$\theta$ is the set of model parameters:
- Class probabilities: $p(c(x) = n) = \pi_n$
- Likelihood parameters: $\mu_n$, $\Sigma_n$
After fitting, segmentation is defined through posterior-distribution using optimal parameters

$$p(c(x)|I(x))~~~~~c^*(x) = \arg \max p(c(x) | I(x))$$
We can optimize using gradient descent - possible but difficult to optimize
Better alternative: Expectation-Maximizaiton

Expectation-Maximization algorithm¶

Key idea - two alternating steps¶

E-step: assume a set of likelihood parameters and "soft" assign samples to labels
M-step: assume soft assignments and maximize likelihood parameters, e.g., find the best mean and standard deviations of the Gaussians

Derivation¶

Dropping the dependence on $x$ for simplicity for now and using Bayes' rule

$$\log p(I|\theta) = \log p(I, c|\theta) - \log p(c|I, \theta)$$

Let us assume we are given an $\theta_{old}$, then

$$ \begin{align} \mathbb{E}_{p(c|I, \theta_{old})} [\log p(c|\theta)] &= \mathbb{E}_{p(c|I,\theta_{old})}[\log p(I, c | \theta)] - \mathbb{E}_{p(c|I, \theta_{old})}[\log p(c|I, \theta)]\\ \log p(c|\theta) &= \mathbb{E}_{p(c|I,\theta_{old})}[\log p(I, c | \theta)] - \mathbb{E}_{p(c|I, \theta_{old})}[\log p(c|I, \theta)]\\ &= Q(\theta | \theta_{old}) - \mathbb{E}_{p(c|I, \theta_{old})}[\log p(c|I, \theta)] \end{align} $$

One can show that

$$0 \ge \mathbb{E}_{p(c|I, \theta_{old})}[\log p(c|I, \theta_{old})] \ge \mathbb{E}_{p(c|I, \theta_{old})}[\log p(c|I, \theta)]$$

This means two things:

$\log p(I | \theta) \ge Q(\theta | \theta_{old})$
If we find a $\theta$ that increases $Q(\theta | \theta_{old})$ we also increase $\log p(I | \theta)$

Iterative algorithm with two step process¶

E-step: compute $p(c | I, \theta_{old})$ given $\theta_{old}$

$$ \begin{align} p(c(x) | I(x), \theta_{old}) &= \frac{p(I(x) | c(x), \theta_{old}) p(c(x) | \theta_{old})}{p(I | \theta_{old})} \\ &= \frac{\mathcal{N}(I(x) | \mu_{c(x)}^{old}, \Sigma_{c(x)}^{old}) \pi_{c(x)}^{old}}{\sum_{n=1}^{N}\mathcal{N}(I(x) | \mu_{n}^{old}, \Sigma_{n}^{old}) \pi_{n}^{old}} \end{align} $$
M-step: maximize $Q(\theta | \theta_{old})$, i.e., $\max_{\theta} \mathbb{E}_{p(c|I, \theta_{old})}[\log p(I, c|\theta)]$

$$ \begin{align} \pi_n &= \frac{\sum_{x \in \Omega} p(c(x) = n | I(x), \theta_{old})}{|\Omega|}\\ \mu_n &= \frac{\sum_{x \in \Omega} I(x) p(c(x) = n | I(x), \theta_{old})}{ \sum_{x \in \Omega} p(c(x) = n | I(x), \theta_{old})}\\ \Sigma_n &= \frac{\sum_{x \in \Omega} (I(x) - \mu_n)(I(x) - \mu_n)^T p(c(x) = n | I(x), \theta_{old})}{\sum_{x \in \Omega} p(c(x) = n | I(x), \theta_{old})} \end{align} $$
Set $\theta_{old} = \theta^*$
Start with a random $\theta_{old}$, iterrate E and M steps

Example - 3 components¶

Example - 5 components¶

Remarks

Quite robust method

Initialization can be important if done badly

Number of component is important. We may need prior information regarding how many different structures we would like to extract from the images

Outputs already useful - gray matter density maps

Used very regularly - in famous tools such as SPM, FSL, and Freesurfer

Extension to include better prior information, e.g., atlas, is possible

Atlas-based segmentation¶

Adding prior information to the segmentation - no longer a simple clustering
Template at a reference frame
For each location provides information about what is true
Deterministic - world atlas
Probabilistic, i.e., $p(c(x))$ - brain atlas
Applicable to any anatomical structure
Created by averaging many different images. The one shown above was made using 152 different images

Formulation¶

Aligning to a reference frame¶

Align the atlas to the image
Linear or non-linear registration
The image is mapped on a reference frame
At each point in the image now we have a rough idea what structure to find

Probabilistic formulation¶

Start with the mixture model

$$ \begin{align} p(I(x) | \theta) &= \sum_{n=1}^{N} p(I(x) | c(x) = n, \theta) p(c(x) = n | \theta)\\ &= \sum_{n=1}^{N} \mathcal{N}(I(x) | \mu_n \Sigma_n) \pi_n \end{align} $$

where $\theta = \{\mu_{n}, \Sigma_{n}, \pi_n\}_{n=1}^{N}$. Parameters are determined in an unsupervised way.
Add the atlas information

$$p(I(x) | \theta) = \sum_{n=1}^{N} \mathcal{N}(I(x) | \mu_n, \Sigma_n) p_{atlas} (c(x) = n)$$

where $\theta = \{\mu_n, \Sigma_{n}\}_{n=1}^N$. Atlas brings in prior information.
Posterior takes into account the atlas and the image intensities into account

$$p(c(x) = n | I(x)) = \frac{\mathcal{N}(I(x) | \mu_n, \Sigma_n) p_{atlas} (c(x) = n)}{\sum_{n=1}^{N}\mathcal{N}(I(x) | \mu_n, \Sigma_n) p_{atlas} (c(x) = n)}$$
- Information propagation from the atlas to the image
- No need to optimize for class probabilities $p(c(x) = n) = \pi$
- For MRI, we still need to optimize for $\mu_n$ and $\Sigma_n$

Without atlas¶

With atlas¶

Remarks

Much better segmentation.

No randomness in components. We know what we ask for.

Used very commonly.

Easily generalizable to other body parts with an appropriate atlas.

Can be generalized to other modalities. May fix all parameters for CT for instance.

An outlier class can be added to detect outliers.

Need an atlas for each body part.

Atlas itself is very important. Need an atlas for di#erent age groups, gender, race...

Accuracy of image registration is very important.

Remedies

Muti-atlas segmentation: Intead of using one, use many different atlases, propogate information from all of them and aggregate.

Patch matching: Use patch-matching to perform local searches ina sliding widndow manner.

Use many atlases rather than one¶

Align all the atlas images to the test image
Aggregate information coming from all the atlases
Remodeling $p_{atlas}(c(x) = 0)$
During aggregate weight atlas based on the quality of registration
- $w_{m} \propto \| I(x) - A_{m} \|_2$
- $w_{m} \propto \| I(x) - T \circ A_m \|_2$
- $w_{m} \propto |J_{T_{A_m} \rightarrow I}|$

Remark

Solves some of the problems of atlas-based segmentation

Highly accurate results

State-of-the-art for a very long time

Leads to elegant probabilistic models

Often requires non-linear registration

Computationally very expensive due to may registrations

Spatial constrains in segmentation¶

Morphological operations¶

Shift-invariant
Non-linear
Based on neighboring pixels defined through structual elements
Applied in a sliding window approach, e.g., structural element slides across the image
View binary images as sets of pixels
Defined in binary but extended to gray-level images
Two main operations
- Dilation $$C = A \oplus B \stackrel{\Delta}{=} \{ x | B_x \cap A \neq 0 \}$$
- Erosion $$C = A \ominus B \stackrel{\Delta}{=} \{x | B_x \subseteq A \}$$
$A$ is the image and $B$ is a structural element, e.g., $\begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}$.

Dilation and erosion - $B = 1_{7 \times 7}$¶

Dilation and erosion - $B = 1_{13 \times 13}$¶

Derived operations¶

Opening: $(A \ominus B) \oplus B$
Closing: $(A \oplus B) \ominus B$

Opening and closing are useful - $B = 1_{13 \times 13}$

Approach 2: Random field priors¶

Neighborhood consistency¶

Formulating neighborhood consistency in a probabilistic model / energy model
Main idea is to punish inconsistency labeling of neighboring voxels
Use Markovian property to define the punishment

Markov random fields: principle¶

Let us focus on the joint distribution of labels and intensities

$$p(I(x)) = \sum_{n=1}^{N} p(I(x) | c(x) = n)p(c(x) = n)$$

$$p(I(x), c(x)) = p(I(x) | c(x)) p(c(x))$$

$$p(I,c) = \prod_{x \in \Omega} p(I(x) | c(x))p(c(x))$$

where all pixels are independent.
In order to model connection between pixels, we change the prior

$$p(I, c) = p(c) \prod_{x\in\Omega} p(I(x) | c(x))$$

joint distribution changes.

Markovian property¶

MRF sets up the prior distribution based on the Markovian property
Specific form depends on the neighborhood structure

$$G(x) = \text{neighbors of }x$$

$$p(c(x) | c(/c)) = p(c(x) | c(G(x)))$$

$p(c(x))$ only depends on its immediate neighbors if all others are given.

Energy formulation¶

$$p(c(x) | c(/x)) = p(c(x) |c(G(x)))$$

A common way to define it is through defining an energy

$$E(c) = \sum_{x\in \Omega} \sum_{y \in G(x)} d(c(x), c(y))$$

d(c(x), c(y)) is a distance between the labels.
Given the energy we can define a probability distribution

$$p(c) = \frac{1}{Z} \exp\{-E(c)\}$$

$Z$ is the normalization constant. Lower distance means higher probability. It enforces consistency.

Gibbs distribution¶

$$E(c) = \sum_{x\in \Omega} d(c(x), c(y)) \leftrightarrow p(c) = \frac{1}{Z} \exp\{-E(c)\}$$

Remark: Hammersley-Clifford theorem

Any probability distribution that satisfies a Markovian property is a Gibbd distribution for an appropriate locally-defined energ and vice-versa.

Segmentation via posterior maximization¶

Given the energy function $E(c)$

$$p(c) = \frac{1}{Z} \exp \{-E(c)\}$$

Posterior distribution becomes

$$p(c|I) = \frac{p(I|c)p(c)}{p(I)}$$

Note that the posterior is no longer point-wise. It is a joint distribution. Segmentation via posterior maximization:

$$ \begin{align} \arg \max_{c} p(c|l) &= \arg \max_c p(I|c)p(c)\\ &= \arg \max_c \exp\{-E(c)\} \prod_{x\in \Omega}p(I(x) | c(x)) \\ \end{align} $$

assuming given the label of a voxel, its intensity is independent from the intensities of the other voxels.

If the data model is also exponential of another energy

$$p(I(x) | c(x)) \propto \exp\{-g(I(x) | \theta_{c(x)})\}$$

Then we end up with an energy minimization formulation

$$ \begin{align} \arg \max_c p(c|I) &= \arg \max_c \ln p(c|I)\\ &= \arg \max_c \left\{ -\sum_{x \in \Omega} g(I(x) | \theta_{c(x)}) - \sum_{x\in\Omega} \sum_{y\in G(x)} d(c(x), c(y)) \right\}\\ &= \arg \min_c \left\{ \underbrace{\sum_{x \in \Omega} g(I(x) | \theta_{c(x)})}_{\text{unary term}} + \underbrace{\sum_{x\in\Omega} \sum_{y\in G(x)} d(c(x), c(y))}_{\text{pairwise term}} \right\}\\ \end{align} $$

Unary term is also the data term in this model. Sometimes called data fidelity term.
However, it could also include any unary prior information on the labels.
Pairwise term is the consistency term that imposes neighboring pixels to have similar labels.
When $c(x)$ is a continuous variable, as in regression problems, the optimization can be tackled with usual techniques.
When $c(x)$ is a catagorical variable, e.g., segmentation, then the optimization can be challenging.

Example¶

Observational model

$$p(I(x) | c(x)) = \mathcal{N}(I(x) | \mu_{c(x)}, \Sigma_{c(x)})$$

where $g(I(x) | \theta_{c(x)}) = (I(x) - \mu_{c(x)})^T \Sigma_{c(x)}^{-1} (I(x) - \mu_{c(x)})$. Energy model for the prior distribution - Ising / Potts Model.

$$d(c(x), c(y)) = \lambda \delta(c(x) \neq c(y)) = \begin{cases} 0 & c(x) = c(y) \\ 1 & c(x) \neq c(y) \end{cases}$$

Segmentation is the solution of

$$\arg \min_c \sum_{x \in \Omega} (I(x) - \mu_{c(x)})^T \Sigma_{c(x)}^{-1} (I(x) - \mu_{c(x)}) + \lambda \sum_{x \in \Omega} \sum_{y \in G(x)} \delta(c(x) \neq c(y))$$

where $\lambda$ becomes a trade-off parameter between data fidelity and neighborhood consistency.

Very difficult to compute the posterior distribution
Difficult to solve the optimization exactly in 2D and higher
NP-hard
Approximate energy minimization and posterior sampling
- Gibbs sampling
- Iterated conditional models
- Graph cuts

Stochastic relaxation / Gibb sampling principle¶

Interative method to approximate solution
Monte-Carlo sampling method
Sample from the posterior distribution is hard

$$p(c | I) = \frac{p(I | c) p(c)}{P(I)} = \frac{P(I|c)p(c)}{\sum_{c} P(I|c)p(c)}$$
Much easier to sample from this

$$p(c(x) | c(/x), I) = \frac{p(I(x) | c(x)) p(c(x) | c(G(x)))}{\sum_{c(x)}p(I(x) | c(x)) p(c(x) | c(G(x)))}$$
Start from a random $c_0$ assignment, such as $\arg \max_c \prod_{x \in \Omega} p(I(x) | c(x))$
Sample from the posterior of each voxel given the others
Iterate over all voxels and some number of times
Convergence properties

Result without MRF¶

Result with MRF¶

Result with MRF higher $\lambda$¶

Remarks

Very simple implementation and affective

Solution depends on parameters

Solution depends on initial conditions

May not converge to a pleasing result

Convergence may be slow

Stochastic in nature - getting something different every time

Other methods exist and may provide better results

Complicated optimization problem

Multiatlas-patch based segmentation¶

Multi-atlas segmentation using classifier fusion¶

Global fusion strategies - majority voting¶

Shap-based averaging fusion¶

Average the shapes of the segmentations. Based on averaging the distance transforms the segmentation.

Combining multiple segmentations: STAPLE¶

At the same time, it tells us the performance (sensitivity and specificity) of each segmentation.
Sensitivity $p$: true positive fraction
Specificity $q$: true negative fraction
Can we find the the performance of each segmentation $(p, q)$ and the ground truth ($T$) that maximizes the likelihood of observing the segentation ($D$)?
- $D$ (data / expert segmentation we want to merge) is known
- $T$ (the ground truth) is unknown
- $p$, $q$ (performance of each segmentation) is unknown
  - E-step: $p^{(0)}$, $q^{(0)}$ initial estimate, $f(T | D, p^(0), q^{(0)})$
  - M-step: use the current ground truth to estimate the next $p^{(1)}$ and $q^{(1)}$

With fusion and selection¶

Patch-based segmentation¶

Patch-based methods have been popular in the computer vision community
- Texture synthesis
- In-painting
- Restoration
- Single-frame super resolution
Basic idea
- Only coarse registration required (e.g., rigid or affine registration)
- Use the same concept as label fusion

Patch-based denoising¶

Uses a neighborhood averaging strategy called non-local means (NLM)
Assumes that every patch in a natural image has many similar patches in the same image

$$I_{nlm}(\mathbf{x}) = \frac{\sum_{\mathbf{y}\in \Omega} w(\mathbf{x}, \mathbf{y}) I(\mathbf{y})}{\sum_{\mathbf{y}\in \Omega} w(\mathbf{x}, \mathbf{y})}, \forall \mathbf{x} \in \Omega$$

where
- $I_{nlm}(\mathbf{x})$ is the denoised image
- $I(\mathbf{x})$ is the original image
- $w(\mathbf{x}, \mathbf{y}) = f\left( \frac{\sum_{\mathbf{x}' \in \mathcal{P}_I(\mathbf{x}), \mathbf{y}' \in \mathcal{P}_{I}(\mathbf{y})}(I(\mathbf{x}') - I(\mathbf{y}'))^2}{2N\beta \hat{\sigma}^2}\right)$
  - The $f(\cdot)$ is the weighting function, usually $\exp(-x)$
  - $2N\beta \hat{\sigma}^2$ is the bandwidth for smoothing
Apply the same idea as in denoising to label fusion.

Example: Patched-based multi-organ segmentation of abdominal CT¶

Structure	Dice	Jaccard
Kidneys	92.5 (7.2) [51.5, 98.2]	86.8 (10.5) [34.6, 96.4]
Liver	94.0 (2.8) [81.4, 97.3]	88.9 (4.8) [68.7, 94.9]
Pancreas	69.6 (16.7) [6.9, 90.0]	55.5 (17.1) [3.6, 83.3]
Spleen	92.0 (9.2) [26.4, 98.2]	86.2 (12.7) [15.2, 96.4]

Segmentation of CT from 150 subjects
Hierarchical, patch-based label fusion process plus graph-cut refinement

Patch-based labeling vs. sparse representation classification (SRC) vs. discriminative dictionary learning for segmentation (DDLS)¶

SRC assumes a target patch represented by a few representative patches.
DDLS adds a learned linear classifier
- $p_t = a_1 p_1 + a_2 p_2 + \cdots + a_n p_n$
  
  where $p_t$ is the target patch, $a_i$ is the coefficient (sparse), and $p_i$ is the patch library
- $\hat{a} = \min_{a} \frac{1}{2} = \| p_t - P_{L^a} \|_2^2 + \lambda_1 \| a \|_1 + \frac{\lambda_2}{2} \| a \|_2^2$
- Decrease computational burden via smaller-size dictionary
- Better exploit discriminative power of patch library
- $<D, W, \alpha> = \arg \min_{D, W \alpha} \underbrace{\| P_L - D \alpha \|_2^2}_{\text{reconstruction error}} + \beta_1 \underbrace{\| H - W \alpha \|_2^2}_{\text{classification accuracy}}$, subject to $\| \alpha \|_0 \le T$
- Grouping patch and label information $<D, W, \alpha> = \arg \min_{D, W \alpha} \left\| \begin{pmatrix} P_L \\ \sqrt{\beta_1} H \end{pmatrix} - \begin{pmatrix} D \\ \sqrt{\beta_1} W \end{pmatrix} \alpha \right\|_2^2$, subject to $\| \alpha \|_0 \le T$
- Impose sparsity $<\tilde{D}, \alpha> = \arg \min_{\tilde, \alpha} \| \tilde{P}_L - \tilde{D} \alpha \|_2^2 + \beta_2 \| \alpha \|_1$, where $\tilde{D} = (D^t, \sqrt{\beta_1} W^t)^t$, $\tilde{P}_L = (P_L^t, \sqrt{\beta_1} H^t)^t$
- Normalized voxel-wise dictionaries and classifiers $\tilde{D}_t = \left\{ \begin{pmatrix} \tilde{d}_1 \\ \sqrt{\beta_1} \tilde{w}_1 \end{pmatrix}, \begin{pmatrix} \tilde{d}_2 \\ \sqrt{\beta_1} \tilde{w}_2 \end{pmatrix}, \cdots, \begin{pmatrix}\tilde{d}_1 \\ \sqrt{\beta_1} \tilde{w}_K \end{pmatrix} \right\}$
- Final $\{D, W\}$ is
  - $\hat{D}_t = \left\{ \frac{\tilde{d}_1}{\| \tilde{d}_1 \|_2}, \frac{\tilde{d}_2}{\| \tilde{d}_2 \|_2}, \cdots, \frac{\tilde{d}_K}{\| \tilde{d}_K \|_2} \right\}$
  - $\hat{W}_t = \left\{ \frac{\tilde{w}_1}{\| \tilde{d}_1 \|_2}, \frac{\tilde{w}_2}{\| \tilde{d}_2 \|_2}, \cdots, \frac{\tilde{w}_K}{\| \tilde{d}_K \|_2} \right\}$
- Three steps for DDLS:
  - Extract patch $p$
  - Solve for $\hat{\alpha}_t = \arg \min_{\alpha_t} \left\| p_t - \hat{D}_t \alpha_t \right\|_2^2 + \beta_2 \| \alpha_t \|_1$
  - Label voxel using classifier $\begin{cases} h_t = \hat{W} \hat{\alpha}_{t} \\ v = \arg \max_{j} h_t(j) \end{cases}$, where $h_t$ is the class label vector

Statistical shape modeling¶

Statistical Shape modeling (SSM) represents the trade-off between population and specific points. For example

Paticient specific problems
- More observation
- Less priot
Standard atlas
- Less observation
- More prior

Principal component analysis (PCA)¶

Assume that we have 2D data points
In this example, data points are scattered along a principal direction
PCA yields the main direction (and secondary directions), and creates an orthogonal coordinate system that maximizes the variance of the points in each direction
Points can be referenced along the main component (axis) using a parameter $b$

SVD diagonalizes the covariance matrix of the original data
It directly yields the eigen-vectors and eigen-values of $Cov(M)$

$$M = \underbrace{U}_{\text{ eigen-vectors }} \underbrace{W}_{\text{ diagonal matrix of eigen values }} V^T$$
PCA assumes that the original data follows a Gaussian distribution (even if in reality it does not)
PCA results in an orthogonal lower-dimensional system, i.e., the principal axes are perpendicular
PCA is a linear method
There also exist several linear and non-linear dimensionality reduction techniques such as factors analysis lsomap, multidimensional scaling (MDS), etc.

SSM for segmentation - active shape models (ASM)¶

General idea

Start with SSM and image
Initially place SSM in image
Sample along SSM surface normals, find largest gradient
Deform pose (scale, rotation, translation) and shape (PCA modes) of model to best fit to extracted image features (largest gradient)
Initialization is crucial
- Approximately align SSM with structure of interest
- Many different methods, choose the one for your application
  - User interaction
  - Centroid alignment
  - Chamfer matching
  - Atlas registration
  - Global search on entire image (evolutionary algorithms)