227-0391-00: Medical Image Analysis

Section 1 Introduction and Preliminaries

Swiss Federal Institute of Technology Zurich

Eidgenössische Technische Hochschule Zürich

Last Edit Date: 06/26/2025

Disclaimer and Term of Use:

We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file

This personal note is adapted from Professor Ender Konukoglu, Professor Mauricio Reyes, and Doctor Ertunc Erdil. Please contact us to delete this file if you think your rights have been violated.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Anatomical axes and basic properties¶

Anatomical axes¶

No description has been provided for this image

Cross sections in a volumetric image¶

Basic image properties¶

Volumes¶

Image size	Pixel / Voxel size	Field-of-view (FOV)
number of pixels / voxels 2 or 3 integers e.g. $256 \times 256 \times 128$	the size a pixel / voxel occupies in the real world 2 or 3 real numbers e.g. $0.6 \times 0.6 \times 3.5~mm^3$	the size an entire image occupies in the real world 2 or 3 real numbers product of image size and pixel / voxel size e.g. $153.6 \times 153.6 \times 448~mm^3$

Volume extended¶

Temporal information	Metabolic information	Transformations
Additional dimension in the image, e.g. a volumetric image would have 4 dimensions with time and contain an additional number indicating number of images Temporal resolution, in applications such as functional MRI, perfusion, dynamic cardiac imaging, etc.	Additional dimension Additional number indicating number of metabolites MR spectroscopy	Transformation matrices between voxel coordinates and real-world coordinates - patient position on the table

Basic notation¶

Continuous version: A volumetric grayscal image is a function
- $\mathbf{I}: \Omega \rightarrow \mathbb{R}$, $\Omega \subset \mathbb{R}^3$ is the image domain
- $\mathbf{I}(\mathbf{x})$ is the intensity at point $\mathbf{x} \in \Omega$, $\mathbf{x} = [x_1, x_2, x_3]$
Discrete version:
- $\Omega \subset \mathbb{Z}^3$ is the discrete image domain, i.e., Cartesian grid
- $\mathbf{I}(\mathbf{x})$ is the intensity at $\mathbf{x} \in \Omega$, $\mathbf{x} = [i, j, k]$
For multiparametric images: Such as diffusion-weighted MRI, spectroscopy or adding multiple modalities together
- $\mathbf{I}: \Omega \rightarrow \mathbb{R}^N$
- $\mathbf{I}(\mathbf{x})$ is the vector of intensities at $\mathbf{x}$
For dynamic images with temporal information
- $\mathbf{I}: \Omega \times \mathbb{R}^{+} \rightarrow \mathbb{R}^N$
- $\mathbf{I}(\mathbf{x}, t)$ is the intensity at point $\mathbf{x}$ and time $t$

Probabilistic modeling¶

PDF, CDF, PMF¶

Intensity at each point $\mathbf{I}(\mathbf{x}) \in \mathbb{R}$ is seen as a random varibale (other functions can also be seen as random variables, e.g., transformation $T(\mathbf{x})$ or discrete labels $L(\mathbf{x})$)
For continuous random variables we define (dropping $(\mathbf{x})$ for simplicity):
- $p(i)$ as its probability density function (PDF)
- $p(i) = Pr[\mathbf{I} \le i] = \int_{-\infty}^{i} p(j) dj$ as its cumulative distribution function (CDF)
For discrete random variables, $L$, we define:
- $p(\mathbf{I}) = p(L = \mathbf{I})$ as the probability mass function (PMF), PMF can be seen as PDF is often named as PDF in scientific articles

Conditionals and the Bayes' Rule¶

When we have two variables such as $I$ and $L$:
- Joint distribution: $p(i, l)$
- Marginal distributions: $p(i) = \sum_{l=0}^{N}p(i,l)$ and $p(l) = \int_{-\infty}^{\infty} p(i, l) di$
- Conditional distributions: $p(i | l)$ and $p(l | i)$
- $p(i, l) = p(i|l)p(l) = p(l|i)p(i)$
The conditionals are linked with the Bayes' rule

$$p(i|l) = \frac{p(l|i)p(i)}{p(l)},~p(l|i) = \frac{p(i|l)p(l)}{p(i)}$$

In a large variety of problems, one of them is observed and the other not. Assume $i$ is observed in that case:
- Likelihood: $p(i|l)$
- Prior: $p(l)$
- Posterior: $p(l|i)$

Posterior distribution, MAP, and MLE¶

A large range of problems can be formulated as given an observation $i$ estimate $l$.
Example problems: image enchancement, segmentation and even registration
A generic solution to such problems is to determine the poterior distribution of $l$, i.e., Bayseian Inference

$$p(l|i) = \frac{p(i|l)p(l)}{p(i)}$$

$p(i)$ requires summing over all $l$ (or integration over all $l$ if continuous), which can be infeasible. The alternative is to determine the $l$ that maximizes the posterior, i.e., Maximum-A-Posterior (MAP) estimate

$$\arg \max_{l} p(l | i) = \arg \max_{l} p(i|l)p(l) = \arg \max_{l} \log p(i|l) + \log p(l)$$

Posterior distribution and MAP estimate require the prior. When such a prior does not exist, the alternative is to determine Maximum Likelihood Estimate (MLE)

$$\arg \max_{l} p(i|l)$$

MLE is the same as MAP when the prior is uniform, i.e., $p(l) = c$, $\forall l$

Cost function, regularization, and optimization¶

Data term and regularization¶

Most problems in medical image analysis are fomulated as optimization problems

$$\arg \min_{\theta} \underbrace{\mathcal{L}(\theta)}_{\text{cost function}} = \underbrace{\mathcal{D}(\mathbf{I}; \theta)}_{\text{data term}} + \lambda \underbrace{\mathcal{R}(\theta)}_{\text{regularization}}$$

Or the related form with explicit constraints

$$\arg \min_{\theta} \mathcal{D}(\mathbf{I}; \theta) \Rightarrow \mathcal{R}(\theta) = 0$$

Examples
- Image denoising: Assume $\mathbf{I}$ is a noisy image and we want to retrieve a denoised $J$
$$\arg \min_J \| I - J \|_2^2 + \lambda \| \nabla J(x) \|_1 = \arg \min_{J} \int (I(x) - J(x))^2 dx + \lambda \int |\nabla J(x)| dx$$
- Image registration: Determine $T$ between images $I$ and $J$
$$\arg \min_{T} \int (I(x) - J(T(x)))^2 dx + \int \| \alpha \Delta T(x) + \beta \nabla (\nabla T(x)) - \gamma T(x) \|_2^2 dx$$
The MAP estimate is also in the same form; regularizers can be thought of as priors with $- \log p(\theta) \propto \mathcal{R}(\theta)$

$$\arg \max_{\theta} \log p(i | \theta) + \log p(\theta) = \arg \min_{\theta} - \log p(i | \theta) - \log p(\theta)$$

Optimization¶

Basic quantities for discrete $\theta$:
- Gradient $$\nabla \mathcal{L}(\theta) = \left[ \frac{\partial \mathcal{L}(\theta)}{\partial \theta_1}, \cdots, \frac{\partial \mathcal{L}(\theta)}{\partial \theta_d} \right]^T$$
- Hessian $$\mathbf{H} = \begin{bmatrix} \frac{\partial^2 \mathcal{L}(\theta)}{\partial \theta_1^2} & \cdots & \frac{\partial^2 \mathcal{L}(\theta)}{\partial \theta_1 \partial \theta_d} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 \mathcal{L}(\theta)}{\partial \theta_1 \partial \theta_d} & \cdots & \frac{\partial^2 \mathcal{L}(\theta)}{\partial \theta_d^2} \\ \end{bmatrix}$$
Gradient-based algorithms
- Gradient descent / ascent
- Newton's method
- Limited-memory BFGS
Gradient-free algorithms - because sometime gradient is difficult to evaluate
- Nelder-Mead Simplex
- Simulated Annealing
- Powell's method, BOBYQA

Calculus of variation¶

When $\theta$ is continuous, i.e., deformation field, denoised version of the image
Cost function is call the functional

$$\mathcal{L}(\theta) = \int_{x_a}^{x_b} L(x, \theta, \nabla \theta) dx$$

The identity similar to gradient is the first variation, for arbitrary $\eta$ whose value vanishes at the bouondaries

$$\delta \mathcal{L}(\theta) \overset{\Delta}{=} \lim_{\epsilon \rightarrow 0} \frac{L(\theta + \epsilon \eta) - L(\theta)}{\epsilon}$$

Setting first variation to 0 for all $\eta$ gives the Euler-Lagrange equation

$$\frac{\partial L}{\partial \theta} - \sum_{k=1}^{d} \frac{d}{dx_k} \frac{\partial L}{\partial \theta^{(k)}} = 0,~\theta^{(k)} = \frac{\partial \theta}{\partial x_k}$$

Left handside is called functional derivative $\partial \mathcal{L} / \partial \theta$, which is often used in gradient ascent

Linear basis models and function parameterizations¶

Basics for parameterization¶

Many problems need optimizing over a function, e.g., non-linear registration - transformations, bias correction - bias field
Discretizing Euler-Langrange at every point is an option but something with less parameters would be very useful
Function parameterization for optimization, few parameters to describe the function
In finite vector spaces $$\vec{v} = a_1 \vec{b}_1 + \cdots + a_d \vec{b}_d = \mathbf{B} \vec{\mathbf{a}}$$
- $\vec{b}_i$ are the basis functions and $a_1$ coefficients
- if $\vec{b}_i$ are orthogonal, i.e., $\vec{b}_i^T\vec{b}_j^T = 0$, $\forall i \neq j$ then $a_i = \vec{v}^T \vec{b}_i / \| \vec{b}_i \|_2$
- if not then Ordinary Least Square (OLS) regression must be performed
  
  $$\arg \min_{\vec{a}} \| \vec{v} - \mathbf{B} \vec{\mathbf{a}} \|_2^2 = (\mathbf{B}^T \mathbf{B})^{-1} \mathbf{B}^T \vec{v}$$
- for known $\vec{b}_i$, $\vec{a}_i$ can be a parameterization for $\vec{v}$

Function parameterization with linear basis models¶

Global parameterizations

$$f(x) = \sum_{i}^d a_i b_i(x)$$

where $b_i(x)$ are smooth basis functions.

The parameterization is $\vec{a}$
To determine $\vec{a}$, the space is discretized and a $\mathbf{B}$ matrix is formed
Note that this parametrization cannot represent all functions
Examples:
- polynomials: bias-field correction in MRI
- splines: non-linear registration

Kernel-based parameterization¶

$$f(x) = \sum_{i = 1}^d a_i K(x, x_i)$$

where $x_i$ are the control points and $K(x, x_i)$ is a kernel function

Radial basis functions are often used: $K(x, x_i) = K(\| x - x_i \|_2)$
Popular radial basis functions:
- Gaussian: $K(\| x - x_i \|_2) = \exp\left(- \frac{\| x - x_i \|_2^2}{\sigma^2}\right)$
- Thin-plate spline: $K(\| x - x_i \|_2) = \| x - x_i \|_2^2 \ln \left( \| x - x_i \|_2 \right)$
Example: landmark based registration, linear / non-linear registration, kernel density estimation

Spatial transformations¶

Linear transformation¶

$$T(\vec{x}) = \mathbf{T} \vec{x}$$

Used in linear image registration
Rigid
- 3 dof in 2D - 2 translation and 1 rotation
- 6 dof in 3D - 3 translation and 3 rotation
- Intra-subject registration, multi-modal intra-subject registration
Similarity transformation
- 4 dof in 2D - rigid + scale
- 7 dof in 3D - rigid + scale
- Used for coarse alignment in inter-subject, initialization for non-linear
Affine transformation
- 6 dof in 2D
- 12 dof in 3D
- Often not used in full dof
- Instead, 9 dof transformation (3 translation + 3 rotation + 3 scale) may be preferred

Non-linear transformation¶

$$T(\vec{x}) = \vec{x} + \underbrace{\vec{u}(\vec{x})}{\text{displacement field}}$$

Used in non-linear registration
Different models are used
Linear basis function models - additive
More advanced techniques based on composition of transformations

$$T(\vec{x}) = T_n \circ T_{n - 1} \circ \cdots \circ T_1(\vec{x})$$
- where $\circ$ is basic function composition
- allows modeling diffeomorphisms

$$T(\vec{x}) = \left[ T_1(\vec{x}), T_2(\vec{x}), T_3(\vec{x}) \right]^T$$

Jacobian

$$\mathbf{J} = \begin{bmatrix} \frac{\partial T_1}{\partial x_1} & \frac{\partial T_1}{\partial x_2} & \frac{\partial T_1}{\partial x_3} \\ \frac{\partial T_2}{\partial x_1} & \frac{\partial T_2}{\partial x_2} & \frac{\partial T_2}{\partial x_3} \\ \frac{\partial T_3}{\partial x_1} & \frac{\partial T_3}{\partial x_2} & \frac{\partial T_3}{\partial x_3} \\ \end{bmatrix}$$
- Jacobian determinant quantifies local volumetric change
- $\det(\mathbf{J}) = 1$: no change
- $\det(\mathbf{J}) < 1$: compression
- $\det(\mathbf{J}) > 1$: expansion
Divergence: $\Delta T$ - trace of Jacobian. Gives information about the amount of compression and expansion as well
Curl - $\Delta \times T$ - information on the amount of infinitesimal rotation

Derivative approximations¶

Finite difference approximations is ued the most often
First order derivative at a grid point $x_0$ with gird spacing $\Delta x$

$$ \begin{align} \frac{df}{dx}|_{x_0} &\approx \frac{f(x_0 + \Delta x) - f(x_0)}{\Delta x} \\ &\approx \frac{f(x_0) - f(x_0 + \Delta x)}{\Delta x} \\ &\approx \frac{f(x_0 + \Delta x) - f(x_0 - \Delta x)}{2 \Delta x} \\ \end{align} $$
Second order derivative at a grid point $x_0$ with grid spacing $\Delta x$

$$\frac{d^2 f}{dx^2}|_{x_0} \approx \frac{f(x_0 + \Delta x) - 2f(x_0) + f(x_0 - \Delta x)}{\Delta x^2}$$
Many different approximations exist