227-0391-00: Medical Image Analysis
Section 3
Image Transformation & Registration
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 06/30/2025
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ender Konukoglu, Professor Mauricio Reyes, and Doctor Ertunc Erdil. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Sampling model¶
Moving image is the one that will be transformed
Target image is to target that we want to attain with the transform
Both images have their own pixel coordinates and coorepsonding Cartesian grids
Assume there exists a spatial transformation, i.e., a coordinate transformation, that matches the moving to the target image
In reality, images are discretized - we only have values at pixel locations
Transformation¶
Let us assume $x' = T(x)$ is the transformation model with $x, x' \in \mathbb{R}^2$ in this case, where $x$ and $x'$ represents points in the moving and target image coordinate systems, respectively. Discussions for 3D transformations is the same.
Let us also define $x = T^{-1}(x')$ as the inverse of the transformation model.
Number of possible models: $T(x) = Ax$ (linear) or $T(x) = x + u(x)$ (non-linear)
Main sampling model question
Given the images, respective pixel-grids and transformation, how do we generate the "output" transformed moving image?
There are two possibilities
Forward transformation
Backward transformation
Approch 1: Forward transformation¶
$$T(x) = \{x_1 / [0.15 \sin(2.5x_2) + 0.85], x_2\} = \{x_1', x_2'\} = x'$$
We transform the input grid. Blue contour is the object's countour in the target image and the gray area is the transformation of the moving image. So, everything works nicely.
However, here everything is continuous and analytically computed! Actually this is not how it would work in discrete case.
In the discrete case each pixel from the moving image will be transformed $x' = T(x)$ and the pixel intensity will be copied to the output image.
Problem:
Transfromed pixel cooordinates $x'$ does no necessarily fall on pixel exactly
Two pixels can be mapped to the same output pixel
Same output pixel can be mapped to two output pixels
Some output pixels may not even get a value assigned
Solution (quite complicated):
Intensities of the output pixels can be computed using partial areas
Ovelep between each transformed pixel and output pixels are computed
This overlap is used in a complex intensity assignment step
Approach 2: Backward transformation¶
$$T^{-1}(x') = \{ x_1' [0.15 \sin (2.5 x_2') + 0.85], x_2' \} = \{ x_1, x_2 \} = x$$
We transform the ouput grid. Blue countour is the object's countour in the target image and the gray area is the transformation of the moving image. Everything is continuous and analytical again.
In the backward mapping, for each ouput pixel the corresponding location in the input / moving image is determined. This is much easier to deal with.
Problem:
- Transformed locations of the output pixels od not generally fall exactly on input pixels.
Solution:
Interpolation
Pixel intensity (all channels if necessary) are interpolated using the nearest input image pixels
Remark
Backward transformation is always easier to use for transforming images through interpolation
Interpolation models¶
Problem source¶
Whenever an image is resampled interpolation is used to compute the intensities at the new pixels. For example, spatial transformations and resizing an image for pre-processing.
Intensities at each pixel are defined and we assume those intensities are defined at the center of the pixels, black nodes. How can we computer the intensity at the red node?
Approach 1: Neasest neighbor interpolation¶
Assign the intensity value of the closest pixel with known intensity. Assign the value of the "nearest neighbor."
Nearest neighbor interpolation is very fast
It does not introduce new intensity values in the image
May introduce heavy artifacts in gray level images
Nearest neighbor interpolation yields a piece-wise constant funciton, which is not continuous nor differentiable
Remark
While it is not used for gray level or multispectural images, it is used for transforming labels since it have the nice property of not introducing any new values
Approach 2: Bilinear interpolation¶
Uses closest 4 pixels' intensity values
Extension of linear interpolation in 2D
Trilinear interpolation in 3D
Let us denote the coordiantes of the four points as $p_{00} = \{ x_1^0, x_2^0 \}$, $p_{01} = \{ x_1^0, x_2^1 \}$, $p_{10} = \{ x_1^1, x_2^0 \}$, $p_{11} = \{ x_1^1, x_2^1 \}$, and $q = \{ x_1, x_2 \}$. Interpolated values at the blue intermediate points are:
$$ \begin{align} I(x_1, x_2^1) &= \frac{x_1^1 - x_1}{x_1^1 - x_1^0} I(x_1^0, x_2^1) + \frac{x_1 - x_1^0}{x_1^1 - x_1^0} I(x_1^1, x_2^1) \\ I(x_1, x_2^0) &= \frac{x_1^1 - x_1}{x_1^1 - x_1^0} I(x_1^0, x_2^0) + \frac{x_1 - x_1^0}{x_1^1 - x_1^0} I(x_1^1, x_2^0) \\ I(x_1, x_2) &= \frac{x_2^1 - x_2}{x_2^1 - x_2^0} I(x_1, x_2^0) + \frac{x_2 - x_2^0}{x_2^1 - x_2^0} I(x_1, x_2^1) \\ \end{align} $$
Bilinear interpolation is fast
It will introduce new intensity values in the image due to interpolation
Does not intriduce as big artifacts as nearest neighbor interpolation
Bilinear interpolation leads to piece-wise linear functions, which is continuous but not differentiable
It may smooth and blur the image, leading to decrease in spatial resolution
Trilinear interpolation in 3D repeats the same procedure for an additional dimension and uses 8 points - corners of a cube
On label maps bilinear interpolation produces extra values (look at the boundaries)
These additional labels arise due to interpolation
They doe not exist and should be avoided whenever resampling label maps
Alternative formulation of bilinear interpolation¶
The bilinear interpolation has the following form with four unknown parameters
$$I(x_1, x_2) = a + bx_1 + cx_2 + dx_1x_2$$
Corresponding to the four unknowns we have the intensity values at the four corners leading to a linear system of equation
$$ \begin{bmatrix} 1 & x_1^0 & x_2^0 & x_1^0 x_2^0 \\ 1 & x_1^0 & x_2^1 & x_1^0 x_2^1 \\ 1 & x_1^1 & x_2^0 & x_1^1 x_2^0 \\ 1 & x_1^1 & x_2^1 & x_1^1 x_2^1 \\ \end{bmatrix} \begin{bmatrix} a \\ b \\ c \\ d \\ \end{bmatrix} = \begin{bmatrix} I(x_1^0, x_2^0) \\ I(x_1^0, x_2^1) \\ I(x_1^1, x_2^0) \\ I(x_1^1, x_2^1) \\ \end{bmatrix} $$
One can solve te system of equations and use $a$, $b$, $c$, $d$ coefficients to interpolate at any $x$ in between $p_{00}$, $p_{01}$, $p_{10}$, $p_{11}$.
Trilinear interpolation can be defined similarly with the following parametric form with 8 unknowns
$$I(x_1, x_2, x_3) = a + bx_1 + cx_2 + dx_3 + ex_1x_2 + fx_1x_3 + gx_2x_3 + hx_1x_2x_3$$
Approach 3: Bicubic interpolation¶
Uses closest 4 pixels' information
However, it requires not only the intensity values but also the derivates at those points
Therefore, it is common to use 16 closest points in the interpolation
Extension of cubic interpolation in 2D
Tricubic interpolation in 3D
Similar idea as bilinear interpolation: cubic interpolation in one dimension followed by the other
Functional form¶
Bicubic interpolation has the following form with 16 unknowns
$$I(x_1, x_2) = \sum_{i = 0}^{3} \sum_{j = 0}^3 a_{ij} x_1^i x_2^j$$
Similar to the bilinear interpolation, a linear system of equations can be constructed and solved. Information to determine these unknowns are taken from the 4 closest neighbors as
Intensities at $p_{00}$, $p_{01}$, $p_{10}$, $p_{11}$ = 4 values
$\partial x_1$ and $\partial x_2$ derivatives at $p_{00}$, $p_{01}$, $p_{10}$, $p_{11}$ = 8 values
$\partial x_1 x_2$ mixed derivatives at $p_{00}$, $p_{01}$, $p_{10}$, $p_{11}$ = 4 values
Remark
In order to compute the derivatives further closest pixels must be used. Therefore, bicubic interpolation uses more pixels than the closest 4.
Approximating the derivatives can be done in different ways.
Convolutional form¶
One easy way to do bicubic interpolation is through convolution with the following kernel - a result by keys
$$ \begin{align} K(x) &= \begin{cases} \frac{3}{2} |x|^3 - \frac{5}{2} |x|^2 + 1 & 0 \le |x| \le 1 \\ -\frac{1}{2} |x|^3 + \frac{5}{2} |x|^2 - 4 |x| + 2 & 1 \le |x| \le 2 \\ 0 & 2 \le |x| \end{cases} \\ f(x) &= \sum_{i = -2}^{2} f(x^i) K(x - x^i) \end{align} $$
The kernel is used to covolve each dimension one after the other
In 1d its support is in $[-2, 2]$ taking into account 4 sampled points
In 2d the bicubic interpolation with Key's convolution takes into account $4 \times 4$ closest poixels to the interpolating point $x$
The kernel is called "Catmull-Rom" kernel
Results on images and label maps¶
Bicubic interpolation is better at retaining gray values than bilinear and nearest neighbor. It has the highest quality interpolation amongst them
Compulational requirement is the highest
Bicubic interpolation leads to smooth functions that are both contunuous and differentiable
Tricubic interpolation in 3D repeats the same procedure for additional dimension and larger number of points
One label maps bicubic interpolation produces extra vales
These additional labels arise due to interpolation
They do not exist and should be avoided whenever resampling label maps
Comparsions¶
Nearest neighbor yields bad quality
Bilinear is a good trade-off
During a registration, the interpolation needs to be computed many times during the optimization. Bilinear's computational advantage is important for this
One may use bilinear during the optimization but bicubic at the end
For transforming label maps, it is better to use nearest neightbor interpolation
Linear transformation models¶
The general form of the transformation is a matrix-vector product
$$x' = T(x) = \mathbf{A}x + t$$
where the transformation is defined by the matrix $\mathbf{A}$ and the translation vector $t$.
The dimension of this matrix ans the translation vector depend on the dimension of the input and output spaces. Transformations between two two-dimensional spaces
$$x \in \mathbb{R}^2, x' \in \mathbb{R}^2 \rightarrow \mathbf{A} \in \mathbb{R}^{2 \times 2}, t \in \mathbb{R}^2$$
Transformations between two three-dimensional spaces
$$x \in \mathbb{R}^3, x' \in \mathbb{R}^3 \rightarrow \mathbf{A} \in \mathbb{R}^{3 \times 3}, t \in \mathbb{R}^3$$
These are the most commonly used cases.
One can also consider transformation of a 3D space to 2D and a 2D space to 3D.
Transformations between 2D images and 3D volumes, e.g., 2D ultrasound and 3D CT for intervention, are also studied.
Remark
This is actually not linear due to the translation $t$, i.e., $T(\alpha x_1 + \beta x_2) \neq \alpha T(x_1) + \beta T(x_2)$. It is linear in the homogeneous coordinates.
Thy are gloabl and called linear transformation models because the Jacobian of the transformation is the same for all $x$.
Parameterization in the most generic case¶
In two-dimensions
$$\mathbf{A} = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix}, t = \begin{bmatrix} t_1 \\ t_2 \end{bmatrix}$$
4 + 2 = 6 free parameters
In three-dimensions
$$\mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \\ \end{bmatrix}, t = \begin{bmatrix} t_1 \\ t_2 \\ t_3 \end{bmatrix}$$
9 + 3 = 12 free parameters
Linear transformations has small number of parameters compared to non-linear transformations
The same parameters apply to the entire image
Therefore, the number of parameters is independent of the number of pixels / voxels
$$T(x) = \mathbf{A}x + t, \forall x$$
If the matrix $\mathbf{A}$ is invertible, then the inverse transformation is given as
$$T^{-1}(x') = \mathbf{A}^{-1}x' + t', t' = -\mathbf{A}^{-1}t, \forall x'$$
Homogeneous coordinates¶
Alternative representation of the linear transformation can be written with the more compact homogeneous coordinates. Instead of
$$T(x) = \mathbf{A}x + t, \forall x$$
One can use the homogeneous coordinates
$$\tilde{T}(x) = \tilde{\mathbf{A}}\tilde{x}, \tilde{\mathbf{A}} = \begin{bmatrix} \mathbf{A} & t \\ \mathbf{0} & 1 \end{bmatrix}, \tilde{x} = \begin{bmatrix} x \\ 1 \end{bmatrix}$$
where $\mathbf{0}$ is a matrix of appropriate size.
For example in 2D, transformation matrix with the homogeneous coordinates can be written as
$$\tilde{A} = \begin{bmatrix} a_{11} & a_{12} & t_{1} \\ a_{21} & a_{22} & t_2 \\ 0 & 0 & 1 \end{bmatrix}$$
Approach 1: Rigid body transformations¶
Rigid body transformation in 2D¶
Rigid body transformations is one of the most commonly used transformation models in medical image analysis
The transformation is defined through a rotation and translation
In 2D, it is parametrized with one angle $\theta$ and one translation vector $t$
$$x' = T(x) = \mathbf{R}x + t, \mathbf{R} = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \end{bmatrix}$$
The transforamtion rotates the xy-plane about the origin counterclockwise by $\theta$
Three parameters in total, i.e., $\theta$, $x$, and $t$
Rigid body transformation in 3D¶
In 3D, it can be defined through three rotations and a translation
It is parametrized with three angles $\theta_{xy}$, $\theta_{xz}$, and $\theta_{yz}$ and a translation vector $t$
Each angle defines rotation about another axis
- $\theta_{xy}$: rotation about the z-axis (rotation of the xy plane)
- $\theta_{xz}$: rotation about the y-axis (rotation of the xz plane)
- $\theta_{yz}$: rotation about the x-axis (rotation of the yz plane)
$$ R(\theta_{xy}) = \begin{bmatrix} \cos(\theta_{xy}) & -\sin(\theta_{xy}) & 0 \\ \sin(\theta_{xy}) & \cos(\theta_{xy}) & 0 \\ 0 & 0 & 1\\ \end{bmatrix}, R(\theta_{xz}) = \begin{bmatrix} \cos(\theta_{xz}) & 0 & \sin(\theta_{xz}) \\ 0 & 1 & 0 \\ -\sin(\theta_{xz}) & 0 & \cos(\theta_{xz})\\ \end{bmatrix}, R(\theta_{yz}) = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \cos(\theta_{yz}) & -\sin(\theta_{yz}) \\ 0 & \sin(\theta_{yz}) & \cos(\theta_{yz}) \end{bmatrix} $$
Remark
$R(\theta_{xy}) R(\theta_{yz}) R(\theta_{xz}) \neq R(\theta_{xy}) R(\theta_{xz}) R(\theta_{yz})$
The transformation can push the object out of the FOV
3D rigid transformation can change the visible anatomy in the same slice
Essentially a different cross-section ocupies the same after transformation
Interpolation is performed using bilinear and trilinear methods
Center of rotation is important
A tricky piece of information that needs to be defined is the centers of coordinate systems in both domains. In the previous examples, the centers of transformation was always taken as the center of the image. However, this can be changed and the reuslting transforamtion also change accordingly.
Cne of transforamtion applies to all transforamtions and is especially important for linear transformations where the same matrix applies to the entire image.
Approach 2: Similarity transformations¶
Similarity transforamtion builds on top of rigid body motion by integrating scaling
It is also used quite extensively in medical image analysis
The transformation is defined through a rotation, translation, and scaling
It is parameterized with a rotation matrix, a translation vector $t$ and a scaling factor
$$x' = T(x) = \mathbf{R} \mathbf{S} x + t, \mathbf{S} = \begin{bmatrix}s & 0 \\ 0 & s \end{bmatrix}~\text{or}~\mathbf{S} = \begin{bmatrix}s & 0 & 0 \\ 0 & s & 0 \\ 0 & 0 & s\\ \end{bmatrix}$$
The scaling "zooms" in or out of the image
The transformation rotates the xy-plane about the origin counterclockwise by $\theta$
The basic similarity transformation adds one parameter to rigid body motion - 4 parameters in 2D and 7 parameters in 3D
Remark
The order of rotation and scaling is not important in similarity transformation. $x' = \mathbf{SR}x$ is the same as $x' = \mathbf{RS}x$.
Extending similarity transformations with anisotropic scaling¶
Previously we defined similarity transformation that used only one scaling factor
One can also think about scaling with different factors in different dimensions
In this case, the transformation is defined through a rotation matrix, translation and a scaling matrix
$$x' = T(x) = \mathbf{RS}x + t, \mathbf{S} = \begin{bmatrix}s_1 & 0 \\ 0 & s_2 \end{bmatrix}~\text{or}~\mathbf{S} = \begin{bmatrix}s_1 & 0 & 0 \\ 0 & s_2 & 0 \\ 0 & 0 & s_3\\ \end{bmatrix}$$
The scaling "zooms" in or out of the image in different amounts in different dimensions
The anisotropic form is no longer a similarity transformation, i.e., shapes are not preserved
Number of parameters increases:
- 2D: 1 (rotation) + 2 (translation) + 2 (scaling) = 5 parameters
- 3D: 3 (rotation) + 3 (translation) + 3 (scaling) = 9 parameters
Remark
The rotation and scaling is very important when applying anisotropic scaling. $x' = \mathbf{SR}x$ is not the same as $x' = \mathbf{RS}x$.
Applying the transformation with $T(x) = \mathbf{SR}x + t$ yields shearing as seens in the third column. The transformation parameters are the same for the images on the second and third columns.
Approach 3: Affine transformations¶
Extends the mapping with anistropic scaling with shearing
The transformation is defined through
$$x' = T(x) = \mathbf{WRS}x + t, \mathbf{W} = \begin{bmatrix} 1 & w \\ 0 & 1 \end{bmatrix}~\text{or}~\mathbf{W} = \begin{bmatrix} 1 & w_1 & w_2 \\ 0 & 1 & w_3 \\ 0 & 0 & 1\\ \end{bmatrix}$$
with the additional shearing matrix $\mathbf{W}$
Shapes are no longer preserved
Number of parameters increases
- 2D: 1 (rotation) + 2 (translation) + 2 (scaling) + 1 (shearing) = 6 parameters
- 3D: 3 (rotation) + 3 (translation) + 3 (scaling) + 3 (shearing) = 12 parameters
Alternative parameterizations are possible
A full affine transformation would also have reflection. This can be achieved with negative scaling factor
Affine transformation with shearing is rarely used in medical image anslysis, but shearing model is not used often
Some application areas of linear transformations¶
Rigid transformations - rotation and translation
- Longitudinal analysis - aligning temporal sequences
- Motion correction - correcting rigid motion between slices in a volumetric acquisition
- Muti-modal registration - aligning different images of different modality acquired at the same time, e.g. PET / MRI / CT
Similarity transforamtions - rotation, translation and global scaling
- Spatial normalization for machine leaning applications - removing variations easy to eliminate
- Alignment for shape analysis - eliminating global variation in shape
Similarity with anisotropic scaling - rotation, translation and scaling in each dimension
- Initialization for non-linear registration
- Spatial normalization for machine learning applications
Affine transformation - rotation, translation, scaling in each dimension and shear
- Initialization for non-linear registration
- Studying mechanical tissue properties
- Mechanical modeling for intervention
- Used to define more complicated non-linear transformations
Summary
Linear transformations are used extensitvely
Rigid, similarity, and similarity + anisotropic scaling are common in medical image computing
Affine transformation is common in modeling interventions
Easy to implement
Non-linear deformable transformation models¶
The general form of the transforamtion is
$$x = T^{-1}(x') = x' + u(x')$$
where the transformation is defined by the function $u(x')$. Note that
This function can be non-linear
We directly model the inverse transformation from output (target) domain to the input (moving) domain
This is due to the difficulty in establishing the inverse transforamtion for a given arbitrary forward transformation
Remember that it is more beneficial from the sampling point of view to work with the inverse transforamtion
The question is "How do we parametrize $u(x')$?"
Naive parameterization¶
$$x = T^{-1}(x') = x' + u(x')$$
where the transformation is defined by the function $u(x')$.
In the naive parameterization, one defines a displacement vector for each pixel / voxel independent from each other
Number of parameters in 2D: (Number of pixels) $\times$ 2, two components fro each displacement vector
Number of parameters in 3D: (Number of voxels) $\times$ 3, three components fro each displacement vector
There are two important issues to consider:
- Very large number of parameters, very large degrees of freedom. It would be difficult to estimate
- This transformation model can generate very "noisy" transformations. uch transformation may not be realistic and even change the topology of the underlying image
$$x = T^{-1}(x') = x' + u(x), u(x) \sim \mathcal{N}(0, 1.5 \mathbf{I}_2)$$
Random transformations look noisy: resulting moved image is quite noisy
Obviously, this is not very desirable
It would be nice to be able to generate smooth transformation all the time
Overview on a registration algorithm¶
$$\theta^* = \arg \min_{\theta} \mathcal{L}(I, T_{\theta} \circ J)$$
During the optimization, we would like to only "search" through smooth transformations
This can be achieved by parameterization
We need models that can generate smooth displacement fields for any set of parameters. Effectively reducing the number of free parameters
There are two strategies
- Inerpolation (kernel) based
- Physical model based
Interpolation based methods¶
Define displacement vectors for sparse set of "control" points and interpolate in between with a smooth model
Radial basis functions
$$u(x') = \sum_{n=1}^{N}\phi(|x' - x'_n|)d_n, \phi : \mathbb{R}^{+} \rightarrow \mathbb{R}, d_n \in \mathbb{R}^{2\text{ or }3}$$
where $x_n$ are the control points, $\phi$ are the basis functions and $d_n$ are the non-linear coefficients, defining the transformation
Two commonly used forms
- Thin Plate Splines (TPS)
- Free Form Deformations (FFD) with B-spline
Approach 1: Thin Plate splines (TPS)¶
In TPS, the basis functions are defined as $\phi(|x' - x'_n|) = \phi(r) = r^2 \log r$
Sometimes defined together with an affine transformation as follows
$$x = \mathbf{A} x' + t + \sum_{n = 1}^{N} \phi(|x' - x'_n|)d_n$$
where $\mathbf{A}$ and $t$ define a global linear transformation and TPS define local refinement
Control points do not have to be uniformly spaced, they cna be randomly dispersed
Parameterization in TPS¶
$$u(x') = \sum_{n=1}^{N} \phi(|x' - x'_n|)d_n, \phi(r) = r^2 \log r$$
In landmark-based registration, where TPS is extensively used, control points are given by the user.
The parameters of the non-linear transformation model are the coefficients $\{d_n\}_{n-1}^{N}$.
- In 2D: (Number of control points) $\times$ 2
- In 3D: (Number of control points) $\times$ 3
If affine transformation is included then there are the parameters of the linear transformation $\mathbf{A}x' + t$
$$x = \mathbf{A}x' + t + \sum_{n=1}^{N} \phi(|x' - x_n'|)d_n$$
TPS is a non-linear deformable transformation model with very few number of parameters.
Increase the number of control points in TPS
Leads to smooth transformations that can also apply non-linear deformations to the object in the image. Increasing the number of control points can give more complicated transformations.
Transformations are bound to be smooth due to the interpolating kernel $\phi$
Effects of each control point and the parameters $d_n$ defined on it are global
Increasing the control points can lead to more complicated transformations
Mostly used for landmark-based registration not intensity-based registration
In landmark-based registration, correspondence between a number of points are given, and model parameters are found to satisfy those correspondences
Approach 2: Free form deformation with B-splines¶
Free Form Deformation (FFD) with B-spline is based on (3D)
$$u(x') = \sum_{l=0}^{3} \sum_{m=0}^{3} \sum_{n=0}^{3} B_{l}(\mu_1) B_{m}(\mu_2) B_{n}(\mu_3) d_{i+l, j+m, k+n}$$
There are $N_x \times N_{y} \times N_{z}$ control points placed in a uniform gird with spacing $\delta$.
Gray is the image grid and red is the grid fro the FFD control points. Displacement vectors $d$ are given on the control points only. $u(x')$ is the interpolated for the image grid points.
Parameters are defined as (in 3D)
Indices for the control points $$i = \lfloor \frac{x_1'}{\delta} \rfloor - 1, j = \lfloor \frac{x_2'}{\delta} \rfloor - 1, k = \lfloor \frac{x_3'}{\delta} \rfloor - 1$$
Distances to control $$\mu_1 = \frac{x_1'}{\delta} - \lfloor \frac{x_1'}{\delta} \rfloor, \mu_2 = \frac{x_2'}{\delta} - \lfloor \frac{x_2'}{\delta} \rfloor, \mu_3 = \frac{x_3'}{\delta} - \lfloor \frac{x_3'}{\delta} \rfloor$$
Basis functions are
$$ \begin{align} B_0(s) &= (1 - s)^3 / 6 \\ B_1(s) &= (3s^3 - 6s^2 + 4) / 6 \\ B_2(s) &= (-3s^3 + 3s^2 + 3s + 1) / 6 \\ B_3(s) &= s^3 / 6 \end{align} $$
Cubic B-spline (of order 4). It is continuous at anchor points up to its 2nd derivative. As a consequence, for example, it has continuous Jacobian and Divergence fields.
Parameterization in FFD with B-spline¶
$$u(x') = \sum_{l=0}^{3} \sum_{m=0}^{3} \sum_{n=0}^{3} B_{l}(\mu_1) B_{m}(\mu_2) B_{n}(\mu_3) d_{i+l, j+m, k+n}$$
$$i = \lfloor \frac{x_1'}{\delta} \rfloor - 1, j = \lfloor \frac{x_2'}{\delta} \rfloor - 1, k = \lfloor \frac{x_3'}{\delta} \rfloor - 1$$
$$\mu_1 = \frac{x_1'}{\delta} - \lfloor \frac{x_1'}{\delta} \rfloor, \mu_2 = \frac{x_2'}{\delta} - \lfloor \frac{x_2'}{\delta} \rfloor, \mu_3 = \frac{x_3'}{\delta} - \lfloor \frac{x_3'}{\delta} \rfloor$$
Grid of control points is often subsampled from the original image grid
The parameters of the non-linear transformation model are the coefficients $\{d_n\}_{n-1}^{N}$. (Same as the TPS model)
- In 2D: (Number of control points) $\times$ 2
- In 3D: (Number of control points) $\times$ 3
One can again include an affine transforamtion similar to TPS model.
One can also consider optimizing localtions of the control points - increasing the number of parameters.
Increasing the number of control points in FFD with B-splines
Leads to smooth transformations that can also apply non-linear deformations to the object in the image. Even when the displacements at the control points are large.
Note that increasing the number of control points gives us less smooth transformations.
Transformations are bound to be smooth due to the interpolating kernel
Effects of each control point and the displacement field on it are local
Increasing the number of control points, i.e., using a finer grid, can lead to less smooth transformations
This provides the ability to perform multi-resolutional transformations. Start with few and increase the number of control points
FFD with B-splines is mostly used for intensity-based registration ot necessarily for landmark-based
Physical models¶
Pixel / voxel-wise modeling - a displacement field fo each grid point
Assume a physical model of transformation
Often motivated from physical phenomenon, such as fluids and elastic body
Very large number of parameters but restricted transformations based on the underlying model
Corase classification
Elastic body models
Fluid flow models
Curvature registration
Diffusion model
Flows of diffeomorphisms
Approach 1: PDE-based models¶
In all the models one looks for $u(x')$ with a displacement vector for each pixel / voxel.
However, the transforamtion $u(x')$ must satisfy a model equation, hence it has lower effective degrees of freedom.
Elastic body models (linear)
$$\mu \Delta u(x') + (\mu + \lambda) \nabla (\nabla \cdot u(x')) + F(x') = 0$$
with $\Delta$ being the Laplace operator, $\mu$ the rigidity, $\lambda$ Lame's first coefficient, $\nabla$ gradient, $\nabla \cdot$ divergence and $F(x')$ the user defined force field.
Diffusion models
$$\Delta u(x') + F(x') = 0$$
the displacement field satisfies the Possion equation
Curvature registration
$$\Delta^2 u(x') + F(x') = 0$$
Approach 2: Flow models¶
The transformation of each point is defined as an Ordinary Differential Equation (ODE)
$$\frac{dx}{dt} = v(x, t), x(0) = x'$$
where the transformation is defined through the time-dependent "velocity" field $v(x, t)$.
The ODE can be solved through integration in time
$$x = \int_{1}^{1} v(x, t) dt + x' = u(x') + x'$$
Theory of ODE says that if $v(x, t)$ is smooth then the resulting transformation is a diffeomophism.
This framework leads to the Large Deformation Diffeomorphic Metrix Mapping (LDDMM) registraion algorithm.
Number of parameters are even higher - one for each pixel / voxel and time.
Used with regularization models on $v(x, t)$, e.g., $\| \nabla \cdot v(x, t) \|$.
Flow with stationary vector fields¶
A variation of the flow model is formulated with stationary vector field (SVF) ($v$ does not depend on time):
$$x = \int_{0}^{1} v(x)dt + x' = u(x') + x'$$
This model has fewer parameters - one velocity vector per pixel / voxel.
When the vector field is stationary, efficient integration schemes exist.
Classifying registration problems¶
Same subject's images | Different subject' images | |
---|---|---|
Same modality | Intra-subject, intra-modality | Inter-subject, intra-modality |
Different modality | Intra-subject, inter-modality | Inter-subject, inter-modality |
Remark
The case "Intra-subject, intra-modality" is the easiest, the case "Inter-subject, inter-modality" is the hardest.
Registering images of the same or different subjects (intra)
- Alignment strategy would change
- Transformations are different
- Example:
- Two CT: same modality
- CT & MR: inter-modality
Registering similar or different modalities
- Need to use different loss functions
- Intensity- based loss functions that are not affected by differences in modality variations
Intra-subject registration¶
Same subject, different time stamps.
Aligning two different images of the same person. Many application areas in research and in pratice. Easier problem then inter-subject because we expect the same anatomy.
Intra-subject registration can be difficult¶
The deformation of the anatomy during motion can be very complex. Transformation models may not be able to model this accurately. An example for this is the breathing motion and its effect on the abdomen.
Images of the same individual may differ vastly in intensity characteristics. Neuroimaging studies are a good example of this.
Differences in anatomies even when images show the same structures.
Lack of exact correspondence between two images.
Presence of pathologies make registration even more difficult.
Landmark-based registration¶
Landmarks in medical imaging
- Prominent locations in an image
- Distinct from their surroundings, similar to interest points from computer vision
- Easily identifiable
- Can be used to orient ourselves in an image
- Landmarks will be used to define correspondence between images
- Landmark definition depends on the imaged structure and type of images to align
- Most commonly manually placed
Examples for thorax
- Trachea bifurcation
- Most superior tip of the liver
- Tip of the pelvis bones - iliac crests
Landmarks for registration
- Should be indentifiable across images
- Same landmark in both of the images
- Corresponding landmarks will guide registration procedure and help align the images
- Correspondence between landmarks define correspondence between different pixels / voxels
Same modality, different subjects¶
Both CT images acquired with similar protocols, e.g., without contrast
Images of different people
Natural landmark positions are anatomical structures common to every person
More generally common to every sample
Different modality, different subjects¶
Different modalities: one CT and one MRI
Images of different people
Showing the same anatomical area
Natural landmark positions are anatomical structures common to every sample
Landmarks should be identifiable in both of the modalities
Same person, same modality¶
Same modality - MRI
Images of the same person taken at different time points
Usual landmark locations - anatomical structure common to everyone
In addition, sample-specific structural variations
Sample-specific landmarks will lead to more accurate alignment
Same person and different modalities is also very common and same idea applies
Landmarks in biological samples¶
The same concepts apply to bilogical images as well
Different modalities - Optical Coherence Tomography and Nissl-stained microscopy image
Image of the same tissue sample
Corresponding structures
Constellations of neurons in these examples is used
Defining such landmarks across different tissue samples can be difficult
How to align images with the landmarks¶
$$\mathcal{L} = \sum_{n=1}^{N} \left\| \mathbf{x}_n^{(1)} - T_{\theta} \left( \mathbf{x}_n^{(2)} \right) \right\|_2^2, \mathbf{x} \in \mathbb{R}^d$$
$\mathbf{x}_n^{(1)}$ is the $n$-th landmark in the first image
$\mathbf{x}_n^{(2)}$ is the corresponding landmark in the second image
$T_{\theta}\left(\mathbf{x}_n^{(2)}\right)$ is the transformed version of $\mathbf{x}_n^{(2)}$
$\theta$ is the set of transformation parameters
$d$ is the dimension of the problem, e.g., 2D, 3D, 3D + time
$\| \cdot \|_2^2$ is the usual $l_2$ loss
other types of losses are also possible
Approach 1: Linear alignment¶
$$\mathcal{L}(\theta) = \sum_{n = 1}^{N} \left\| \mathbf{x}_n^{(1)} - T_{\theta}\left( \mathbf{x}_n^{(2)} \right) \right\|_2^2, \mathbf{x} \in \mathbb{R}^d$$
When aligning with linear transformation $T_{\theta}$ is defined through a matrix $\mathbf{A}$ and a trnaslation $\mathbf{t}$
$$\mathcal{L}(\mathbf{A}, \mathbf{t}) = \sum_{n=1}^{N} \left\| \mathbf{x}_n^{(1)} - \left( \mathbf{A}\mathbf{x}_n^{(2)} + \mathbf{t} \right) \right\|_2^2, \mathbf{x} \in \mathbb{R}^d$$
The loss is minimized to determine the best transformation parameters $\mathbf{A}$ and $\mathbf{t}$
$$\arg \min_{\mathbf{A}, \mathbf{t}} \sum_{n=1}^{N} \left\| \mathbf{x}_n^{(1)} - \left( \mathbf{A}\mathbf{x}_n^{(2)} + \mathbf{t} \right) \right\|_2^2$$
For other loss functions, the minimization can be solved using optimization algorithms.
Rigid transformation, same subject, same modality¶
Results from rigid registration - $T_{\theta}$ is a rigid transformation, i.e., rotation and translation
Using only 4 landmarks the rigid registration correctly aligns the images
Notice that the overlay is sharper - meaning good alignment
Rigid transformation, different subject, same modality¶
Rigid registration may not be enough to account for differences between different subjects
Landmarks: trachea bifunction, liver tip and pelvis tips
Notice the landmarks do not match perfectly
Affine transformation, different subject, same modality¶
Affine registration may be a bit better in accounting for inter-subject differences
Landmarks: trachea bifurcation, liver tip and pelvis tips
Notice the landmarks match better
How would adding more landmarks change?
At three landmarks, alignment for the landmarks are perfect but for the other parts it is not.
That is because affine transformation has 6 unknowns, and 6 equalities - given by three landmarks - gives an well-determined syste, of equations.
Increasing number of landmarks yield worst alignment for the landmarks but better for other areas.
More landmarks leads to an over-determined system of equations.
Why does the affine transformation not align all the landmarks perfectly?
With 6 parameters, it does not have the degrees of freedom to satisfy more than 3 correspondences in 2D. Note that 3 correspondences lead to 6 equationsl each coordinate will give one equation $$\mathcal{L} = \sum_{n=1}^{N} \left\| \mathbf{x}_n^{(1)} - \left( \mathbf{A}\mathbf{x}_n^{(2)} + \mathbf{t} \right) \right\|_2^2, \mathbf{x} \in \mathbb{R}^d$$
Approach 2: Non-linear alignment with Thin Plate Splines (TPS)¶
$$\mathcal{L} = \sum_{n=1}^{N} \left\| \mathbf{x}_n^{(1)} - T_{\theta} \left( \mathbf{x}_n^{(2)} \right) \right\|_2^2, \mathbf{x} \in \mathbb{R}^d$$
When aligning with TPS, the cost function becomes
$$\mathcal{L} = \sum_{n=1}^{N} \left\| \mathbf{x}_n^{(1)} - \mathbf{x}_n^{(2)} - \sum_{k=1}^{N} \phi \left( \left| \mathbf{x}_n^{(2)} - \mathbf{x}_k^{(2)} \right| \right) d_k \right\|_2^2, \mathbf{x} \in \mathbb{R}^d$$
Note that the displacement vectors are defined through the same landmarks. Similar to linear, the loss is minimized to determined the best transformation paramters
$$\arg \min_{\{d_k\}_{k=1, \cdots, N}} \sum_{n=1}^{N} \left\| \mathbf{x}_n^{(1)} - \mathbf{x}_n^{(2)} - \sum_{k=1}^{N} \phi \left( \left| \mathbf{x}_n^{(2)} - \mathbf{x}_k^{(2)} \right| \right) d_k \right\|_2^2$$
After determing the optimal parameters $d_k^*$ one can use it to find the displacement vectors everywhere in the target image, not just the landmarks
$$T_{\theta}\left( \mathbf{x}_n^{(2)} \right) = \mathbf{x}_n^{(2)} + \sum_{k=1}^{N} \phi \left( \left| \mathbf{x}^{(2)} - \mathbf{x}_k^{(2)} \right| \right) d_k^*$$
TPS transformation, same subject, same modality¶
Results from TPS registration - $T_{\theta}$ is a TPS transformation
Notice that the overlay is sharper - meaning good alignment
TPS has larger degrees of freedom and model complicated transformations
12 different landmarks are used to get a good alignment
Notice that the landmarks are perfectly aligned but further, the alignment in the rest of the image is also much better than linear transforamtions
Remark
More landmarks can lead to better alignment in TPS.
Notes on landmark-based registration
Other transformation models can also be used in the same optimization $$\mathcal{L} = \sum_{n=1}^{N} \left\| \mathbf{x}_n^{(1)} - T_{\theta}\left( \mathbf{x}_n^{(2)} \right) \right\|_2^2, \mathbf{x} \in \mathbb{R}^d$$
We only considered placing landmarks manually, however, automatic determination of landmarks is also possible with machine learning algorithms
Robust losses are especially important if we are suspucious about the quality of landmarks. For example, when you use automatic methods to detect such landmarks.
Identifying all interest points and then using robust fitting methods, such as RANSAC, is also possible to use. This is particularly used for natural images but less so for medical images
Intensity-based registration¶
Intensity-based loss function
$$\mathcal{L}(\theta) = d(I, T_{\theta} \circ J)$$
where $d(\cdot, \cdot)$ is a distance metric between intensities.
Distance $d(\cdot, \cdot)$ is defined over the intensities of the fixed image $I$ and the moving image $J$ after transformation, i.e., $T_{\theta} \circ J$.
There are no landmarks, hence no input information on corresponding locations in the images.
Harder problem as there is no unique solution to the problem.
Applicable in more general situations as it does not require landmark placement.
Definition of $d(\cdot, \cdot)$ is very important and depends on the type of registration.
Intra-modality loss metrics - SSD¶
One of the most commonly used metrics is the Sum of Squared Distances (SSD)
$$\mathcal{L}(\theta) = d(I, T_{\theta} \circ J) = \sum_{\mathbf{x} \in \Omega} \left\| I(\mathbf{x}) - (T_{\theta} \circ J)(\mathbf{x}) \right\|_2^2$$
where $\Omega$ is the image domain of the fixed image and $(T_{\theta} \circ J)(\mathbf{x})$ refers to the intensity of the transformed moving image at $\mathbf{x}$.
Another way of writing the same distance using the explicit transformation is
$$\mathcal{L}(\theta) = d(I, T_{\theta} \circ J) = \sum_{\mathbf{x} \in \Omega} \left\| I(\mathbf{x}) - J\left(T_{\theta}^{-1}(\mathbf{x})\right) \right\|_2^2$$
and the registration is defined as
$$\arg \min_{\theta} \sum_{\mathbf{x} \in \Omega} \left\| I(\mathbf{x}) - J\left(T_{\theta}^{-1}(\mathbf{x})\right) \right\|_2^2$$
When the images are the same then SSD becomes 0.
There is no unique solution to this problem. One can warp images in any way to perfectly minimize the SSD to its minimum value 0.
It assumes the intensities are comparable across the images. Hence, it would not work for registering images with different modalities.
Image gradients are important for SSD¶
Let us look at the optimization problem
$$\arg \min_{\theta} \sum_{\mathbf{x} \in \Omega} \left\| I(\mathbf{x}) - J\left(T_{\theta}^{-1}(\mathbf{x})\right) \right\|_2^2$$
Assume that there is only one parameter $\theta$ and we compute the derivation the cost with respect to this parameter
$$\frac{d \mathcal{L}(\theta)}{d\theta} \propto - \sum_{\mathbf{x} \in \Omega} \left( I(\mathbf{x}) - J\left(T_{\theta}^{-1}(\mathbf{x})\right) \nabla J\left(T_{\theta}^{-1}(\mathbf{x})\right)^T \frac{d T_{\theta}^{-1}(\mathbf{x})}{d \theta} \right)$$
Let us focus on the dot product at the far right
$$\nabla J\left(T_{\theta}^{-1}(\mathbf{x})\right)^T \frac{d T_{\theta}^{-1}(\mathbf{x})}{d \theta}$$
This term is non-zero when $\frac{d T_{\theta}^{-1}(\mathbf{x})}{d \theta}$ has a component parallel to the gradient of the transpose image $\nabla J$.
Therefore, only changes in transformation that are parallel to the image gradient can change, hence reduce, the loss.
Variations from SSD¶
Original SSD
$$d(I, T_{\theta} \circ J) = \sum_{\mathbf{x} \in \Omega} \left\| I(\mathbf{x}) - J\left(T_{\theta}^{-1}(\mathbf{x})\right) \right\|_2^2$$
One can consider multiple variations on the SSD to improve its performaance.
Using a robust loss instead of $ell_2$, such as $l_2$.
Optimization may become more difficult.
Instead of using intensities, one can consider features extracted from larger patches
$$d(I, T_{\theta} \circ J) = \sum_{\mathbf{x} \in \Omega} \left\| f(I(\mathcal{N}(\mathbf{x}))) - f\left( J\left( \mathcal{N} \left( T_{\theta}^{-1} (\mathbf{x}) \right) \right) \right) \right\|_2^2$$
where $f(\cdot)$ is a feature and $\mathcal{N}(\cdot)$ denotes a patch in the image with the argument indicating the center.
Inter-modality loss metrics¶
SSD assumes intensities are comparable.
In inter-modality registration, when we are aligning images from different modalities, we cannot assume they are directly comparable.
Inter-modality loss metrics are of statistical form.
Two of the most commonly used ones are
- Mutual information
- Normalized cross correlation
They both assume there is a statistical relationship between intensities of the two images.
Metric 1: Normalized Cross Correlation (NCC)¶
Loss based normalized cross correlation is defined as
$$\mathcal{L}(\theta) = d(I, T_{\theta} \circ J) = -NCC^2(I, J)$$
$$NCC(I, J) = \frac{1}{|\Omega|} \frac{1}{\sigma_I \sigma_J} \sum_{\mathbf{x} \in \Omega} (I(\mathbf{x}) - \mu_{I}) \left( J\left( T_{\theta}^{-1}(\mathbf{x}) - \mu_J \right) \right)$$
where $|\Omega|$ denotes the number of pixels in the fixed image $I$, $\sigma_I$ and $\sigma_J$ are the standard deviations of $I$ and $T_{\theta} \circ J$, and $\mu_I$ and $\mu_J$ are the means of the same.
NCC is the 2D version of the Pearson correlation coefficient.
It assumes pixel intensities as independent random variables.
It achieves maximum socre of 1 if $I(\mathbf{x})$ and $J(T_{\theta}^{-1})$ are linearly related and the same relationship holds for all $\mathbf{x} \in \Omega$.
It can be thought to approximate
$$\mathbb{E}\left[ (I - \mu_I) \times (I - \mu_J) \right] / \sigma_I \sigma_J$$
if $I$ and $J$ were random varibales.
NCC can be noth positive and negative. Both cases are fine. That is why the loss is often defined in terms of $NCC^2$. We minimize the negative of the square.
Remark
Notice that the NCC does not go down to -1. This is because in reality, the relationship is never perfectly linear. Moreover, background pixels may influence the total loss.
Metric 2: Mutual information¶
Mutual information is defined as
$$\mathcal{L}(\theta) = d(I, T_{\theta} \circ J) = MI(I, J)$$
$$MI(I, J) = D_{KL}(p(I, J) | p(I) p(J))$$
where $D_{KL}(P | Q)$ is the Kullback-Leibler divergence that measures a distance between two distributions $P$ and $Q$.
MI measures Kullback-Leibler divergence between the joint distributions of intensities and product of marginal distributions of the same.
It assumes pixel intensities as independent random variables.
MI takes the value 0 if $p(I, J) = p(I) p(J)$, when the intensities in the two images are independent from each other - no statistical relationship.
Images that are not aligned would have no statistical relationship, thus, low MI.
Perfectly aligned images should have a strong statistical relationship.
Hence, MI should be maximized during a registration process.
Computation of mutual information¶
The KL divergence is given in continuous domain as
$$MI(I, J) = \int_I \int_J p(I, J) \log \frac{p(I, J)}{p(I) p(J)} dIdJ$$
We can approximate $p(I)$ and $p(J)$ with histograms of $I$ and $T_{\theta} \circ J$, respectively. $p(I, J)$ is approximated with 2D histogram of $I$ and $T_{\theta} \circ J$. With these approximations
$$MI(I, J) = \sum_{i} \sum_{j} h(i, j) \log \frac{h(i, j)}{h(i) h(j)}$$
where $h$ denotes histograms and $i$, $j$ go through different bins of the histograms.
Comparasion of NCC and MI
MI is more powerful, it can capture complicated relationships.
For instance, it does not ha ripples for large angles in the rotation example.
MI can be difficult to compute.
NCC is easier to compute.
Optimization¶
We have seen different definitions for $\mathcal{L}(\theta)$
Landmark-based cost function
$$\arg \min_{\theta} \sum_{n=1}^{N} \left\| \mathbf{x}_n^{(1)} - T_{\theta}\left(\mathbf{x}_n^{(2)}\right) \right\|_2^2$$
Intensity-based cost function
$$\arg \min_{\theta} SSD(I, T_{\theta} \circ J)$$ $$\arg \max_{\theta} NCC^2(I, T_{\theta} \circ J)$$ $$\arg \max_{\theta} MI(I, T_{\theta} \circ J)$$
Optimization for image registration is a difficult problem in general.
Landmark-based cost function is easier of the two. As long as degrees of freedom.
Optimization for liear transformation models is easier than non-linear transformation models.
In general, intensity-based registration problems using non-linear transformation models (named as deformable registration) are ill-posed, i.e., there are many different solutions that can give similar results.
To overcome the ill-posed nature of the problem, regularization is often used when using non-linear transformation models.
Completing the cost function with continuous regularization models¶
To make the optimization well-posed, a regularization is often added to the loss functions
$$\mathcal{L}(\theta) = d(I, T_{\theta} \circ J) + \lambda r(T_{\theta}), T_{\theta} = \mathbf{x} + u(\mathbf{x})$$
where $r(T_{\theta})$ prefers some sort of transformation over others and $\lambda$ is a weight. Tikhonov regularization schemes that prefer smoothly varying $u(\mathbf{x})$ are often used. The generic form is
$$r(T_{\theta}) = \int_{\Omega} \phi(\nabla u) d\mathbf{x}$$
where $\nabla$ denotes gradient in space $\phi$ is a convex function with minimum at zero.
Examples are
$l_2$ regularization that prefers smoothly varying displacement dields
$$r(T_{\theta}) = \int_{\Omega} (\nabla u)^2 d\mathbf{x}$$
$l_1$ regularization that prefers piece-wise constant u(x)
$$r(T_{\theta}) = \int_{\Omega} |\nabla u| d \mathbf{x}$$
Regularization models from continuous mechanics¶
To make the optimization well-posed, a regularization is oftenn added to the loss functions
$$\mathcal{L}(\theta) = d(I, T_{\theta} \circ J) + \lambda r(T_{\theta}), T_{\theta} = \mathbf{x} + u(\mathbf{x})$$
where $r(T_{\theta})$ prefers some sort of trnasformations over others. Energy terms from continuum mechanics are also commonly used.
Linear elastic regularization uses the Cauchy strain tensor $V = V(u)$ and minimizes the strain energy caused by the displacement field
$$r(T_{\theta}) = \int_{\Omega} \lambda Tr(V)^2 + \mu Tr(V)^2 d \mathbf{x}, V(u) = (\nabla u + \nabla u^T) / 2$$
where $\nu$ and $\mu$ are Lame constants and $\nabla$ is the gradient in space. This model effectively prefers small $u(\mathbf{x})$ values.
Hyperelastic regularization allows for large deformations using the simplest material model, the Saint Venant-Kirchhoff model with the Green-St. Venan strain tensor $E = E(u)$
$$r(T_{\theta}) = \int_{\theta} \frac{\lambda}{2} Tr(E)^2 + \mu Tr(E^2) d\mathbf{x}, E(u) = \frac{1}{2}\nabla T_{\theta}^T \nabla T_{\theta} - \mathbb{I} = \frac{1}{2} \left[ \nabla u^T + \nabla u + \nabla u^T \nabla u \right]$$
which prefers locally rigid transformations.
Complete registration model is ready for optimization¶
The complete registration loss function, together with the data term $d(\cdot)$ and the regularization $r(\cdot)$ is ready for optimization.
Linear transformation models - linear registration problem:
$$\mathcal{L}(\theta) = d(I, T_{\theta} \circ J), T_{\theta} = \mathbf{AX} + \mathbf{t}$$
Non-linear transformation models - non-linear registration problem:
$$\mathcal{L}(\theta) = d(I, T_{\theta} \circ J) + \lambda r(T_{\theta}), T_{\theta} = \mathbf{x} + u(\mathbf{x})$$
Most commonly, the optimization can be done via continuous methods:
$$\theta_{t+1} = \theta_t + \alpha g(\mathcal{L(\theta_t)})$$
where $\alpha$ is a step-size parameter, $t$ indices optimization iterations and $g(\cdot)$ is a gradient-based function, the simplest being the gradient descent
$$\theta_{t+1} = \theta_t - \alpha \frac{d \mathcal{L}(\theta)}{d \theta}|_{\theta_{t}}$$
second order optimization schemes and discrete formulations of the optimization is also possible. Gradient-free techniques are also possible.
Remarks
- It is common to first align images with linear registration followed by deformable non-linear registration. This helps to get a better alignment.
- Even with regularization terms, registration can get stuck in local minima. To aviod this multi-scale optimization is used. At coarse resolution, registration is easier because there are not that many details in the images. Registration starts a the coarsest scale where the problem of aligning the images is easier. Results are used as initialization of the next step.