227-0391-00: Medical Image Analysis
Section 6
Interpretability, Uncertainty, and Data-Efficient Methods
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 07/09/2025
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ender Konukoglu, Professor Mauricio Reyes, and Doctor Ertunc Erdil. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Pixel-wise predictions with deep neural networks¶
Segmentation¶
Dataset¶
For a training set, we need good images and a large number of images. Note that we would also need to have validation set and a test set.
How big of a data set do we need?
It depends on the problem
It depends on
- How variable is the structure?
- How variable is the background?
- How variable is the intensity profile?
Example: For brain MRI with T1w MPRAGE image, it takes 3 - 5 labeled volumes to train a good UNet for segmenting large anatomical structures, e.g., white matter, gray matter, hippocampus, etc. It is difficult to generalize from such examples.
Ambiguity and uncertainty in segmentation labels is very common¶
Different people might create different labels.
Missing data in problems requiring multiple sequences is very common¶
We may need to impute missing data during training or testing.
Variations between similar images are very common¶
They are all acquired with similar sequences (T1w MPRAGE). However, small differences in scanners and acquisition protocols (sequence parameters) can change the contrast between different tissue a lot.
Changes can happen during training. They can also happen during inference.
Cost function¶
Basic classification loss (binary cross entropy)¶
Pixel-wise extension of the binary classification loss. The basic binary classification loss for a single sample is
$$BCE(y, f(x; \theta)) = -\mathbb{1}(y=1) \log f(x; \theta) - \mathbb{1}(y = 0) \log(1 - f(x; \theta))$$
assuming $f(x; \theta)$ represents the probability of being of class $y = 1$ predicted by the network.
Extension to all the pixels is to define the loss at each pixel and sum over pixels
$$L(y, f(x; \theta)) = \sum_{r=1}^{D} BCE(y(r), f(x; \theta)(r))$$
where $y(r)$ and $f(x; \theta)(r)$ are the ground truth labels and predictions at pixel $r$, respectively.
Note that the network output $f(x; \theta)$ is assumed to be an image.
Multi-class case (cross entropy)¶
Just like classificatiom problems, segmentation problems often have multiple output labels. Then we extend the multi-class classification loss
$$CE(y, f(x; \theta)) = -\sum_{k} p(y = k) \log f_k(x; \theta)$$
Here we use a different notation;
The network assigns a probability to each output class, indicating the probability of the given sample $x$ belonging to the class $k$.
If ground truth does not have any uncertainty, which is most of the applications, then $p(y = k)$ takes the value 1 for one class.
Extension to all the pixels is to define the loss at each pixel and sum over pixels
$$L(y, f(x; \theta)) = \sum_{r=1}^{D} CE(y(r), f(x; \theta)(r))$$
Loss over a set of samples is defined similarly for both cases: $\mathcal{L}(\mathcal{D}; \theta) = \sum_{n=1}^{N} L(y_n, f(x_n; \theta))$.
Challenge with the (binary) cross entropy¶
When there are small structure, either compared to background or other classes (in the multi-class problem), the losses associated with small structures are dominated by losses associated with larger structures.
$$ \begin{align} L(y, f(x; \theta)) &= \sum_{r = 1}^{D}CE(y(r), f(x; \theta)(r))\\ &= \sum_{r \in \Omega_{small}} CE(y(r), f(x; \theta)(r)) + \sum_{r \in \bar{\Omega_{small}}} CE(y(r), f(x; \theta)(r)) \\ &= L_{\Omega_{small}} + L_{\bar{\Omega_{small}}} \end{align} $$
It is very likely that $L_{\Omega_{small}} << L_{\bar{\Omega_{small}}}$ because of $|\Omega_{small}| << |\bar{\Omega_{small}}|$.
Alternative: Sorensen-Dice Coefficient¶
$$DSC = \frac{2 |A \cap B|}{|A| + |B|}$$
where $A$ and $B$ are masks.
Used for evaluating quality of segmentation predictions
DSC = 1 when the sets perfectly overlap
DSC = 0 when there is no overlap
The issue is that this is not differentiable (cannot be used during training)
For binary segmentation
$$L(y, f(x; \theta)) = \frac{2 \sum_{r = 1}^{D} y(r) f(x; \theta)(r)}{\sum_{r=1}^{D}y(r)^2 + \sum_{r=1}^{D}f(x; \theta)(r)^2}$$
For multi-class segmentation
$$L(y, f(x; \theta)) = \sum_{k = 1} \frac{2 \sum_{r = 1}^{D} p(y(r) = k) f_k(x; \theta)(r)}{\sum_{r=1}^{D}p(y(r) = k)^2 + \sum_{r=1}^{D}f_k(x; \theta)(r)^2}$$
Each class DSC loss is equal in magnitude despite having different sizes contrary to corss entropy.
Architecture¶
Binary segmentation¶
Given the following deep neural network, where the $\rightarrow$ represents convolutional link, $\Rightarrow$ represents the fully connected link, $I$ is the input image, $L\#$ are the intermediate layers, and $O \in [0, 1]$ is the output.
$$I \rightarrow L1 \rightarrow L2 \rightarrow L3 \rightarrow L4 \Rightarrow L5 \Rightarrow O$$
Keep in mind that
We would like an output of image size (input size equals to output size)
Channel dimension reduces from $L1$ to $L4$
$$I \rightarrow L1 \rightarrow L2 \rightarrow L3 \rightarrow L4 \rightarrow L5 \stackrel{\uparrow s}{\rightarrow} O$$
We unsample within the network and convert an intermediate layer $L5$ to required image size. The upsampling factor $s$ depends on the size of the channel at $L5$.
We can use convolution layers that keeps the dimension of the incoming channels the same, e.g., padding for convolution and no-pooling. Moreover, for the last layer, we need to use sigmoid activation to make sure the output is in $[0, 1]$.
$$I \rightarrow 32 \rightarrow 128 \rightarrow 256 \stackrel{\uparrow s}{\rightarrow} 128 {\color{red}~\rightarrow~} 32 {\color{blue}~\rightarrow~} O$$
$\rightarrow$: 1-padding with $3 \times 3$ convolutions, ReLU activations and pooling
${\color{red}\rightarrow}$: 1-padding with $3 \times 3$ convolutions, ReLU activations
${\color{blue}\rightarrow}$: 1-padding with $3 \times 3$ convolutions, Sigmoid activations
Multi-class segmentation¶
$$I \rightarrow L1 \rightarrow L2 \rightarrow L3 \rightarrow L4 \stackrel{\uparrow s}{\rightarrow} L5 {\color{blue}~\rightarrow~} O$$
The output $O$ has multiple channels in this case. Each channel is an image indicating probabilities of each pixel belonging to one class.
Softmax type non-linear activation
$$f_k(x; \theta)(r) = O(r) = \frac{\exp(a_k(r))}{\sum_{j} \exp(a_j(r))}$$
where $a_j(r)$ is the activation in channel $j$ at pixel $r$.
Challenges with this idea¶
This simple idea provide segmentation albeit losing resolution. The problem is that at $L5$, there has been so many pooling operations that details are lost.
The key question is how to use both contextual information and retain high-resolution information.
Alternative 1: Fully convolutional networks (FCN) with hierarchical¶
Combination from different scales to retain high-resolution details in segmentation maps. Note that this method also upsamples at intermediate layers.
Alternative 2: UNet¶
$\rightarrow_{\downarrow 2}$: convolution (same), non-linear activation, pooling
$\rightarrow_{\uparrow 2}$: bilinear upsampling, convolution (same), non-linear activation OR
$\rightarrow_{\uparrow 2}$: transposed convolution, non-linear activation
${\color{red}\rightarrow}$: convolution (same), non-linear activation
${\color{blue}\rightarrow}$: convolution (same), sigmoid or softmax activation
Remark
Skip connections can retain details.
Some other alternatives¶
DeepMedic
3D extensions are available
Transformer technologies within UNet structures are immensely successful
Integrating transformer blocks in the encoding path
Architecture design is an active research area
Restoration¶
Dataset¶
For dataset, it is the similar issue as we met in segmentation.
Missing data in multimodal cases
Images may be coming from different "domains," e.g., centers, scanners, sequences
Groud truth labels may not be available for all questions
- It may not be possible to acquire ground truth images because
- Difficult to acquire such images due to physical limitations on the acquisition system
- Unethical to acquire additional ground truth images
- Using synthetic data is a solution
- It may not be possible to acquire ground truth images because
Cost function¶
Basic cost functions¶
Mean absolute error (MAE)
$$MAE = \sum_{r} |y(r) - f(x; \theta)(x)|$$
Mean squared error (MSE)
$$MSE = \sum_{r} \| y(r) - f(x; \theta)(x) \|_2^2$$
Normalized mean squared error (NMSE)
$$NMSE = \frac{MSE}{\sum_{r}\| y(r) \|_2^2}~~~~\text{or}~~~~NMSE=\frac{MSE}{\| \max(y) - \min(y) \|_2^2}$$
Peak signal to noise ratio (PSNR)
$$PSNR = 10 \log_{10} \frac{\max(y)^2}{MSE}$$
Structural similarity index measure (SSIM)
$$SSIM(a, b) = \frac{(2 \mu_a \mu_b + c_1)(2 \sigma_{ab} + c_2)}{(\mu_a^2 + \mu_b^2 + c_1)(\sigma_a^2 + \sigma_b^2 + c_2)}$$
Computed over many patches $a$ and $b$ coming from $y$ and $f(x; \theta)$. Combination of local luminance, contrast and structure comparisons.
Advanced cost function 1: Perceptual loss¶
While it is unclear what it means to capture "perceptual differences," one can use neural networks to do this.
$L1$, $L2$, $L3$, and $L4$ are layers of a previously trained CNN
Perceptual loss is defined as
$$\sum_{j} MSE(\phi_j^y, \phi_{j}^f)$$
The CNN can be a well-established network trained on natural images, e.g., VGG, or a network trained on CT images for another task
The CNN can be much deeper
Distance between "deep features" measure differences in contextual information going beyond simple pixel-wise difference
Advanced cost function 2: Adversarial loss¶
Another way to capture perceptual differences is adversarial losses.