227-0391-00: Medical Image Analysis

Section 5 Deep Learning Models

Swiss Federal Institute of Technology Zurich

Eidgenössische Technische Hochschule Zürich

Last Edit Date: 07/08/2025

Disclaimer and Term of Use:

We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file

This personal note is adapted from Professor Ender Konukoglu, Professor Mauricio Reyes, and Doctor Ertunc Erdil. Please contact us to delete this file if you think your rights have been violated.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Convolutional Neural Network (CNN)¶

Introduction to neural networks¶

$$ \begin{align} y &= f^*(x) = g(x; \theta) \\ \theta^* &= \arg \min_{\theta} L(y, g(x; \theta)) + R(\theta)\\ g(x; \theta) &= g_L \circ \cdots \circ g_2 \circ g_1(x) \end{align} $$

where $\theta$ is the set of learnable parameters, $L(y, g(x; \theta))$ is the loss function , $R(\theta)$ is the regularization.

Multilayer perceptron (MLP)¶

$$g(x; \theta) = g_L(h^{L - 1}; W_L, b_L) \circ \cdots g_1(x; W_1, b_1)$$

No description has been provided for this image

where $\sigma_L$ softmax activation for multi-class classification, $h^l = \sigma_{l}(W_l h^(l-1) + b_l)$, $h^l \in \mathbb{R}^{d_l \times 1}$, $h^0 = x$, $h^L = g(x; \theta)$.

For a single layer in CNN, we have $h^l = W_l \times h^{l-1} + b_l$, where $W_l$ is the convlolution filter.

Activation functions¶

Convolution layers¶

Convolution is the integral of the product of the two functions, after one of them is reserved and shifted.

Convolution operation - 1D¶

Consider two functions of a single-variable $f(t)$ and $g(t)$
Continuous case: $(f \times g)(t) = \int f(\tau) g(t - \tau)$
Discrete case: $(f \times g)(t) = \sum_{-\infty}^{\infty} f(\tau) g(t - \tau)$

Convolution operation - 2D¶

Consider the convolution for two variables $I(x, y)$ image and $k(x, y)$ convolutional kernel
Discrete case $(I \times k)(x, y) = \sum_{m} \sum_{n} I(m, n) k(x - m, y - n)$
Convolution is commutative (this property is usedul in some proofs)

$$(I \times k)(x, y) = (k \times I)(x, y) = \sum_{m} \sum_{n} I(x - m, x - n) k(m, n)$$

Convolution and cross-correlation¶

Convolution: $(k \times I)(x, y) = \sum_{m} \sum_{n} I(x - m, x - n)k(m, n)$
Cross-correlation: $(k \times I)(x, y) = \sum_{m} \sum_{n} I(x + m, x + n)k(m, n)$
In practice, in many machine learning libraries, cross-correlation is implemented under the name of covolution
In the literature, covolutional kernels are also referred to as convolutional weights and convolutional filters

Convolution example¶

Extracting vertical edges

Extracting horizontal edges

Random extract

Fully connected vs. convolution¶

Note that nonlinearities activations remain similar.

Fully connected¶

$a_{l, k} = \sum_{j} w_{l,k,j} h_{l-1}+b_{l,k}$
Each $h_{l-1, j}$ is a number corresponding to the values of the $j^{th}$ neuron at layer $l -1$, and so is $a_{l,k}$
Each $h_{l}$ is a vector of neurons and $a_l$ is a vector of activation before applying the activation function
There is a separate $w_{l,k,j}$ that is linking each neuron $h_{l-1, j}$ to each $a_{l,k}$
High dimensions of $h_{l-1}$ and $a_l$ lead to high number og weights

Convolution¶

$a_{l, k} = \sum_{j} w_{l,k,j} \times h_{l-1}+b_{l,k}$
Each $h_{l-1, j}$ is an image of neurons corresponding to the $j^{th}$ channel at layer $l-1$, ans so is $a_{l,k}$
There is a separate convolutional kernel linking images $h_{l-1,j}$ and $a_{l,k}$. Same kernel applies to the entire image of neurons
$h_{l-1, j}$ are called channels
$w_{l,k,j}$ are convolutional filters

Convolutions instead of projections¶

Sparser connection¶

Layer size and number of parameters¶

First layer
- Fully connected
  - $x \in \mathbb{3 \times N_0 \times M_0}$
  - $h_1 \in \mathbb{R}^{d_1}$
  - $W_1 \in \mathbb{R}^{d_1 \times \{3 \times N_0 \times M_0\}}$
- Convolutional
  - $x \in \mathbb{3 \times N_0 \times M_0}$
  - $h_1 = \mathbb{R}^{d_{1} \times (N_1 \times M_1)}$
  - $W_1 = \{w_{1,1}, w_{1,2}, \cdots, w_{1, d_1}\}$
  - $w_{1, j} \in \mathbb{R}^{3 \times k_1 \times k_2}$
  - $3 \times k_1 \times k_2$: kernel size
  - $W_1 \in \mathbb{R}^{d_1 \times \{3 \times k_1 \times k_2\}}$
Intermediate layer
- Fully connected
  - $h_{l - 1} \in \mathbb{R}^{d_l - 1}$
  - $h_l \in \mathbb{R}^{d_l}$
  - $W_l \in \mathbb{R}^{d_l \times d_{l - 1}}$
- Convolutional
  - $h_{l-1} \in \mathbb{R}^{d_{l-1} \times N_{l-1} \times M_{l-1}}$
  - $h_{l} \in \mathbb{R}^{d_l \times (N_l \times M_l)}$
  - $W_l = \{w_{l,j,k}\}$, where $j = 1, \cdots, d_{l-1}$ and $k = 1, \cdots, d_l$
  - $W_l \in \mathbb{R}^{d_l \times d_{l-1} \times (k_1 \times k_2)}$

Kernel¶

When the kernel is placed in the image - no problem
Whne it is placed on the boundary - the out of boundary values are not well defined
Two options
- Valid convolution: only evaluate the the convolution when all the elements are defined
  - The kernel only evaluated within the red area
  - We lose a pixel at each end of the image
    - $h_l \in \mathbb{d_l \times (N_{l - 1} \times M_{l - 1})}$
    - $h_{l-1} \in \mathbb{R}^{d_{l - 1} \times (N_{l - 1} \times M_{l - 1})}$
    - $w_{l, j, k} \in \mathbb{R}^{d_l \times d_{l - 1} \times (k_1 \times k_2)}$
    - $N_l = N_{l - 1} - k_1 + 1$
    - $M_l = M_{l - 1} - k_2 + 1$
- Same convolution (padding): pad the boundaries so that results of the convolution will have the same size
  - Alternatively, we can pad the image on the boundaries so that the channels will have the same size across layers
    - $h_{l - 1} \in \mathbb{R}^{d_{l - 1} \times (N_{l - 1} \times M_{l - 1})}$
    - $h_{l - 1} \in \mathbb{R}^{d_{l - 1} \times (N_{l - 1} + k_1 - 1) \times (M_{l - 1} + k_2 - 1)}$, after padding
    - $h_l \in \mathbb{R}^{d_l \times d_{l - 1} \times (N_l \times M_l)}$
    - $M_l = M_{l - 1}$
    - $N_l = N_{l - 1}$
  - The value we pad is a parameter. Zero is commonly used but one can also use symmetric padding

Problems using MLPs with grid-like structures¶

$$h^l = \sigma_l (W_l h^{l - 1} + b_l)$$

The linear transformation has three issues

Fully connected links leads to too many parameters
- Classification network with 16 possible output classes
- Q: How many parameters are in the following fully connected networks, where each arrow is fully connected link and assuming a bias number per layer $$x \in \mathbb{R}^{64 \times 64} \rightarrow 256 \rightarrow 128 \rightarrow 64 \rightarrow 16$$
- A: Input size is $64 \times 64 = 4096$.
  - $4096 \rightarrow 256$
    - Weights: $4096 \times 256 = 1048576$
    - Biases: 256
  - $256 \rightarrow 128$
    - Weights: $256 \times 128 = 32768$
    - Biases: 128
  - $128 \rightarrow 64$
    - Weights: $128 \times 64 = 8192$
    - Biases: 64
  - $64 \rightarrow 16$
    - Weights: $64 \times 16 = 1024$
    - Biases: 16
  - $\text{Total parameters} = 1048576 + 256 + 32768 + 128 + 8192 + 64 + 1024 + 16 = 1091024$
- Q: How many parameters and how many hidden neurons in each layer in the following convolultion neural network, where double arrow are "valid" convolution links with $5 \times 5$ kernels, arrows are fully connected links, and assuming a bias number per channel for convolutional links $$x \in \mathbb{R}^{64 \times 64} \Rightarrow 256 \Rightarrow 128 \Rightarrow 64 \rightarrow 16$$
- A:
  - Conv layer input $64 \times 64 \times 1 \rightarrow 256$ channels
    - Kernel: $5 \times 5$, input channels $1 \rightarrow$ output channels $256$
    - Output size: $(64 - 5 + 1) = 60 \rightarrow 60 \times 60 \times 256$
    - Weights: $256 \times 1 \times 5 \times 5 = 6400$
    - Biases: 256
  - Con layer $256 @ 60 \times 60 \rightarrow 128$ channels
    - Output size: $60 - 5 + 1 = 56 \rightarrow 56 \times 56 \times 128$
    - Weights: $128 \times 256 \times 5 \times 5 = 819200$
    - Biases: 128
  - Conv layer $128 @ 56 \times 56 \rightarrow 64$ channels
    - Output size: $56 - 5 + 1 = 52 \rightarrow 52 \times 52 \times 64$
    - Weights: $64 \times 128 \times 5 \times 5 = 204800$
    - Biases: 64
  - Fully connected layer $64 \times 52 \times 52 \rightarrow 16$
    - Input size: $64 \times 52 \times 52 = 173056$
    - Weights: $173056 \times 16 = 2768896$
    - Biases: 16
  - $\text{Hidden neurons} = 60 \times 60 \times 256 + 56 \times 56 \times 128 + 52 \times 52 \times 64 = 1496064$
  - $\text{Total parameters} = 6400 + 256 + 819200 + 128 + 204800 + 64 + 2768896 + 16 = 3799760$
Images are composed of a hierarchy of local statistics
- Hierarchically aggregating local spatial features. Extracting task specific features from the images. As the layer progress
Lack of translation invariance
- Translation invariance is not native to fully connected networks
- These images will most likely lead to a very different activations in the hidden layer
- Rest of the network will see different activations
- In various vision applications, these images should lead to identical outputs
- Convolution cna help - it is translation equivariant
- Translation equivariance: applying a transformation to the input yields the same result as applying the transformation to the output
$$f(T(x)) = T(f(x))$$
- However, convolution does not have translation invariance
$$f(T(x)) \neq f(x)$$

Strides¶

Assume we use convolution kernels of size $5 \times 5$ and valid padding in a recognition system, output channel size should be $1 \times 1$ (with number channels equal to the number of classes)
We would need many layers or very large kernels
In a normal convolution, we only move one pixel in each direction, not skipping any pixels

However, we can decide to skip several pixels while shifting the kernel, e.g., skipping 2 kernels
Lead to reduction in channel size

$$N_l = \lfloor \frac{N_{l - 1} - k_1}{s_1} \rfloor, M_l = \lfloor \frac{M_{l - 1} - k_2}{s_2} \rfloor$$

The dimension drops very quickly. The rate of drop will be faster if stride increases
We lose information, higher stride means higher loss of information
We do not gain translation invariance
If used, it is most common to use stride 2 in all directions

Pooling¶

Pool information in a neighborhood
Represents the region with one number, summarizes information
Applied to each channel separately
Max-pooling - maximum of the activation values
Min-pooling - minimum of the activation values
Both are non-linear operations, like median filtering
Averaging pooling - linear operator
Max-pooling is the most commonly used version

Max-pooling¶

Represents the entire region with the neuron that achieves the highest activation
Leads to partial local translation invariance
617 in the highlighted area can be in any of the neurons, the pooled value will not change
Does not lead to complete translation invariance
Often applied with strides equal to the size of the pooling kernel. Note that the pooling kernel does not contain any learnable parameters

Leads to substatial dimensionality reduction
Even when the pooling kernel is of size $2 \times 2$, it can have the image
As the size of the pooling kernel increases, the reduction increases as well
Non-linear dimensionality reduction
Only the most prominent activation is transmitted to the next layer
More advanced pooling mechanisms exist, e.g., CapsuleNets

Transformers¶

Self-attention¶

Attention is a function of $x_1, x_2, \cdots, x_N$

$$y_i = Attention(x_1, \cdots, x_i, \cdots, x_N) = \sum_{j=1}^{N} a_{ij} x_j$$

where $a_{ij} = \frac{\exp(x_i^T x_j)}{\sum_{j'=1}^{N} \exp(x_i^T x_{j'})}, \mathbb{R}^{D \times 1}$, is the attention coefficient, with constraints $a_{ij} \ge 0$ and $\sum_{j=1}^N a_{ij} = 1$.

Matrix notation.

$$Y = Softmax(XX^T)X$$

Note that softmax is applied to each row separately.

Add learnable parameters¶

$$\tilde{X} = XU, U \in \mathbb{R}^{D \times D}$$

$$Y = Softmax(\tilde{X}\tilde{X}^T)\tilde{X}$$

$$Y = Softmax(XUU^TX^T)XU$$

Note that $U$ is the added parameter and $XUU^TX^T$ is symmetric, which is the problem.

Add different learnable parameters for Key, Query, and Value¶

$$K = XW_K, W_K \in \mathbb{R}^{D \times D_K}$$

$$Q = XW_Q, W_Q \in \mathbb{R}^{D \times D_Q}$$

$$V = XW_V, W_V \in \mathbb{R}^{D \times D_V}$$

$$Y = Softmax(QK^T)V$$

Note that $QK^T$ is asymmetric.
Usually $D_K = D_Q = D_V = D$, but $D_V$ can be different sometimes.

Metaphor: Netflix¶

Hard attention - in retrieval we select the most similar movie to the query
In transformations
- Generatize hard attention to soft attention
- The new representation for the query becomes the weighted average of the values

Scaled self-attention¶

Assume $Q$ and $K$ are from univariate Gaussian
The gradient of the softmax function become exponentially small for small and large inputs
If $Q, K \sim N(0, 1)$ then $Var(QK^T) = D_K$

$$Y = Softmax\left( \frac{QK^T}{\sqrt{D_K}} \right) V$$

Muti-head self-attention¶

Remember that we have multiple channels in CNNs, there might be different useful features to extract at each layer using multiple filters

Similarly, there might be multiple patterns that require attention, using multiple attention heads

$$H_h = Attention(Q_h, K_h, V_h), \forall h \in [1, H]]$$

where $Attention(Q_h, K_h, V_h) = Softmax \left( \frac{Q_h K_h^T}{\sqrt{D_{K_h}}} \right) V_h$, $K_h = XW_{K_h}$, $Q_h = XW_{Q_h}$, and $V_h = XW_{V_h}$.

As a result, we have

$$Y = Concat[H_1, H_2, \cdots, H_H] W_0$$

where $Concat[H_1, H_2, \cdots, H_H] \in \mathbb{R}^{N \times HD_V}$ and $W_0 \in \mathbb{R}^{HD_V \times D}$.

Transformer layers¶

Positional encoding¶

Remember that CNNs are translation equivaeiant

$$f(T(X)) = T(f(X))$$
Transformers are equivalent with respect to input permutations

$$f(P(X)) = P(f(X))$$
Ecode token order in the data

$$\tilde{x}_i = x_i + r_i$$

where $r_i, x_i \in \mathbb{R}^{D \times 1}$
An ideal positional encoding should
- Provide a unique represnetation for each position
- Be bounded
- Generalize to longer sequences (if the encoding is short, then it cannot adapt to longer sequence)
- Encode the relative positions of tokens

$$ r_{ni} = \begin{cases} \sin \left( \frac{n}{L^{i/D}} \right) & \text{if }i\text{ is even} \\ \cos \left( \frac{n}{L^{(i-1)/D}} \right) & \text{if }i\text{ is odd} \\ \end{cases} $$

Transformers with positional encoding

Transformers for images¶

$Y = MLP(z_0^0)$
$X_L = TransformerEncoder(X_0)$, $X_0 = [z_0^0; z_1^0; \cdots; z_9^0]$
$z_i^0 = z_i + E_{pos}^i$, where $E_{pos}^i \in \mathbb{R}^{1 \times D}$
$z_i = flatten(x_i)E$, where $E \in \mathbb{R}^{(P^2 C) \times D}$ and $z_i \in \mathbb{R}^{1 \times D}$
$x_i \in \mathbb{R}^{P \times P \times C}$, $\forall i \in [1, N]$
$N = HW / P^2$

Transformers vs. CNNs for images¶

In CNNs, locality and two-dimensional neighborhood are strong inductive biases for vision tasks.
In ViTs
- Only MLP layers are local while self-attention is global
- Patching is the only 2-dimensional inductive bias; it has to learn geographical properties from scratch
- Generally, requires more training data and a comparable CNN