263-5354-00: Large Language Model

Section 2 Modeling Foundations

Swiss Federal Institute of Technology Zurich

Eidgenössische Technische Hochschule Zürich

Last Edit Date: 07/13/2025

Disclaimer and Term of Use:

We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file

This personal note is adapted from Professor Ryan Cotterell. Please contact us to delete this file if you think your rights have been violated.

This work is licensed under a Creative Commons Attribution 4.0 International License.

The task of language model

Train a LM
Parameterized family of LMs
Select a LM given data

Strategies

Maximum-likelihood estimation / regularization
Reinforcement learning (RLHF)

Evaluate a LM

Measure log-likelihood on heldout data

Statistical setup

$p^*$ is the true distribution
$X^{(n)} \sim p^*$ is the training data ($N$ times)
Choose $\hat{p}$ on $\{X^{(n)}\}_{n=1}^{N}$
Goal: $\hat{p}$ be close to $p^*$

Representation-based language models¶

Vector space representations¶

The good representation principle states that the success of a machine learning model depends - in great part - on the representation that is chosen (or learned) for the objects that are being modeled. In the case of language modeling, the two most salient choice points are the represeantations chosen for the symbols, elements of $\bar{\Sigma}$, and the representations chosen for the contexts, elements of $\bar{\Sigma}^*$.

Vector space¶

A vector space over a field $\mathbb{F}$ is a set $\mathbb{V}$ together with two binary operations that satisfy the following axioms:

Associativity of vector addition: for all $\mathbf{v, u, q} \in \mathbb{V}$

$$(\mathbf{v} + \mathbf{u}) + \mathbf{q} = \mathbf{v} + (\mathbf{u} + \mathbf{q})$$
Commutativity of vector addition: for all $\mathbf{v, u} \in \mathbb{V}$

$$\mathbf{v} + \mathbf{u} = \mathbf{u} + \mathbf{v}$$
Identity element of vector addition: there exists $\mathbf{0} \in \mathbb{V}$ such that for all $\mathbf{v} \in \mathbb{V}$

$$\mathbf{v} + \mathbf{0} = \mathbf{v}$$
Inverse element of vector addition: for every $\mathbf{v} \in \mathbb{V}$ there exists a $-\mathbf{v} \in \mathbb{V}$ such that

$$\mathbf{v} + (-\mathbf{v}) = 0$$
Compatibility of scalar multiplication with field mutiplication: for all $\mathbf{v} \in \mathbb{V}$ and $x, y \in \mathbb{F}$

$$x(y \mathbf{v}) = (xy) \mathbf{v}$$
Identity element of scalar mutiplication: for all $\mathbf{v} \in \mathbb{V}$

$$\mathbf{1v} = \mathbf{v}$$

where $\mathbf{1}$ is the multiplicative identity in $\mathbb{F}$.
Distributivity of scalar multiplication with respect to vector addition: for all $x \in \mathbb{F}$ and all $\mathbf{u, v} \in \mathbb{V}$

$$x(\mathbf{v} + \mathbf{u}) = x \mathbf{v} + x \mathbf{u}$$
Distributivity of scalar multiplication with respect to field addition: for all $x, y \in \mathbb{F}$ and all $\mathbf{v} \in \mathbb{V}$

$$(x + y)\mathbf{v} = x\mathbf{v} + y\mathbf{v}$$

Remark

An important characteristic of a vector space is its dimensionality, which corresponds to the number of independent directions - basis vectors - in the space. Any $\mathbf{v} \in \mathbb{V}$ can be expressed as a linear combination of the $D$ basis vectors. The coefficients of this linear combination can then be combined into a $D$-dimensional coordinate vector in $\mathbb{F}^D$.

Inner product spaces additionally define an inner product, mapping pairs of elements of the vector space to scalars.

Inner product space¶

An inner product space is a vector space $\mathbb{V}$ over a field $\mathbb{F}$ coupled with a map

$$\langle \cdot, \cdot \rangle : \mathbb{V} \times \mathbb{V} \rightarrow \mathbb{F}$$

such that the following axioms hold

Conjugate symmetry: for all $\mathbb{v, u} \in \mathbb{V}$

$$\langle \mathbf{v, u} \rangle = \bar{\langle \mathbf{v, u} \rangle}$$

where $\bar{x}$ denotes the conjugate of the element $x \in \mathbb{F}$.
Linearity in the first argument: for all $\mathbf{v, u, z} \in \mathbb{V}$ and $x, y \in \mathbb{F}$

$$\langle x \mathbf{v} + y \mathbf{u}, \mathbf{z} \rangle = x \langle \mathbf{v}, \mathbf{z} \rangle + y \langle \mathbf{u}, \mathbf{z} \rangle$$
Positive-definiteness: for all $\mathbf{v} \neq 0$

$$\langle \mathbf{v, v} \rangle > 0$$

Remark

Inner products are often defined such that they capture some notion of similarity of the vectors in $\mathbb{V}$. Every inner product on a real or complex vector space induces a vector norm.

Norm¶

Given a vector space $\mathbb{V}$ over $\mathbb{R}$ or $\mathbb{C}$ and an inner product $\langle \cdot, \cdot \rangle$ over it, the norm induced by the inner product is defined as the function $\| \cdot \|: \mathbb{V} \rightarrow \mathbb{R}_{\ge 0}$, where

$$\| \mathbf{v} \| \stackrel{\text{def}}{=} \sqrt{\langle \mathbf{v}, \mathbf{v} \rangle}.$$

Remark

A Hilbert space is then an inner product space in which all sequences of elements satisfy a useful property with respect to the norm defined by the inner product: every convergent series with respect to the norm converges to a vector in $\mathbb{V}$.

Hilbert space¶

A Hilbert space is an inner product space that is complete with respect to the norm defined by the inner product. An inner product space is complete with respect to the norm if every Cauchy sequence (an absolute convergent sequence, i.e., a sequence whose elements become arbitrarily close to each other) converges to an element in $\mathbb{V}$. More precisely, an inner product space is complete if, for every series

$$\sum_{n=1}^{\infty} \mathbf{v}_n$$

such that

$$\sum_{n=1}^{\infty} \| \mathbf{v}_n \| < \infty$$

it holds that

$$\sum_{n=1}^{\infty} \mathbf{v}_n \in \mathbb{V}.$$

Remark

Even if an inner product space $\mathbb{V}$ is not necessarily a Hilbert space, $\mathbb{V}$ can always be completed to a Hilbert space.

Completion theorem for inner product spaces¶

Any inner product space can be completed into a Hilbert space.

Remark

The inner prioduct space can be completed into a Hilbert space by completing it with respect to the norm induced by the inner product on the space. For this reason, inner product spaces are also called pre-Hilbert spaces.

Utility of different spaces

Space Utility

Vector space A space in which representations of symbols and strings live. It also allows the expression of the vector representations in terms of the basis vectors.

Inner product space Defines an inner product, which defines a norm and can measure similarity.

Hilbert space There are no "holes" in the representation space with respect to the defined norm, since all convergent sequences converge into ℝ.

Space	Utility
Vector space	A space in which representations of symbols and strings live. It also allows the expression of the vector representations in terms of the basis vectors.
Inner product space	Defines an inner product, which defines a norm and can measure similarity.
Hilbert space	There are no "holes" in the representation space with respect to the defined norm, since all convergent sequences converge into ℝ.

Representation function¶

Let $\mathcal{S}$ be a set and $\mathbb{V}$ a Hilbert space over some field $\mathbb{F}$. A representation function $f$ for the elements of $\mathcal{S}$ is a function of the form

$$f : \mathcal{S} \mapsto \mathbb{V}.$$

Remark

The dimensionality of the Hilbert space of the representations, $D$, is determined by the modeler. Importantly, in the case that $\mathcal{S}$ is finite, we can representation function as a matrix $\mathbf{E} \in \mathbb{R}^{|\mathcal{S}| \times D}$ (assuming $\mathbb{V} = \mathbb{R}^{D}$) where the $n^{th}$ row correpsonds to the representation of the $n^{th}$ element of $\mathcal{S}$.

Symbol embedding function¶

Let $\Sigma$ be an alphabet. An embedding function $\mathbf{e}(\cdot) : \bar{\Sigma} \rightarrow \mathbb{R}^D$ is a representation function of individual symbols $\bar{y} \in \bar{\Sigma}$.

Remark

The representation $\mathbf{e}(\bar{y})$ are commonly refered to as embeddings. The simplest way to represent discrete symbols with real-valued vectors: one-hot encodings:

Let $n: \bar{\Sigma} \rightarrow \{1, \cdots, |\bar{\Sigma}|\}$ be a bijection (i.e., an ordering of the alphabet, assigning an index to each symbol in $\Sigma$). A one-hot encoding $[[ \cdot ]]$ is a representation function which assigns the symbol $\bar{y} \in \bar{\Sigma}$ the $n(\bar{y})^{th}$ basis vector:

$$[[y]] \stackrel{\text{def}}{=} \mathbf{d}_{n(y)},$$

where here $\mathbf{d}_n$ is the $n^{th}$ canonical basis vector, i.e., a vector of zeros with a 1 at position $n$.

Context encoding function¶

Let $\Sigma$ be an alphabet. A context encoding function $\text{enc}(\cdot): \bar{\Sigma}^* \rightarrow \mathbb{R}^D$ is a representation function of strings $\bar{\mathbf{y}} \in \bar{\Sigma}^*$.

Remark

To be completely consistent, the encoding function should be defined over the set $\left( \bar{\Sigma} \cup \{\text{BOS}\} \right)^*$ to allow for the case when $y_0 = \text{BOS}$. We refer to $\text{enc}(\bar{\mathbf{y}})$ as encoding $\bar{\mathbf{y}} \in \Sigma^*$.

Compatibility of symbol and context¶

The smaller the angle between the two representations it, the more similar the two representations are. In a Hilbert space, we define the cosine of the angle $\theta$ between the two representations

$$\cos(\theta) \stackrel{\text{def}}{=} \frac{\langle \mathbf{u, v} \rangle}{\|\mathbf{u}\| \|\mathbf{v}\|}.$$

The Cauchy-Schwartz inequality immediately gives us that $\cos(\theta) \in [-1, 1]$ since $-\| \mathbf{u} \| \| \mathbf{v} \| \le \langle \mathbf{u, v} \rangle \le \| \mathbf{u} \| \| \mathbf{v} \|$.

Given a context representation $\text{enc}(\bar{\mathbf{y}})$, we can compute its inner products with all symbol representations $\mathbf{e}(\bar{y})$:

$$\langle \mathbf{e}(\bar{y}), \text{enc}(\bar{\mathbf{y}}) \rangle,$$

which can be achieved simply with a matrix-vector product:

$$\mathbf{E} \text{enc} (\bar{\mathbf{y}}).$$

$\mathbf{E} \text{enc}(\bar{\mathbf{y}}) \in \mathbb{R}^{|\bar{\Sigma}|}$ has the nice property that each of the individual entries corresponds to the similarities of a particular symbol to the context $\bar{\mathbf{y}}$. The entries of the vector $\mathbf{E} \text{enc} (\bar{\mathbf{y}})$ are often called scores or logits.

We now need to transform $\mathbf{E} \text{enc}(\bar{\mathbf{y}}) \in \mathbb{R}^{|\bar{\Sigma}|}$ into a valid discrete probability distribution by using a projection function.

Projecting onto the simplex¶

Projection functions: mapping vectors onto the probability simplex¶

$p_{SM}(\bar{y} | \bar{\mathbf{y}})$ is a categorical distribution with $\bar{\Sigma}$ categories, i.e., a vector of probabilities whose components correspond to the probabilities of individual categories. The simplest way to represent a categorical distribution is a vector on a probability simplex.

Probability simplex¶

A probability simplex $\mathbf{\Delta}^{D - 1}$ is the set of non-negative vectors $\mathbb{R}^D$ whose components sum to 1:

$$\mathbf{\Delta}^{D - 1} \stackrel{\text{def}}{=} \left\{ \mathbf{x} \in \mathbb{R}^D | x_d \ge 0, d = 1, \cdots D\text{~~and~~}\sum_{d=1}^{D}x_d = 1 \right\}$$

Remark

The definition of a simplex means that we can more formally express $p_{SM}$ as a projection from the Hilbert space of the context representations to $\mathbf{\Delta}^{D - 1}$, i.e., $p_{SM} : \mathbb{V} \rightarrow \mathbf{\Delta}^{|\bar{\Sigma}| - 1}$.

Projection function¶

A projection function $\mathbf{f}_{\mathbf{\Delta}^{D-1}}$ is a mapping from a real-valued Hilbert space $\mathbb{R}^D$ to the probebility simplex $\mathbf{\Delta}^{D-1}$

$$\mathbf{f}_{\mathbf{\Delta}^{D-1}}: \mathbb{R}^D \rightarrow \mathbf{\Delta}^{D-1}.$$

which allows us to define a probability distribution according to $\mathbf{E}\text{enc}(\bar{\mathbf{y}})$:

$$p_{SM}(\bar{y} | \bar{y}) = \mathbf{f}_{\mathbf{\Delta}^{|\bar{\Sigma}|-1}} (\mathbf{E} \text{enc}(\bar{\mathbf{y}}))_{\bar{y}}.$$

Softmax¶

Let $\tau \in \mathbb{R}_{+}$ be the temperature. The softmax at temperature $\tau$ is the projection function defined as

$$\text{softmax}(\mathbf{x})_d \stackrel{\text{def}}{=} \frac{\exp\left[ \frac{1}{\tau} x_d \right]}{\sum_{j=1}^{D}\exp\left[ \frac{1}{\tau} x_j \right]}$$

for $d = 1, \cdots, D$.

Remark

The temperature parameter $\tau$ gives us a mechanism for controlling the entropy of the softmax function by scaling the individual scores in the input vector before their exponentiation. In the context of the Boltzmann distribution, it was used to control the "randomness" of the system:

When the temperature is high, the softmax funciton outputs a more uniform probability distribution, whose probabilities are relatively evenly spread out among the different categories.

When the temperature is low, the softmax function outputs a peaked probability distribution, where the probability mass is concentrated on the most likely category.

Limiting behavior of the softmax function¶

As we take $\tau$ to the edge of the possible values it can assume, the following properties hold

$$\lim_{\tau \rightarrow \infty} \text{softmax}(\mathbf{x}) = \frac{1}{D} \mathbf{1}$$

$$\lim_{\tau \rightarrow 0^+} \text{softmax}(\mathbf{x}) = \mathbf{e}_{\arg \max(\mathbf{x})}$$

where $\mathbf{e}_d$ denotes the $d^th$ basis vector in $\mathbb{R}^D$, $\mathbf{1} \in \mathbb{R}^D$ the vector of all ones, and

$$\arg \max (\mathbf{x}) \stackrel{\text{def}}{=} \left\{ d | x_d = \max_{d=1, \cdots, D} (x_d) \right\},$$

i.e., the index of the maximum element of the vector $\mathbf{x}$ (with the ites broken by choosing the lowest such index). In words, this means that the output of the softmax approaches the uniform distribution as $\tau \rightarrow \infty$ and towards a single mode as $\tau \rightarrow 0^+$.

Variational characterization of the softmax¶

Given a set of real-valued scores $\mathbf{x}$, the following equality holds

$$ \begin{align} \text{softmax}(\mathbf{x}) &= \arg \max_{\mathbf{p} \in \mathbf{\Delta}^{D - 1}} \left( \mathbf{p}^T\mathbf{x} - \tau \sum_{d = 1}^{D} p_d \log p_d \right) \\ &= \arg \max_{\mathbf{p} \in \mathbf{\Delta}^{D - 1}} \left( \mathbf{p}^T\mathbf{x} - \tau H(\mathbf{p}) \right) \end{align} $$

this tells us that softmax can be given a variational characterization, i.e., it can be viewed as the solution to an optimization problem.

Desirable properties of the softmax function¶

The softmax function with temperature parameter $\tau$ exhibits the following properties.

In the limit as $\tau \rightarrow 0^+$ and $\tau \rightarrow \infty$, the softmax recovers the argmax operator and projection to the center of the probabiility simplex (at which lies the uniform distribution), respectively.
$\text{softmax}(\mathbf{x} + c\mathbf{1}) = \text{softmax}(\mathbf{x})$ for $c \in \mathbb{R}$, i.e., the softmax is invariant to adding the same constatnt to all coordinates in $\mathbf{x}$.
The derivative of the softmax is continuous and differentiable everywhere; the value of its derivative can be explicitly computed.
For all temperatures $\tau \in \mathbb{R}_+$, if $x_i \le x_j$, then $\text{softmax}(\mathbf{x})_i \le (\mathbf{x})_j$. In other words, the softmax maintains the rank of $\mathbf{x}$.

Deterivative of the sparsemax function¶

The derivative of the sparsemax with respect to its input $\mathbf{x}$ is as follows:

$$\frac{\partial \text{sparsemax}(\mathbf{x})_i}{\partial x_j} = \begin{cases} \delta_{ij} - \frac{1}{S(\mathbf{x})} & \text{if } i, j \in S(\mathbf{x}) \\ 0 & \text{otherwise} \end{cases}$$

Up until now, projection functions, together with symbol representations and the representation function enc, give us the tools to define a probability distribution over next symbols that encodes complex linguistic interactions.

Representation-based locally normalized models¶

Let enc be an encoding function. A representation-based locally normalized model is a model of the following form:

$$p_{SM}(\bar{y}_t | \bar{\mathbf{y}}_{<t}) \stackrel{\text{def}}{=} \mathbf{f}_{\mathbf{\Delta}^{|\bar{\Sigma}| - 1}} (\mathbf{E}\text{enc}(\bar{\mathbf{y}}_{<t}))_{\bar{y}_t}$$

where unless otherwise stated, we assume $\mathbf{f}_{\mathbf{\Delta}^{|\bar{\Sigma}| - 1}} = \text{softmax}$. It defines the probability of an entire string $\mathbf{y} \in \Sigma^*$ as

$$p_{LN}(\mathbf{y}) \stackrel{\text{def}}{=} p_{SM}(\text{EOS}|\mathbf{y}) \prod_{t=1}^{T} p_{SM}(y_t | \mathbf{y}_{<t})$$

where $y_0 \stackrel{\text{def}}{=} \text{BOS}$.

Tightness of softmax representation-based models¶

Let $p_{SM}$ be a representation-based sequence model over the alphabet $\Sigma$. Let

$$s \stackrel{\text{def}}{=} \sup_{y \in \Sigma} \| \mathbf{e}(y) - \mathbf{e}(\text{EOS}) \|_2$$

i.e., the largest distance to the representation of the $\text{EOS}$ symbol, and

$$z_{\text{max}} \stackrel{\text{def}}{=} \max_{\mathbf{y \in \Sigma^t}} \| \text{enc}(\mathbf{y}) \|_2$$

i.e., the maximum attainable context representation norm for contexts of length $t$. Then the locally normalized model $p_{LN}$ induced by $p_{SM}$ is tight if

$$sz_{\text{max}} \le \log t.$$

Representation-based lamguage models with bounded encodings are tight¶

A locally-normalized representation-based language model with uniformly bounded $\| \text{enc}(\mathbf{y}) \|_p$ (for some $p \ge 1$) is tight.

Estimating a language model from data¶

The Core Task: Language Modeling¶

Goal: To estimate the parameters of a model, $p_M$, that approximates the true, unknown probability distribution of natural language strings, $p_{LM}$.
Data: This is done using a corpus (dataset) $D = { y^(n) }_(n=1)^N$, which is a collection of text strings.
Fundamental Assumption: The data samples $y^(n)$ are assumed to be independently and identically distributed (i.i.d.) according to $p_{LM}$.
Approach: The task is framed as an optimization problem to find the best model parameters, denoted by $\theta$.

Language Modeling Objectives¶

The objective function (or loss function) defines what we are trying to optimize. The goal is to find model parameters $\thetâ$ that make our model $p_\theta$ as "close" as possible to the true distribution $p_{LM}$, which we approximate using the data.

Key Objectives & Equivalence¶

Three common objectives are closely related and often lead to the same solution:

Maximum Likelihood Estimation (MLE)¶

Principle: Find the parameters $\thetâ$ that maximize the probability (likelihood) of observing the training data $D$.
Formula:
$$\theta_{MLE} = \arg \max_{\theta \in \Theta} \Sigma_{n=1}^N \log p_\theta(y^{(n)})$$
Note: We maximize the log-likelihood for numerical stability.

Cross-Entropy Minimization¶

Concept: Measures the difference between two probability distributions. It can be interpreted as the average number of bits needed to encode data from the true distribution using a code optimized for the model's distribution.
Formula:
$$H(p_{\theta^*}, p_\theta) = − \Sigma_{y \in D} p_{\theta^*}(y) \log p_\theta(y)$$
Here, $p_{\theta^*}$ is the empirical distribution from the data (assigns probability $1/N$ to each observed sample in $D$).
Key Insight: Minimizing cross-entropy is equivalent to maximizing log-likelihood.

KL Divergence Minimization¶

Concept: A measure of how one probability distribution diverges from a second, expected probability distribution.
Relationship:
$$D_{KL}(p_{\theta^*} || p_\theta) = H(p_{\theta^*}, p_\theta) − H(p_{\theta^*})$$
Since $H(p_{\theta^*})$ is constant with respect to the model parameters $\theta$, minimizing KL Divergence is also equivalent to minimizing cross-entropy and maximizing log-likelihood.

Alternative Objectives¶

Masked Language Modeling (MLM)¶

Task: Instead of predicting the next word, the model predicts a "masked" word based on both its left and right context.
Example: The students [MASK] to learn. -> predict: want
Note: A model trained with MLM (like BERT) is not a true language model because it doesn't define a valid probability distribution over an entire string. However, it's very effective for creating representations for downstream tasks.

Training Dynamics¶

Teacher Forcing: During training, the model is always given the correct, ground-truth token as context for the next prediction, even if its own previous prediction was wrong.
Exposure Bias: A problem caused by teacher forcing. The model is never exposed to its own mistakes during training, which can lead to compounding errors during inference when it must generate sequences based on its own previous outputs.

Parameter Estimation (Training)¶

This is the process of using numerical optimization to find the best parameters $\theta$.

Data Splitting¶

Training Set: Used to compute the loss and update model parameters.
Validation Set: Used to monitor the model's generalization during training, check for overfitting, and tune hyperparameters (like when to use early stopping).
Test Set: Held-out data used for the final evaluation of the model's performance. It should not influence training in any way.

Numerical Optimization¶

Core Idea: Start with initial parameters $\theta₀$ and iteratively update them to minimize the loss function.

Algorithms:¶

Gradient Descent: Uses the full dataset to compute gradients. Impractical for large datasets.
Stochastic Gradient Descent (SGD): Uses mini-batches to approximate gradients, making training faster.
ADAM: A popular optimization algorithm that adapts the learning rate per parameter, combining momentum and RMSprop.

Other Considerations:

Parameter Initialization: Good initialization is important to avoid poor convergence or suboptimal solutions.
Early Stopping: A regularization technique where training is halted when validation performance stops improving, even if training loss decreases.

Regularization Techniques¶

Regularization refers to any modification to a learning algorithm that improves generalization, often at the cost of training performance.

Common Techniques:¶

Weight Decay (L2 Regularization):
Adds a penalty to the loss based on the squared magnitude of weights.
Encourages simpler models with smaller weights.
Entropy Regularization:
Penalizes low-entropy (peaky) output distributions.
Prevents the model from becoming overconfident.
Label Smoothing:
Distributes a small portion of the target label’s probability mass to other labels.
Prevents the model from becoming too sure of its predictions.
Dropout:
Randomly zeroes out a fraction of neuron activations during training.
Forces the network to rely on multiple features rather than just a few.
Normalization: Rescales data to stabilize training.
- Batch Normalization: Normalizes across the batch dimension.
- Layer Normalization: Normalizes across features within a single example.