263-5354-00: Large Language Model
Section 1
Probabilistic Foundations
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 07/11/2025
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ryan Cotterell. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Informal definition of language modeling¶
Given
$\Sigma$: An alphabet - a finite, non-empty set, somtimes referes to as a vocabulary.
$\text{EOS}$: The end-of-sequence symbol, where $\text{EOS} \notin \Sigma$.
$\Sigma^*$: The set of all strings over the alphabet $\Sigma$, which is countable. Usually called the Kleene of $\Sigma$. For example, if $\Sigma = \{a, b\}$, then $\Sigma^* = \{\varepsilon, a, b, aa, bb, ab, ba, \cdots\}$, where $\varepsilon$ is the empty string.
String: A string is a finite sequence of symbols drawn from an alphabet. For example, if $\Sigma = \{a, b\}$, then $abaa$ is a string.
A language model can be seen as the probability of a string according to the distribution $p$.
$$p(\mathbf{y}) = p(y_1, \cdots, y_T) = p(\text{EOS} | \mathbf{y}) \prod_{t = 1}^{T}p(y_t | \mathbf{y}_{<t})$$
where
$y \in \Sigma \cup \{\text{EOS}\}$
$\mathbf{y} \in \Sigma^*$
$p(y | \mathbf{y})$ represents the probability of the symbol $y$ occuring as the next symbol after the string $\mathbf{y}$
Remark
We assume we have finite strings over an alphabet $\Sigma$ in this case. However, when dealing with infinite spaces, we will run into a classic paradox. Infinite coin toss is such example.
Suppose we have $\{H, T\}$, where $H$ represents heads and $T$ represents fails. We want to place a distribution over $\{H, T\}^{\infty}$, the uncountable set of infinite sequences of $\{H, T\}$.
Intuitively, people will define $\mathbf{y}_{<t}$, $p(H | \mathbf{y}_{<t}) = p(T | \mathbf{y}_{<t}) = \frac{1}{2}$, and $p(\text{EOS} | \mathbf{y}_{<t}) = 0$.
However, each individual finite sequence over $\{H, T\}$ should also be assigned probability $(1/2)^{\infty} = 0$. With these, we can get the paradox:
$$\begin{align}1 &= p\left(\{H, T\}\right) \\ &= p \left(\bigcup_{\omega \in \{H, T\}^{\infty}} \{\omega\} \right) \\ &= \sum_{\omega \in \{H, T\}^{\infty}} p\left(\{\omega\}\right) \\ &= \sum_{\omega \in \{H, T\}^{\infty}} 0 = 0 \end{align}$$
Language models: distributions over strings¶
Sets of strings¶
Alphabet¶
An alphabet is a finite, non-empty set. We denote an alphabet as $\Sigma$ or $\Delta$ here. We refer to the elements of an alphabet as symbols or letters and will denote them with lowercase letters: $a$, $b$ ,$c$.
String¶
A string (also referred to as word) over an alphabet is any finite sequence of letters. Strings made up of symbols from $\Sigma$ will denoted by holded Latin letters, e.g., $\mathbf{y} = y_1 \cdots y_Y$, where each $y_n \in \Sigma$.
Kleene star¶
Let $\Sigma$ be an alphabet. The Kleene star $\Sigma^*$ is defined as
$$\Sigma^* = \bigcup_{n = 0}^{\infty} \Sigma^n$$
where
$$\Sigma^n \stackrel{\text{def}}{=} \underbrace{\Sigma \times \cdots \times \Sigma}_{n \text{ times}}$$
Note that we difine $\Sigma^0 \stackrel{\text{def}}{=} \{ \epsilon \}$. We call the $\Sigma^*$ the Kleene closure of the alphabet $\Sigma$. We also define
$$\Sigma^+ \stackrel{\text{def}}{=} \bigcup_{n=1}^{\infty} \Sigma^n = \Sigma \Sigma^*.$$
Infinite sequences¶
Let $\Sigma$ be an alphabet. The set of all infinite sequences over $\Sigma$ is defined as:
$$\Sigma^{\infty} \stackrel{\text{def}}{=} \underbrace{\Sigma \times \cdots \times \Sigma}_{\infty-\text{times}}$$
Since strings are canonically finte in computer science, we will explicitly use the terms infinite squence or infinite string to refer to elements of $\Sigma^{\infty}$.
Formal language¶
Let $\Sigma$ be an alphabet. A language $L$ is a subset of $\Sigma^*$.
For example, let $\Sigma = \{a, b, c\}$. Then $\Sigma^* = \{\varepsilon, a, b, c, aa, ab, ac, ba, bb, bc, ca, cb, cc, aaa, aab, aac, \cdots\}$. As a result, $L_1 \stackrel{\text{def}}{=} \{a, b, ab, ba\}$, $L_2 \stackrel{\text{def}}{=} \{\mathbf{y \in \Sigma^* | y_1 = a}\}$, and $L_3 \stackrel{\text{def}}{=} \{\mathbf{y \in \Sigma^* | |\mathbf{y}| \text{ is even}}\}$ are all valid languages.
String subelements¶
A subsequence of a string $\mathbf{y}$ is defined as a sequence that can be formed from $\mathbf{y}$ by deleting some or no symbols, leaving the order untouched. A substring is a contiguous subsequence.
For instance, ab and bc are substrings and subsequences of $\mathbf{y} = abc$, while ac is a subsequence but not a substring.
Prefixes and suffixes are special cases of substrings. A prefix is a substring of $\mathbf{y}$ that shares the same first letter as $\mathbf{y}$ and a suffix is a substring of $\mathbf{y}$ that shares the same last letter as $\mathbf{y}$. We will also denote a prefix $y_1, \cdots y_{n-1}$ of the string $\mathbf{y} = y_1 \cdots y_T$ as $\mathbf{y}_{<n}$. We will also use the notation $\mathbf{y} \lhd \mathbf{y}'$ to denote that $\mathbf{y}$ is a suffix of $\mathbf{y}'$.
Defining a language model¶
Language model¶
Let $\Sigma$ be an alphabet. A language model is a (discrete) distribution $p_{LM}$ over $\Sigma^*$.
For example, let $\Sigma \stackrel{\text{def}}{=} \{a\}$. For $n \in \mathbb{N}_{\ge 0}$, define
$$p_{LM} (a^n) \stackrel{\text{def}}{=} 2^{-(n+1)},$$
where $a^0 \stackrel{\text{def}}{=} \epsilon$ and $a^n \stackrel{\text{def}}{=} \underbrace{a \cdots a}_{n\text{ times}}$.
We claim that $p_{LM}$ is a language model. To see that, we verify that it is a valid probability distribution over $\Sigma^*$. It is easy to see that $p_{LM}(a^n) \ge 0$ for any $n$. Additionally, we see that the probabilities of finite sequences indeed sum to 1:
$$\sum_{\mathbf{y} \in \Sigma^*} p_{LM}(\mathbf{y}) = \sum_{n=0}^{\infty} p_{LM}(a^n) = \sum_{n=0}^{\infty} 2^{-(n+1)} = \frac{1}{2} \sum_{n=0}^{\infty} 2^{-n} = \frac{1}{2} \frac{1}{1 - \frac{1}{2}} = 1.$$
In our formal analysis of language models, we will also often refer to the language defined by a language model.
Weighted language¶
Let $p_{LM}$ be a language model. The weighted language of $p_{LM}$ is defined as
$$L(p_{LM}) \stackrel{\text{def}}{=} \{ (\mathbf{y}, p_{LM}(\mathbf{y})) | \mathbf{y} \in \Sigma^* \}$$
For example, the language of the language model from the previous example is
$$L(p_{LM}) \stackrel{\text{def}}{=} \left\{ \left( a^n, 2^{-(n+1)} \right) | n \in \mathbb{N}_{\ge 0} \right\}$$
Remark
A language model is itself a very simple concept - it is simply a distribution that weights strings (natural utterances) by their probabilities to occur in a particular language.
For any (natural) language, the groud-truth language model $p_{LM}$ is unknown and complex. Therefore, we need to use some computational models to tractably represent distributions over strings and ways of approximating (learning) the ground-truth distribution based on finite datasets using such models.
Global and local normalization¶
Globally normalized models are normalized by summing over the entire (infinite) space of strings. Locally normalized models define a sequence of conditional distributions and make use of the chain rule of probability to define the joint probability of a whole string.
When specifying $p_{LM}$, we have two fundamental options. Depending on whether we model $p_{LM}(\mathbf{y})$ for each string $\mathbf{y}$ directly or we model individual conditional probabilities $p_{LM}(y_n | \mathbf{y}_{<t})$, we distinguish globally and locally normalized models.
The beginning of sequence string symbol. Conventionally, we will introduce a special symbol over which globally or locally normalized models operate: the beginning of sequence (BOS) symbol, which, as the name suggests, denotes the beginning of a string or a sequence. For a string $\mathbf{y} = y_1 \cdots y_T$, we will suggestively denote $y_0 \stackrel{\text{def}}{=} \text{BOS}$.
Globally normalized language models¶
Globally normalized models are also called energy-based language models. To define a globally normalized model, we start with the definition of an energy funciton.
Energy function¶
An energy function is a function $\hat{p}: \Sigma^* \rightarrow \mathbb{R}$.
An energy function can be used to define a very general class of probability distributions by normalizing its exponentiated negative values.
Globally normalized models¶
Let $\hat{p}_{GN}(\mathbf{y}): \Sigma^* \rightarrow \mathbb{R}$ be an energy function. A globally normalized model (GNM) is defined as
$$p_{LM}(\mathbf{y}) \stackrel{\text{def}}{=} \frac{\exp\left[-\hat{p}_{GN} (\mathbf{y})\right]}{\sum_{\mathbf{y}' \in \Sigma^*} \exp\left[ \hat{p}_{GN} (\mathbf{y}') \right]} \stackrel{\text{def}}{=} \frac{1}{Z_G} \exp\left[ - \hat{p}_{GN}(\mathbf{y}) \right]$$
where $Z_G \stackrel{\text{def}}{=} \sum_{\mathbf{y}' \in \Sigma^*} \exp\left[ \hat{p}_{GN} (\mathbf{y}') \right]$, which is also called the normalization constant.
Globally normalized models are attractive because one only needs to define an (unnormalized) energy function $\hat{p}_{GN}$, which scores entire sequence at once. However, the downside it that it may be diffcult to compute the normalizer $Z_G$.
Normalizability¶
In defining the normalizer $Z_G \stackrel{\text{def}}{=} \sum_{\mathbf{y}' \in \Sigma^*} \exp\left[ -\hat{p}_{GN}(\mathbf{y}') \right]$, we notationally cover up a certain subtlety. The set $\Sigma^*$ is countably infinite, so $Z_G$ may diverge to $\infty$. In this case, $\frac{1}{Z_G} \exp\left[ - \hat{p}_{GN}(\mathbf{y}) \right]$ is not well-defined, which motivates the following definition.
Normalizing energy function¶
We say that an energy function is normalizable if the quantity $Z_G$ in $\frac{1}{Z_G} \exp\left[ - \hat{p}_{GN}(\mathbf{y}) \right]$ is finite, i.e., if $Z_G < \infty$.
With this definition, we can state a relatively trivial result that characterizes when an energy function can be turned into a globally normalized language model.
Theorem: Normalizable energy functions induce language models
Any normalizable energy function $p_{GN}$ induces a language model, i.e., a distribution over $\Sigma^*$.
Proof: Given an energy function $\hat{p}_{GN}$, we have $\exp\left[ - \hat{p}_{GN}(\mathbf{y}) \right] \ge 0$ and
$$ \begin{align} \sum_{\mathbf{y} \in \Sigma^*} p_{GN}(\mathbf{y}) &= \sum_{\mathbf{y} \in \Sigma^*} \frac{\exp\left[-\hat{p}_{GN} (\mathbf{y})\right]}{\sum_{\mathbf{y}' \in \Sigma^*} \exp\left[ \hat{p}_{GN} (\mathbf{y}') \right]} \\ &= \frac{1}{\sum_{\mathbf{y}' \in \Sigma^*} \exp\left[ \hat{p}_{GN} (\mathbf{y}') \right]} \sum_{\mathbf{y} \in \Sigma^*} \exp\left[-\hat{p}_{GN} (\mathbf{y})\right] \\ &= 1, \end{align} $$
which means that $p_{GN}$ is a valid probability distribution over $\Sigma^*$.
Although it is a big advantage that normalizable energy functions always form a language model, ensuring that they are normalizable can still be difficult and restrictive. When is an energy function normalizable?
Even it is known that an energy function is normalizable, we still need an efficient algorithm to compute it. However, efficiently computing $Z_G$ can be challenging, the fact that $\Sigma^*$ is infinite means that we cannot always computer $Z_G$ in a tractable way.
As a matter of fact, there are no general-purpose algorithms fro this. Moreover, sampling from the model is similarly instractable, as entrie sequence have to be drawn at a time from the large space $\Sigma^*$.
Locally normalized language model¶
The inherent difficulty in computing the normalizer, an infinite summation over $\Sigma^*$, motivates the definition of locally normalized language models, which we will denote with $p_{LN}$. Rather than defining a probability distribution over $\Sigma^*$ directly, they decompose the problem into the problem of modeling a series of conditional distributions over the next possible symbol in the string given the context so far, i.e., $p_{LN}(y | \mathbf{y})$, which could be naively combined into the full probability of the string by multiplying the conditional probabilities.
In essense, this reduces the problem of having to normalize the distribution over an infinite set $\Sigma^*$ to the problem of modeling the distribution of the next possible symbol $y_n$ given the symbols seen so far $\mathbf{y}_{<n}$. This means that normalization would only ever require summation over $|\Sigma|$ symbols at a time, solving the tractability issues encountered by globally normalized models.
However, in order to be a language model, $p_{LN}(y | \mathbf{y})$ must constitute a probability distribution over $\Sigma^*$, but this may not be the case because locally normalized models can place positive probability mass on infinitely long sequences. Additionally, we also have to introduce a new symbol that tells us to "stop" generating a string, which we call the end of sequence symbol, EOS. We assume $\text{EOS} \notin \Sigma$ and we define
$$\bar{\Sigma} \stackrel{\text{def}}{=} \Sigma \cup \{ \text{EOS} \}.$$
Moreover, we explicitly denote elements of $\bar{\Sigma}^*$ as $\bar{\mathbf{y}}$ and symbols in $\bar{\Sigma}$ as $\bar{y}$. Given a sequence of symbols and the EOS symbol, we take the string to be the sequence of symbols encountered before the first EOS symbol as denoting the end of the string or even as a language model terminating its generation.
Due to the issues with defining valid probability distributions over $\Sigma^*$, we will use the term sequence model to refer to any model that may place positive probability on infinitely long sequences. Thus, seuqence models are strictly more general thatn language models, which, by definition, only place positive prbability mass on strings, i.e., finite seuqnces.
Sequence model¶
Let $\Sigma$ be an alphabet. A sequence model (SM) over $\Sigma$ is defined as a set of conditional probability distributions
$$p_{SM}(y | \mathbf{y})$$
for $y \in \Sigma$ and $\mathbf{y} \in \Sigma^*$. We will refer to the string $\mathbf{y}$ in $p_{SM}(y | \mathbf{y})$ as the history or the context.
Note that we will mostly consider SMs over the set $\bar{\Sigma}$. To reiterate, we have just formally defined locally normalized sequence models rather than locally normalized language models. That has to do with the fact that, in contrast to a globally normalized model with a normalizable energy function, a SM might not correspond to a language model.
We will now work up to a locally normalized language model.
Locally normalized language model¶
Let $\Sigma$ be an alphabet. Next, let $p_{SM}$ be a sequence model over $\bar{\Sigma}$. A locally normalized language model (LNM) over $\Sigma$ is defined as
$$p_{LN}(\mathbf{y}) \stackrel{\text{def}}{=} p_{SM} (\text{EOS} | \mathbf{y}) \prod p_{SM} (y_t | \mathbf{y}_{<t})$$
for $\mathbf{y} \in \Sigma^*$ with $|\mathbf{y}| = T$. We say a locally normalized language model is tight if
$$\sum_{\mathbf{y} \in \Sigma^*} p_{LN}(\mathbf{y}) = 1.$$
Example: Locally and globally normalized language model
The image above is an example of a locally normalized language model. The values of the edges represent the conditional probability of observing the new word given the observed words (higher up on the path from the root node BOS). Note that the probabilities stemming from any inner node should sum to 1 - however, to avoid clutter, only a subset of the possible arcs is drawn.
The image above is an example of a globally normalized model which can for example generate sentences based on the probabilities determined by normalizing the assigned scores $\hat{p}_{GN}$.
We can compute the probabilities of various strings by starting at the root node BOS and choosing one of the paths to a leaf node, which will always be EOS. The values on the edges represent the conditional probabilities of observing the new word given at the target of the edge given the context seen on the path so far, i.e., $p_{LN}(\bar{y}_t | \bar{\mathbf{y}}_{<t})$ at the level $t$ of the tree. For example, the probability of the string BOS "The best" EOS under this language model is $0.04 \cdot 0.13 \cdot 0.22 = 0.001144$. On the other hand, a globally normalized model would simply score all possible sentences using the score function $\hat{p}_{GN}(\mathbf{y})$.
Locally normalizing a language model¶
When can a language model be locally normalized? In fact, every language model can be normalized.
Theorem: Any language model can be locally normalized
Let $p_{LM}$ be a language model. Then, there exists a locally normalized language model $p_{LN}$ such that, for all $\mathbf{y} \in \Sigma^*$ with $|\mathbf{y}| = T$,
$$p_{LM}(\mathbf{y}) = p_{LN}(\mathbf{y}) = p_{SM}(\text{EOS} | \mathbf{y}) \prod_{t=1}^{T} p_{SM}(y_t | \mathbf{y}_{<t}).$$
We now introduce the concept of prefix probabilies, which denote the sum of the probabilities of all strings beginning with a certain prefix.
Prefix probability¶
Let $p_{LM}$ be a language model. We define a $p_{LM}$'s prefix probability $\pi$ as
$$\pi(\mathbf{y}) \stackrel{\text{def}}{=} \sum_{\mathbf{y}' \in \Sigma^*} p_{LM}(\mathbf{yy}'),$$
that is, the probability that $\mathbf{y}$ is a prefix of any string $\mathbf{yy}'$ in the language, or, equivalently, the cumulative probability of all strings beginning with $\mathbf{y}$.
Note that, natually, $\pi(\epsilon) = 1$.
Proof of the theorem
We define the individual conditional probability distributions over the next symbol of the SM $p_{SM}$ using the chain rule of probability. If $\pi(\mathbf{y}) > 0$, then define
$$p_{SM}(y | \mathbf{y}) \stackrel{\text{def}}{=} \frac{\pi(\mathbf{y}y)}{\pi(\mathbf{y})}$$
for $y \in \Sigma$ and $\mathbf{y} \in \Sigma^*$ such that $p(\mathbf{y}) > 0$. We still have to define the probabilities of ending the sequence using $p_{SM}$ by defining the EOS probabilities. We define, for any $\mathbf{y} \in \Sigma^*$ such that $\pi(\mathbf{y}) > 0$,
$$p_{SM}(\text{EOS} | \mathbf{y}) \stackrel{\text{def}}{=} \frac{p_{LM}(\mathbf{y})}{\pi(\mathbf{y})}$$
that is, the probability that the globally normalized model will generate exactly the string $\mathbf{y}$ and not any continuation of it $\mathbf{yy}'$, given that $\mathbf{y}$ has already been generated. Each of the conditional distributions of this model is clearly defined over $\bar{\Sigma}$. This, therefore, defines a valid SM. To see that $p_{LN}$ constitutes the same distribution as $p_{LM}$, consider two cases.
Case 1: Assume $\pi(\mathbf{y}) > 0$. Then, we have
$$ \begin{align} p_{LN}(\mathbf{y}) &= \left[ \prod_{t = 1}^{T} p_{SM}(y_t | \mathbf{y}_t) \right] p_{SM} (\text{EOS} | \mathbf{y}) \\ &= \frac{\pi(y_1)}{\pi(\epsilon)} \frac{\pi(y_1 y_2)}{\pi(y_1)} \cdots \frac{\pi(\mathbf{y}_{<T})}{\pi(\mathbf{y}_{< T-1})} \frac{\pi(\mathbf{y})}{\pi(\mathbf{y}_{<T})} p_{SM}(EOS | \mathbf{y})\\ &= \frac{\pi(\mathbf{y})}{\pi(\epsilon)} \frac{p_{LM}(\mathbf{y})}{\pi(\mathbf{y})} \\ &= p_{LM}(\mathbf{y}) \end{align} $$
where $\pi(\epsilon) = 1$.
Case 2: Assume $\pi(\mathbf{y}) = 0$. Let $\mathbf{y} = y_1 \cdots y_T$. Then, there must exist a $1 \le t' \le T$ such that $\pi(\mathbf{y}_{<t'}) = 0$. Note that
$$p_{LN}(\mathbf{y}) \prod_{t=1}^T p_{SM}(y_t | \mathbf{y}_{<t}) = 0$$
whereas the conditional probabilities after $t'$ can be arbitrarily defined since they do not affect the string having 0 probability.
When is a locally normalized language model a language model? LNMs which specify distributions over strings $p_{LN}(y_1 \cdots y_T)$ in terms of their conditional probabilities $p_{SM}(y_t | \mathbf{y}_{<t})$ for $t = 1, \cdots, T$ and $p_{SM}(\text{EOS} | \mathbf{y})$ have become the standard in NLP literature. However, LNMs come with their own set of problems. An advantage of normalizable globally normalized models is that they, by definition, always define a valid probability space over $\bar{\Sigma}$. Although this might be counterintuitive at first, the same cannot be said for LNMs - in this sense, locally normalized "language models" might not even be language models.
A measure-theoretic fundation¶
The goal of measure-theorietic probability is to assign probabilities to sebsets of an outcome space $\Omega$. The standard approach to probability theory resorts to only assigning probability to certain "nice" (but not necessarily all) subsets of $\Omega$, which are referred to as events or measurable subsets.
The set of measurable subsets is commonly denoted as $\mathcal{F}$ and a probability measure $\mathbb{P}: \mathcal{F} \rightarrow [0, 1]$ is the function that assigns a probability to each measurable subset.
$\sigma$-algebra¶
Let $\mathcal{P}(\Omega)$ be the power set of $\Omega$. Then $\mathcal{F} \subseteq \mathcal{P}(\Omega)$ is called a $\sigma$-algebra (or $\sigma$-field) over $\Omega$ if the following conditions hold:
$\Omega \in \mathcal{F}$,
if $\varepsilon \in \mathcal{F}$, then $\varepsilon^c \in \mathcal{F}$,
if $\varepsilon_1, \varepsilon_2, \cdots$ is a finite or infinite sequence of sets in $\mathcal{F}$, then $\bigcup_n \varepsilon_n \in \mathcal{F}$.
If $\mathcal{F}$ is a $\sigma$-algebra over $\Omega$, we call the tuple $(\Omega, \mathcal{F})$ a measurable space.
The triple $(\Omega, \mathcal{F}, \mathbb{P})$ is collectively known as a probability space. A measurable space guarantees that operations on countably many sets are always valid, and hence permits the following deifinition of probability measure.
Paobability measure¶
A probability measure $\mathbb{P}$ over a measurable space $(\Omega, \mathcal{F})$ is a function $\mathbb{P}: \mathcal{F} \rightarrow [0, 1]$ such that
$P(\Omega) = 1$,
if $\varepsilon_1, \varepsilon_2, \cdots$ is a countable sequence of disjoint sets in $\mathcal{F}$, then $\mathbb{P}(\bigcup_n \varepsilon_n) = \sum_n \mathbb{P}(\varepsilon_n)$.
In this case, we call $(\Omega, \mathcal{F}, \mathbb{P})$ a probability space.
Probability measures are only defined on certain "measurable" subsets of $\Omega$. Some subsets, called non-measurable sets, cannot consistently be assigned a probability. Caratheodory’s theorem provides a way to construct valid probability spaces (e.g., for sequences) that align with our intuition.
Algebra¶
$\mathcal{A} \subseteq \mathcal{P}(\Omega)$ is called an algebra (or field) over $\Omega$ if
$\Omega \in \mathcal{A}$,
if $\varepsilon \in \mathcal{A}$, then $\varepsilon^c \in \mathcal{A}$,
if $\varepsilon_1, \varepsilon_2 \in \mathcal{A}$, then $\varepsilon_1 \cup \varepsilon_2 \in \mathcal{A}$.
Probability pre-measure¶
Let $\mathcal{A}$ be an algebra over some set $\Omega$. A probability pre-measure over $(\Omega, \mathcal{A})$ is a function $\mathbb{P}_0: \mathcal{A} \rightarrow [0, 1]$ such that
$\mathbb{P}_0 (\Omega) = 1$,
if $\varepsilon_1, \varepsilon_2, \cdots$ is a (countable) sequence of disjoint sets in $\mathcal{A}$ whose (countable) union is also in $\mathcal{A}$, then $\mathbb{P}_0 (\cup_{n=1}^{\infty} = \sum_{n=1}^{\infty} \mathbb{P}_0 (\varepsilon_n))$.
Note that the only difference between a $\sigma$-algebra and an algebra is that condition 3 is weakened from countable to finite, and the only difference between a probability measure and a pre-measure is that the latter is defined with respect to an algebra instead of a $\sigma$-algebra.
The idea behind Caratheodory extension theorem is that there is often a simple construction of an algebra $\mathcal{A}$ over $\mathcal{\Omega}$ such that there is a natural way to define a probability pre-measure.
Random¶
A mapping $x: \Omega \rightarrow S$ between two measurable spaces $(\Omega, \mathcal{S})$ and $(\mathcal{S}, \mathcal{T})$ is an $(\mathcal{S}, \mathcal{T})$-values random variable, or a measurable mapping, if, for all $\mathcal{B} \in \mathcal {T}$,
$$x^{-1}(\mathcal{B}) \stackrel{\text{def}}{=} \{ \omega \in \Omega : x(\omega) \in \mathcal{B} \} \in \mathcal{F}.$$
Any measurable function (random variable) induces a new probability measure on the output $\sigma$-algebra based on the one define on the original $\sigma$-algebra. This is called the pushforward measure, which we will denote by $\mathbb{P}_{*}$, given by
$$\mathbb{P}_{*} (x \in \varepsilon) \stackrel{\text{def}}{=} \mathbb{P}\left( x^{-1} (\varepsilon)\right),$$
that is, the probability of the result of $x$ being in some event $\varepsilon$ is determined by the probability of the event of all the elements which $x$ maps into $\varepsilon$, i.e., the pre-image of $\varepsilon$ given by $x$.
Tight language models¶
Tightness¶
Models whose generative process may fail to terminate are called non-tight.
A locally normalized language model $p_{LN}$ derive from a sequence model $p_{SM}$ is called tight if it defines a valid probability distribution over $\Sigma^*$:
$$\sum_{\mathbf{y} \in \Sigma^*} p_{LN}(\mathbf{y}) = \sum_{\mathbf{y} \in \Sigma^*} \left[ p_{SM} (\text{EOS} | \mathbf{y}) \prod_{t = 1}^{T} p_{SM}(y_t | \mathbf{y}_{<t}) \right] = 1.$$
A non-tight 2-gram model
$$p_{LN}(\Sigma^*) = \sum_{n=1}^{\infty} p_{LN}(a^n) = \sum_{n=0}^{\infty} p_{LN}(a^n) = \sum_{n=0}^{\infty} = (0.7)^n \cdot 0.1 = \frac{0.1}{1 - 0.7} = \frac{1}{3} < 1$$
Remark
Here any string that contains the symbol $b$ will have probability 0 because $p_{LN}(\text{EOS} | b) = p_{LN}(a | b) = 0$.
This also confirms that the local normalization does not necessarily yield $p_{LN}$ that is a valid distribution over $\Sigma^*$.
A tight 2-gram model
$$ \begin{align} \mathbb{P}(\Sigma^*) &= \sum_{n=1}^{\infty} \sum_{m=0}^{\infty} \mathbb{P}(a^n b^m) \\ &= \sum_{n=1}^{\infty} \left( \mathbb{P}(a^n) + \sum_{m=1}^{\infty} \mathbb{P}(a^n b^m) \right)\\ &= \sum_{n=1}^{\infty} \left( 0.1 \cdot (0.7)^{n-1} + \sum_{m=1}^{\infty} (0.7)^{n-1} \cdot 0.2 \cdot (0.9)^{m-1} \cdot 0.1 \right)\\ &= \sum_{n=1}^{\infty} \left( 0.1 \cdot (0.7)^{n-1} + (0.7)^{n-1} \cdot 0.2 \cdot \frac{0.1}{1 - 0.9} \right)\\ &= \sum_{n=1}^{\infty} \left( 0.1 \cdot (0.7)^{n-1} + (0.7)^{n-1} \cdot 0.2 \right)\\ &= \sum_{n=1}^{\infty} 0.3 \cdot (0.7)^{n-1}\\ &= \frac{0.3}{1 - 0.7} = 1 \end{align} $$
Remark
The non-tightness of the previous example is related to the fact that the conditional probability of $\text{EOS}$ is 0 at some states, in contrast to this example. However, making $p_{LN}(y_n = \text{EOS} | \mathbf{y}_{<n}) > 0$ for all prefixes $\mathbf{y}_{<n}$ is neither necessary nor sufficient to ensure tightness.
Defining the probability measure of an LNM¶
The outline of our measure-theoretic treatment of LNMs in this section to arrive at precise characterization of $p_{LN}$. The final box corresponds to the sequence model (probability measure over $\Sigma^* \cup \Sigma^{\infty}$) constructed for $p_{LN}$.
Sequence model¶
A sequence model is a probability space over the set $\Sigma^* \cup \Sigma^{\infty}$.
The set $\Sigma^{\infty} \subset \Sigma^* \cup \Sigma^{\infty}$ represents the event where the sequence model non-terminating, i.e., it attempts to generate an infinitely long sequence.
Re-definition of a language model¶
A language model is a probability space over $\Sigma^*$. Equivalently, a language model is a sequence model such that $\mathbb{P}(\Sigma^{\infty}) = 0$.
Step 1: Defining an algebra over $\bar{\Sigma}^{\infty}$¶
$\bar{\Sigma}^{\infty}$ is a probability space, which contains all sequence with $\text{EOS}$ as a symbol.
Cylinder set
Given any set $\mathcal{H} \subseteq \bar{\Sigma}^k$, i.e., a set of sequences of symbols from $\bar{\Sigma}$ of length $k$, define its cylinder set (of rank k) to be
$$\bar{C}(\mathcal{H}) \stackrel{\text{def}}{=} \left\{ \mathbf{y \omega} : \mathbf{y} \in \mathcal{H}, \mathbf{\omega} \in \bar{\Sigma}^{\infty} \right\}$$
If $\Omega = \{H, T\}$, consider $\mathcal{H} = \{H, T\}^k$. If $k = 2$, then
$$ C(\mathcal{H}) = \begin{cases} HT,& HT,& \cdots \\ HT,& TH,& \cdots \\ \vdots& \vdots& \vdots \\ \end{cases}. $$
We denote the collection of all rank-$k$ cylinder sets as
$$\bar{C}_k \stackrel{\text{def}}{=} \left\{ \bar{C}(\mathcal{H}) : \mathcal{H} \in \mathcal{P}(\bar{\Sigma}^k)\right\}$$
and define
$$\bar{C} \stackrel{\text{def}}{=} \bigcup_{k = 1}^{\infty} \bar{C}_k$$
to be the collection of all cylinder sets over $\Omega$.
Lemma
$\bar{C} \subseteq \mathcal{P}$ is an algebra over $\Omega = \bar{\Sigma}^{\infty}$.
Proof: As $\bar{\Sigma}^{\infty} = \bar{C}(\bar{\Sigma}^k)$ for any $k$, and in particular is a cylinder set of any rank. Then given a cylinder set $\bar{C}(\mathcal(H))$ of rank $k$, i.e., $\mathcal{H} \subseteq \bar{\Sigma}^{k}$, $(\bar{C}(\mathcal{H}))^c = \bar{C}(\bar{\mathcal{\Sigma}}^k \backslash \mathcal{H})$. Hence $\bar{C}$ is closed under complements. Finally, notice that the intersection of two sylinder sets of ranks $k_1 \le k_2$ is another cylinder set of rank $k_2$. Hence $\bar{C}$ is an algebra over $\Omega$.
Step 2: Defining a pre-measure over $\bar{C}$¶
Given an LNM $p_{LN}$ and any set $\bar{C}(\mathcal{H}) \in \bar{C}$, let
$$\mathbb{P}_0 (\bar{C}(\mathcal{H})) \stackrel{\text{def}}{=} \sum_{\bar{\mathbf{y}} \in \mathcal{H}} p_{LN} (\bar{\mathbf{y}})$$
where we have defined
$$p_{LN} (\bar{\mathbf{y}}) \stackrel{\text{def}}{=} \prod_{t=1}^{T} p_{LN}(\bar{y}_t | \bar{\mathbf{y}}_{<t}).$$
Remark
$\mathbb{P}_0$ is a well-defined function.
$\mathbb{P}_0$ is a pre-measure over $\bar{C}$.
Let $\mathbb{P}_0$ be a finitely additive probability pre-measure over $\bar{C}$ such that, given a decreasing sequence of sets $A_1 \supset A_2 \supset \cdots$ in $\bar{C}$ where $\bigcap_{n=1}^{\infty} A_n = \emptyset$, $\lim_{n \rightarrow \infty} \mathbb{P}_0 (A_n) = 0$. Then $\mathbb{P}_0$ is also countably additive over $\bar{C}$.
Step 3: Exteding the pre-measure $\mathbb{P}_0$ into a measure $\mathbb{P}$¶
To extend $\mathbb{P}_0$ into a measure, we will use the Caratheodory theorem.
Caratheodory extension theorem
Given an algrebra $\mathcal{A}$ over some set $\Omega$ and a probability premeasure $\mathbb{P}_0 : \mathcal{A} \rightarrow [0, 1]$, there exists a probability space $(\Omega, \mathcal{F}, \mathbb{P})$ such that $\mathcal{A} \subset \mathcal{F}$ and $\mathbb{P}|_{\mathcal{A}} = \mathbb{P}_0$.
Futhermore, the $\sigma$-algebra $\mathcal{F}$ depends only on $\mathcal{A}$ and is minimal and unique, which we will also denote by $\sigma(\mathcal{A})$, and the probability measure $\mathbb{P}$ is unique.
Step 4: Defining a sequence model¶
To convert the infinite $\text{EOS}$-containing sequences from $\bar{\Sigma}^{\infty}$ finite or possibly infinite strings processed by a sequenence model.
A random variable is a mapping between two $\sigma$-algebras. Here, we want to construct a $\sigma$-algebra over $\Sigma^* \cup \Sigma^{\infty}$, mapping from $\bar{\Sigma}^{\infty}$ to $\Sigma^* \cup \Sigma^{\infty}$ to have the appropriate objects.
Given $\mathcal{H} \subseteq \Sigma^k$, define a rank-$k$ cylinder set in $\Sigma^* \cup \Sigma^{\infty}$ to be
$$\mathcal{C}(\mathcal{H}) \stackrel{\text{def}}{=} \{\mathbf{y \omega} : \mathbf{y} \in \mathcal{H}, \omega \in \Sigma^* \cup \sigma^{\infty}\}.$$
Note that the suffixes $\omega$ of the elements in $C(\mathcal{H})$ now is $\Sigma^* \cup \sigma^{\infty}$ rather than $\bar{\Sigma}^{\infty}$. Meaning that
They do not conotain $\text{EOS}$
They and elements of $C(\mathcal{H})$ can also be finite.
Let $\mathcal{C}_k$ be the set of all rank-k sylinder sets. Define $\mathcal{C} \stackrel{\text{def}}{=} \bigcup_{k = 1}^{\infty}C_k$. Then $\sigma(\mathcal{C})$ is a $\sigma$-algebra. We can now have the following random variable
$$ x(\omega) = \begin{cases} \omega_{<k} & \text{if } k \text{ is the first EOS in } \omega \\ \omega & \text{otherwise (if EOS } \in \omega) \end{cases} $$
given any $\omega \in \bar{\Sigma}^{\infty}$.
Remark
The function $x : (\bar{\Sigma}^{\infty}, \sigma(\bar{\mathcal{C}})) \rightarrow (\Sigma^* \cup \Sigma^{\infty}, \sigma(\mathcal{C}))$ is a measurable mapping.
$x$ intuitively "cuts out" the first stretch of $\omega$ before the first $\text{EOS}$ symbol (where an LNM would stop generating) or leaves the sequence intact if there is no termination symbol $\text{EOS}$.
$\mathbb{P}_*$ is a probability measure on $(\Sigma^* \cup \Sigma^{\infty}, \sigma(\mathcal{C}))$, hence $(\Sigma^* \cup \Sigma^{\infty}, \sigma(\mathcal{C}), \mathbb{P}_*)$ is a probability space.
To sum up, given an LNM $p_{LN}$, we have constructed a sequence model $p_{SM}$, a probability space over $\Sigma^* \cup \Sigma^{\infty}$, where the probabilities assigned to (infinite) strings by $p_{SM}$ agree with $p_{LN}$.
Interpreting the constructed probability space¶
Under the formulation of a probability space together with a random variable, useful probability quantities arise naturally and intuitively.
In measure space $(\Sigma^* \cup \Sigma^{\infty}, \sigma(\mathcal{C}))$, $\{\mathbf{y}\}$ is measureable for all $\mathbf{y} \in \Sigma^*.$
In the measure space $(\Sigma^* \cup \Sigma^{\infty}, \sigma(\mathcal{C}))$, $\sigma^{\infty}$ is measurable.
A sequence model $(\Sigma^* \cup \Sigma^{\infty}, \sigma(\mathcal{C}), \mathbb{P})$ is tight if and only if $\sum_{\mathbf{y} \in \Sigma^*} \mathbf{P}(\{\mathbf{y}\}) = 1$.
Characterizing tightness¶
Tight sequence model¶
A sequence model is said to be tight if $\mathbb{P}_*(x \in \Sigma^{\infty}) = 0$, in which case it is also a language model. Otherwise, we say that it is non-tight.
A non-tight seuqence model involves setting $p(\text{EOS} | \text{content})$ to 0. Can $p(\text{EOS} | \Sigma^n)$ tend to 0 and the sequence model still be tight? The answer is yes.
Tight example:
- $\Sigma = \{a\}$
- $p(a | x) = 1 - \frac{1}{|x| + 1}$, $x \in \Sigma^*$
- $p(\text{EOS} | x) = \frac{1}{|x| + 1}$
- This case is like the p-series with $p = 1$ (harmonic series to be specific), it diverges, which makes it tight
Non-tight example:
- $\Sigma = \{a\}$
- $p(a | x) = 1 - \frac{1}{|x|^2 + 1}$, $x \in \Sigma^*$
- $p(\text{EOS} | x) = \frac{1}{|x|^2 + 1}$
- This case is like the p-series with $p = 2$, it converges, which makes it non-tight
A sufficient condition for tightness¶
An LNM is tight if and only if $\tilde{p}_{EOS}(t) = 1$ for some $t$ or $\sum_{t = 1}^{\infty} \tilde{p}_{EOS}(t) = \text{EOS}$.
$\tilde{p}_{EOS}(t) = \frac{\sum_{\omega \in \Sigma^{t - 1}}p_{LN}(\omega) p_{LN}(\text{EOS} | \omega)}{\sum_{\omega \in \Sigma^{t - 1}}p_{LN}(\omega)}$, which can be seen as the weighted average probability of terminating at a string of length $t$.