263-5354-00: Large Language Model
Section 3
Classical Language Models
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 07/15/2025
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ryan Cotterell. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Informal definition of a finite-state language model
A language model $p_{LM}$ is finite-state if it defines only finitely many unique conditional distributions $p_{LM}(y | \mathbf{y})$. In other words, there are only finitely many contexts $\mathbf{y}$ which define the distribution over the next symbol, $p_{LM}(y | \mathbf{y})$.
Remark
Intuitively, this framework might be useful because it bounds the number of unique conditional distributions we have to learn. However, finite-state language models are not sufficient for modeling human language. Nevertheless, they can still offer a baseline for modeling more complex phenomena.
Finite-state automata (FSA)¶
Finite-state automata are one of the simplest devices for defining a formal langugae.
A finite-state automaton (FSA) is a 5-tuple $(\Sigma, Q, I, F, \delta)$ where
$\Sigma$ is an alphabet
$Q$ is a finite set of states
$I \subseteq Q$ is the set of initial states
$F \subseteq Q$ is the set of final or accepting states
A finite multiset $\delta \subseteq Q \times (\Sigma \cup \{\varepsilon\}) \times Q$, elements of $\delta$ are generally called transitions. The fact that it is a multiset reflects that it can contain multiple copies of the same element (i.e., transitions between the same pair of states with the same symbol).
Example
An FSA can be graphycally represented as a labeled, directed multi-graph. The vertices in the graph represent the states $q \in Q$ and the (labeled) edges between them the transitions in $\delta$. The labels on the edges correspond to the input symbols $a \in \Sigma$ which are consumed whrn transitioning over the edges. The initial states $q_{\iota} \in I$ are marked by a special incoming arrow while the final states $q_{\varphi} \in F$ are indicated using a double circle.
We can specify it as
$\Sigma = \{a, b, c\}$
$Q = \{1, 2, 3\}$
$I = \{1\}$
$F = \{3\}$
$\delta = \{(1, a, 2), (1, b, 3), (2, b, 2), (2, c, 3)\}$
Deterministic finite-state automaton¶
A FSA $\mathcal{A} = (\Sigma, Q, I, F, \delta)$ is deterministic if
it does not have any $\varepsilon$-transitions, which allows a finite-state machine to transition to a new state without consuming a symbol
for every $(q, a) \in Q \times \Sigma$, there is at most one $q' \in Q$ such that $q \stackrel{a}{\rightarrow} q' \in \delta$
there is a single initial state, i.e., $|I| = 1$
Otherwise, $\mathcal{A}$ is non-deterministic.
Language of a finite-state automaton¶
If the automaton ends up in one of the final states $q_{\varphi} \in F$, we say that the automaton accepts that string. A finite-state automaton is therefore a computational device that determines whether a string satisfies a condition. A string that satisfies this condition is said to be recognized by the automaton and the set of all strings satisfiying this condition form the language of the automaton.
Let $\mathcal{A} = (\Sigma, Q, I, F, \delta)$ be an finite-state automaton. The language of $\mathcal{A}$, $L(\mathcal{A})$ is defined as
$$L(\mathcal{A}) \stackrel{\text{def}}{=} \{\mathbf{y} | \mathbf{y} \text{ is recognized by } \mathcal{A}\}.$$
Regular language¶
Abstractly, a finite-state automaton is hence a specification of a set of rules that strings must satisfy to be included in its language. The set of languages that finite-state automata can recognize is known as the class of regular languages.
A language $L \subseteq \Sigma^*$ is regular if and only if it can be recognized by an unweighted finite-state automaton, i.e., if there exists a finite-state automaton $\mathcal{A}$ such that $L = L(\mathcal{A})$.
Example
Additional simple examples of FSAs are shown below. The first FSA (on the left side) can formally be defined with
$\Sigma = \{a, b, c\}$
$Q = \{1, 2, 3, 4, 5, 6\}$
$I = \{1\}$
$F = \{6\}$
$\delta = \{(1, a, 2), (1, b, 3), (2, b, 2), (2, c, 4), (3, c, 4), (3, b, 5), (4, a, 6), (5, a, 6)\}$
The first FSA (on the left side) is deterministic while the second one (on the right side) is non-deterministic.
Weighted finite-state automata (WFSA)¶
A common and very useful augmentation to finite-state automata is through the addition of weights on the transitions.
Real-weighted finite-state automaton (WFSA)¶
A real-weighted finite-state automaton (WFSA) $\mathcal{A}$ is a 5-tuple $(\Sigma, Q, \delta, \lambda, \rho)$ where
$\Sigma$ is a finite alphabet
$Q$ is a finite set of states
$\delta \subseteq Q \times (\Sigma \cup \{\varepsilon\}) \times \mathbb{R} \times Q$ a finite multiset of transitions
$\lambda : Q \rightarrow \mathbb{R}$ a weighting function over $Q$
$\rho : Q \rightarrow \mathbb{R}$ a weighting function over $Q$
Remark
The initial and final states can be implicitly be specified by the states given non-zero initial or final weights by the $\lambda$ and $\rho$ functions, i.e., $I = {q \in Q | \lambda(q) \neq 0}$ and $F = \{q \in Q | \rho(q) \neq 0\}$. We will also denote transition weights with $\omega \left(q \stackrel{a / w}{\rightarrow} q'\right) \stackrel{\text{def}}{=} w$.
Graphicall, we write the transition weights on the edges of the graph representing the WFSA after the output symbol, separated by a "/". The same separator is also used to separate the state name from its final weight, which is written in the node. The initial weights, however, are written on the incoming arrow denoting initial states.
An example graph is shown below
Transition matrix¶
The connection of WFSAs to graphs makes it natural to define a set of transition matrices specified by a WFSA.
Let $\mathcal{A} = (\Sigma, Q, \delta, \lambda, \rho)$ be a WFSA. For any $a \in \Sigma$, we define the symbol-specific transition matrix $\mathbf{T}^{(a)}$ as the transition matrix of the graph restricted to $a$-labeled transitions. We also define the (full) transition matrix as $\mathbf{T} \stackrel{\text{def}}{=} \sum_{a \in \Sigma} \mathbf{T}^{(a)}$.
Example
Consider the WFSA $\mathcal{A}$ above, the transitional matrices for $\mathcal{A}$ are
$$\mathbf{T}^{(a)} = \begin{pmatrix} 0 & 0.5 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & \frac{1}{\pi\cdot e}\\ 0 & 0 & 0 & 0 & 0 & 0.29\\ 0 & 0 & 0 & 0 & 0 & 0\\ \end{pmatrix}, \mathbf{T}^{(b)} = \begin{pmatrix} 0 & 0 & \frac{1}{\pi} & 0 & 0 & 0\\ 0 & 0.63 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0.13 & 0\\ 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0\\ \end{pmatrix}, \mathbf{T}^{(c)} = \begin{pmatrix} 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0.9 & 0 & 0\\ 0 & 0 & 0 & 0.21 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0\\ \end{pmatrix}$$
$$\mathbf{T}=\mathbf{T}^{(a)}+\mathbf{T}^{(b)}+\mathbf{T}^{(c)}=\begin{pmatrix} 0 & 0.5 & \frac{1}{\pi} & 0 & 0 & 0\\ 0 & 0.63 & 0 & 0.9 & 0 & 0\\ 0 & 0 & 0 & 0.21 & 0.13 & 0\\ 0 & 0 & 0 & 0 & 0 & \frac{1}{\pi\cdot e}\\ 0 & 0 & 0 & 0 & 0 & 0.29\\ 0 & 0 & 0 & 0 & 0 & 0\\ \end{pmatrix}$$
Paths and path weight¶
Path¶
A path $\pi$ is an element of $\delta^*$ with consecutive transitions, meaning that it is of the form $\left( q_1 \stackrel{\cdot / \cdot}{\rightarrow} q_2, q_2 \stackrel{\cdot / \cdot}{\rightarrow} q_3 \cdots q_{n-1} \stackrel{\cdot / \cdot}{\rightarrow} q_n \right)$, where $\cdot$ is a placeholder. The length of a path is the number of transition in it; we denote the length as $|\pi|$.
We use $p(\pi)$ and $n(\pi)$ to denote the origin and the destination of a path, respectively. The yield of a path is the concatenation of the input symbols on the edges along the path, which we denote as $s(\pi)$. Furthermore, we denote sets of paths with capital $\Pi$. A fwe variants involving $\Pi$:
$\Pi(\mathcal{A})$ as the set of all paths automaton $\mathcal{A}$
$\Pi(\mathcal{A}, \mathbf{y})$ as the set of all paths in automaton $\mathcal{A}$ with yield $\mathbf{y}\in \Sigma^*$
$\Pi(\mathcal{A}, q, q')$ as the set of all paths in automaton $\mathcal{A}$ from state $q$ to state $q'$
Path weight¶
The inner path weight $w_I(\pi)$ of a path $\pi = q_1 \stackrel{a_1 / w_1}{\rightarrow} q_2 \cdots q_{n-1} \stackrel{a_N / w_N}{\rightarrow} q_n$ is defined as
$$w_1(\pi) = \prod_{n=1}^{N}w_n.$$
The (full) path weight of the path $\pi$ is then defined as
$$w(\pi) = \lambda(p(\pi)) w_I(\pi) \rho(n(\pi)).$$
A path $\pi$ is called accepting or successful if $w(\pi) \neq 0$.
Remark
The inner path weight is therefore the product of the weights of the transitions on the path while the (full) path weight is the product of the transition weights as well as the initial and final weights of the origin and the destination of the path respectively.
String acceptance weights and weighted regular languages¶
Stringsum¶
The stringsum, string weight, or acceptance weight of a string $\mathbf{y} \in \Sigma^*$ under a WFSA $\mathcal{A}$ is defined as
$$\mathcal{A}(\mathbf{y}) \stackrel{\text{def}}{=} \sum_{\pi \in \Pi(\mathcal{A}, \mathbf{y})} w(\pi).$$
Weighted language of a weighted finite-state automaton¶
Let $\mathcal{A}$ be a WFSA. Its (weighted) language is defined as
$$L(\mathcal{A}) \stackrel{\text{def}}{=} \{(\mathbf{y}, \mathcal{A}(\mathbf{y})) | \mathbf{y} \in \Sigma^*\}.$$
Weighted regular language¶
A weighted language $L$ is a weighted regular language if there exists a WFSA $\mathcal{A}$ such that $L = L(\mathcal{A})$.
State-specific allsum¶
Let $\mathcal{A} = (\Sigma, Q, \delta, \lambda, \rho)$ be a WFSA. The allsum of a state $q \in Q$ is defined as
$$Z(\mathcal{A}, q) = \sum_{\pi \in \Pi(\mathcal{A}), q_1 = q} w_I(\pi) \rho(n(\pi)).$$
Remark
State-specific allsums are also referred to as the backward values in the literature and are often denoted as $\beta(q)$.
WFSA allsum¶
Let $\mathcal{A} = (\Sigma, Q, \delta, \lambda, \rho)$ be a WFSA. The allsum of $\mathcal{A}$ is defined as
$$Z(\mathcal{A}) = \sum_{\mathbf{y} \in \Sigma^*} \mathcal{A}(\mathbf{y}) = \sum_{\mathbf{y} \in \Sigma^*} \sum_{\pi \in \Pi(\mathcal{A}, \mathbf{y})} w(\pi) = \sum_{\pi \in \Pi(\mathcal{A})} w(\pi).$$
Remark
The second equality comes from the crucial observation that the double sum in the second term sums over precisely all paths of the automaton $\mathcal{A}$, which is where the name of the quanitty comes from allsum.
Accessibility and probabilistic weighted finite-state automata¶
(Co)-accessible and useful states¶
A state $q \in Q$ of a WFSA is accessible if there is a non-zero-wieghted path to $q$ from some state $q_\iota$ with $\lambda(q_{\iota}) \neq 0$; it is co-accessible state if there is a non-zero-weighted path from $q$ to some state $q_{\varphi}$ with $\rho(q_{\varphi}) \neq 0$. It is useful if it is both accessible and co-accessible, i.e., $q$ appears on some non-zero-weighted accepting path.
Trim automaton¶
Trimming a WFSA means removing its useless states. Removing the non-useful states means removing their rows and columns from $\mathbf{T}$ as well as their rows from $\vec{\lambda}$ and $\vec{\rho}$, yielding possibly smaller $\mathbf{T}'$, $\vec{\lambda}'$, and $\vec{\rho}'$.
Remark
We will use WFSAs to specify language models. However, not every WFSA is a language model, i.e., a distribution over strings. Generally, the weight of a string could be negative if we allow arbitray real weights. Thus, a restriction we will impose on all weighted automata that represent finite-state language models is that the weights be non-negative.
Probabilistic weighted finite-state automaton (PFSA)¶
A specific class of WFSAs tht will be of particular interest later probabilistic WFSAs.
A WFSA $\mathcal{A} = (\Sigma, Q, \delta, \lambda, \rho)$ is probabilistic (a PFSA) if
$$\sum_{q \in Q} \lambda(q) = 1$$
and, for all $q \in Q$ and all outgoing transitions $q \stackrel{a / w}{\rightarrow} q' \in \delta$ it holds that
$$\lambda(q) \ge 0$$
$$\rho(q) \ge 0$$
$$w \ge 0$$
and
$$\sum_{q \stackrel{a / w}{\rightarrow} q'} w + \rho(q) = 1.$$
Remark
This means that the initial weights of all the states of the autmaton form a probability distribution (the initial weight of a state corresponds to the probability of starting in it), as well as that, for any state $q$ in the WFSA, the weights of its outgoing transitions (with any label) together with its final weight form a valid discrete probability distribution. In certain way, probabilistic finite-state automata naturally correspond to locally normalized language models.
Notice that the final weights in a PFSA play an analogous role to the $\text{EOS}$ symbol. The probability of ending a path in a specific state $q$ - and therefore ending a string - is $q$'s final weight. With this, when modeling language with weighted finite-state automata, we will be able to avoid the need to specify the special symbol and rather rely on the final weights, which are naturally part of the framework.
Finite-state language models¶
A language model $p_{LM}$ is finite-state if it can be repreented by a weighted finite-state sutomaton, i.e., if there exists a WFSA $\mathcal{A} = (\Sigma, Q, \delta, \lambda, \rho)$ such that $L(\mathcal{A}) = L(p_{LM})$. Equivalently, we could say that $p_{LM}$ is finite-state if its language is a weighted regular language.
String probabilities in a probabilistic finite-state automaton¶
In a probabilistic FSA, any action from a state $q \in Q$ is associated with a probability. It is intuitive to see those probabilities as conditional probabilities of the next symbol given the input so far.
Path probability in a PFSA¶
We call the weight of a path $\pi \in \Pi(\mathcal{A})$ in a probabilistic FSA the probability of the path $\pi$.
String probability in a PFSA¶
Natually, we define the probability of $\mathbf{y}$ as the sum of the individual paths that recognize it.
We call the stringsum of a string $\mathbf{y} \in \Sigma^*$ in a probabilistic FSA the probability of the string $\mathbf{y}$:
$$p_{\mathcal{A}}(\mathbf{y}) \stackrel{\text{def}}{=} \mathcal{A}.$$
Remark
Note that the two definitions above do not require any normalization over all possible paths or strings. This closely resembels the way we defined locally normalized models based on the conditional probabilities of a sequence model.
String probabilities in a general weighted finite-state automaton¶
Let $\mathcal{A} = (\Sigma, Q, \delta, \lambda, \rho)$ be a normalizable WFSA with non-negative weights. We define the probability of a string $\mathbf{y} \in \Sigma^*$ under $\mathcal{A}$ as
$$p_{\mathcal{A}}(\mathbf{y}) \stackrel{\text{def}}{=} \frac{\mathcal{A}(\mathbf{y})}{Z(\mathcal{A})}.$$
Language models induced by a WFSA¶
Let $\mathcal{A} = (\Sigma, Q, \delta, \lambda, \rho)$ be a WFSA. We define the language model induced by $\mathcal{A}$ as the following probability distribution over $\Sigma^*$
$$p_{LM_{\mathcal{A}}}(\mathbf{y}) \stackrel{\text{def}}{=} p_{\mathcal{A}}(\mathbf{y}).$$
Normalizing finte-state language models¶
This section outlines an algorithm for normalizing a language model defined by a Weighted Finite-State Automaton (WFSA). The primary goal is to compute the "allsum", denoted as $Z(\mathcal{A})$, which is the sum of weights of all possible paths in the automaton.
Converting a Matrix of Pairwise Pathsums to the Allsum¶
First, let's assume we have a matrix $M$ where the entry $M_{ij}$ represents the sum of all inner weights for all paths between state $i$ and state $j$:
$$ M_{ij} = \sum_{\pi \in \Pi(\mathcal{A}, i, j)} w_{I}(\pi) $$
The allsum $Z(\mathcal{A})$ can be calculated using this matrix $M$ along with the initial weight function $\lambda(i)$ and the final weight function $\rho(j)$. The derivation is as follows:
$$ \begin{align} Z(\mathcal{A}) &= \sum_{\pi \in \Pi(\mathcal{A})} w(\pi)\\ &= \sum_{i,j \in Q} \lambda(i) \left( \sum_{\pi \in \Pi(\mathcal{A}, i, j)} w_{I}(\pi) \right) \rho(j)\\ &= \sum_{i,j \in Q} \lambda(i) M_{ij} \rho(j) \end{align} $$
This can be expressed in a compact vector form:
$$ Z(\mathcal{A}) = \vec{\lambda} M \vec{\rho} $$
where $\vec{\lambda}$ and $\vec{\rho}$ are vector representations of the initial and final weight functions.
Computing the Matrix of Pairwise Pathsums¶
The next challenge is to compute the matrix $M$ itself. Let $T$ be the transition matrix of the automaton $\mathcal{A}$, where $T_{ij}$ is the sum of weights of all direct transitions (paths of length 1) from state $i$ to $j$.
Lemma: The matrix $T^d$ contains the sum of weights for all paths of exactly length $d$.
For an automaton that may contain cycles, paths can be of any length. To account for all possible paths, we must sum over all path lengths from 0 to infinity. This gives us the matrix $T^*$, which is equivalent to the matrix $M$ discussed earlier.
$$ T^* = \sum_{d=0}^{\infty} T^d $$
This is a geometric series for matrices, and we can find a closed-form solution:
$$ T^* = I + T \sum_{d=1}^{\infty} T^{d-1} = I + T T^* $$
Rearranging the equation $T^* = I + T T^*$ gives:
$$ \begin{align} T^*(I - T) &= I\\ T^* &= (I - T)^{-1} \end{align} $$
This crucial result shows that the matrix of all pairwise pathsums can be found by computing the inverse of the matrix $(I - T)$.
Conclusion: A globally normalized n-gram language model can be normalized by computing a matrix inversion. The runtime for this matrix inversion is $\mathcal{O}(|Q|^3)$, where $|Q|$ is the number of states, which can be computationally expensive.
Convergence Condition¶
The infinite sum $T^* = \sum_{d=0}^{\infty} T^d$ only converges if the magnitudes of all eigenvalues of $T$ are less than 1. This is equivalent to the spectral norm of $T$ being less than 1, i.e., $||T||_2 < 1$. If this condition holds, $(I - T)$ is guaranteed to be invertible.
Speed-ups of the Allsum Algorithm¶
The $\mathcal{O}(|Q|^3)$ complexity can be prohibitive. Faster methods exist for specific automaton structures:
- Acyclic WFSAs: The allsum can be computed in time linear in the number of transitions.
- Decomposable WFSAs: If the automaton consists of sparsely connected components that form an acyclic graph of components, a hybrid approach can be more efficient.
Importantly, the allsum algorithm is differentiable, allowing it to be integrated into gradient-based training frameworks.
Locally Normalizing a Globally Normalized Model¶
Any globally normalized model can be converted into a locally normalized one. For finite-state models, this is done via a "weight pushing" algorithm.
Theorem: Normalizable WFSAs (with non-negative weights) and tight probabilistic FSAs (PFSAs) are equally expressive.
This means for any normalizable WFSA, we can construct an equivalent PFSA. The construction involves re-weighting the initial, final, and transition weights based on state-specific allsums, $Z(\mathcal{A}, q)$, which is the sum of weights of all paths starting from state $q$.
The new weights for the probabilistic automaton $\mathcal{A}_L$ are defined as:
- Initial weights: $\lambda_{\mathcal{A}_{L}}(q) \triangleq \lambda(q)\frac{Z(\mathcal{A},q)}{Z(\mathcal{A})}$
- Final weights: $\rho_{\mathcal{A}_{L}}(q) \triangleq \frac{\rho(q)}{Z(\mathcal{A},q)}$
- Transition weights: $\omega_{\mathcal{A}_{L}}(q \xrightarrow{a/w} q') \triangleq \frac{\omega(q \xrightarrow{a/w} q')Z(\mathcal{A},q')}{Z(\mathcal{A},q)}$
The proof shows that with these definitions, the path probabilities remain the same between the original WFSA and the new PFSA.
Defining a Parametrized Globally Normalized Language Model¶
To train a finite-state language model, its weights can be parametrized by a set of learnable parameters $\theta$.
- A score function $f_{\theta}(q_1, y, q_2)$ determines the weights of transitions.
- Functions $f_{\theta}^{\lambda}$ and $f_{\theta}^{\rho}$ determine the initial and final weights.
This creates a parametrized automaton $\mathcal{A}_{\theta}$. The model's probability for a string $y$ is defined by its stringsum in $\mathcal{A}_{\theta}$, normalized by the allsum $Z(\mathcal{A}_{\theta})$.
To fit this into the standard framework of globally normalized models, an energy function can be defined as:
$$ \hat{p}_{GN}^{\mathcal{A}_{\theta}}(y) = -\log(\mathcal{A}(y)) $$
This formulation allows the model to be trained using gradient-based optimization methods.
The n-gram Assumption and Subregularity¶
N-gram models are a historically significant type of finite-state language model. They simplify the task of probability calculation by making the n-gram assumption: the probability of a given word depends only on the previous $n-1$ words. This is also known as an $(n-1)^{th}$ order Markov assumption.
For words near the beginning of a sequence, the context is padded with special "beginning-of-sequence" ($\text{BOS}$) symbols to ensure the history is always $n-1$ words long.
Representing n-grams as WFSAs¶
Every n-gram model can be formally represented as a Weighted Finite-State Automaton (WFSA). The intuition is that the model's finite number of possible histories can be represented as the states of an automaton.
The construction is as follows:
- States: The states $Q$ are all possible sequences of words (histories) up to length $n-1$.
- Transitions: A transition's weight is the conditional probability of seeing the next word given the current history.
- Initial State: The single starting state is a sequence of $n-1$ $\text{BOS}$ symbols.
- Final Weights: A state's final weight is the probability of the sequence ending after that history.
Subregularity¶
N-gram models belong to a class of languages called subregular languages because they can be recognized by a machine that is simpler than a full finite-state automaton. Specifically, they are strictly n-local languages, which means that the validity of any block of $n-1$ symbols is considered independently of the symbols that came before it. This is the exact property that the n-gram assumption describes.
Representation-based n-gram Models¶
While simple n-gram models can be parameterized by counting word occurrences from a text corpus, this approach has significant drawbacks.
- Data Sparsity: The model will assign a zero probability to any valid sequence that didn't happen to appear in the training data.
- Lack of Generalization: It treats all words as distinct indices in a table and cannot use semantic similarity. For example, it can't infer that "adopt a cat" is a probable sentence just from seeing "adopt a dog," because it doesn't know that "cat" and "dog" are similar.
Neural n-gram Models¶
To solve these problems, modern approaches use distributed word representations, or embeddings, where each word is represented by a vector $e(y)$. These models, first successfully applied by Bengio et al., use a neural network as a context-encoding function, $\text{enc}$.
This encoder processes the embeddings of the $n-1$ context words to produce a single vector. The probability of the next word is then calculated with a softmax layer. The model's key advantages are:
- Generalization: It can handle unseen n-grams by leveraging the similarity between word embeddings.
- Efficiency: It shares parameters across different contexts, requiring far fewer parameters than a table-based model where the size grows exponentially with $n$.
Despite using a neural network, this is still fundamentally an n-gram model because it is limited to a finite context of $n-1$ words and cannot model longer dependencies.