263-5354-00: Large Language Model
Section 4
Neural Network Language Models
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 07/16/2025
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ryan Cotterell. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Recurrent neural language models¶
This section introduces a class of models that can, in theory, handle the complexities of human language that simpler models like n-grams and context-free grammars cannot. Pay close attention to the why we need RNNs, the formal definition of an RNN, and the key variations like LSTM and GRU. The limitations of RNNs are also a critical topic.
Why We Need More Than Context-Free Grammars¶
Core Idea: While context-free grammars (CFGs) can handle some recursive structures (like nested clauses), they are not powerful enough to model all phenomena in human language.
Example: Cross-serial dependencies in Swiss German.
- In the phrase "...mer d'chind em Hans s'huus lönd hälfe aastriiche" (...we the children Hans the house let help paint), there's a dependency between nouns and verbs that cross over each other.
d'chind
(the children) links tolönd
(let)em Hans
(Hans) links tohälfe
(help)
Abstract Representation: This type of dependency can be abstractly represented as a language of the form
$$xA^{m}B^{n}C^{m}yD^{n}z,$$
where
- $x$ = "...mer"
- $A$ = "d'chind"
- $B$ = "em Hans"
- $y$ = "s'huus"
- $C$ = "lönd"
- $D$ = "hälfe"
- $z$ = "aastriiche"
Conclusion: A language with this structure cannot be described by a context-free grammar. This proves that to accurately model human language, we need more expressive formalisms. This is the primary motivation for turning to Recurrent Neural Networks (RNNs).
Recurrent Neural Networks (RNNs)¶
An Informal Introduction¶
Core Idea: RNNs process sequences (like text) one element at a time, maintaining a hidden state ($h$) that acts as a summary or "memory" of the sequence seen so far.
Comparison to Finite-State Automata (FSA):
Similarities: Both process input sequentially and transition between states.
CRITICAL DIFFERENCE: An FSA has a finite number of states. An RNN's hidden state is a vector of real numbers (e.g., in a space like $\mathbb{R}^{D}$), giving it a theoretically infinite state space. This allows it to store much more information and model long-range dependencies that FSAs cannot.
Why hidden
The state vector $h$ is an internal representation used to make predictions. The actual "visible" output is the probability distribution over the next word. The state itself is not the final output, hence it's "hidden."
A Formal Definition of RNNs¶
Definition: A deterministic real-valued Recurrent Neural Network (RNN) is a 4-tuple
$$(\Sigma, D, f, h_0),$$
where:
- $\Sigma$: The alphabet of input symbols (our vocabulary).
- $D$: The dimension of the hidden state space.
- $f: \mathbb{R}^{D} \times \Sigma \rightarrow \mathbb{R}^{D}$: The dynamics map or transition function. This is the core of the RNN; it defines how to update the hidden state.
- $h_0 \in \mathbb{R}^{D}$: The initial state vector.
Real vs. Rational RNNs:
- Real-valued RNNs (state space $\mathbb{R}^{D}$) are a mathematical abstraction useful for analysis with calculus (e.g., for gradient-based learning).
- Rational-valued RNNs (state space $\mathbb{Q}^{D}$) are what can actually be implemented on a computer.
RNNs for Language Modeling¶
- Recurrent Neural Encoding Function ($\text{enc}_R$): This function uses the RNN to turn a sequence of any length into a fixed-size vector representation (the hidden state).
- Hidden State ($h_t$): The state of the RNN after reading the t-th symbol. It's defined recursively: $h_t = \text{enc}_{\mathcal{R}}(y_{<t+1}) = f(h_{t-1}, y_t)$ The initial state is $h_0 = \text{enc}_{\mathcal{R}}(y_{<1})$.
- Recurrent Neural Sequence Model: To get probabilities, we use the hidden state. The probability of the next symbol $y_t$ given the context $y_{<t}$ (represented by $h_{t-1}$) is: $$p_{SM}(y_t | y_{<t}) = \text{softmax}(E \cdot h_{t-1})_{y_t}$$ Here, $E$ is a symbol representation matrix (an embedding matrix) that helps convert the hidden state into scores (logits) for each word in the vocabulary. The softmax function then turns these scores into a probability distribution.
Additional Definitions¶
- One-hot encoding: A way to represent a symbol as a binary vector with a single '1' at the index corresponding to that symbol and '0's everywhere else.
- Activation function ($\sigma$): A non-linear function applied element-wise during the state update. If the dynamics map is $h_t = \sigma(g(h_{t-1}, y))$, then $\sigma$ is the activation function. Examples include sigmoid, tanh, and ReLU.
Tightness of RNNs¶
"Tightness" is a formal property that asks: does our probability model assign a total probability of 1 to all possible finite strings? If not, it's "leaky," meaning some probability mass is assigned to infinite sequences, which is usually undesirable.
Key Result (Theorem): A softmax RNN is tight if the norm of its hidden states does not grow too fast. Specifically, if
$$s||h_t||_2 \le \log t.$$
Practical Takeaway (Corollary): An RNN with a bounded activation function (like $\text{sigmoid}$ or $\text{tanh}$, which always output values in a fixed range) is guaranteed to be tight.
Important Caveat: RNNs with unbounded activation functions (like $\text{ReLU}$, where $\text{ReLU}(x) = \max(0, x)$) are not guaranteed to be tight. They can be tight, but they can also be leaky depending on the other parameters. This is a subtle but important point.
Elman and Jordan Networks (Simple RNNs)¶
These are two of the earliest and simplest ways to define the dynamics map $f$.
- Elman Network (aka "Vanilla RNN"):
- Update Equation: $$h_t = \sigma(U h_{t-1} + V e'(y_t) + b_h)$$
- Components:
- $h_{t-1}$: Previous hidden state.
- $e'(y_t)$: Embedding (vector representation) of the current input symbol.
- $U$: The recurrence matrix (weights applied to the previous state). $V$: The input matrix (weights applied to the current input).
- $b_h$: A bias vector.
- Jordan Network: Very similar to the Elman network, but the update depends on a transformation of the previous output, not the previous hidden state. This is a less common architecture.
- Components:
- $h_t = \sigma(U r_{t-1} + V e'(y_t) + b_h)$
- $r_t = \sigma_{o}(Eh_t)$
- Components:
Variations on Recurrent Networks (Gated RNNs)¶
Motivation (Extremely Important): Simple RNNs like the Elman network struggle to learn long-term dependencies. This is due to the vanishing and exploding gradient problem, where during training, the gradients can become exponentially small (vanish) or large (explode) as they are propagated back through the sequence, making it impossible to learn connections between distant words.
Solution: Gating Mechanisms
- Idea: Introduce "gates" - vectors with values between 0 and 1 - that learn to control the flow of information. A gate can decide what information to forget from the old state and what new information to write to the new state.
Long Short-Term Memory (LSTM)¶
- The most famous and successful gated RNN.
- Key Components:
- Hidden State ($h_t$): The output state, just like in a simple RNN.
- Memory Cell ($c_t$): An additional state vector that acts as the "long-term memory." It's easier for information to flow unchanged through the cell.
- The Gates:
- Forget Gate ($f_t$): Decides what to throw away from the old memory cell $c_{t-1}$.
- Input Gate ($i_t$): Decides which new information to store in the memory cell.
- Output Gate ($o_t$): Decides what part of the memory cell to read from to produce the new hidden state $h_t$.
- LSTM Update Equations:
- $f_t = \sigma(U^f h_{t-1} + V^f e'y_t + b^f)$: Compute the forget gate.
- $i_t = \sigma(U^i h_{t-1} + V^i e'y_t + b^i)$: Compute the input gate.
- $g_t = \tanh(U^g h_{t-1} + V^g e'y_t + b^g)$: Compute a candidate new memory vector.
- $c_t = f_t \odot c_{t-1} + i_t \odot g_t$: Update the cell state. Forget old stuff, add new stuff. (The $\odot$ symbol means element-wise multiplication).
- $o_t = \sigma(U^o h_{t-1} + V^o e'y_t + b^o)$: Compute the output gate.
- $h_t = o_t \odot \tanh(c_t)$: Update the hidden state.
Gated Recurrent Unit (GRU)¶
- A popular, slightly simpler alternative to the LSTM.
- Reset Gate: $r_t = \sigma(U^r h_{t-1} + V^r e'y_t + b^r)$
- Update Gate: $z_t = \sigma(U^z h_{t-1} + V^z e'y_t + b^z)$
- Candidate Vector: $g_t = \tanh(U^g(r_t \odot h_{t-1}) + V^g e' y_t + b^g)$
- $h_{t-1} = (1 - z_t) \odot g_t + z_t \odot h_{t-1}$
- Key Simplifications:
- Combines the hidden state and memory cell into a single state vector, $h$.
- Combines the input and forget gates into a single update gate ($z_t$).
- Introduces a reset gate ($r_t$).
- Intuition: The update gate ($z$) controls how much of the previous state to keep versus how much of the new candidate state to use. The reset gate ($r$) controls how much the previous state influences the calculation of the candidate state.
Parallelizability: The Achilles' Heel of RNNs¶
- The Problem: The calculation of the hidden state is inherently sequential. To compute $h_t$, you must have already computed $h_{t-1}$.
- The Consequence: You cannot parallelize the processing of a single long sequence across multiple processors during training. This makes training RNNs on very large datasets and long documents extremely slow compared to other architectures.
- Teacher's Note: This is the single biggest drawback of RNNs and the primary motivation for the development of the Transformer architecture, which we will see later. While generation is always sequential for these models, the inability to parallelize during training is a major bottleneck.
Representational capacity of recurrent neural networks¶
This section is highly theoretical but crucial for understanding what RNNs can and cannot do. We will explore two opposing views: one where RNNs are quite limited (equivalent to finite automata) and another where they are all-powerful (Turing complete). The difference lies entirely in the assumptions we make. Pay close attention to the key theorems and the reasons for the dramatic shift in computational power.
RNNs are Equivalent to Finite-State Automata (Under Strict Assumptions)¶
The main idea here is that if we place realistic constraints on an RNN, similar to how it would operate on a real computer, its power is dramatically limited.
- Key Assumption: We use a simplified, non-differentiable activation function called the Heaviside function.
- Definition: The Heaviside function, $H(x)$, outputs 1 if $x > 0$ and 0 otherwise. This forces the hidden state of the RNN to be a vector of only 0s and 1s. $$H(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{otherwise} \end{cases}$$
- Central Theorem: An Elman RNN that uses the Heaviside activation function (an "HRNN") is computationally equivalent to a deterministic probabilistic finite-state automaton (dPFSA).
Proof Idea:
An HRNN is no more powerful than a dPFSA (Lemma):
- Since the hidden state $h_t$ of an HRNN is a D-dimensional vector of 0s and 1s, there are only $2^D$ possible hidden states.
- A model with a finite number of states is, by definition, a finite-state automaton. We can construct a dPFSA where each of its states corresponds to one of the $2^D$ possible hidden states of the HRNN.
An HRNN can simulate any dPFSA (Lemma - Minsky's Construction):
- This is a proof by construction. Given a dPFSA, we can build an HRNN that recognizes the same weighted language.
- Hidden State: The HRNN's hidden state $h_t$ is a one-hot vector representing the pair
(current_dPFSA_state, symbol_leading_to_it)
. Its dimension is $D = |Q| \times |\Sigma|$, where $|Q|$ is the number of states and $|\Sigma|$ is the alphabet size of the dPFSA. - Parameters: The RNN's matrices ($U, V$) and bias ($b_h$) are carefully set up to mimic the dPFSA's transition function using logical AND operations. The output matrix $E$ is set to store the log-probabilities of the dPFSA's transitions.
Implications (Very Important):
Under these practical assumptions (finite precision, leading to a finite number of states), RNNs can only recognize regular languages.
This means they are strictly less expressive than non-deterministic PFSAs.
RNNs can be seen as a very compact way to represent potentially huge dPFSAs.
Space Complexity and Lower Bounds¶
This section explores how efficiently (in terms of hidden state size) an HRNN can simulate a dPFSA.
- Minsky's Construction Size: Linear in both the number of states and alphabet size: $O(|Q||\Sigma|)$.
- More Efficient Constructions:
- It's possible to do better by using more clever state representations than one-hot vectors.
- Indyk's Construction is the most efficient, achieving a size of $O(|\Sigma|\sqrt{|Q|})$. This is asymptotically optimal for the unweighted case.
- Lower Bound for the Probabilistic Case:
- Key Question: Can we use Indyk's efficient $O(\sqrt{|Q|})$ construction for weighted automata (dPFSAs)?
- Answer: NO. The size of an HRNN needed to simulate a dPFSA has a lower bound of $\Omega(|Q|)$.
- Reason: The bottleneck is the output matrix E. A dPFSA can have $|Q|$ completely different, linearly independent probability distributions for the transitions leaving each state. To model these, the matrix E must be large enough to represent them all, forcing the hidden state dimension to grow at least linearly with $|Q|$.
Turing Completeness of RNNs (Under Theoretical Assumptions)¶
Now we flip the script. By relaxing the practical constraints, RNNs become maximally powerful.
- Key Assumptions:
- Arbitrary Precision: The weights and hidden state values can be real numbers with infinite precision.
- Unbounded Computation Time: The RNN can perform multiple internal update steps between reading input symbols.
- Key Building Block: The saturated sigmoid activation function. It's linear between 0 and 1 and clips values outside this range. $$\sigma(x) = \begin{cases} 0 & \text{if } x \le 0 \\ x & \text{if } 0 < x \le 1 \\ 1 & \text{if } x > 1 \end{cases}$$
- Central Theorem: An Elman RNN with the saturated sigmoid activation function is Turing complete.
Proof Idea (High Level):
- The proof shows that such an RNN can simulate a two-stack pushdown automaton, which is known to be equivalent in power to a Turing machine.
- Simulating a Stack: The core trick is to encode the entire contents of a stack (an arbitrarily long sequence of symbols) into a single, high-precision number stored in one dimension of the RNN's hidden state.
- Stack Operations:
PUSH
: Implemented with division and addition.POP
: Implemented with multiplication and subtraction.
- The RNN uses other hidden dimensions as flags and buffers to control these arithmetic operations, simulating the behavior of a PDA. By simulating two such stacks, it achieves the power of a Turing machine.
Computational Power of Gated Variants (LSTM vs. GRU)¶
- LSTM: The separate memory cell ($c_t$) in an LSTM can act as a counter. This allows LSTMs to learn counting languages (like $a^n b^n c^n$) that simple RNNs and GRUs struggle with. LSTMs are sometimes compared to counter machines and are considered more expressive.
- GRU: Lacks the explicit memory cell of an LSTM. Theoretically and empirically, it behaves more like a simple Elman RNN and struggles with precise counting tasks.
Consequences of Turing Completeness¶
Being Turing complete sounds great, but it comes with a huge catch. It means that RNNs inherit all the fundamental limitations of general-purpose computers, as described by computability theory.
- The Halting Problem: The most famous undecidable problem. There is no algorithm that can determine, for all possible inputs, whether an arbitrary program (or Turing machine) will finish running or loop forever.
- Undecidability in RNNs: Because RNNs are Turing complete, many fundamental questions about them are also undecidable. The proofs for these results work by showing that if you could solve the problem for an RNN, you could also solve the Halting Problem, which is a contradiction.
Key Undecidable Problems for RNNs:
- Tightness: You cannot create a general algorithm to determine if an arbitrary RNN language model is tight.
- Highest Weighted String: You cannot create a general algorithm to find the single most probable string an RNN can generate.
- Equivalence: You cannot create a general algorithm to determine if two RNNs define the exact same probability distribution over strings.
- Minimization: You cannot create a general algorithm to find the smallest possible RNN (in terms of hidden units) that is equivalent to a given RNN.
Transformer-based language models¶
This section introduces the Transformer, arguably the most important neural network architecture in modern NLP. We will focus on two key questions: Why was it created (i.e., what problems with RNNs does it solve?), and How does it work? Pay very close attention to the Attention Mechanism, as it is the heart of the Transformer.
Motivation: Why We Need Transformers¶
Recurrent Neural Networks (RNNs), including LSTMs and GRUs, have two fundamental problems that the Transformer architecture was designed to solve.
The Parallelization Problem (The Biggest Reason!):
- RNNs are inherently sequential. To compute the hidden state $h_t$, you must first compute $h_{t-1}$, which depends on $h_{t-2}$, and so on.
- This sequential dependency makes it impossible to effectively parallelize the training process over a single long sequence, creating a major bottleneck when training on the massive datasets used today.
The Information Bottleneck Problem:
- An RNN must compress the entire history of a sequence, no matter how long, into a single, fixed-size hidden state vector $h_t$.
- Intuitively, as the sequence gets longer, it becomes harder to retain all relevant information in this single vector, creating a "bottleneck".
The Transformer architecture tackles these issues by completely removing recurrence. Instead of sequential processing, it computes representations for all tokens in a sequence simultaneously, allowing each token to directly "look at" or attend to all other tokens in its context.
Formal Definition of a Transformer¶
At a high level, a Transformer fits into our existing language modeling framework. The magic is in how it computes the context encoding.
Transformer Sequence Model (Definition): A language model that uses a Transformer to compute the context encoding $h_{t-1}$. The probability of the next word is still calculated in the standard way:
$$p_{SM}(\overline{y}_{t}|y_{<t}) = \text{softmax}(E \cdot h_{t-1}) = \text{softmax}(E \cdot \text{enc}_{\mathcal{T}}(y_{<t}))_{\overline{y}_t}$$
The entire novelty of the Transformer lies in the encoding function, $\text{enc}_{\mathcal{T}}$, which is built around the Attention Mechanism.
The Attention Mechanism (The Heart of the Transformer)¶
This is the most important concept in this section. If you understand this, you understand the core of the Transformer.
Attention is a mechanism that allows a model to compute a representation for a token by taking a weighted average of the representations of all other tokens in its context. The weights determine which tokens are most "relevant."
Key Ingredients: Attention works with three types of vectors[cite: 1443]:
- Query ($q$): Represents the current token that is "asking" for information.
- Keys ($K$): A set of vectors, one for each token in the context. The keys are like "labels" for the information that the context tokens hold.
- Values ($V$): A set of vectors, one for each token in the context. The values contain the actual information or representation of those tokens.
The Process (Definition):
- Compute Scores: For the current query $q$, compute a similarity score against every key $k_i$ in the context using a scoring function $f$ (most commonly, scaled dot-product: $f(q,k) = \frac{\langle q, k \rangle}{\sqrt{D}}$).
- Compute Weights: Normalize these scores into weights that sum to 1 using a projection function, almost always softmax. This step creates the "attention weights". The result is a vector $s$.
- Compute Output: The final output, $a$, is a weighted sum of all the value vectors $v_i$, using the attention weights $s_i$ as the coefficients. $$a_t = \text{Att}(q, K_t, V_t) = \sum_{i=1}^{t} s_i v_i$$
Types of Attention:
- Soft Attention: The standard approach using the softmax function. It produces a "soft" weighted average over all value vectors.
- Hard Attention: A theoretical variant where only the value corresponding to the key with the highest score is selected (an argmax operation). This is useful for analysis but not common in practice.
Transformer Layers and the Full Architecture¶
A full Transformer model is built by stacking multiple Transformer Layers on top of each other.
- Transformer Layer (Definition): A single processing block with two main sub-layers:
- A self-attention mechanism.
- A position-wise feed-forward neural network.
- Self-Attention: This is the application of attention where the Queries, Keys, and Values are all derived from the same input sequence from the previous layer.
- Residual Connections: To help with training deep networks, the input to each sub-layer is added to its output. This is a critical component for model performance.
- Masking (Crucial for Language Modeling!):
- For an autoregressive language model, the representation for token $y_t$ should only depend on past tokens ($y_{<t}$).
- However, the self-attention mechanism, computed efficiently with matrix multiplication, would allow every token to see every other token, including future ones.
- To fix this, we apply a causal mask before the softmax step. This mask sets the attention scores for all future positions to $-\infty$, ensuring their probability becomes 0 after the softmax.
Other Important Components ("Bits and Bobs")¶
- Positional Encodings:
- Problem: The self-attention mechanism is permutation-invariant; it has no built-in notion of word order.
- Solution: We must explicitly add positional encoding vectors to the input word embeddings. These vectors provide the model with information about the position of each token in the sequence.
- Multi-Head Attention:
- Instead of one attention calculation, multi-head attention runs several attention "heads" in parallel. Each head has its own set of projection matrices for $Q$, $K$, and $V$.
- This allows the model to jointly attend to different types of information from different representation subspaces.
- Layer Normalization:
- A technique applied after each sub-layer to stabilize training. It normalizes the vector of activations at each layer to have a consistent mean and variance.
Tightness of Transformer Models¶
Just like with RNNs, we need to know if our model defines a proper probability distribution. For Transformers, the answer is a clean "yes."
Main Result (Theorem): A Transformer language model that uses soft attention (the standard softmax version) is guaranteed to be tight.
Proof Idea (High Level):
- The inputs (word + positional embeddings) come from a finite vocabulary and are therefore bounded, living in a compact set.
- All operations in a Transformer layer (attention, matrix multiplication, softmax, layer norm) are continuous functions.
- A continuous function always maps a compact set to another compact set.
- Therefore, no matter how long the input sequence is, the final hidden state vectors are always contained within a fixed, single compact set.
- Since the output vectors are bounded, the probability of generating the $\text{EOS}$ token is always bounded below by some small positive number $\varepsilon > 0$.
- If the $\text{EOS}$ probability is always greater than zero, the model is guaranteed to be tight.
Representational capacity of transformer language models¶
This section explores the computational power of Transformers. The results here can seem contradictory at first glance, so it's critical to pay attention to the specific assumptions being made in each case (e.g., type of attention, precision, definition of equivalence). This is a hot area of research, and understanding these nuances is key.
A Tale of Two Transformers: Contradictory Findings¶
When we look at the theoretical power of Transformers, we find a confusing picture:
On one hand (The Weak Transformer): Some studies show that Transformers can't even recognize very simple regular languages like
FIRST
(checking if the first symbol is '1') orPARITY
(checking for an odd number of '1's). They also fail at simple context-free languages likeDYCK
(balanced parentheses).On the other hand (The Strong Transformer): Other research proves that Transformers are Turing complete, meaning they are maximally powerful and can, in theory, compute anything a normal computer can.
How can both be true?
The "weak" results often assume unique hard attention or show that with soft attention, the model's confidence degrades as strings get longer because attention gets diluted.
The "strong" results assume infinite precision and use a clever "trick" to handle the Transformer's lack of a sequential state.
A Crucial Trick: Homomorphic Equivalence¶
This is the most important concept for understanding how we prove Transformers are powerful. It's a different, "weaker" kind of equivalence than we used for RNNs.
The Problem: Classic computational models (like Turing machines) have an internal, sequential state. Transformers are designed to be parallel and stateless. How can we compare them?
The "Cheat": We make the Transformer simulate a state by writing the state into the output string itself.
- Instead of generating a symbol $y_t$, the Transformer will generate an augmented symbol $y_t q_t$, where $q_t$ is the state of the machine being simulated.
- Because the Transformer can attend to the entire history, it can "read" the state $q_{t-1}$ from the last augmented symbol and use that to compute the next state $q_t$ and symbol $y_t$.
This leads to a different notion of equivalence:
- Model Equivalence: The standard, strong notion. Two models are equivalent if they recognize the exact same language ($L(\mathcal{C}_1) = L(\mathcal{C}_2)$). We proved this for RNNs.
- Homomorphic Equivalence: A weaker notion. Two models are homomorphically equivalent if you can map the language of one to the language of the other using a simple transformation (a homomorphism). The proofs of Turing completeness for Transformers rely on this weaker equivalence.
Climbing the Hierarchy: What Transformers Can Do¶
By making the right assumptions (e.g., infinite precision, averaging hard attention), we can show that Transformers can simulate models from the bottom to the top of the Chomsky hierarchy.
Transformers Can Simulate n-gram Models¶
This is the first positive result, showing Transformers can handle at least this basic class of models.
Theorem: For any n-gram language model, there exists a Transformer that can simulate it. This means Transformers can recognize strictly local languages.
Proof by Construction :
- Architecture: We construct a multi-head attention Transformer with $n-1$ heads. Each head is responsible for identifying one of the $n-1$ symbols in the context window.
- The Goal of Head $h$: To find and return the identity of the symbol at position $t-h$.
- How it Works:
- Positional Encodings: Use specific positional encodings to give the model awareness of absolute positions. A simple encoding like $f_{pos}(t) = (t, 1)^T$ works.
- Crafting Q, K, V: The Query, Key, and Value matrices are carefully designed. For Head $h$:
- Query (from position t): The query vector is designed to represent the target position, $t-h$.
- Key (from position j): The key vector is designed to represent its own position, $j$.
- Scoring: A scoring function like $-| \langle q_t, k_j \rangle | = -|t-h-j|$ is used. This score is maximized (at 0) exactly when $j = t-h$. With hard attention, the head is forced to attend only to the desired position.
- Value: The value transformation is set up to simply pass through the one-hot encoding of the symbol at the attended position.
- Combining Heads: The $n-1$ one-hot vectors returned by the heads are concatenated. A final feed-forward network then acts like a big lookup table: it identifies the full n-gram and outputs the corresponding pre-defined probability distribution.
This construction proves that Transformers are at least as powerful as n-gram models, using direct model equivalence. The more powerful results for context-free and Turing-complete languages will rely on the homomorphic equivalence "trick."