263-5354-00: Large Language Model
Section 5
Training, Fine Tuning and Inference
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 07/18/2025
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Mrinmaya Sachan. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Transfer Learning and Popular Pretrained Language Models¶
Transfer learning is the technique of applying knowledge gained from one task to solve different, but related, tasks. Inspired by how humans learn, this approach allows machine learning models to become more efficient. For instance, knowing English can make it easier to learn Dutch. In neural networks, a model is first trained on a general source task (like language modeling) and then adapted for a new target task. This process is often more efficient than training a model from scratch.
Key terminology includes:
- Pretrained model: A model that has been trained on a source task.
- Pretraining: The initial process of training on the source task.
- Fine-tuning: The process of updating the weights of a pretrained model for a new target task.
Early Models: CoVE and ELMo¶
While CoVE (Contextualized Word Embeddings) was an early step, ELMo (Embeddings from Language Models) was one of the first widely successful transfer learning models based on language modeling.
ELMo¶
ELMo's key innovation was generating context-dependent word representations. Unlike older models that had static vectors for each word, ELMo produces different embeddings for the same word depending on the sentence it's in.
- Architecture: ELMo uses two stacked, multi-layer bidirectional LSTMs (Long Short-Term Memory networks):
- A forward language model that reads the text from left to right.
- A backward language model that reads from right to left.
- How it works: For a specific task, the hidden state representations from all layers of both the forward and backward LMs are combined. A weighted sum of these layers is learned to create a final, task-specific representation. The formula for the task-specific representation is: $$ELMo_{k}^{task} = \gamma^{task} \sum_{l=0}^{L} s_{l}^{task} h_{kl}^{LM}$$ Where $\gamma^{task}$ is a scaling factor and $s_{l}^{task}$ are learned weights for each layer.
- Impact: ELMo significantly improved the state-of-the-art on numerous NLP benchmarks, including question answering, sentiment analysis, and named entity recognition.
BERT: Bidirectional Encoder Representations from Transformers¶
BERT marked a major advancement by using the Transformer encoder architecture, which allows it to process the entire input sequence at once, making it truly bidirectional.
Architecture: BERT is a multi-layer bidirectional Transformer encoder. Two main versions were released:
- BERT_BASE: 12 layers, 768 hidden units, 12 attention heads (110M parameters).
- BERT_LARGE: 24 layers, 1024 hidden units, 16 attention heads (340M parameters).
Pre-training Objectives: BERT is pre-trained on two simultaneous tasks:
- Masked Language Modeling (MLM): About 15% of the tokens in the input text are replaced with a
[MASK]
token. The model's goal is to predict the original identity of these masked tokens based on the surrounding unmasked tokens. This allows the model to learn deep bidirectional context. - Next Sentence Prediction (NSP): The model receives two sentences, A and B, and must predict whether sentence B is the actual sentence that follows A in the original text. This helps BERT understand sentence relationships, which is crucial for tasks like question answering.
- Masked Language Modeling (MLM): About 15% of the tokens in the input text are replaced with a
Fine-tuning: To adapt BERT to a specific task, a small classification layer is added on top of the core model, and all the parameters are fine-tuned end-to-end on task-specific labeled data.
BERT Variants¶
The success of BERT inspired numerous variants designed to improve its performance, efficiency, or training methodology.
RoBERTa (Robustly Optimized BERT Approach)¶
RoBERTa doesn't change BERT's architecture but optimizes the pre-training process.
- Key Changes:
- Trained on a much larger dataset (160GB of text).
- Trained with a larger batch size for a longer time.
- Used dynamic masking, where the masking pattern applied to the training data was changed during each epoch, unlike BERT's static mask.
ALBERT (A Lite BERT)¶
ALBERT focuses on parameter efficiency to create smaller, faster models.
- Key Changes:
- Factorized Embedding Parameterization: Decomposes the large vocabulary embedding matrix into two smaller matrices, reducing parameters.
- Cross-layer Parameter Sharing: All Transformer layers share the same parameters, drastically reducing the total parameter count.
- Sentence Order Prediction (SOP): Replaces NSP with a task to predict if two consecutive sentences are in their original order or swapped. This focuses on learning discourse coherence.
ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately)¶
ELECTRA introduces a more sample-efficient pre-training task called Replaced Token Detection (RTD).
- Key Changes:
- It uses two models: a small Generator (like a small BERT) and a larger Discriminator (ELECTRA itself).
- The Generator replaces some tokens in the input with plausible alternatives.
- The Discriminator's job is not to predict the original token, but to predict whether each token in the corrupted text was part of the original input or replaced by the Generator.
- This approach is more efficient because the model learns from every token in the input, not just the 15% that are masked in BERT.
GPT (Generative Pre-training Transformers)¶
The GPT family of models are decoder-only Transformer models, making them true language models capable of generation. They are pre-trained using the standard language modeling objective of predicting the next token.
- GPT: The first model in the family could be fine-tuned on various natural language understanding tasks by converting inputs into a single text sequence.
- GPT-2: This version scaled up the model size and training data significantly. A key finding was that large-scale generative pre-training enabled the model to perform many NLP tasks in a zero-shot setting, demonstrating abilities as an unsupervised multi-task learner.
- GPT-3: Further scaling the model to 175 billion parameters revealed a strong few-shot "in-context learning" ability. This means the model can perform a new task by simply seeing a few examples in the prompt, without any parameter updates.
Seq2seq TLMs (Encoder-Decoder Models)¶
These models utilize a full Transformer encoder-decoder architecture, which makes them particularly well-suited for sequence-to-sequence tasks like machine translation and text summarization. The encoder uses bidirectional attention to understand the input, while the decoder generates a fluent output.
T5 (Text-to-Text Transfer Transformer): T5 reframes every NLP task as a "text-to-text" problem.
- Architecture: T5 uses a standard Transformer encoder-decoder architecture.
- Key Idea: Instead of having different outputs for different tasks (e.g., a class label for classification, a text span for QA), T5 is trained to output a text string for every task. A prefix is added to the input to tell the model which task to perform (e.g., "summarize: ..." or "translate English to German: ...").
- Pre-training Objective: The primary objective is "text infilling," where contiguous spans of text are replaced with a single mask token, and the model is trained to predict the missing text span.
BART: Another powerful encoder-decoder model, BART is pre-trained by corrupting text with various noise functions and then training the model to reconstruct the original text. The most effective combination of noising strategies was found to be text infilling and sentence permutation.
Parameter Efficient Finetuning of LMs (PEFT)¶
Motivation behind PEFT¶
As language models grow to billions of parameters, fine-tuning the entire model for every new task becomes computationally expensive and requires large amounts of storage. Parameter-Efficient Fine-Tuning (PEFT) methods were developed to address these challenges. They aim to adapt large pre-trained models to new tasks by updating only a small subset of the model's parameters, keeping the majority of the original model frozen. This significantly reduces computational costs, memory requirements, and the risk of overfitting on smaller datasets.
PEFT Methods¶
BitFit¶
BitFit is a simple yet surprisingly effective PEFT method.
- Core Idea: Freeze almost all of the model's weights and only fine-tune the bias terms. Bias terms exist in the attention mechanism and the feed-forward layers and constitute a tiny fraction (e.g., ~0.04%) of the total parameters.
- Benefit: Achieves strong performance on several benchmarks while being extremely parameter-efficient.
Adapters¶
Adapters are small, trainable neural network modules inserted into a pre-trained model.
- Architecture: A typical adapter consists of two feed-forward layers with a bottleneck structure (a down-projection to a smaller dimension, a non-linearity, and an up-projection back to the original dimension).
- How it works: These adapter modules are inserted inside each Transformer layer, typically after the multi-head attention and feed-forward sublayers. During fine-tuning, the original pre-trained model weights are frozen, and only the newly added adapter layers are trained.
Prefix Tuning¶
Prefix tuning is inspired by prompting but operates in the continuous embedding space.
- Core Idea: Instead of adding tunable parameters inside the model like adapters, prefix tuning prepends a sequence of continuous, task-specific vectors (the "prefix") to the input sequence.
- How it works: The parameters of the large language model are kept frozen. Only the vectors of the prefix are trained. These learned vectors steer the model's behavior for the specific downstream task.
LoRA (Low-Rank Adaptation)¶
LoRA is a highly effective and popular PEFT method that avoids the inference latency that can be introduced by adapters.
- Core Idea: LoRA hypothesizes that the change in weights during model adaptation has a "low intrinsic rank". It approximates the weight update matrix ($\Delta W$) with a low-rank decomposition.
- How it works: For a pre-trained weight matrix $W$, the update is represented by two smaller, low-rank matrices, $A$ and $B$, such that $\Delta W = BA$. During training, $W$ is frozen, and only $A$ and $B$ are updated. The modified forward pass becomes: $$h = Wx + \Delta Wx = Wx + \frac{\alpha}{r} \cdot BAx$$ During inference, the update can be merged with the original weights ($W' = W + BA$), meaning there is no extra inference latency.
Prompting and Zero-shot Inference with LMs¶
Definition of Prompting¶
Prompting is a technique for guiding a pre-trained language model to perform a specific task by formulating the task input as a textual prompt. This method often allows the model to solve tasks without any task-specific training or parameter updates.
- How it works:
- An input is converted into a prompt, which is a text string with a slot for the model to fill in an answer (e.g., "The movie was great. Overall, it was a
[Z]
movie."). - The language model then calculates the most probable text to fill the
[Z]
slot. - This generated text is then mapped to a final answer.
- An input is converted into a prompt, which is a text string with a slot for the model to fill in an answer (e.g., "The movie was great. Overall, it was a
In-Context Learning (Zero-shot and Few-shot Inference)¶
In-context learning is a remarkable emergent ability of large language models to perform new tasks based solely on information provided in the prompt, without any gradient updates.
- Zero-shot learning: The model is given a prompt describing the task and must perform it without any examples. For instance, giving the prompt "Translate English to French: cheese =>".
- Few-shot learning: The model's prompt is augmented with a few examples (demonstrations) of the task. For example: "Translate English to French: sea otter => loutre de mer, peppermint => menthe poivrée, cheese =>". The model learns the task from these in-context examples.
This ability improves dramatically with model scale; larger models benefit much more from few-shot examples than smaller ones.
Various Kinds of Prompting Strategies¶
Simple prompting is effective, but more sophisticated strategies can unlock better reasoning and performance.
Chain of Thought (CoT) Prompting¶
- Core Idea: Instead of asking for a direct answer, CoT prompting encourages the model to generate a step-by-step reasoning process that leads to the final answer.
- How it works: In a few-shot setting, the examples provided in the prompt include the intermediate reasoning steps. This teaches the model to "think out loud". For example, for a math problem, the prompt would show the steps to solve it, not just the final number.
- Zero-shot CoT: Amazingly, CoT can be triggered in a zero-shot manner by simply appending the phrase "Let's think step-by-step" to the end of a question prompt.
- Benefit: Significantly improves performance on tasks requiring arithmetic, commonsense, and symbolic reasoning.
Least-to-Most Prompting¶
- Core Idea: A strategy for solving complex problems by breaking them down into a series of smaller, more manageable sub-problems.
- How it works:
- First, the model is prompted to decompose the main problem into simpler sub-questions.
- Then, the model is prompted to solve each sub-question sequentially, where the answer to a previous sub-question can be used to help solve the next one.
- Benefit: Useful for tasks where the complexity exceeds that of the examples in a standard CoT prompt.
Self-Consistency¶
- Core Idea: A decoding strategy that improves upon the greedy approach of just taking the single most likely answer. It leverages the idea that a complex problem might have multiple valid reasoning paths that all lead to the correct answer.
- How it works:
- Instead of generating just one response, it samples multiple diverse reasoning paths (e.g., multiple chain-of-thought outputs) from the language model by using a non-zero temperature.
- It then aggregates the final answers from these different paths and selects the most consistent one (e.g., through a majority vote).
- Benefit: Makes the model's answers more robust and accurate, especially on challenging reasoning tasks.