263-5354-00: Large Language Model
Section 6
Applications and the Benefits of Scale
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 07/19/2025
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Mrinmaya Sachan. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Vision Language Models¶
At its core, a Vision Language Model (VLM) is an AI model that deals with multimodality-that is, information that comes in more than one form. While this can include things like knowledge graphs, the most common combination is vision (images/videos) and language (text).
Inspired by the success of Large Language Models (LLMs) in NLP, Vision-Language Pre-training (VLP) has become the standard approach for building models that can connect what they "see" with what they "read".
Vision-and-Language (VL) Tasks¶
VL tasks involve both visual and language modalities in their inputs and outputs. They are generally grouped into three categories.
Image-Text Tasks¶
These are the most prominent and researched tasks in the VL field.
Visual Question Answering (VQA): An AI system is given an image and a natural language question about it and must provide an answer. For example, when shown an image of a dog with a frisbee and asked, "What is the dog holding with its paws?", the system should answer "frisbee".
Image Captioning: This involves generating a descriptive and meaningful caption for an image. The goal is to understand the visual content and translate it into natural language text.
Image-Text Retrieval: The model retrieves the most relevant texts for an image query or the most relevant images for a text query from a large collection.
Visual Grounding: The model receives an image, a referring expression (text), and object bounding boxes. It then has to predict the bounding box that corresponds to the text query.
Text-to-Image Generation: Considered the reverse of image captioning, the system creates a high-quality image based on a text description.
Computer Vision (CV) Tasks as VL Problems¶
Core visual tasks like image classification and object detection are traditionally pure vision problems. However, newer models like CLIP have demonstrated that language supervision is highly beneficial. By considering the semantic meaning behind class labels, these CV tasks can be transformed into VL problems. This approach enables models to recognize concepts they weren't explicitly trained on, a capability known as open-vocabulary object detection.
Video-Text Tasks¶
These tasks are the video counterparts to the image-text tasks, such as video captioning, retrieval, and question answering. A key difference is that video requires the AI system to capture not only spatial information within frames but also the temporal dependencies between them.
VLM Architectures: The Building Blocks¶
A standard VLM consists of a text encoder, a vision encoder, a fusion module, and sometimes a decoder.
Vision Encoder¶
This component processes the image to extract visual features. The three main types are:
Object Detector (OD): Uses models like Faster R-CNN, pre-trained on datasets such as Visual Genome, to detect objects and extract their features.
Convolutional Neural Network (CNN): Employs standard vision models like ResNet to convert an image into a set of feature vectors. Stronger CNN backbones generally lead to better performance.
Vision Transformer (ViT): An image is divided into patches, which are then flattened into vectors and processed by a Transformer, similar to how text is processed in NLP.
Text Encoder¶
This component processes the input text. It's typically based on pre-trained language models like BERT or RoBERTa. The text is tokenized into subwords, and these tokens are converted into feature vectors.
Multimodal Fusion Module¶
This is where the visual and text features are combined, allowing the model to learn cross-modal relationships. The two primary methods are:
Merged Attention: The text and visual features are simply concatenated and fed into a single Transformer block. This design is used in many models, including VisualBERT, UNITER, and ViLT.
Co-attention: The text and visual features are processed in separate Transformer streams but interact through cross-attention layers to enable cross-modal interaction. This design is used in models like LXMERT and ViLBERT.
Pre-training Objectives¶
VLMs are pre-trained on vast datasets of image-text pairs using several key objectives to guide the learning process.
Masked Language Modeling (MLM)¶
This objective is adapted from language model pre-training.
How it works: About 15% of the input words are randomly masked, and the model must predict these masked tokens.
The Goal: The prediction is based on the surrounding unmasked words and the corresponding image.
The Formula: The model works by minimizing the negative log-likelihood of predicting the correct masked word ($w_m$) given the unmasked words ($w_{\backslash m}$) and the image ($v$).
$$\mathcal{L}_{MLM}(\theta) = -\mathbb{E}_{(w,v) \sim D} \log P_{\theta}(w_m|w_{\backslash m},v)$$
Image-Text Matching (ITM)¶
This objective trains the model to determine if an image and a caption are a correct match.
How it works: The model is presented with an image-caption pair, which is either a correct match or a mismatched pair with equal probability.
The Goal: It then predicts a binary label indicating whether the pair is matched.
The Formula: A binary cross-entropy loss is used for optimization, where $y$ is the true label and $s_{\theta}(w,v)$ is the model's predicted score for the pair.
$$\mathcal{L}_{ITM}(\theta) = -\mathbb{E}_{(w,v) \sim D} [y \log s_{\theta}(w,v) + (1-y)\log(1-s_{\theta}(w,v))]$$
Image-Text Contrastive Learning (ITC)¶
Popularized by models like CLIP and ALIGN, this objective teaches the model to align the vector representations of matched image-text pairs.
How it works: Within a batch of N image-text pairs, ITC aims to identify the N correct pairs out of all $N^2$ possible combinations.
The Goal: For any given image in the batch, the model is trained to assign the highest similarity score to its true corresponding text caption and lower scores to all other captions, and vice versa.
Masked Image Modeling (MIM)¶
This is the visual equivalent of MLM, where parts of the input image are masked.
How it works: Some image patches or regions are randomly hidden, and their visual features are replaced by zeros.
The Goal: The model is trained to reconstruct the features of these masked regions using the remaining visible parts of the image and the associated text as context.
Note: While MIM is an intuitive concept, several state-of-the-art models either do not use it or have found it is not helpful for improving downstream performance.
Retrieval-Augmented Generation¶
The Core Problem: Static Knowledge¶
Language models store a vast amount of information learned during training directly in their parameters. This is called parametric knowledge. However, this leads to a significant problem:
Knowledge is Static: The information inside the model is frozen at the time of training. If a fact changes in the real world (e.g., a new president is inaugurated), the model will continue to provide the old, incorrect information.
Updates are Expensive: The only reliable way to update this parametric knowledge is to fine-tune or completely retrain the model, which are complex and costly operations.
The RAG Solution: Using a Knowledge Base¶
To solve this, we turn to non-parametric models, which rely on an external, updatable source of information called a knowledge base. This knowledge base can be a collection of documents like Wikipedia or a structured database.
The general process for these knowledge-enhanced models is:
During inference, the model generates a query or key.
This key is used to search the external knowledge base with a Retriever.
The retrieved information (an "artefact") is "fused" back into the model, which then uses this new context to generate a better output.
Key Advantages of RAG¶
Easy Updates: Knowledge can be kept current by simply editing the external knowledge base, with no expensive model retraining required.
Reduced Hallucination: By grounding the model in factual, retrieved documents, RAG helps mitigate "hallucination" - the tendency of LLMs to generate text that is not factually correct.
Interpretability and Verification: You can inspect which documents were retrieved to generate an answer, making the model's output more trustworthy and verifiable.
Access to More Knowledge: Models can't memorize everything, especially long-tail knowledge. A knowledge base gives them access to a much larger world of information.
Efficiency: Non-parametric models with fewer parameters can sometimes achieve better results than much larger parametric models.
Prototypical RAG: The Two-Stage Approach¶
The simplest RAG approach involves two main stages where each component is trained independently.
Stage 1: Retrieval¶
The goal of the retriever is to search the entire knowledge base ($\mathcal{D}$) and find a small subset of the most relevant documents for a given query ($q$).
There are two main categories of retrieval methods:
1. Sparse Retrieval (e.g., TF-IDF, BM25)¶
These classic methods are based on term frequency.
TF-IDF (Term Frequency-Inverse Document Frequency): Measures how important a word is to a document in a collection.
Term Frequency (tf): A term is more important if it appears frequently in a document.
Inverse Document Frequency (idf): A term is less important if it appears in many documents (e.g., common words like "the").
Limitation: These methods struggle with synonyms and rely on exact keyword overlap between the query and the document.
2. Dense Retrieval (e.g., DPR)¶
Modern approaches that overcome the limitations of sparse models by using dense vector embeddings.
Dense Passage Retrieval (DPR): A common example that uses a dual-encoder architecture (often based on BERT) to map both the query and the documents into the same high-dimensional vector space.
Similarity Score: The relevance between a query $q$ and a document $d$ is calculated using a simple dot product of their vector representations:
$$SIM(q,d) = ENCODER(q)^T \cdot ENCODER(d)$$
Training: The encoders are trained using contrastive learning. The goal is to make the similarity score high for a relevant (positive) document and low for all irrelevant (negative) documents. This is optimized using a negative log-likelihood loss function:
$$L(q_i, d_i^+, d_{i,1}^-, \cdots, d_{i,n}^-) = -\log \frac{e^{SIM(q_i, d_i^+)}}{e^{SIM(q_i, d_i^+)} + \sum_{j=1}^{n} e^{SIM(q_i, d_{i,j}^-)}}$$
Performance: Dense retrievers like DPR significantly outperform sparse methods like BM25.
Stage 2: Generation¶
Once the relevant documents are retrieved, they are passed to a second model, which is responsible for producing the final answer.
Extractor Models (e.g., DrQA): Early systems used a reading comprehension model to extract the answer span directly from the retrieved documents.
Generator Models (e.g., Fusion-in-Decoder): More recent methods prompt a generative model with the retrieved documents to generate a free-form answer, which has been shown to produce better results.
Limitation of Prototypical RAG
The retriever and the generator are trained separately. This means the retriever is not optimized for the final task, and the generator is not trained specifically on how to best leverage the retrieved information.
Advanced RAG Models & Fusion Techniques¶
More advanced models address the limitations of the prototypical approach, primarily by jointly training the retriever and generator. A key distinction between these models is how and when they fuse the retrieved artefact into the generation process.
REALM: Fuse in the Input Layer¶
REALM (Retrieval-Augmented Language Model) was a pioneering model that jointly optimizes the retriever and the language model.
How it works: For a given input, REALM's neural retriever first finds relevant documents from a corpus like Wikipedia. These documents are then simply concatenated to the original input and fed into a language model to produce the final output (e.g., predicting a masked word).
Joint Optimization: The entire system is trained end-to-end, so the retriever learns to find documents that are most useful for the language model to perform its task.
Key Result: REALM significantly outperformed both sparse-retrieval systems (BM25+BERT) and large parametric-only models like T5 on open-domain question answering.
RETRO: Fuse in Intermediate Layers¶
RETRO (Retrieval Enhanced Transformer) integrates retrieved information deeper inside the model architecture.
How it works: Instead of concatenating at the input, RETRO uses a special Chunked Cross-Attention (CCA) mechanism. This allows the model, at its intermediate layers, to look at the retrieved text chunks and incorporate that information while it is processing its own input.
Frozen Retriever: The retriever itself is a frozen BERT-based k-Nearest-Neighbor (kNN) retriever that is not updated during training.
Key Result: This architecture is extremely efficient. It has been shown that RETRO can outperform parametric models that are 25x larger.
kNN-LM: Fuse at the Output Layer¶
kNN-LM takes a different approach by mixing retrieved information at the final prediction step.
How it works:
Build Datastore: First, it processes a large training corpus and creates a massive datastore. This datastore maps the vector representation of every sentence prefix to the very next word that followed it in the text.
Retrieve: During inference, for a given context, it finds the k-nearest neighbors (i.e., the most similar contexts) from the datastore.
Interpolate: It then takes the next-word probabilities from these retrieved neighbors and interpolates them with the language model's own output probability distribution. A hyperparameter $\lambda$ controls how much weight is given to the retrieved distribution versus the model's own prediction.
Dynamic Gating: An improvement on this model learns a dynamic gating mechanism. This allows the model to decide on-the-fly how much to trust the kNN distribution versus its own parametric knowledge, based on the current context. For example, it can learn to rely less on retrieval for common words and more for rare, factual information.
Alignment with Human Preferences¶
Instruction Tuning¶
Instruction tuning is a technique for fine-tuning a pre-trained language model on a collection of diverse NLP tasks that are described using natural language instructions.
How it Works¶
The process is straightforward: a short description of the task is prepended to the model's input during training. For example, instead of just feeding the model a movie review and a "positive" label, you would provide an input like: "Is the sentiment of this movie review positive or negative? Review: [movie review text]
".
The key idea is to train the model on a wide mixture of different instructed datasets. This explicit multi-task training helps the model learn to follow instructions and generalize this ability to unseen tasks during inference.
Example: The FLAN Model¶
The FLAN (Finetuned Language Net) family of models is a prime example of instruction tuning.
Training: FLAN is fine-tuned on a large number of NLP tasks, such as translation, commonsense reasoning, and sentiment analysis.
Evaluation: To test its generalization, it's evaluated in a zero-shot setting on a task it has never seen during instruction tuning (e.g., natural language inference).
Results: FLAN models have been shown to significantly outperform larger models that are not instruction-tuned, even when the larger models are given few-shot examples. Scaling up the model size and the number of tasks further improves reasoning and generalization capabilities.
Reinforcement Learning from Human Feedback (RLHF)¶
RLHF is a powerful technique that uses human preferences as a direct reward signal to train a language model. The goal is to align the model with both explicit intentions (e.g., following instructions) and implicit intentions (e.g., being truthful, helpful, and not harmful).
The RLHF process involves three main steps:
Step 1: Train a Supervised Policy (SFT)¶
First, a pre-trained language model is fine-tuned on a high-quality dataset of human-written demonstrations. This initial model is called the Supervised Fine-Tuning (SFT) model, and it learns the desired style and behavior for responding to prompts.
Step 2: Train a Reward Model (RM)¶
This step teaches a model to understand human preferences.
Several different outputs are sampled from the SFT model for a given prompt.
A human labeler then ranks these outputs from best to worst.
This comparison data is used to train a separate Reward Model (RM). The RM's job is to take a prompt and a response and output a scalar score that predicts how a human would rate it.
The RM is trained with a loss function that encourages it to give a higher score to the winning response ($y_w$) than the losing response ($y_l$): $$\mathcal{L}(\theta_{RM}) = -\frac{1}{\binom{K}{2}} E_{(x, y_w, y_l) \sim D} [\log(\sigma(r_{\theta_{RM}}(x, y_w) - r_{\theta_{RM}}(x, y_l)))]$$
Step 3: Optimize Policy with RL (PPO)¶
Finally, the SFT model is further fine-tuned using reinforcement learning.
A new prompt is sampled, and the SFT model (now the "policy") generates a response.
The Reward Model from Step 2 calculates a reward score for that response.
This reward is used to update the policy's parameters using an RL algorithm, typically Proximal Policy Optimization (PPO).
The objective function for this step includes a penalty term (a Kullback-Leibler, or KL, penalty) to prevent the policy from deviating too far from the original SFT model, which helps mitigate over-optimization of the reward model. Models like InstructGPT and ChatGPT were trained using this RLHF procedure.
Direct Preference Optimization (DPO)¶
While powerful, RLHF is complex, computationally expensive, and can be unstable to train. Direct Preference Optimization (DPO) is a simpler and more direct alternative.
How DPO Works¶
DPO achieves the goal of aligning with human preferences without needing an explicit reward model or a complex RL loop. It directly fine-tunes the language model on the same preference data (pairs of winning and losing responses) collected for RLHF.
The core insight is a mathematical simplification. The RLHF objective can be rearranged to show that the underlying reward function is directly related to the probabilities assigned by the language model policy. DPO leverages this to create a simple loss function that bypasses the need for an explicit reward model.
The DPO loss function directly optimizes the language model's parameters ($\theta$) based on the preference data: $$\mathcal{L}^{DPO}(\theta) = -E_{(x, y_w, y_l) \sim D} \left[\log \sigma\left(\beta \log \frac{\pi^{\theta}(y_w|x)}{\pi^{SFT}(y_w|x)} - \beta \log \frac{\pi^{\theta}(y_l|x)}{\pi^{SFT}(y_l|x)}\right)\right]$$
The intuition is simple: the loss function simultaneously increases the likelihood of the preferred response ($y_w$) and decreases the likelihood of the dispreferred response ($y_l$), relative to the original SFT model.
DPO vs. RLHF Summary
DPO does not train an explicit reward model.
DPO does not need to sample from the model during training to get rewards; it optimizes the language model weights directly.
Calibration¶
Calibration and Selective Prediction¶
What is Calibration?¶
Confidence calibration is the process of ensuring that a model's predicted probability scores accurately reflect the true likelihood of an event. For example, if a model makes 100 predictions, each with a stated confidence of 70%, we expect that about 70 of those predictions will be correct.
A system is considered perfectly calibrated if the following condition holds for all probability levels $p$:
$$\mathbb{P}(\hat{Y} = Y | \hat{P} = p) = p, \forall p \in [0, 1]$$
where $\hat{Y}$ is the predicted label and $\hat{P}$ is the predicted probability (confidence).
Calibration is crucial in high-stakes fields like medical diagnosis, where poorly calibrated models could lead to dangerous decisions. Modern deep neural networks, due to their high capacity and certain training methods, are often found to be poorly calibrated.
Measuring (Mis)Calibration¶
Two common tools are used to measure how well-calibrated a model is:
Expected Calibration Error (ECE): This is a widely used metric to quantify a model's miscalibration.
How it works: It first groups all predictions into a set number of bins based on their confidence scores (e.g., 0-10%, 10-20%, etc.). For each bin, it calculates the difference between the average confidence and the actual accuracy of the predictions in that bin.
The Formula: The ECE is the weighted average of these differences across all bins:
$$ECE = \sum_{m=1}^{M} \frac{|B_m|}{n} |acc(B_m) - conf(B_m)|$$
where $B_m$ is the set of predictions in bin $m$, $n$ is the total number of samples, $acc(B_m)$ is the accuracy of predictions in the bin, and $conf(B_m)$ is the average confidence.
Reliability Diagrams: These are visualizations that plot the expected accuracy as a function of confidence.
How to read them: The average confidence for each bin is plotted on the x-axis, and the corresponding accuracy is plotted on the y-axis.
Ideal result: For a perfectly calibrated model, all points would fall along the diagonal line. Deviations from this diagonal represent miscalibration, with the gaps visually showing the ECE.
Calibration Methods¶
Temperature Scaling: This is a simple and effective post-processing method to recalibrate a model's outputs. It uses a single parameter, the temperature T, to "soften" the final probability distribution by scaling the logits before the softmax function:
$$p = \text{softmax}(z/T)$$
The optimal temperature $T$ is learned on a held-out validation set.
Structured Calibration: For tasks with a very large output space (like structured prediction), calibration can be generalized by focusing only on a user-defined set of "events of interest" rather than the entire output space.
Selective Prediction (The Reject Option)¶
Selective prediction allows a model to abstain from making a prediction when its confidence is low, which helps to reduce its error rate.
How it works: A selective classifier consists of a standard classifier, $f(x)$, and a selective function, $g(x)$, which is typically a confidence threshold. If the model's confidence is above the threshold, it makes a prediction; otherwise, it abstains.
The Coverage vs. Risk Trade-off: The performance of a selective classifier is measured by a trade-off between two key metrics:
Coverage: The fraction of examples on which the model does not abstain. $$\text{coverage} = \frac{1}{N} \sum_{i=1}^{N} g(x_i)$$
Risk: The error rate calculated only on the predictions that were not abstained. $$\text{risk} = \frac{\sum_{i=1}^{N} g(x_i) \cdot \mathbb{I}[y_i \neq \hat{y}_i]}{\sum_{i=1}^{N} g(x_i)}$$
Evaluation: The performance is often visualized using a Risk-Coverage Curve (RCC), which shows how the risk changes as the coverage is varied (by adjusting the confidence threshold). The Area Under the Curve (AUC) of the RCC can be used as a single performance metric.