263-3855-00: Cloud Computing Architecture
Section 3
Virtualization
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 03/01/2025
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ana Klimovic. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Virtual machines¶
Virtualization¶
Systems are built on levels of abstraction. Higher levels hide details at lower levels. For example. files are an abstraction of a disk.
Virtualization creates a virtual representation of a resource on a physical machine. Provides a level of indirection between abstract and concrete. Does not necessarily hide low-level details.
Virtualization example: virtual memory¶
OS gives each process its own virtual memory address space. Maps virtual pages from eac process to physical pages in main memory. Each process has an illusion it owns the whole physical memory.
Virtual machines (VM)¶
"Machine" from different perspectives:
OS developer: full CPU ISA & devices
Compiler developer: user ISA and OS ABI
Application developer: user ISA and library API
Leads to different types of VMs:
- System VMs such as Xen, KVM, and VMware. A Virtual Machine Monitor (VMM) honors existing hardware interfaces to create virtual copies of a complete hardware system.
- Process VMs such as Java Virtual Machine (JVM)
VM terminology¶
Host: the physical platform (hardware, sometimes also host OS)
Guest: the additional platforms that run on the VM (OS, apps, etc.)
Virtual Machine Monitor (VMM) or hypervisor: thin layer of software that supports virtualization
Key properties of virtualization¶
Partitioning | Encapsulation |
---|---|
Resource sharing; isolation (security) | Checkpoint / restore; migrate; execution replay |
Why use virtualization in the cloud?¶
Share hardware efficiently and securely between multiple users. For example, virtualization enables:
Sever consolidation to improve resource utilization
Load balancing
Datacenter scaledown
Use case: server consolidation¶
Problem: Underutilized physical servers
Solution: Consolidate to improve utilization and decrease cost
Use case: load balancing¶
Balance load for performance and response time. Consolidate load to reduce power.
How should we implement virtualization?¶
Three main requirements:
Safety: isolation between guests, isolation between guests and VMM
Equivalency: fidelity of results with and without VMM
Efficiency: good performance (minimal overhead)
Several different approaches
Hosted interpretation
Direct execution with trap-and-emulate, with dynamic binary translation, or with hardware-assisted virtualization
Paravirtualization
Operating systems terminology¶
Trap: any kind of transfer of control to the operating system
System call: synchronous (i.e., planned), program-to-kernel transfer such as read file, allocate memory
Exception: synchronous program-to-kernel transfer caused by exceptional events such as divide by zero, page falut, page portection error, etc.
Interrupt: asynchronous device-initiated transfer such as network packet arrives, keyboard event, timer ticks
Hardware-enforced priviledge rings¶
Code in a more privilieged ring can read and write memory in a lower privilege ring. Function calls between rings can only happen through hardware-enforced mechanisms.
Only ring 0 can execute privileged instructions. Rings 1, 2, and 3 will trap when executing privileged instructions.
Examples of privileged instructions¶
Update memory address mapping
Flush or invalidate data cache
Read or write system registers
Change the voltage and frequency of processor
Reset a processor
Perform I/O operations
Context switch, change from kernel mode to user mode
Virualization approach 1: Hosted interpretation¶
Run the VMM as a regular user application on top of a host OS
VMM maintains a software-level representation of physical hardware
VMM steps through instructions in the VM code, updating virtual hardware as necessary (e.g., register values)
Advantages
Complete isolationl no guest instruction is directly executed on host hardware
Easy to handle privileged instructions
Disadvantages
Emulating a modern processor is difficult
Interpretation is very slow (e.g., 100x slower than direct execution on hardware)
Virtualization approach 2: Direct execution with Trap-and-Emulate¶
Run VMM directly on host machine hardware.
But what happens when the guest OS wants to execute a privileged instruction (e.g., an instruction that modifies hardware state)?
Whenever the guest OS executes a privileged instruction, it results in a trap (i.e., transfer of control to VMM). The VMM uses a policy to handle the trap. For example, execute the instruction on behalf of the guest OS, kill the VM, etc.
Due to hardware-enforced ring protections, guest apps cannot tamper with the guest OS, guest apps and guest OS cannot tamper with the VMM. When the guest OS executes a privileged instruction, it will trap into the VMM. When a guest app generates a system call, the app will trap into the VMM.
Advantages
- Faster than approach 1 (hosted interpretation)
Disadvantages
Still slow (emulation)
Doesn't always work
The processor needs to be "virtualizable", not the case with x86 architecture
Example: managing page tables with trap-and-emulate¶
Need to give each VM's guest OS the illusion that it is managing physical memory pages.
Page-table shadowing
VMM intercepts paging operations: Constructs copy of page tables ("shadow" tables)
Guest page tables map: Guest virtual address to guest physical address
Shadow page tabless map: Guest virtual address to host / machine physical address
Overheads
Trap to VMM adds to execution time
Shadow page tables consume significant memory
When is an architecture virtualizable?¶
Need at least two execution modes (kernel & user). All sensitive instructions must be privileged instructions. Sensitive instructions are those that change the hardware configuration (allocations, mappings, etc.) or whose outcome depends on the hardware configuration. Privileged instructions are those that cause a trap when executed in user mode.
If a processor is virtualizable, a VMM can interpose any sensitive instruction that the VM tries to execute. VMM can control how the VM interacts with the "outside world." VMM can fool the guest OS into thinking it runs at the highest privilege level.
For many years, x86 chips were not virtualizable. For example, on Pentium chip, 17 instructions were not virtualizable (sensitive, but no trap)
push instruction can push a register value onto the top of the stack
%cs register contains (among other things) 2 bits representing the current privilege level
A guest OS running in ring 1 could push %cs and see that the privilege level is not ring 0
To be virtualizable, push should cause a trap when invoked from ring 1, allowing the VMM to push a fake %cs value which indicates that the guest OS is running in ring 0.
Virtualization approach 3: Direct execution with binary translation¶
The VMM dynamically rewrites non-virtualizable instructions
VMM scans the guest instruction stream and identifies sensitiv instructions
VMM dynamically rewrites the binary code to add instructions that will trap
Advantages
Can run unmodified guest OSes and apps
Most instructions run at bare-metal speed
Disadvantages
- Implementing the VMM is difficult and translation impacts performance. For example, translating virtual memory, then physical memory, then machine memory
Virtualization approach 4: Para-virtualization¶
Direct execution with binary translation is tricy, so modify the guest OS to remove sensitive-but-unprivileged instructions
Example: Xen hypervisor
Guest OS is modified to inform VMM of changes to page table mappings
Guest OS modified to install "fast" syscall handlers
Guest applications are unmodified
Advantages
Faster than direct execution with translation
Fewer context switches
Less bookkeeping
Disadvantages
- Requires subsstantial modifications to the OS. This can be even harder to implement than binary translation logic
Virtualization approach 5: Direct execution with hardware support¶
Direct execution with binary translation is tricky, so add hardware support for virtualization to the CPU. For example, Intel VT-x, AND-V
Add new privilege model "VT root mode"
- Allow direct execution of VM on the processor (vmentry) until a privileged instruction is executed (then vmexit)
Add Virtual Machine Control Structure (VMCS)
Can be configured to control which instructions trigger vmexit, e.g., interrupts, memory faults, IO, etc.
Can only be accessed via privileged VMREAD and VMWRITE instructions.
Advantages
- Fast and supported on most CPUs today. This approach is most commonly used today.
Virtualizing memory: challenges¶
Address translation
Guest OS expects contiguous, zero-based physical memory
VMM must provide this illusion
Page-table shadowing
VMM intercepts paging operations
Constructs copy of page tables
Overheads
VM exits add to execution here
Shadow page tables consume significant memory
Hardware support for memory virtualization: Extended page tables¶
Regular page tables: map guest virutal to guest physical.
Extended page tables (EPT): Map guest physical to host physical (or "machine") address; new hardware page-table walker
Performance benefits
Guest OS can modify its own page tables freely
Avoid VM exit due to page fault
Memory savings
Without EPT, would require a shadow page table for each guest user process
A single EPT supports entire VM
Address translation with page table¶
x86-64 paging multi-level translation¶
Address translation¶
Translation took aside buffer (TLB) caches address translation mappings.
With 32-bit addressing - 2-level page walk:
With 64-bit addressing - 4-level page walk:
With virtulization - 24 steps in page walk:
Virtualizing IO¶
In additional to virtualizing the CPU and memory, we also need to virtualize IO device (networking, storage, etc.)
Issue: lots of I/O devices (richers & diverse), making virtualization challenging
Problem: Writing device drivers for all I/O device in the VMM layer is not practical
Insight: Device drivers already written for popular operating systems
Solution: Present virtual I/O devices to guest VMs
Challenges¶
Virtual device interface
Trap device commands
Translates DMA operations
Injects virutal interrupts
Software methods
I/O device emulation
Paravirutalize device interface
Challenges
Overheads of copying I/O buffers
Controlling DMA and interrupts
SR-IOV: Single-Root I/O Virtualization¶
Extension of PCIs specification
Allows a PCIs device to appear as multiple separate physical PCIs device
Physical Function (PF): original device
Virtual Function (VF): extra "device", limited functionality, VFs can be created / destroyed dynamically
Example: Hypervisor can create a virtual NIC for each VM
Performance implication of virtualization¶
VMs do not provide performance isolation guarantees when running on shared CPU cores. Bolt is a system that can:
Inject performance interference on collocated (victim) VMs
Collect statistics to predict which type of application the victim VM is running
Execute a targeted attack to degrade the performance of the victim VM
Containers¶
What is a container¶
A container provides lightweight operating system level virtualization. Containers share the host OS, but have their own system binaries, lbraries, and dependencies.
Containers vs. VMs¶
Containers are lighter-weight that VMs:
Can achieve higher density sharing of resources on a machine
Faster startup and shutdown time
Bare-metal like performance since no VMM traps or binary translation
However
Containers provide less secure isolation than VMs
Require applications to run on the same host OS
Two notions of container¶
Docker | Linux Containers |
---|---|
- Goal is packaging. - Standard way to package app and its dependencies so can move easily between environments. - Production matches test environment. |
- Goal is performance isolation. - Not security isolation; use VMs for that. - Resource isolation with namespaces. - Manage CPU cores, memory, bandwidth limit with cgroups. |
A brief history of containers¶
1979: chroot system call
- Chnage the root directory for a process and limit the files it can access
2000: FreeBSD jails
- Isolated subsystems called jails, characterized by their file system, hostname, IP address and run command
2001: Linux VServer project
- Support multiple userspaces Linux
2008: Linux Containers (LXC)
- Containers implementated with Linux kernel namespaces and cgroups
2013: Docker added several features on top of LXC
An application-centric view of containers; packaging all dependencies into a single object
Tools and interface for building, running, and managing the lifecycle of containers
Versioning and layering of container images
Linux Containers (LXC)¶
Use three key isolation mechanisms
namespaces: abstract and limit which global resources a process can see
cgroups: monitor and limit the amout of resources a process can use
secomp-bpf: limit system calls that a process can call
Linux kernel namespaces¶
A namespace abstracts a global system resource (e.g. network interface)
Gloal: restrict what a container can see
Process-level isolation of global resources
Processes have the illusion they are the only processes in the system
Changes to the global resource are visible to other processes in the same namespave, but not to others processes
Examples of Linux kernel namespaces¶
MNT: what file systems and mount points are visible?
PID: what other processes are visible?
NET: which network devices are visible and how are the routing tables configured?
Users: what user IDs are visible?
IPC: which inter-process communication channels are available?
A Linux system starts with a single default namespace of each type, used for all processes. Processes can create new namespaces and join them.
Linux control groups (cgroups)¶
Goal: limit the resources that a container has access to
Group processes based on resource limits
Enable resource accounting / monitoring for the group
Implementation
cgroups are created, deleted, and modified by altering the structure of a virtual file system called cgroupfs
Support nested groups, child processes inherit attributes of parent
Each cgroup has several subsystems whose limits can be set, e.g., cpu, memory, devices, etc.
Linux seccomp-bpf¶
Goal: limit the system calls the container can call
- Reduce the "attach surface" of the kernel, make more secure
seccomp-bpf stands for Secure Computing with Berkeley Packet Filters
Implementation
- Users specify filters for incoming system calls and / or their parameters using the Berkeley Packet Filter interface (originally used to enable user-space programs to specify network packet filters)
Docker¶
Docker builds on top of Linux Containers, making features like cgroups and namespaces easy for application developers to use. Standard platform for building and sharing containerized applications.
Docker basic concepts¶
Image: Read-only template for creating a container. An image is often based on another image, e.g., start with ubuntu image and add web server on top.
Container: Runnable instance of an image that you can create, stop, move, or delete.
Registry: A shared storage service for Docker images.
Engine: Open source software that creates and runs containers. Consists of deamon process dockerd.
Client: Communicates with dockerd by executing docker client commands, e.g., docker build
Container images¶
Images are divided into a sequence of layers. Each command creates a new layer on top of the previous layers. Defined by a file called Dockerfile - like a script for setting up container.
Docker runs on top of a container runtime¶
Container runtime: library responsible for starting & managing containers.
Takes as input the root file system and a configuration file for the container
Unshares the namespace
Creates cgrouop and sets resource limits
Execute container command in cgroup
Why / when not to use containers?¶
Containers provide less secure isolation than virtual machines
Container "escape" (i.e., getting root access on the machine) is possible with a single kernel vulnerability
The kernel is typically a much larger code base than the VMM / hypervisor, and is hence more likely to have vulnerabilities
Secure isolation vs. performance tradeoff¶
gVisor makes containers more secure¶
A new kind of low level container runtime: creates a container and sandboxes it.
- Intercepts syscalls and performs them in a secure user space kernel instead of in the actual kernel.
Good for running untrusted applications like serverless.
WebAssembly¶
What is a WebAssembly?¶
Different programs run with only the assistance of the operating system compile C, C++, Go, and Rust into a binary executable format that can directly run on the OS; written using scripting languages, source code is read and executed in rapid succession (JavaScript, Python, Ruby, PHP, and Perl); compiled to a bytecode format but then require a special program (a runtime or virtual machine) to run them (Java and C#).
Binary instruction format for secure, fast, and portale execution. Originally for the web, now used in cloud and edge computing.
Key features¶
Strong security
Small binary sizes
Fast loading and running
Support for many operating systems and architectures
Interoperability with the browser or cloud services
VM vs. Container vs. WebAssembly¶
VM | Container | WebAssembly | |
---|---|---|---|
Isolation | Strong | Moderate | Strong |
Performance | Slower (emulation overhead) | Fast (lightweight, shared kernel) | Near-native (no emulation overhead) |
Startup Time | Slow (seconds to minutes) | Fast (milliseconds to seconds) | Instant (milliseconds) |
Footprint | Heavy (GBs) | Moderate (MBs) | Light (KBs) |