263-3010-00: Big Data
Section 3
Cloud Storage
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 10/02/2024
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Ghislain Fourny. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Storing Data¶
Suppose a dataset is 273 TB, organized into 176 million files across 680,000 directories, with 1.23 billion objects, including 260 million stars and 208 million galaxies. Since no single machine can store this in 2021, a single-machine solution is unfeasible. How do we manage this large-scale data?
The fact is that Petabytes do not fit on a single machine. Good thing is that we do not need to reinventing the wheel. About 99% of what we learned with 48 years of SQL and relational can be reused:
Relational algebra: selection, projection, grouping, sorting, joining
Language: SQL, declarative languages, functional languages, optimizations, query plans, indices
Components of tables: table, rows, columns, primary key
NoSQL is the solution, although it does not follows some constraints that SQL has, such as tabular integrity, domain integrity, atomic integrity (1NF), 2NF, 3NF, and BCNF, it has some new features that SQL does not have, it supports:
Heterogeneous data: Data that is not in first normal form.
Nested data: Data that does not fulfil domain integrity (it may not even have a schema) and also not relational integrity.
Denormalized data: At large scales, it becomes desirable to sometimes denormalize, rather than normalize data, in order to get better performance.
The technology stack¶
The technology stack we are going to rebuild goes from a level very close to the hardware (storage) and all the way to interaction with the end user (query language or even voice assistant).
The following are some examples and description for each stack:
Stack | Example | Description |
---|---|---|
Storage | Local file system, NFS, HDFS, S3 | Systems to store data, including local, network, distributed, and cloud storage options. |
Encoding | UTF-8, BSON | Methods to encode data into bits for computer processing, including textual encodings like UTF-8 and binary encodings like BSON. |
Syntax | CSV, XML, JSON, RDF/XML | Formats used to represent data in a human-readable text form, or in binary form for efficient storage. |
Data models | Relational model, RDF triples, XML infoset | High-level representations of data, such as models for tables (relational), trees, and graphs. |
Validation | XML Schema, JSON Schema, XBRL taxonomies | Languages used to impose constraints and validate the structure of data. |
Processing | MapReduce, Spark, Flink | Frameworks that process and analyze data, ranging from older two-stage systems to newer dataflow engines. |
Indexing | Hash indices, B+-trees, spatial indices | Techniques used to speed up data lookups, generally independent of the data's shape. |
Data stores | RDBMS, MongoDB, Snowflake | Products that provide a complete set of functionalities for storing and managing data, including relational databases and document stores. |
Querying | SQL, SPARQL, Cypher | Languages used to query different data models, such as SQL for tables, SPARQL for graphs, and others for trees and graphs. |
User interfaces | Tableau, Qlikview, Siri | Tools and applications that abstract away the query language, providing an easy-to-use interface for users. |
In this section, we only focus on the storage stack.
Databases vs. Data Lakes¶
Data must be stored somewhere, and there are two main paradigms for storing and retrieving it:
Extract-Transform-Load (ETL): Data is imported into a database, requiring extra work before querying. This is used by relational databases and some NoSQL systems like MongoDB and HBase. ETLing data allows for faster queries by storing it in a proprietary format and building indices.
Data Lake Paradigm: Data is stored on a file system (local, distributed, or cloud) and queried in place using tools like Pandas, Apache Spark, or RumbleDB. This approach allows immediate querying without ETL but is generally slower, best suited for full data scans (e.g., MapReduce, Spark).
From Your Laptop to a Data Center¶
Local file systems¶
On a single machine, data is typically stored on the local file system, limited by disk capacity to a few terabytes. SSDs (Solid State Disks), though more expensive and often smaller, can now hold up to 1 TB. A file on the local filesystem consists of content and metadata. The content is sliced into blocks (around 4 kB each) and read block by block for optimized performance. Metadata includes file attributes like name, access rights, owner, modification time, and size. Files are organized hierarchically in directories familiar to most users.
More users, more files¶
How does this file system accessible to more users? One way is to make a disk accessiable through the network, local area network (LAN), and shared with other people.
There also exists larger networks, wide area network (WAN), but it is difficult to scale concurrent access and problems can arise when two people work on a file at the same time.
Another difficulty is that a local file system cannot support billions of files easily.
To make this scale possible, the approach is called object storage:
- throw away the hierarchy: there are no directories
make the metadate flexible: attributes can differ from file to file (no schema)
user a very simple, if not trival, data model: a flat list of files (called objects) identified with an identifier (ID); blocks are not exposed to the user, i.e., an object is a blackbox
- user a large number of cheap machines rather than some super computers, an example is the battery of Tesla vehicles, which is made of many many small battries
Scale¶
Aspect | Scale Up | Scale Out |
---|---|---|
Definition | Increase the resources (memory, CPU, disk) of a single machine. | Add more machines and distribute the workload across them. |
Cost | Becomes exponentially more expensive as resources increase. | Typically more cost-effective; allows adding lower-cost machines. |
Complexity | Relatively simpler, but limited by the capacity of a single machine. | Involves more complexity due to distributed system management. |
Limitations | Has hard limits in terms of cost and hardware capabilities. | Can theoretically scale indefinitely with more machines. |
Use Case | Useful for small-to-moderate data sizes that fit on a single machine. | Best for handling large datasets that don't fit on a single machine or need high throughput. |
Performance Gain | Improves performance linearly but with diminishing returns. | Performance improves by adding more machines, with better parallelism. |
Example | Upgrading a server from 8 GB to 32 GB of RAM. | Adding multiple servers to handle 50 TB of data. |
Scale up process (from left to right)
Scale out process (from left to right)
Indeed, optimizing code is definitely necessary before scaling up or scaling out. Generally speaking, scaling out in the context of data processing is typically worth it in two cases:
the data does not fit on the single local machine
the single local machine cannot read the data fast enough
Data Centers¶
In data centers, clusters typically consist of thousands to tens of thousands of machines, but scaling beyond this is limited due to power and cooling constraints. A data center can consume as much electricity as an airport, making it difficult to manage over 100,000 machines. Companies like CERN are shifting toward fewer servers but with more cores per server, with modern servers (or "nodes") having between 1 and 64 cores, and memory ranging from 16 GB to 6 TB. Local storage varies between 1 to 20 TB, while network bandwidth typically ranges from 1 to 100 Gb/s, with higher speeds possible in High-Performance Computing (HPC) contexts. The highest network speeds are within the same cluster, but connections across clusters or data centers can be slower.
Data centers are organized into racks, which hold flat, rectangular servers or other components like storage disks and network switches. Each rack module typically takes up 1 to 4 rack units (RU), and racks are arranged in rooms to form a cluster. Large cloud providers, like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud, dominate the market, with most companies renting virtual machines or high-level services from these providers instead of managing their own physical servers. One common service offered is object storage.
Object Stores¶
With thousands of affordable commodity servers ready, we can focus on how to store large amounts of data. Most cloud providers offer object storage as a service, allowing users to utilize these solutions without needing to install anything.
Amazon S3¶
Amazon's object storage, known as Simple Storage Service (S3), organizes data into buckets identified by Bucket IDs, with each object (or file) assigned a unique Object ID.
Objects in S3 can be any file type, and unlike traditional file systems, S3 uses a flat structure rather than a hierarchical one. Each object can be up to 5 TB, but uploads over 5 GB require multiple blocks. Users can create up to 100 buckets by default, with the option to request more.
Azure Blob Storage¶
Azure Blob Storage is another object storage service similar to Amazon S3, but with key differences. It uses three identifiers: Account, Container, and Blob, and provides more transparency about how objects are divided into blocks.
Azure differentiates between Block Blobs, Append Blobs, and Page Blobs, with varying maximum sizes (up to 190.7 TB for Block Blobs). Physically, Azure is organized into storage stamps across data centers, each holding up to 30 PB but limited to 80% capacity to prevent full storage, reallocating data as needed.
Comparison between Amazon S3 and Azure Blob Storage¶
Amazon | Azure | |
---|---|---|
Object ID | Bucket + Object | Account + Container + Blob |
Object API | Parts | Block / Append / Page |
Limit | 5 TB | 190.7 TB (block) / 195 GB (append) / 8 TB (page) |
Guarantees and Severice Level¶
Service Level Agreements¶
Cloud services come with a Service Level Agreement (SLA), a contract that outlines guarantees like durability and availability. For example, Amazon S3 promises 99.999999999% durability (losing less than one object in 100 billion annually) and 99.99% availability (less than 1 hour of downtime per year).
SLAs often include uptime percentages, where 99% allows 4 days of downtime, and 99.99999% allows less than 4 seconds. S3 doesn't guarantee latency, as it depends on location and network speed, but ensures throughput, with 5,500 reads/s and 3,500 writes/s.
The following is a table that shows the SLA and corresponding outage:
SLA | Outage |
---|---|
99% | 4 days / year |
99.9% | 9 hours / year |
99.99% | 53 minutes / year |
99.999% | 6 minutes / year |
99.9999% | 32 seconds / year |
99.99999% | 4 seconds / year |
The CAP Theorem¶
Early relational database management system relied on ACID properties. In order to scale out, many distributed systems have to make a compromise on the transactional guarantees that they offer. This is best explained with the so-called CAP theorem.
The CAP theorem is basically an impossibility triangle: a system cannot guarantee at the same time:
Atomic Consistency: at point in time, the same request to any server returns the same result, in other words, all nodes see the same data.
Availability: the system is available for request at all times (SLA with very high availability).
Partition tolerance: the system continues to function even if the network linking its machines is occassionally partitioned.
The CAP theorem is, in fact, not formally proven and it would be more appropriate to call it a conjecture. An intuitive way of explaining it is as follows.
When an update is made via some server, this update must propagate to other nodes to the extent that a specific replication factor for the data is met.
If there is a partition of the network leading to two disconnected sub-networks and the propagation of some data did not have a chance to complete, then there are two possible design decisions:
- either the system is temporarily put to a halt (synchronous propagation) and thus becomes unavailable until the network partition is resolved and the propagation can be completed. In this case the system is atomically consistent but not available (CP).
- or the servers continue to serve requests, but the data served by the two disconnected sub-networks will be different. In this case the system is available but not atomically consistent (AP). Such systems propagate updates asynchronously and are also often called eventually consistent in the sense that, if one hypothetically would no longer receive updates, there exists a time at which the system will be consistent. In practice of course, as updates keep arriving, the system rarely becomes fully consistent - a common misconception about eventual consistency.
A third option is for the system to forgo partition tolerance, remaining available and consistent (AC) as long as no network partition occurs. During a partition, no guarantees apply. Be cautious of systems claiming to be available, consistent, and partition-tolerant (CAP) simultaneously—this is often marketing. In reality, companies may downplay network partitions due to their rarity, but when one occurs, they must still choose between availability (A) or consistency (C), even if users don't notice.
Generalized CAP: PACELC theorem
PACELC stands for Partition, Availability, Consistency, Else, Latency, and Consistency. The theorem states that in the event of a network partition, a distributed system must choose between availability and consistency; otherwise, it must choose between latency and consistency.
In case of a partition: | ||
---|---|---|
In the normal case | Consistency | Availability |
Consistency | PC/EC | PA/EC |
Low latency | PC/EL | PA/EL |
RESP APIs¶
Data stores, including object stores, key-value stores, document stores, wide column stores, and distributed file systems, typically provide their functionality through APIs. A significant majority of these data stores offer a REST API, which stands for REpresentational State Transfer. REST is essentially "HTTP done right," as it effectively utilizes HTTP methods and resources, providing a standardized way for clients to interact with data stores over the web.
The advantages of offering a REST API are substantial. It simplifies integration with various programming languages, as HTTP clients are available for almost every host language, including Java, Python, PHP, and R. This ubiquity allows developers to create "wrapper APIs" in their preferred language that forward requests through the REST API to the underlying data store. Consequently, designing a new system that supports a REST API is often an obvious choice for developers.
While REST APIs are common, other integration options exist, such as native drivers that allow direct communication with the data store. However, creating these drivers requires separate implementations for each programming language, which can be resource-intensive. MongoDB is a notable example of a document store that utilizes native drivers to facilitate this direct interaction.
Another protocol worth mentioning is the XML-based SOAP protocol, which also enables communication between clients and servers. The foundation for REST, the HTTP protocol, was invented by Sir Tim Berners-Lee in 1989, laying the groundwork for the modern web. In this client-server architecture, communication occurs through HTTP methods applied to various resources, enabling a wide range of functionalities for data management and retrieval.
A resource can be anything: a document, a PDF, a person, a calendar entry, or even a physical object such as a smart plug ("Web of things"). A resource is referred to with what is called a URI. URI stands for Uniform Resource Identifier. A URI looks like so:
http://www.example.com/api/collection/foobar?id=foobar#head
where
http
is the scheme//www.example.com
is the authority/api/collection/foobar
is the path?id=foobar
is the query#head
is the fragment
Example of http request
Bucket resource, object resource
There exist other schemes than http, for example mailto, ftp, hdfs, file, and so on. However, http is the most popular schemes even for resources that are "offline", i.e., not actually reachable via HTTP.
A client can act on resources by invoking methods, with an optional body. The most important methods are:
GET
(without a body): this method returns a representation of the resource in some format (text, XML, JSON...).GET
should have no side effects (beyond logging, of course).
PUT
: this method creates or updates a resource from a representation of a newer version of it, in some format (text, XML, JSON...).PUT
has the side effect that a subsequentGET
asking for the same format should return the same representation.PUT
is idempotent, in that callingPUT
with the same resource and body is identical to calling it just once.
DELETE
(without a body): this method deletes a resource.DELETE
has the side effect that a subsequent GET asking for a representation of the resource should fail with a not-found (404) error.
POST
: this method is a blank-check, in that it acts on a resource in any way the data store seems fit; the behavior, of course, should be publicly documented. A typical use ofPOST
is to create new resources but letting the REST server pick a resource URI for this new resource.
Object Stores in Practice¶
Object stores utilize a flat key-value model, but most services emulate a file hierarchy by allowing slashes (/) in keys, interpreting them as virtual paths. While slashes are treated as regular characters in the storage layer, this logical hierarchy enhances the usability of object stores for various applications.
Static website hosting¶
Object storage solutions like S3 and Azure Blob Storage facilitate straightforward static website hosting. Static websites serve content without dynamic generation, relying on HTML pages, images, CSS stylesheets, and client-side JavaScript. Storing these assets in object storage is convenient and cost-effective, with cloud providers offering easy access via HTTP URLs. Additionally, using a content delivery network (CDN) on top of the storage bucket can improve performance by caching files globally, ensuring faster delivery, especially during high traffic periods.
Dataset Storage¶
Object stores are also popular as data lakes, allowing datasets to be stored as objects in various formats such as CSV or JSON. Typically, a dataset comprises multiple objects, enabling larger data sizes beyond the individual object limits (e.g., 5 TB on S3). This practice, known as sharding, is essential for handling large datasets and will be crucial as we explore concepts like MapReduce and Spark in this course.
Key-value Stores¶
Object stores like Amazon S3 and Azure Blob Storage can store large objects (up to TBs) but have higher latency (100 ms or more) compared to traditional databases (1-9 ms). They use a key-value model with a simple API, but unlike key-value stores, they allow for metadata storage alongside objects.
Key-value stores, known for "eventual consistency," are optimized for smaller objects (up to hundreds of kB) and typically lack complex APIs and atomic consistency guarantees. They excel in scalability and are suited for use cases like shopping carts or social media interactions, while object stores are more about large, unstructured data.
A key-value store differs from a typical relational database in three aspects:
Its API is considerably simpler than that of a relational database (which comes with query languages)
It does not ensure atomic consistency; instead, it guarantees eventual consistency
A key-value store scales out well, in that it is very fast also at large scales
Key-value store API¶
A key-value store, on the high level, is based on the key-value model; it can be seen as some sort of a gigantic associative array (also logically known as "map" or "dict") that is constantly modified and accessed.
A key-value store, on the high level typically has three access methods, which should not come as a surprise:
get(key)
retrieves the value associated with a key.put(key, value)
associates a new value with the key.delete(key)
removes any association of a value with the key
In practice, the API is slightly more complex because of conflict resolution. In the case of Dynamo, for example, get(key)
not only returns value(s), but also a context indicating potential conflicts, which for now can be seen as a black box that we will get back to. Likewise, put has a third parameter, context (which is typically forwarded "as is" from a previous get call), and put(key, context, value)
associates
the key with a new value while resolving any conflict.
Design principles¶
We will focus on a specific key-value technology for our explanations: Amazon Dynamo. It is itself based (with some modifications) on the Chord protocol, which is a Distributed Hash Table.
Distributed Hash Tables are generally highly scalable, robust against failure and self organizing. However, they do not provide any support for range queries, guarantees on data integrity (which is pushed to the user), and do not deal with any security issues.
On the physical level, a distributed hash table is made of nodes (the machines we have in a data center, piled up in racks) that work following a few design principles.
The first design principle is incremental stability. This means that new nodes can join the system at any time, and nodes can leave the system at any time, sometimes gracefully, sometimes in a sudden crash.
The second principle is symmetry: no node is particular in any way
The third principle is decentralization: there is no "central node" that orchestrates the others. One may wonder what the difference with symmetry is? A symmetric system could still have protocols that elect a leader that then takes up an orchestration role. But this is not the case in a distributed hash table.
The fourth principle is heterogeneity: the nodes may have different CPU power, amounts of memory, etc.
A ring to rule them all¶
Nodes can be connected in two main ways: peer-to-peer (every node communicates with every other node) and centralized (a Coordinator Node controls Worker Nodes). In distributed hash tables, a peer-to-peer network is preferred for its symmetry and decentralization.
A key aspect of distributed hash table design, especially in the Chord protocol, is that each logical key is hashed into IDs represented in bits. For Dynamo, this hash consists of 128 bits (16 bytes).
Mathematically, all possible IDs naturally form a ring structure (consider $\mathbb{Z} / 2^{128} \mathbb{Z}$). However, in this context, we only need to conceptualize this ring as a circle, without relying on its mathematical properties.
The next conceptual step of the design is that all nodes get a random position on that ring, i.e., a random 128-bit number.
Which node stores what key is then decided based on the ring topology. Given a key $k$, its 128-bit hash is taken and it corresponds to a position on the ring. From that position on, one follows the ring clockwise, until one reaches one of the nodes (remember, they were logically assigned a position on the ring); this node is responsible for storing key $k$.
The domain of responsibility of a single node $n$ is thus the ring interval separating it from the next node counterclockwise: all keys whose hash lands in that interval are stored on, and the responsibility of, node $n$.
Now, what happens when nodes come and go? Well, when a new node n joins the cluster, it gets assigned a position on the ring. This position is in the domain of responsibility of some existing node $m$. This ring interval is now going to be split: everything between $m$ and $n$ remains within the responsibility of $m$, but the other half of the domain of responsibility is newly the responsibility of $n$. The corresponding data needs to be transferred from $m$ to $n$.
What about when a node leaves? Let us start with the case of a graceful exit, i.e., the node (call it $n$) informs the rest of the cluster before leaving. Then, it will transfer all its interval of responsibility to the next node clockwise (call it $m$) on the ring: two intervals merge into a single one. Here too, data needs to be transferred from $n$ to $m$.
Now, let us look into an abrupt exit: the node crashes with no opportunity to inform the cluster or prepare its exit.
And we have a problem. Because the data for that node is gone. And this problem is going to be solved with a fundamental principle that will follow us through the course: data replication. We slightly change the design: data within an interval will not be stored only the node that follow the interval clockwise, but on the next $N$ nodes that follow the interval clockwise. Equivalently explained, a node is responsible not only for the interval that follows it counterclockwise, but for the next $N$ intervals that follow it counterclockwise.
Preference lists¶
To locate the node(s) holding a value for a given key and its 128-bit hash in a decentralized system, clients must connect to specific nodes. The Chord protocol uses a finger table for efficient searching, allowing nodes to find others through powers of two, enabling logarithmic time complexity. In contrast, Dynamo employs "preference lists," where each node knows which nodes are responsible for each key (or key range) in decreasing priority. These lists are maintained through peer-to-peer synchronization or shared databases.
Each list must have at least N nodes, with the first node designated as the coordinator. Two parameters, R and W (where $R + W > N$), dictate read and write operations: $R$ is the minimum nodes from which to read a value, and $W$ is the minimum nodes that must confirm storage before a write succeeds. Initially, a request is sent to a random node via a load balancer, which then redirects it to the coordinator. If that fails, the request is sent to the next node. The coordinator waits for $R$ or $W$ confirmations from the other nodes before responding to the client.
An improvement: tokens and virtual nodes¶
The design described so far has two issues: first, we might be out of luck with the (logical) distribution of the nodes on the ring, with some nodes responsible for large intervals and some others for smaller ones. Second, the nodes may not all have the same hardware resources: some may have more memory or less CPU, etc.
These two issues are addressed by virtualizing the nodes: instead of directly mapping each node to one position on the ring, a larger quantity of (virtual) nodes, called tokens, are positioned on the ring.
For example, there might be 100 nodes in the cluster, and we will position 1000 tokens on the ring. Then, the tokens are assigned to the nodes, in numbers that match their resources: a large node may get, say, 20 tokens, while a small one may get 3. Then the system works as above, each physical node handling the responsibilities assigned with all of its assigned tokens.
Adding or removing a node is straightforward: all tokens corresponding to a node leaving the network are redistributed to other nodes (not necessarily the same one), and a node joining the network gets assigned tokens taken over from other (not necessarily the same one) nodes.
Partition tolerance and vector clocks¶
We now analyze how a distributed hash table behaves in the presence of network partitions. As we saw previously, when there is a network partition a system will either crash (CA), become temporarily unavailable (CP), or accept inconsistencies, downgrading the guarantee to eventual consistency (AP). Distributed hash tables, including Dynamo, are typically AP.
A fundamental conceptual tool in AP systems is the use of vector clocks. In a system with no partitions, versions of the data can be seen as linear, and could be identified with integers (in fact, some code repository systems do exactly this).
But in case there are network partitions, the versions may fork, forming a DAG (Directed Acyclic Graph) structure rather than a line. Vector clocks are a way to annotate the versions when they follow a DAG structure.
A vector clock can logically be seen as a map from nodes (machines) to integers, i.e., the version number is incremented per machine rather than globally. A version number increases for a machine when this machine processes the corresponding entry (key-value) and updates it. For machines that have not yet touched a given entry, the correspond number (implicitly 0) can be left out from the vector clock: it gets added the first time the machine updates the entry.
On the picture above, node A creates the entry and associates it with a vector clock
{ "A" : 1 }
Then node A (which is probably at the top of the preference list) updates the entry again and it increases:
{ "A" : 2 }
Then for some reason, the entry was processed, separately and without knowing about each other (network partition!) by nodes B and C. Node B updates the value and associates it with vector clock
{ "A" : 2, "B" : 1 }
while node C, who does not know about node B, does the same and associates it with vector clock
{ "A" : 2, "C" : 1 }
We now have two conflicting versions, which will be returned to the user upon getting the value for that key (assuming it sees both node B and C, i.e., the network partition is over), together with the context, which "merges" the conflicting vector clocks with the max integers:
{ "A" : 2, "B" : 1, "C" : 1 }
This context is passed (via put()) to node A for another update to resolve the conflict, leading to the new vector clock:
{ "A" : 3, "B" : 1, "C" : 1 }
Note that vector clocks can be compared to each other with a partial order relation $\le$. A partial order is a relation that is antisymmetric (if the order holds in both directions, then the vector clocks must be equal), reflexive (the order always holds if both sides are the same vector clock), and transitive (if the order holdes between vector clocks A and B as well as B and C, then it must hold between vector clocks A and C), but it might not be total in the sense that vector clocks might not be comparable (the order is undefined).
More specifically, a vector clock is smaller or equal to another if for each machine, the two associated integers are also smaller or equal (in the same direction). For example
{ "A" : 1 } ≤ { "A" : 1 }
{ "A" : 1 } ≤ { "A" : 2 }
{ "A" : 1 } ≤ { "A" : 2, "B" : 1 }
{ "A" : 1 } ≤ { "A" : 2, "C" : 1 }
{ "A" : 1 } ≤ { "A" : 3, "B" : 1, "C" : 1 }
{ "A" : 2 } ≤ { "A" : 2 }
{ "A" : 2 } ≤ { "A" : 2, "B" : 1 }
{ "A" : 2 } ≤ { "A" : 2, "C" : 1 }
{ "A" : 2 } ≤ { "A" : 3, "B" : 1, "C" : 1 }
{ "A" : 2, "B" : 1 } ≤ { "A" : 3, "B" : 1, "C" : 1 }
{ "A" : 2, "C" : 1 } ≤ { "A" : 3, "B" : 1, "C" : 1 }
A partial order, unlike a total order, may not have a maximum (greater than all others), but it may have several maximal elements instead (not smaller than any other). When there is a unique maximal element (which is then the maximum because there is a finite number of vector clocks), there is no conflict and only one value is associated with the key throughout the DHT; and conflicts can be recognized because there are several maximal elements (typically one for each former partition), at least until the conflict is resolved.