The CAP Theorem

The first explanation I heard about the CAP Theorem reduced it to "consistency, availability, network partition tolerance: pick two out of three". However, as catchy as this statement was, I discovered that it wasn't accurate. In fact, the CAP Theorem has been widely misunderstood and misused since it was named in 2000 and formalized in 2002. In this blog post, I'll compile and clarify what I have learned from papers, blogs, books, and my distributed systems engineering class.

C, A, and P 

Let's define each of the letters before talking about the CAP theorem.

Consistency

Gilbert's and Lynch's 2002 paper considers only the linearizability consistency model. Linearizability is a guarantee on individual reads and writes of a single object. A total order on all operations must exist such that each operation appears to have executed instantly and atomically at some point between when it was called and when it sent back a response.

Time constraint for linearizability. Assume the value of x is initially 0. The blue bars are Client 1's requests. The red bars are Client 2's requests. 
Specifically,  the total order must follow two kinds of constraints:
  1. Time. Non-concurrent operations must obey real time constraints. For example, the first scenario in the diagram shows how reads that return before a write is called must return the old value. The second scenario in the diagram shows how reads that are called after a write returns must return the new value. Finally, in the third scenario, the read and write are concurrent (overlapping), so the read may return either the old value or the new value. 
  2. Value. Once a read returns some value, later reads must also return that value (or the value of a later write).
Note: Linearizability and serializability are two different concepts! Serializability is a property of transactions involving isolation. While linearizability concerns a single object, serializability is about potentially multiple objects. A system with serializable transactions is one where transactions appear to run serially even though they may have run concurrently in reality. In fact, serializability is the strongest isolation level.  

Availability

Availability means that "every request received by a non-failing node in the system must result in a response" eventually. Note that this means any node. If a client is disconnected from a leader in a single-leader system,  

Network Partition Tolerance 

Network partitions are not a design feature - they will happen in every distributed system. They are defined as faults where nodes are alive but disconnected from one another over an asynchronous network. 

What is the CAP Theorem? 

A distributed system cannot be both consistent and available. 

CP

Under network partitions, applications that need to be linearizable may become unavailable. For example, they may return an error, or wait until the network is up again. 

AP 

Under network partitions, applications that remain unavailable can continue to respond to requests despite nodes being disconnected from other. (Each node can continue to respond to requests independently.) 

What are the limits of the CAP Theorem? 

Criticism of relying too heavily and simplistically on the CAP Theorem (e.g. from Stonebraker, Abadi, Kleppmann), can be summarized as:
  • The CAP Theorem doesn't cover many, many types of errors that are frequently encountered in real world distributed systems. For example, nodes that went down, human errors, and application errors are not covered. 
  • Designing a distributed system should not boil down to such a simple statement. There are many complicated tradeoffs to make. 
  • Using the CAP Theorem to automatically give up consistency is bad practice. As Stonebraker wrote, "In the real world, giving up consistency does not improve availability. Hence, you are giving up consistency in exchange for nothing. This is a horrible engineering tradeoff, and the CAP theorem, therefore, encourages engineers to make awful decisions." This is because network partitions don't happen as often as other errors, so giving up consistency doesn't give us anything in return. 

Conclusion 

The CAP Theorem should be used with caution when working with distributed systems because it can be misleading, and is specific to network partition faults. However, it was historically influential, especially for the NoSQL databases trend. 

Popular posts from this blog

Building A Toy Language Interpreter in Go

Space Race: a simple Rust game built on the Bevy framework

Building a Toy Language Compiler in Go