tldr: Dynamo, Amazon's Highly Available Key-Value Store

Part of the tldr series, which summarizes papers I'm currently reading. Inspired by the morning paper.

Dynamo [2007]

Introduction

Dynamo is a eventually consistent, distributed key-value store built by Amazon to be highly reliable and highly scalable. It's been deployed to handle storage for some of Amazon's core services. The diagram below shows Amazon's platform, and where Dynamo fits in. 

System Architecture

The figure below is a table of the distributed systems techniques that Dynamo uses.
Its interface is made up of two operations: 
  • get(key)
    • Locates object replicas corresponding to the key
    • Returns object(s) with conflicting versions, and a context (defined below)
  • put(key, context, object)
    • Figures out where to store the replicas of the object based on the key
    • Writes replicas to disk
    • The context encodes system metadata (e.g. version) about the object, and helps the system validate the object in the put request

Partitoning 

In order to scale incrementally, Dynamo uses consistent hashing to dynamically partition the data over a set of nodes. As in other systems, the hash function's outputs are placed around a ring. Each node's ID is hashed and placed in this ring. Each data object's key is hashed and placed in this ring. Data objects are assigned to nodes based on their position: the first node with a larger position than the data object's position stores that object. Only a node's neighbors are affected when it's added or removed. Since consistent hashing doesn't evenly distribute the IDs, Dynamo uses virtual nodes, where each node gets assigned to several positions in the ring. 

Replication 

In order to be highly available and durable, Dynamo replicates each data item on N hosts, where N is a parameter configured per instance. Each key has a coordinator node that's responsible for replicating it at the node's N-1 clockwise successor nodes in the ring.

The preference list is the list of distinct physical nodes responsible for storing a key. Every node knows the preference list for any key, as will be explained later. The preference list contains more than N nodes, since nodes can fail. 

Data Versioning 

Dynamo is eventually consistent. The shopping cart application is one example where eventual consistency is useful. If the newest version of the cart is unavailable, but a user updates an old version of the cart, the versions need to be reconciled. 

In Dynamo, each update's result is a new and immutable version. An object can have multiple, concurrent versions. Usually, new versions override older versions. Sometimes, however, the versions can conflict. In those cases, it's up to the user to collapse the version branches together. For the shopping cart application, "add to cart" operations never get lost, but "delete item from cart" operations may be reverted. 

Dynamo uses vector clocks to keep track of objects' versions. A vector clock is a list of (node, counter) pairs, and there's one vector clock per version per object. To compare two versions of an object, if all the counters of a version are less than or equal to all the counters of the other version, the first version is obsolete and can be ignored. Otherwise, the two versions conflict. 

Ring Membership 

To handle node failures, an admin manually connects to a Dynamo node to either add a node to a ring, or remove a node from a ring. The node that serves this request writes (membership change, timestamp) to a persistent store. A gossip protocol lets other nodes know about membership changes. 

Adding/Removing Storage Nodes

When a new node is added to the system, it gets assigned to IDs randomly positioned around the ring. Then, some nodes that have already been part of the system for a while longer need to handle some of the keys that the new node now handles. They can transfer their keys to the new node. Removing happens similarly, but in reverse.


Popular posts from this blog

Space Race: a simple Rust game built on the Bevy framework

Building a Toy Language Compiler in Go

tldr: Raft + Implementation Tips