tldr: Spinnaker
Part of the tldr series, which summarizes papers I'm currently reading. Inspired by the morning paper.
Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore (2011)
Rao et al.Introduction
Spinnaker is a datastore developed at IBM to run on a large cluster of commodity servers in a single data center. This system's features include:- Key-based range partitioning
- 3-way replication
- Transactional get-put API. Can choose either:
- Strong read consistency
- Timeline read consistency
Paxos
Spinnaker's replication protocol is a variation of the Multi-Paxos protocol. It will be discussed in a further detail later, but for now, here's an overview of Paxos.
Paxos solves reaching consensus on a value on 2F+1 replicas while tolerating up to F failures. Once consensus is reached, the value can be accessed as long as the majority of replicas are up. There are 2 phases in Paxos:
- Leader election
- Propose. A node that wants to propose a value v sends a prepare message with a proposal number n.
- Promise. If a node receives a prepare message, it checks n:
- If n is greater than the n in previously accepted messages, the node responds with a promise message. This message also includes a lower n and its corresponding v the receiver had accepted already, if any.
- Else, the node responds with a nack.
- Leader sends value to be accepted, and followers accept it with an ack.
- Accept. If the proposer receives promises from the majority of servers, it can send an accept message with v and n to those nodes. This accept message contains the corresponding v for the highest value of n that the proposer received, so it's not necessarily the original value that the proposer proposed.
- Ok. If a node receives an accept message, it checks n and responds with an ok if n is greater than any proposal number it accepted in a prepare message.
The above process is a summary of Paxos when there's a single value to agree on. Spinnaker uses a variation of the multi-Paxos protocol, which solves consensus on a sequence of values.
Architecture
Data is organized into rows with unique keys, and tables. As shown in the figure above, rows are distributed across a cluster using range partitioning. Each group of nodes (3 by default) that replicates a specific key range is called a cohort. (A node can be in more than one cohort.)Nodes
Each node contains a shared write-ahead log for performance, but use their own unique log sequence numbers. Nodes also contain a commit queue stored in main memory of writes that aren't committed yet. Writes that have been committed are stored in a memtable, which is periodically flushed to an SSTable structure on disk.Zookeeper
Spinnaker uses the coordination service Zookeeper, which manages meta-data and events.Replication Protocol
Spinnaker uses the Paxos protocol. Each cohort consists of an elected leader, and followers. Each phase of the protocol is divided into the leader election phase and the quorum phase, where the leader proposes a write and the followers accept it.Write
The figure above shows the write process:- Client submits a write, which gets routed to the leader of the affected key range.
- The leader appends the write to its log. Then, in parallel, the leader:
- Begins a log force to disk.
- Appends the write to its commit queue and sends propose messages with the write to its followers.
- When the followers receive a propose message, they force a log record for the write to disk. Then, they append the write to their commit queue, and send an ack to the leader.
- After the leader gets an ack from at least 1 follower, it applies the write to its memtable, committing the write.
- The leader then sends a reply to the client.
The leader sends periodic asynchronous commit messages to the followers to apply all pending writes up to a specific log sequence number to their memtable. This log sequence number is called the last committed LSN, and is saved by all the machines for recovery.
Read
Strongly consistent reads are routed to a cohort's leader. Timeline consistent reads can be routed to any node, so the client might receive a stale value.Leader Election
Each leader is elected independently within each cohort. Whenever a cohort's leader has failed or the system restarts, leader election begins. Here's a simplified version of the process:
- Each node in the cohort announces that it is a candidate by broadcasting its last log sequence number.
- Once a majority of the cohort becomes candidates, each node in the cohort checks to see if it's the new leader. The new leader is the candidate with the maximum last log sequence number.
Recovery
Follower Recovery
We first discuss how to recover from follower failure. It happens in 2 phases:
- Local recovery. The follower safely re-applies log records from its most recent checkpoint through its last committed LSN.
- Catch up. The follower sends its last committed LSN to the leader. The leader sends all committed writes after the received LSN to the follower. The follower catches up.
Leader Recovery
When a leader fails, a new leader is elected (as described previously in this post). This new leader contains all committed writes from the old leader in its log. However, some writes from the old leader may have been not been committed yet. The new leader first catches up each follower to the new leader's last committed LSN. Then, once at least one follower has caught up, the leader sends writes not yet committed to the followers. Finally, the leader opens the cohort back up for new writes.