tldr: FaRM
Part of the tldr series, which summarizes papers I'm currently reading. Inspired by the morning paper.
Optimistic Concurrency Control + FaRM [2015]
Dragojević et al.
Introduction
Before FaRM, distributed transactions were thought of as convenient for programmers, but slow. However, FaRM guarantees 90-100 million transactions/second on 90 machines! Those transactions are replicated 3 times (by default) and persistent. What is FaRM and how did it accomplish this?
Systems often have to make tradeoffs between availability, consistency, performance, and simplicity. However, FaRM is a main memory platform for data centers that doesn't compromise on any of these because it minimizes network, storage, and CPU bottlenecks by:
Systems often have to make tradeoffs between availability, consistency, performance, and simplicity. However, FaRM is a main memory platform for data centers that doesn't compromise on any of these because it minimizes network, storage, and CPU bottlenecks by:
- Assuming the whole dataset can fit in RAM, so no disk I/O
- Using non-volatile RAM, so no disk I/O
- Using a one-sided RDMA for a fast network
- Bypassing the OS kernel for fast network access
- Designing transaction and replication protocols optimized for one-sided RDMA
Programming model
An application using FaRM sees an abstraction of a global address space, divided into 2GB regions. Each region is replicated on a primary and some number of backups. Each server in a FaRM cluster stores several regions in non-volatile DRAM.
The figure below shows a cluster with 4 servers. Each server stores one incoming message queue for each of the other servers. Another server can send messages to its corresponding queue stored in the receiver through RDMA. The receiving server periodically polls.
FaRM uses ZooKeeper to ensure machines agree on and store the configuration of the cluster. However, it does not use ZooKeeper for failure detection, recovery, or lease management. Instead, FaRM uses RDMA for fast recovery.
To execute a transaction, FaRM uses one-sided RDMA to read objects, and locally buffer writes. The coordinator keeps track of the addresses and version of all objects that were accessed. (Each object has a corresponding 64-bit version.) After executing the transaction, FaRM tries to commit, as shown in the below figure:
Let's go through the steps of commit in more detail:
- Lock. The coordinator writes LOCK to the log of each server that's the primary for an accessed written object. (In the diagram, that means servers P1 and P2.)
- The LOCK record consists of the transaction ID, IDs of regions with objects written by the transaction, and addresses, versions, and values of those objects.
- Upon receiving LOCK, primaries attempt to lock the objects at the versions, and reply if it was successful. (It can fail if the object's version changed since the transaction read it, or if another transaction locked it.)
- If the locking attempt failed, the coordinator aborts the transaction.
- Validate. The coordinator reads the versions of objects that were read but not written by the transaction from their corresponding primaries. (In the diagram, that means server P3.)
- Validation fails if any object has changed. The coordinator then aborts the transaction.
- Validation uses one-sided RDMA by default.
- Commit backups. The coordinator writes COMMIT-BACKUP to the non-volatile logs of each backup. (In the diagram, this means servers B1 and B2.)
- The COMMIT-BACKUP consists of the same data as a LOCK (which is described above.)
- The backups send back an ACK.
- Commit primaries. After the coordinator receives all ACKs, it writes a COMMIT-PRIMARY to the logs of each primary. (In the diagram, this means servers P1 and P2.)
- The COMMIT-PRIMARY contains the transaction ID to commit.
- Upon receiving COMMIT-PRIMARY, the primaries update the objects in place, increment their versions, and unlock them.
- As soon as one primary sends back an ACK, the coordinator reports to the application that the transaction was completed.
- Truncate. The coordinator can truncate logs in primaries and backups after receiving ACKs from all primaries.
FaRM transactions use optimistic concurrency control. This means transactions can access a data object without acquiring a lock. Before it commits, the transaction must verify that no other transaction has modified the data object. If so, the commit is valid. Otherwise, it aborts and retries.
Furthermore, FaRM guarantees strict serializability of successfully committed transactions. Reads of an object get the value of the most recent committed write. Reads across different objects are not necessarily atomic, but individual reads are.
Handling Failures
Recovery from failure happens in 5 phases:
- Failure detection. FaRM uses leases to determine whether a machine failed. Leases are extremely short, which help ensure the system's availability.
- Reconfiguration. This protocol moves a FaRM instance to a new configuration, and uses one-sided RDMA to get good performance.
- Transaction State Recovery. FaRM uses logs distributed across the servers to recover transaction state. It does this quickly by giving work to threads and machines in parallel.
- Recovering data. FaRM recovers data at backups for each region.
- Recovering allocator state. The allocator splits regions into 1 MB blocks to allocate small objects.
Overall, recovery is fast, and FaRM can bounce back from failure to peak throughput in less than 50 ms.
Conclusion
FaRM shows that distributed transactions can be fast without sacrificing consistency and availability. It uses optimistic concurrency control, which is faster than locking provided that there aren't too many conflicts between transactions.