tldr: Fault-Tolerant Virtual Machines
Part of the tldr series, which summarizes papers I'm currently reading. Inspired by the morning paper.
Design of a Practical System for Fault-Tolerant Virtual Machines [2010]
Scales et al.
Introduction
The primary/backup approach is a common way of implementing fault-tolerant servers. The primary server continuously sends over all its changes to a backup server. Thus, when the primary fails, the backup server can take over without data loss and the clients knowing that failure happened. However, this approach requires using a large amount of bandwidth.
Another way of implementing fault-tolerant servers is the state-machine approach, which requires less bandwidth usage. Instead of replicating data, this approach replicates the changes that cause the data/state to change. A state machine is deterministic: if state machine A and state machine B have the same start state, and receive the same input requests in the same order, they will have the same end state. All servers execute all operations.
Determinism is easier to achieve on VMs than on physical servers because the VMM can record exactly what happened on the primary VM and replay those operations correctly and efficiently on the backup VM, even if there was some non-determinism. (An example of a non-deterministic event is a virtual interrupt, and an example of a non-deterministic operation is reading the clock cycle counter of the processor.) Specifically, the VMM records the inputs to a VM and all possible non-determinism in a stream of entries written to a log. This is called deterministic replay.
The authors have designed a system called VMware FT for fault-tolerant virtual machines based on the state-machine approach. Specifically, this system uses deterministic replays to implement instruction-level replication. They also design for hardware fault-tolerance. However, at the time, their system couldn't handle multicore processors because of resulting major performance issues: almost every access to shared memory can be non-deterministic.
Design
To implement a fault-tolerant primary VM, the system maintains a backup VM on a physically different server. The two VMs are in virtual lockstep: mostly in sync, save for a small time lag.
The virtual disks are in shared storage. The authors decided against separate virtual disks because the 2 disks would have to be synced up explicitly on start-up and re-synced after failure. This would include disk state, not just running state. Additionally, the system wouldn't be able to use shared storage to handle the split-brain problem (which will be discussed in a later section). One advantage of non-shared storage is if shared storage is unavailable or expensive, or if the primary and backup are far apart.
All inputs go to the primary VM, including network inputs. The primary VM passes on all received input through a logging channel over the network to the backup VM. Although the primary and backup VMs always execute identically, only the primary's outputs are returned to clients. The backup's outputs are dropped by the VMM.
Protocol
Log entries used for deterministic replay are sent to the backup VM through the logging channel. Additionally, the system creates a special log entry for each output operation. The primary may not send an output to the external world until the backup has received and acked this special log entry.
The authors designed this protocol to ensure that when the primary fails and the backup takes over, no data is lost and the clients notice no service interruption or inconsistency. The paper defines the Output Requirement as:
If the backup ever takes over after a failure of the primary, the backup will continue executing in a way that is entirely consistent with all outputs that the primary has sent to the external world.
To optimize performance, the primary continues executing even if the backup hasn't acked yet. The output produced in the meantime just gets buffered until the backup acks. Symmetrically, the backup also has a buffer for receiving. We discuss this in more detail in a later section.
The primary and the backup could both send the same output, so the system doesn't guarantee exactly once output events. For example, the primary could send an output and then immediately crash. The backup doesn't know whether the primary crashed before or after sending an output, so when the backup becomes live, it sends the same output event. However, the authors claim this is not a problem because clients should be able to handle duplicate network packets anyways.
Failure response
If the backup fails, the primary simply stops sending entries through the logging channel and starts normally executing.
If the primary fails, the backup replays the log entries from the primary that it has received and acked. After it has finished replaying all the log entries, it starts normally executing.
After a failure, the clustering service starts a VM on another host to be the new backup.
Failure detection
To detect VM failure, the system uses periodic heartbeat messages and monitors the logging channel's traffic. The system declares a failure if the heartbeats or logging traffic has stopped for a specific amount of time (~seconds).
The authors avoid the split-brain problem by using the shared storage. Before either a primary or backup goes live when a failure is detected, the VM must execute an atomic test-and-set operation on the shared storage. A VM can only go live if this operation succeeds.
More on the logging channel
Performance is not a problem if the backup's buffer is empty - it just stops executing until the primary sends a log entry. Clients aren't affected. However, if the primary's buffer is full, it stops until log entries are flushed. Since the primary is unresponsive until entries are flushed, this can affect clients. Another reason this is a problem is that in case of primary failure, it can take a longer time for the backup to go live:
time for the backup to go live = time for system to detect failure + time for backup to replay log entries that it has already acked but not replayed
Thus, the authors try to prevent this situation by ensuring the backup's physical server isn't overloaded with many other VMs. Furthermore, if the backup starts lagging by more than some amount of time (~1 second), the system slows down the primary by telling the scheduler to give it less CPU cycles. If the backup catches up, the scheduler gives the primary more CPU cycles. In practice, this kind of performance issue is very rare.
Disk I/O
One problem with disk operations is that there can be races with VM memory accesses. For example, a VM application or the OS could access the same memory at the same time as a disk read, which would lead to a non-deterministic outcome.
To solve this problem, the authors use bounce buffers. A bounce buffer is temporary, and is the same size as the memory being accessed by the disk. A disk read first reads the data to a bounce buffer, and then copies the data to guest memory only when the I/O completes. A disk read writes the data from the bounce buffer after the data is copied.
Network I/O
The system disables asynchronous network performance optimizations, since they introduce non-determinism. However, the authors then had to improve performance. They do this in a couple ways:
- Clustering optimizations to reduce VM traps and interrupts.
- Reduce delay for transmitted packets by reducing the time required for the primary to send a log entry to the backup and for the backup to act.
Conclusion
VMware FT is a commercial product that uses the state-machine approach to achieve fault tolerance. It significantly reduces the network bandwidth consumption compared to the common primary/backup approach. Currently, it is limited to uni-processor VMs due to performance issues. Furthermore, instruction level replication is expensive. Many other systems (e.g. MapReduce, GFS, Raft) instead replicate states and operations at the application-level, which is much more efficient.