tldr: Frangipani
Part of the tldr series, which summarizes papers I'm currently reading. Inspired by the morning paper.
Frangipani: A Scalable Distributed File System [1997]
Thekkath et. al
Introduction
Frangipani is a scalable distributed file system that manages disks on multiple machines. Its designers focused on making it simple and easy to use. It achieves good performance via caching.
Frangipani's built on top of Petal, a distributed storage system that provides virtual disks, each with a 2^64 byte address space.
Frangipani's built on top of Petal, a distributed storage system that provides virtual disks, each with a 2^64 byte address space.
The above figure shows that Frangipani servers run on top of a shared Petal virtual disk and coordinate their actions with locks. More servers can be added to scale up.
System Overview
With Frangipani, all user programs see the same files, even if they are running on different machines. Furthermore, their views are coherent: changes to a file or directory on one machine are immediately visible to all other machines. (More on this in a later section.) Changes are not guaranteed to be saved in non-volatile storage until a sync system call, but metadata changes are logged and can be made non-volatile.
The Frangipani file server module on each machine runs in the OS kernel as one of the file system implementations in the file system switch. Each file server keeps track of its own redo log of pending changes in a Petal disk. If a server crashes, another server can access the log in the Petal disk.
The lock service provides multiple-reader/single-writer locks to clients on the network. It's responsible for managing virtual disk access and keeping buffer caches coherent.
The figure below shows a possible Frangipani structure. Note that servers don't communicate with each other. They only communicate with Petal and the lock server.
Disk Layout
The figure above shows how Frangipani divides up its virtual disk space on Petal:
- The first region stores configuration parameters.
- The second region stores logs.
- Each server has its own partition of this region to store its own log.
- The third region stores allocation bitmaps.
- These data structures store which blocks in the remaining regions are free.
- Each server has its own private partition of this region.
- The fourth region stores inodes.
- Each file has a corresponding inode that stores metadata (e.g. timestamps and pointers to data location.)
- There's a fixed mapping between bits in the allocation bitmap and the inodes.
- Each server can allocate inodes for new files only in its corresponding part of this region (based on its allocation bitmap).
- Each server can read, write, or free any existing file's inode.
- The fifth region stores small data blocks.
- Each small block is 4 KB.
- The remaining part of the Petal disk stores large blocks.
- Each large block is 1 TB.
Now that we have an overview of Frangipani, we go into how it implements crash recovery, cache coherence, and atomicity.
Crash Recovery
Frangipani handles failures by using WAL. Before each time a server updates metadata, it appends a record of the update to its log in memory. The log is periodically written to Petal. A server actually updates its metadata only after the corresponding record is stored in Petal.
Frangipani detects when a server crashes either through the server's client, or when the lock service gets no reply from the server. During recovery, any incomplete records in the failed server's log are applied. Note that Frangipani only needs to process the failed server's log.
To ensure correct recovery, the system does the following:
- The locking protocol ensures that updates requested to the same data by different servers are serialized. A write lock can only be released once the data has been written to Petal.
- Recovery never replays a log record for an update that's already completed.
- At any moment, only one recovery process is trying to replay a server's log region.
Cache Coherence
There are two kinds of locks in Frangipani:
- A server can read and cache the corresponding data from disk if it holds a read lock.
- A server can read/write and cache the corresponding data from disk if it holds a write lock. The cached data can differ from the on-disk data, but before a server releases the write lock, it must apply its data to disk.
Furthermore, Frangipani follows 3 rules:
- A workstation must acquire a lock, then read from Petal.
- The workstation writes to its log and Petal, and then releases lock.
- Caching without holding a lock is not allowed.
Atomicity
Transactions must grab all the needed locks. (For example, when creating a file, a workstation must grab locks on the directory, inode, and bitmap.) Doing this will avoid conflicts, such as when two workstations try to create files in the same directory. (This means they'd be stored in the same free slot.)