tldr: Google File System

Part of the tldr series, which summarizes papers I'm currently reading. Inspired by the morning paper.

The Google File System (2003)

Ghemawat et al.

The Google File System (GFS) is a "scalable, distributed file system for large distributed data-intensive applications". It was developed at Google and at the time the paper was written, there were GFS clusters already deployed that were as large as 1000 nodes with 300 TB of disk storage that were accessed by 100s of clients. Its design prioritizes performance and simplicity over consistency.

Guiding Assumptions

Here are the assumptions that guided how GFS was designed:

The file system consists of 100s or 1000s of commodity machines, and failure is common.

GFS must continuously monitor, tolerate, and recover from machine failures.

The files that are being stored are large, usually more than 100 MB.

GFS should optimize for multi-GB files (common scenario), and support but not optimize for small files (rarer, on the order of KBs). Thus, GFS clusters and clients don't cache file data.

The read workload consists of large streaming reads (more than 1 MB) and small random reads (on the order of KBs). The large reads are usually sequential and the small reads are at arbitrary offsets.

Batch and sort small reads for better performance.

The write workload consists of large, frequent, sequential writes that append data to files, instead of overwriting existing data. Once written, files are usually only read.

GFS should optimize for the larger writes (common scenario), and support but not optimize for smaller, arbitrary writes (rarer).

Most applications have stricter requirements on time to process bulk data than on response time for individual reads and writes.

GFS should prioritize high bandwidth over low latency.

Interface

GFS hierarchically organizes files in directories. It supports the following operations:

Create, Delete
Open, Close
Read, Write
Snapshot: Creates copy of a file or a directory tree
Record append: Multiple clients can concurrently append data to the same file

Architecture

A GFS cluster has 1 master and multiple chunkservers.

Client Code

Each application is linked with GFS client code that allows it to read and write data by communicating with both the master and the chunkservers. Specifically, clients communicate with the master for metadata operations, and communicate with the chunkservers for data operations.

Chunkservers

Each cluster has multiple chunkservers, each of which is usually run on a commodity machine. Files are divided into fixed-size chunks, which are identified by an immutable and globally unique 64-bit chunk handle. Chunkservers store these chunks on their local disks. By default, each chunk is replicated on 3 chunkservers for reliability.

The authors decided on a chunk size of 64 MB. The advantages of this large a chunk size are:

Clients don't need to interact with the master as much. Only 1 request to the master for a specific chunk is needed for all reads and writes to that chunk.
A persistent TCP connection to a chunkserver over a larger interval of time reduces network overhead. (This happens because clients usually perform many operations on each chunk.)
The master has to store less metadata, so it can store all the metadata in memory.

One of the main disadvantages of a large chunk size is that chunkservers storing chunks from small files can become hotspots. This can happen when many clients try to access the same small file because smaller files are only made up of a few chunks, and thus only stored on a few chunkservers. In practice, this issue isn't major because most applications store large files.

Master

Each cluster also has 1 master, which maintains types of metadata like:

File and chunk namespaces
Mapping from files to chunks
Locations of each chunk's replicas

The first 2 types of metadata are stored persistently, with an operation log. On the other hand, chunk locations are not stored persistently. Instead, the master just requests this data from chunkservers when it starts up, and stays up-to-date by periodically sending HeartBeat messages to chunkservers.

All the metadata is stored in-memory, so the master can perform operations quickly and periodically scan and maintain its entire state.

The authors try to prevent the master from becoming a bottleneck by not letting clients read or write file data directly through the master. However, the master is responsible for other kinds of operations, which are discussed in a later section.

Consistency Model

GFS trades consistency for performance and simplicity. Its relaxed consistency model has the following guarantees:

File namespace modifications are atomic and handled by the master.
Data mutations' affects on a file region depend on several factors. We first define a file region as consistent if all clients always see the same data from any replica. Also, a file region is defined if it's consistent, and clients see the mutation's entire write.

For a write operation, the client specifies the file offset where the data is to be written.

After a serial, successful mutation, the file region is defined (and thus, also consistent).
After a concurrent, successful mutation, the file region is consistent but undefined.

For a record append operation, GFS picks the file offset to write the data, and appends it atomically at least once. GFS might insert padding or duplicate records (See the append record section later on for why).

Serial, successful mutations and serial, concurrent mutations result in defined regions interspersed with inconsistent regions. (Further explained in a later section.)

After any failed mutation, the file region is inconsistent (and thus, also undefined).
After a sequence of successful mutations, the file region is defined and contain the most recent mutation's data. GFS guarantees this by:

Applying mutations to a chunk in the same order on all its corresponding replicas. (Further explained in a later section.)
Detecting stale replicas and not involving them in mutations or giving them to the master to be passed on to clients.

Since clients cache chunk locations, they can read from stale replicas.
A chunk is lost irreversibly only when all its corresponding replicas are lost before the master can react.

The target applications can tolerate this relaxed consistency model because most applications append instead of overwrite data.

Appending is more efficient and more resilient to failure than overwriting.
Periodic checkpoints allow writers to recover and prevent readers from reading incomplete data.
After record appends from multiple writers, readers can identify and throw away padding or duplicate records using checksums. Or, readers can use unique identifiers to filter out duplicate records.

Operations

Read

The process for a read operation is:

The client translates file name and byte offset into a chunk index.
The client sends a request to the master containing the file name and chunk index.
The master replies with the corresponding chunk handle, and locations of the replicas which are storing the chunk.
The client caches the information it has received from the master.
The client sends a request to a replica (usually, the closest one) containing the chunk handle and a byte range.

If the client sends more requests for the same chunk, it doesn't have to communicate with the master unless the cached information expires or the file is reopened.

Write

Chunk leases are used to maintain a consistent mutation order across replicas. When performing operations that modify chunks, the master sends out a chunk lease to one of the chunkservers that holds the chunk. This chunkserver is called the primary, and it chooses the mutation order. The other servers, called the secondaries, must follow that order. A lease can time out for 60 seconds before the master chooses a new chunkserver to be the primary.

With chunk leases in mind, the process for a write operation is:

The client asks the master for the identity of the primary chunkserver and the locations of the secondaries.
The master replies with the info, and the client caches it. Thus, the client won't need to interact with the master again until the primary chunkserver becomes unreachable or longer holds the chunk lease.
The client pushes the data to all replicas in a linear chain to fully utilize each machine's outbound bandwidth. Data flow is scheduled based on network topology for better performance. (The primary isn't necessary the first to receive the data.) In other words, each machine forwards the data to the closest machine that hasn't received data yet.
The client sends a write request to the primary once all replicas have acknowledged that they received the data. The primary decides the mutation order and applies the mutations locally.
The primary forwards the write request and mutation order to the secondaries. The secondaries applies the mutations in the same order as the primary.
All secondaries reply to the primary when they've finished.
The primary replies to the client with the status of the write. For example, if some secondaries failed, the client retries steps 3-7.

Atomic Record Append

Record appends work almost like writes (as described in the previous section). The only difference is at step 4. The primary checks if appending data to the current chunk would make it larger than 64 MB. If it will overflow, the primary pads the rest of the chunk and tells the secondaries to pad as well. The primary then tells the client to retry the record append on the next chunk.

If a record append fails at any replica, the client retries the operation. Thus, the data might contain duplicates.

To be a successful record append, the data must be written at the same offset of all replicas of a chunk.

Master operations

Chunk re-replication

The master re-replicates a chunk if the number of available replicas drop below a user-specified number. This happens when the chunkserver is unavailable or corrupted, one of its disks is disabled, or because the user-specified number increased.

Replica re-balancing

The master periodically re-balances replicas. It scans the current replica distribution, and moves replicas to improve load balancing and disk space utilization.

Stale replica detection

Chunk replicas become stale when chunkservers fail and thus miss mutations to its chunks. For each chunk, the master records a chunk version number. Whenever the master grants a new chunk lease for a chunk, it increments its chunk version number and tells the new value to the available replicas.

Master Replication

As mentioned in previous sections, chunks are replicated. The master state is replicated too. If the master fails, an external monitor will start a new master process on a different machine with the replicated operation log. This shadow master lets clients read the file system even when the primary master is unavailable.

Conclusion

GFS is a foundational system used by MapReduce and Bigtable. The GFS paper inspired other systems like HDFS. It also demonstrates the tradeoff between consistency, and simplicity and performance.

Search This Blog

elizabeth wei's blog