tldr: Zookeeper

Part of the tldr series, which summarizes papers I'm currently reading. Inspired by the morning paper.

ZooKeeper: Wait-free coordination for Internet-scale systems (2010)

Hunt et. al

Introduction

ZooKeeper is a widely-used coordination service for large-scale, distributed applications developed at Yahoo!. The authors designed ZooKeeper as a coordination kernel for developers to implement their own coordination primitives. The API manipulates wait-free data objects, which makes it more high performing and fault tolerant. Additionally, the authors chose to guarantee both FIFO client ordering of operations and linearizable writes for efficiency and usefulness to developers.

The ZooKeeper service interface

Clients submit requests through the ZooKeeper API. Each method in the API has a synchronous version, for executing a single operation, and an asynchronous version, for executing tasks in parallel.

A session starts when a client connects to ZooKeeper and clients have a handle to this session while they issue requests. Sessions end when:

Client closes a session handle.
ZooKeeper discovers that the client is faulty. (Sessions have timeouts, and a client is considered faulty if ZooKeeper doesn't receive anything from the client for a timeout interval.)

Data Model

ZooKeeper organizes znodes, in-memory data nodes, in the data tree, a hierarchical name space. They usually store metadata for coordination because they map to abstractions in the client application. Znodes are referred to by standard UNIX notation for file system paths (this is their key).

For example, in the figure above, the two subtrees correspond to two client applications, Application 1 and Application 2. The three znodes under Application 1 correspond to client processes that persist while the processes are running.

Clients can create two types of znodes:

Regular znodes are created and deleted explicitly.
Ephemeral znodes are created explicitly. They can be deleted explicitly by the client, or deleted by the system after a session ends.

If the client specifies a sequential flag when creating a znode, a monotonically increasing counter is appended to its name.

Finally, ZooKeeper implements watches over znodes so clients can receive notifications of changes without having to poll. Watches are triggered once, and are unregistered after triggered or after the session closes.

Guarantees

The authors chose to guarantee:

FIFO client ordering of operations

All requests from a given client are executed in the order that they were sent by the client.

Linearizable writes

Update requests are serializable and respect precedence.
A client can have multiple outstanding operations, which are executed in FIFO order.
Read requests are processed locally, so the service can scale with the number of servers in the system.

Here's an example scenario:

New leader is elected
Leader changes configuration parameters and notifies other processes once it finishes

ZooKeeper needs to ensure that other processes don't use configuration parameters that are still being changed, or don't used half-changed configuration parameters if the leader dies. To do this, a new leader designates a path as the ready znode. Other processes will use the configuration only once that znode exists. Specifically,

The new leader first deletes ready.
The new leader updates configuration znodes.

Changes are pipelined and asynchronous for high performance.

The new leader creates ready.

If a process sees the ready znode, it must also see all configuration changes due to the ordering guarantee.
If the ready znode is not there, the processes ignore the configuration changes.

Primitives Examples

Here's a list of some common primitives we can implement with the ZooKeeper API:

Dynamic configuration management (wait-free)

Configuration is stored in a node z.
Initially, processes read z and set a watch on it.
If z is updated, processes are notified. They read the new configuration and set a watch on it again.

Rendezvous (involves client waiting for an event)

Client starts a master and several workers, but doesn't yet know what addresses and ports to give to the workers to connect to the master.
Client creates a rendezvous node z, and passes its pathname to the master and workers.
The master fills in z once it knows which addresses and ports it's using.
The clients read z and set a watch on it. Thus, they wait to be notified when z is updated.

Group membership (involves using ephemeral nodes, wait-free)

Group membership is represented in a node z.
When a process starts, it creates an ephemeral child node under z.
If the process fails or ends, its corresponding child node is automatically deleted by ZooKeeper.
To get group membership, a process just has to list z's children. To monitor group membership, a process can set a watch on it whenever it receives a notification about changes.

Simple locks (blocking primitive)

A lock is represented by a znode z.
To acquire a lock, the client attempts to create the ephemeral node z.

If the client succeeds, they have successfully acquired the lock.
Else, the client reads z with a watch set on it so they are notified when the lock is released again.

To release a lock, the client either fails or explicitly deletes z.

Other clients that have been observing z can now try to acquire the lock.

Ticket locks (blocking primitive)

The implementation for a simple lock described above can be impacted by the herd effect: if many clients are waiting to acquire a lock, they will all contend for the lock when it is released.
Let's try to avoid the herd effect by lining up the clients waiting for the lock.

First, let a lock again be represented by a znode z.
If a client has the lowest sequence number of all the children nodes, it acquires the lock.
Else, the client watches either z or the znode with the largest sequence number smaller than its sequence number.

This implementation avoids the herd effect because each znode is watched by exactly one other client. Thus, only one client wakes up when a lock is released.

Implementation

Every server responds to clients, but clients only connect to one server to send a request. As shown in the figure above, when a client sends the service a request, the request is prepared in the request processor.

Write requests require server coordination, so the service uses atomic broadcast to obtain agreement before committing changes. (The entire database is replicated on each server.) Specifically, write requests are forwarded to the leader. The followers receive message proposals with state changes, and agree on them.

Read requests don't require coordination, so the server just reads the local database and sends a response back.

The database on each server is in-memory and contains the entire data tree. To be able to recover, updates are write-ahead logged on disk and then applied to the in-memory database.

Request processor

When the leader receives a write request, it figures out what the future state will look like after the write is applied, and generates a corresponding, idempotent transaction. ZooKeeper does this because outstanding transactions may not have been applied yet, and may be needed. For example, a conditional write request might depend on transactions that haven't been applied yet.

Atomic broadcast

When the leader executes write requests, it broadcasts them through Zab, an atomic broadcast protocol. A majority of servers must agree on a proposal, so ZooKeeper can only work if a majority of servers are up and connected.

To achieve high performance, the implementation:

Tries to keep the request processing pipeline full (e.g. thousands of requests) for high throughput.
Uses TCP so message order is maintained by the network.
Chooses the same leader to both create and propose transactions.

Replicated database

To ensure recovery doesn't take a long time, ZooKeeper periodically takes snapshots by doing a depth first traversal of the data tree. These snapshots are fuzzy because the system doesn't lock the state when taking the snapshot, so changes might occur during the snapshot. This won't affect the recovery, however, since transactions are idempotent.

Read requests

Read requests are tagged with an ID that maps to the last transaction seen by the server. The ID allows the read requests to be ordered with respect to the write requests.

As mentioned before, reads are processed locally, so read-dominant workloads are fast. However, reads may return stale values. To accommodate applications that need precedence order, ZooKeeper implements the asynchronous operation sync. A client calls sync before reading to guarantee receiving the most up-to-date value.

Conclusion

ZooKeeper is now an open source project. It has been a very popular core component for many big data processing frameworks like Hadoop and Kafka. It uses wait-free objects, and achieves high throughput on read-dominant workloads. ZooKeeper's simple interface allows other applications to build a variety of useful primitives on top of it. Finally, compared to Paxos, ZooKeeper emphasizes leader election and following the leader. Raft, developed years later, would follow this same approach.

Search This Blog

elizabeth wei's blog