tldr: Spark

Part of the tldr series, which summarizes papers I'm currently reading. Inspired by the morning paper.

Spark (Resilient Distributed Datasets) [2012]

Zaharia et al.

Introduction 

This paper is motivated by two real-world use cases where existing cluster computing frameworks (e.g. MapReduce) are inefficient: 
  • Iterative ML/graph algorithms
  • Interactive data mining (multiple queries on same subset of data)
Specifically, these use cases reuse intermediate results in several computations. A framework like MapReduce would be inefficient because it has to write data to an external storage system in order to reuse it. 

The authors instead introduce data structures called resilient distributed datasets (RDDs), which are designed to efficiently persist data in-memory for many kinds of applications. They implement RDDs in a system called Spark. Additionally, they present an interface based on coarse-grained transformations (e.g. map or filter). 

Resilient Distributed Datasets 

An RDD is "a read-only, partitioned collection of records". Users can create them with transformation operations (e.g. map or filter) on data in stable storage, or other RDDs. Then, the users can specify action operations, which return a value to the application or export data to a storage system. Some examples of action operations are:
  • Count, which returns the number of elements in the dataset. 
  • Save, which exports the dataset to a storage system. 
  • Persist, which specifies which RDDs will be reused in the future. These RDDs are stored in-memory, and overflow to disk if there isn't enough space. The user can specify which RDDs to prioritize storing in memory. 
If a partition of an RDD is lost, the RDD just replays its transformations for that particular partition. This means efficient recovery because the system doesn't have to checkpoint and log all the data, and doesn't have to recompute all the partitions. 

RDD versus DSM

In the table below, the authors compare RDDs with distributed shared memory (DSM), where applications read and write to anywhere in a global address space. 


Some of the advantages of RDDs that the authors highlight are:
  • RDDs are created through coarse-grained transformations, which means efficient fault tolerance. 
  • RDDs are immutable, so straggler nodes are easily handled by backup copies of slow tasks. 
  • Since operations are bulk operations, the runtime can optimize performance by scheduling tasks to improve data locality. 

Spark Programming Interface 

As shown in the figure above, programmers write a driver program that connects to workers in order to use Spark. (The authors chose Scala as Spark's programming language.) The driver program creates RDDs and invokes actions on them.  The workers store the partitions of RDDs in RAM. 

To a programmer, five parts of an RDD are exposed: 
  • A set of partitions: atomic pieces of the dataset
  • A set of dependencies on parent RDDs
  • A function for computing the dataset based on its parents
  • Metadata about partitioning scheme
  • Metadata about data placement
There are two types of dependencies:
  • Narrow dependencies: each partition of the parent RDD is used by at most one partition of the child RDD.
    • Allow pipelined execution on one cluster node, e.g. for each element, apply map then filter. 
    • Recovery after node failure is more efficient since only lost parent partitions need to be recomputed. 
  • Wide dependencies: each partition of the parent RDD may be used by multiple child partitions. 
    • Requires data from all parent partitions to be available.
The diagram below gives some examples of these two types of dependencies.

Conclusion 

RDDs are an efficient, fault-tolerant, widely applicable abstraction for in-memory computations on cluster. Spark was open sourced in 2010 and is now maintained by Apache.

Popular posts from this blog

Space Race: a simple Rust game built on the Bevy framework

Building a Toy Language Compiler in Go

tldr: Raft + Implementation Tips