Posts

Showing posts from February, 2018

tldr: Spinnaker

Image
Part of the tldr series , which summarizes papers I'm currently reading. Inspired by the morning paper . Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore  (2011) Rao et al. Introduction Spinnaker is a datastore developed at IBM to run on a large cluster of commodity servers in a single data center. This system's features include: Key-based range partitioning 3-way replication Transactional get-put API. Can choose either: Strong read consistency Timeline read consistency  Paxos Spinnaker's replication protocol is a variation of the Multi-Paxos protocol . It will be discussed in a further detail later, but for now, here's an overview of Paxos.  Paxos solves reaching consensus on a value on 2F+1 replicas while tolerating up to F failures. Once consensus is reached, the value can be accessed as long as the majority of replicas are up. There are 2 phases in Paxos: Leader election Propose . A node that wants to propose a

tldr: Raft + Implementation Tips

Image
Part of the tldr series , which summarizes papers I'm currently reading. Inspired by the morning paper . Raft (Extended Version) + Implementation Tips [2014] Ongaro et al. Note: For implementation tips, see the end of this blog post.  Introduction Consensus algorithms The r eplicated state machine approach  is useful for providing fault-tolerance in distributed systems. It lets a cluster of unreliable machines continue operating even when some of its members fail. In this approach, machines   usually use a replicated log to stay in sync. As shown in the figure below, each server's log stores the same commands in the same order. Every state machine executes commands from its own log in order. State machines are deterministic, so each produces the same order of outputs. Once the majority of machines have responded, the command is done executing. Consensus algorithms ensure that the replicated log stays consistent. In the figure above, clients send commands to

tldr: Fault-Tolerant Virtual Machines

Image
Part of the tldr series , which summarizes papers I'm currently reading. Inspired by the morning paper . Design of a Practical System for Fault-Tolerant Virtual Machines [2010] Scales et al.  Introduction The primary/backup approach is a common way of implementing fault-tolerant servers. The primary server continuously sends over all its changes to a backup server. Thus, when the primary fails, the backup server can take over without data loss and the clients knowing that failure happened. However, this approach requires using a large amount of bandwidth. Another way of implementing fault-tolerant servers is the state-machine approach , which requires less bandwidth usage. Instead of replicating data, this approach replicates the changes that cause the data/state to change. A state machine  is deterministic: if state machine A and state machine B have the same start state, and receive the same input requests in the same order, they will have the same end state.  All

tldr: Google File System

Image
Part of the tldr series , which summarizes papers I'm currently reading. Inspired by the morning paper . The Google File System  (2003) Ghemawat et al. The Google File System (GFS) is a " scalable, distributed file system for large distributed data-intensive applications ". It was developed at Google and at the time the paper was written, there were GFS clusters already deployed that were as large as 1000 nodes with 300 TB of disk storage that were accessed by 100s of clients. Its design prioritizes performance and simplicity over consistency. Guiding Assumptions Here are the assumptions that guided how GFS was designed: The file system consists of 100s or 1000s of commodity machines, and failure is common. GFS must continuously monitor, tolerate, and recover from machine failures. The files that are being stored are large, usually more than 100 MB. GFS should optimize for multi-GB files (common scenario), and support but not optimize for small files (

tldr: MapReduce + A Major Step Backwards?

Image
Part of the tldr series , which summarizes papers I'm currently reading. Inspired by the morning paper . MapReduce: Simplified Data Processing on Large Clusters (2004) Dean et al.  MapReduce is a functional programming pattern, popularized by Google for big data processing. It abstracts away parallelization, fault-tolerance, and load balancing, which allows programmers to run jobs on a distributed system through a simple but powerful 2 core API. Programming Model The MapReduce library consists of two functions: Map . (k1, v1) --> list(k2, v2) . A user-defined function that takes in a key-value pair and outputs a list of intermediate key-value pairs.  Reduce . (k2, list(v2)) -->  list(v2) . A user-defined function that takes in an intermediate key and a list of its corresponding values. It merges those values into a possibly smaller list of values.  The process from beginning to end is: User specifies a list of input key-value pairs.  Map takes each input

tldr: Processing Data Warehousing Queries on GPU Devices

Image
Part of the tldr series , which summarizes papers I'm currently reading. Inspired by the morning paper . The Yin and Yang of Processing Data Warehousing Queries on GPU Devices (2013) Yuan et al. Much research has shown that databases using GPUs have the potential to be much faster than databases using CPUs at the microbenchmarking level. CPUs have a few cores while GPUs can have thousands of smaller cores intended to execute in parallel. However, no major data warehouse  (e.g. Teradata, DB2, Oracle, SQL Server, Hive, Pig, Tenzing) truly uses GPUs in production. The paper investigates why by analyzing query processing using GPUs. Specifically: Ying:  PCIe data transfer between host memory (CPU) and device memory (GPU)  Yang: Kernel execution , where the query is executed on device memory data. (A kernel is a function compiled to execute on GPUs.) The authors focus on OLAP queries with  star schema  and a column-store  format for their query engine. They use the S

tldr: C-Store

Image
Today I'm starting a new series, tldr, to summarize papers I'm currently reading. Inspired by the morning paper . C-Store: A Column-oriented DBMS (2005) Stonebraker et al. C-Store is a relational DBMS optimized for reads. It's suitable for OLAP systems and  data warehouses , which load and aggregate data from several OLTP systems. The data in data warehouses are often arranged with a star schema . (Named because a large fact table in the middle of the "star" contains columns of attributes and foreign key references to dimension tables on the points of the "star".) Star Schema. Source:  https://qph.ec.quoracdn.net/main-qimg-f58dd8f80702e7ec349d293d531200b3-c In a row-store representation , where data values in the same row are stored together, OLAP queries can be very inefficient when they need to access many rows. Fact tables often have many columns (e.g. 100), which means that a typical query which only touches a few columns (e.g. 5) will w