tldr: C-Store

Today I'm starting a new series, tldr, to summarize papers I'm currently reading. Inspired by the morning paper.

C-Store: A Column-oriented DBMS (2005)

Stonebraker et al.

C-Store is a relational DBMS optimized for reads. It's suitable for OLAP systems and data warehouses, which load and aggregate data from several OLTP systems. The data in data warehouses are often arranged with a star schema. (Named because a large fact table in the middle of the "star" contains columns of attributes and foreign key references to dimension tables on the points of the "star".)

Star Schema. Source: https://qph.ec.quoracdn.net/main-qimg-f58dd8f80702e7ec349d293d531200b3-c
In a row-store representation, where data values in the same row are stored together, OLAP queries can be very inefficient when they need to access many rows. Fact tables often have many columns (e.g. 100), which means that a typical query which only touches a few columns (e.g. 5) will waste memory loading in data from many irrelevant columns.

Column-store representation

This brings us back to C-Store. It uses a column-store representation, which means that:
  • Reads only have to load the columns relevant to a given query. 
  • Values in columns usually have lower entropy than values in rows. This means that scarce disk bandwidth can be saved by using more CPU cycles to compress columns. 
    • Data values can be "densepacked" in storage, since within each column they're the same data type.
    • Data values themselves can be compressed into a more compact form, since within each column they have relatively few distinct values.
C-Store stores groups of columns sorted on different attributes without using a huge amount of memory, which means it can optimize queries. This will be further explained later in the post. 

Hybrid architecture for both updates and reads 

"There is a tension between providing updates and optimizing data structures for reading. ...columns of data are maintained in entry sequence order ... allows efficient insertion of new data items, either in batch or transactionally, at the end of the column. However, the cost is a less-than optimal retrieval structure, because most query workloads will run faster with the data in some other order. However, storing columns in non-entry sequence will make insertions very difficult and expensive."
C-Store's architecture tries to handle this "tension" between updates and reads by having a small writeable store (WS) for updates and inserts, and a large read-optimized column store (RS) for reads. A tuple mover asynchronously transfers batches of data from the WS to the RS. This kind of insert is the only insert that the RS supports.

SQL queries result in inserts to the WS, and deletes to the RS (which are handled by the tuple mover). Updates are an insert + delete.

In the RS, columns of data are compressed with encodings depending on:
  1. Whether the column is sorted according to its own values or to values in another column.
  2. The proportion of distinct values it contains. 
For example, a column sorted according to another column's values and with only a few distinct values is represented by a list of tuples (v, b), where v is the data value and b is a bitmap of that data value's positions in the column.

The WS is also a column store, but does not compress data values since it's relatively much smaller than the RS.

Data model of overlapping projections

Instead of the tables and corresponding indices which are standard in row-store representations, C-Store implements projections. A projection is made up of at least 1 column from a table, T. It can contain columns from other tables, but it has the same number of rows as T. A projection is horizontally partitioned into at least 1 segment to take advantage of its shared-nothing machine environment. (A data warehouse is most practically run on a grid of cheap hardware, each with its own memory, disk, and CPU)

In order to answer queries, every column in every table can be stored in at least 1 projection. Furthermore, segments from different projections must be joined to reconstruct logical rows. C-Store achieves this through:
  • Storage keys. A different storage key corresponds to each data value in every column in a segment. A data value's storage key is its column-based position in the segment: for example, the fifth data value in a column has storage key 5. Within a segment, data values from different columns with equal storage keys are in the same logical row. 
  • Join indices. Let's say a table, T, can be covered by two projections, T1 and T2. Assume T1 has M segments and T2 has N segments. Then a join index between T1 and T2 consists of M tables. Each row contains the segment ID and storage key of a data value in T2 that correspond to a data value in T1. For example, the figure below shows the join index that maps the EMP3 projection to the EMP1 projection. In general, reconstructing a table T covered by projections T1, T2 ... TK means finding a path of join indices. 

Snapshot isolation

C-Store implements efficient snapshot isolation for read-only transactions. Read-only transactions can only access data at a point in time when's there guaranteed to be no uncommitted transactions. This means that no locks are used for read-only transactions. 

The problem thus becomes: which data values in the WS and the RS are visible to a read-only transaction at a specific point in time? C-Store solves this with a high water mark and metadata vectors. 

High water mark

The high water mark is defined as the most recent event time in the past "before which we can guarantee that there are no uncommitted transactions." Its units are in terms of epochs, which are configurable by the user but usually on the order of seconds. The HWM's value is determined by one node, which the paper calls the timestamp authority (TA)

Assume the current epoch is e. Then, periodically:
  1. The TA sends "end of epoch" message to the other nodes.
  2. The other nodes increment their own local epoch counters to e+1.
  3. New transactions now run with the incremented epoch, e+1
  4. Nodes send an "epoch complete" message to the TA once all transactions that began in epoch e (or earlier) are done.
  5. Once the TA receives all "epoch complete" messages for epoch e, it sets the HWM to be e and sends the value e to the other nodes. 
Queries ran by users usually access the most recent data, according to a maintained map of epochs to real-world times. 

Insertion vector and deleted record vector

Two vectors are referenced to check each data value's visibility to a query. An insertion vector is stored for each segment in the WS, which contains the epochs in which each record was inserted. A deleted record vector for each projection is also maintained, which contains the epochs in which each record was deleted or 0 if the record has not been deleted. 

Conclusion

This paper has stood the test of time because it shows how to build a scalable database using the relational data model and SQL (around 2005, NoSQL was becoming very popular). It further popularized the choice of a column store representation for OLAP workloads. Since then, the commercial version of C-Store, Vertica, has been founded and sold. 

Popular posts from this blog

Building A Toy Language Interpreter in Go

Space Race: a simple Rust game built on the Bevy framework

Building a Toy Language Compiler in Go