tldr: What Goes Around Comes Around

Part of the tldr series, which summarizes papers I'm currently reading. Inspired by the morning paper.

tldr: The Relational Model ft. What Goes Around Comes Around [2005]

Stonebraker et al.

Stonebraker and Hellerstein go over the history of data model proposals, since many similar ideas have been recycled over the years. The goal of this paper is to help "future researchers avoid replaying history". 

Hierarchical (IMS): Late 1960s and 1970s

"Lesson 2: Tree structured data models are very restrictive"

This data model arranges data in trees of record types (a collection of named fields with data types). Each record type except for the root has a unique parent record type. In a tree structure, data must often be repeated (which is undesirable because data could become inconsistent). Also, the tree structure can't represent a record type that doesn't have a parent.

"Lesson 4: A record-at-a-time user interface forces the programmer to do manual query optimization, and this is often hard."

IMS's data manipulation language is a "record-at-a-time" language, which means the programmer must first implement the algorithm for their query, which IMS then executes. This makes the programmer's life hard because there are often multiple ways to write a query, and optimization can be unclear.

"Lesson 3: It is a challenge to provide sophisticated logical reorganizations of tree structured data"

IMS supported several possible storage formats for the data. For performance reasons, they also imposed rules on possible operations. For example, record types each have a unique hierarchical sequence key (HSK), created by concatenating the HSKs of ancestor records in depth-first, left-to-right order. However, the storage format which hashes root records' keys didn't allow "get next", since there isn't an easy way to order  hashed records by HSKs. These restrictive rules made it hard to tune applications by changing the storage format.

"Lesson 1: Physical and logical data independence are highly desirable"

Physical data independence is the ability of a database application to continue to run, regardless of tuning at physical level. This is important because when new code is added to the application, tuning the new version of the application may involve changing the storage format. (IMS has limited physical data independence.)

Logical data independence is the ability of a database application to continue to run, regardless of logical changes, like new record types. (IMS has logical data independence since its data manipulation language is based on a logical database, not on an actual physical database.)

In general, data independence is important for database applications to have a longer lifetime without expensive maintenance.

Network (CODASYL): 1970s

"Lesson 5: Networks are more flexible than hierarchies but more complex"

The network data model is a network of record types (with keys) connected by directed relationship arcs. A record type can have more than one parent. Parent record instances have a 1-to-n relationship with child record instances. Finally, there must be at least one entry point, which is a record type that's not any other record type's child. 

Networks avoid being boxed into some of the hierarchical model's constraints - for example, record types no longer need parents. However, they have poorer physical and no logical data independence. 

Furthermore, a CODASYL programmer is faced with a much more complex program: they have to move around in multi-dimensional space, and have to keep track of the last record touched by the application, the last record of each record type touched, and the last record of each relationship arc touched. 

"Lesson 6: Loading and recovering networks is more complex than hierarchies"

Loading CODASYL networks is more complex because larger number of records have to ordered into relationship arcs, and involve many disk seeks, so optimization became important. Unfortunately, each application had to write its own load program.

Relational: 1970s and early 1980s

In the relational model, a database is a collection of at least one relation. Specifically, a relation is made up of a relation schema and a relation instance. 

relation schema contains the relation's name, field names (e.g. "name"), and field domains (e.g. "string"). 

relation instance is a table, which contains a set of unique records. Each record has the same number of fields as the relation schema. 

The degree of a relation is the number of fields. The cardinality of a relation instance is the number of records.

"Lesson 8: Logical data independence is easier with a simple data model than with a complex one."

The tables' simplicity makes logical data independence more likely. Furthermore, the relational model is flexible enough to represent almost anything.

"Lesson 7: Set-a-time languages are good, regardless of the data model, since they offer much improved physical data independence."

Data is accessed through a high-level set-at-a-time data manipulation language, which provides more physical data independence. Thus, there was actually no need for a physical storage proposal.

"Lesson 10: Query optimizers can beat all but the best record-at-a-time DBMS application programmers." 

System R and INGRES showed that efficient relational model implementations are possible, and that query optimizers can beat almost any programmer constructing query plans using record-at-a-time languages. 

Popular posts from this blog

Space Race: a simple Rust game built on the Bevy framework

Building a Toy Language Compiler in Go

tldr: Raft + Implementation Tips