tldr: Architecture of a Database System

Part of the tldr series, which summarizes papers I'm currently reading. Inspired by the morning paper.

tldr: Architecture of a Database System [2007]

Hellerstein, Stonebraker, Hamilton

This paper focuses on DBMSs, specifically RDBMSs, the "most mature and widely used database systems in production today."

The figure above shows a generic RDBMS. Before going into detail into each of the 5 main components, the paper walks through an example of a single-query transaction:

At the airport, a gate agent clicks on a link to request a flight's passenger list.
The gate agent's computer (the client) calls an API which establishes a connection over a network with the Client Communications Manager.

This component is responsible for:

Establishing and remembering the connection
Responding to SQL commands from the client
Returning data and control messages

There are a couple types of arrangements:

Client-server system: direct connection between client and database server (e.g via JDBC)
Three-tier system: client connects with middle-tier server (e.g. web server), which facilitates communication between the client and the database

The Process Manager receives the SQL command. This component is responsible for:

Assigning a thread of computation to the command
Handling admission control, which means deciding whether the system should immediately execute the query or defer until enough system resources are available

The Relational Query Processor begins to execute the SQL command. This component is responsible for:

Checking that the author is authorized to run the query
Compiling the SQL query into a query plan
Running the query plan with the plan executor, which is made up of a set of operators (e.g. joins, selection, projection, aggregation, sorting)

Some of the query plan's operators request data from the Transactional Storage Manager.

This component is responsible for:

Managing data access calls (read)
Managing data manipulation calls (create, update, delete)

This component's subcomponents are:

Access methods: algorithms and data structures (e.g. tables and indexes) to organize and access data on disk
Buffer manager: decides what data to transfer between disk and memory buffers, and when
Lock manager: locks are acquired before accessing data to ensure correct execution with concurrent queries
Log manager: if the gate agent's query had included updates to the data, the log manager would ensure that the transaction is durable

Once all the query operators have the data they requested, the stack of activities unwinds:

Access methods return to query operators, which compute result tuples
Result tuples are shipped in buffers to the Client Communications Manager
Client Communications Manager returns the result tuples to the client
The transaction manager, process manager, and communications manager clean up state

The above example didn't mention any of the Shared Components and Utilities, which are involved throughout the process. For example:

Catalog manager: Query processor uses this component during authentication, parsing, and query optimization
Memory manager: Used whenever memory is dynamically allocated or deallocated

The Process Manager

Process Models

The paper first discusses several possible DBMS process models, which sketch out how concurrent user requests are modeled, and how to map them to OS processes/threads. Before going over the process models, here are some important definitions:

OS process

OS thread + private, unique address space
Scheduled by kernel

OS thread

Program execution unit
Has access to memory of other threads executing in the same multi-threaded OS process
Scheduled by kernel scheduler

Lightweight thread

Scheduled by application-level thread scheduler
Appears to the OS scheduler as a single thread of execution, since all the lightweight threads run within a single OS process
Comparison with OS threads:

Faster thread switching
But blocking operations in any lightweight thread will block all other lightweight threads

DBMS worker

Thread of execution in DBMS, which handles a client's requests
1:1 between DBMS worker and DBMS client

Now, we're ready to look at 3 DBMS process models:

Process per DBMS worker model

One OS process hosts each DBMS worker (direct mapping)
Pros:

Most straight-forward model
OS scheduler manages timesharing for DBMS workers
OS can detect common bugs like memory overruns

Cons:

DBMS connections share some in-memory data structures, so they must be explicitly allocated in shared memory
Scaling issues: process uses more memory than a thread, and process switches are slower

Current use:

Still heavily used

IBM DB2: defaults to this model on OSs that don't support scalable OS threads
PostgreSQL: exclusive process model
Oracle: default process model

Thread per DBMS worker model

Single multi-threaded process hosts all DBMS workers
Each client connection allocates a new thread
Pros:

Scales well to large numbers of concurrent connections
Data sharing is easy

Cons:

OS doesn't protect threads from common bugs (like memory overruns, stray pointers)
Debugging is tricky (race conditions)
Porting (between different threading interfaces and multi-threaded scaling)

Current use:

Commonly used:

IBM DB2: defaults to this model on OSs that support scalable OS threads
Microsoft SQL Server
MySQL

Two major current variants:

OS thread per DBMS worker
DBMS thread per DBMS worker

Pros: Avoids OS scheduler scaling/performance issues
Cons: High implementation costs, poor development tools support, substantial software maintenance

Process pool model

A pool of processes hosts DBMS workers
Client requests sent to a process in the process pool, or waits until a process is unavailable
Pros:

Same pros as process per DBMS worker model
But more memory efficient
And scales well to large number of users

Current use:

Still currently used

Oracle: optional model suggested by Oracle when there are large numbers of concurrently connected users

Variant: thread pool model

Microsoft SQL Server: default model

Parallel Architecture

The paper also discusses DBMSs on parallel hardware. There are several models of parallelism:

Shared Memory

All processors access the same RAM and disk
The 3 process models we discussed in the previous section run well on shared memory hardware architecture
Supported by all major commercial DBMS providers

Shared Nothing

Cluster of independent machines communicate over network
One machine can't directly access another machine's memory or disk
Tables are spread out over several machines (horizontal data partitioning) through hash-based or range-based partitioning, or round-robin
Scalable

Shared Disk

All processor can access disk
Processors can't access other processor's RAM
Lower administration cost

Don't have to partition tables across machines to get parallelism
Although very large databases have to be partitioned anyways

One node's failure doesn't affect the other nodes' ability to access the entire database
The disk is a single point of failure
Need to explicitly coordinate sharing data across machines

Distributed lock manager
Cache-coherence protocol

Buffers

SQL requests are moved to the server processes, and results are moved back to the client in buffers.

Disk I/O Buffers: reads and writes to shared data store

Database I/O requests

Persistent data is staged in the buffer pool:

In the thread per DBMS worker model, the buffer pool is heap-resident, and can be accessed by all threads
In the other models, the buffer pool is allocated in shared memory, and can be accessed by all processes

When a thread reads a page, it sends a request with the disk address and the pointer to a free memory location in the buffer pool where the result can be stored
When a thread flushes a page, it sends a request with the page's current memory location in the buffer pool and the destination disk address

Log I/O requests

The log is an array of entries stored on at least one disk
It's generated while processing transactions, and staged in an in-memory queue (the log tail) that's periodically flushed to disk

In the thread per DBMS worker model, the log tail is heap-resident
In the other models, either:

A separate process manages the log
Or the log tail is allocated in shared memory

Client communication buffers:

DBMS worker uses the client communications socket as queue for tuples, to prefetch in advance of client requests
Or, use client-side caching to store results that will probably be fetched soon

Admission Control

DBMSs thrash as workload increases and throughput increases to a threshold. This is often because of memory pressure: the DBMS is busy replacing pages since it can't keep the working set of pages in the buffer pool. It's also sometimes because of lock contention: transactions deadlock with each other, and continually abort and restart.

Thus, the job of the admission control policy is to defer query execution until enough resources are available. Ideally, the system will degrade gracefully: latencies of transactions will increase with the rate of queries, but throughput will hold up at its maximum threshold.

Admission control has 2 layers:

The dispatcher process simply limits the number of client connections to a threshold.
The relational query processor decides what to do with a query after it has been parsed and optimized. Specifically,

The query can be postponed, execute with fewer resources, or execute as normal.
The decision is based on estimates of which disk devices the query will access and the number of random and sequential I/Os for each device, the CPU load based on operators and number of input tuples, and the memory footprint of the query's data structures (e.g. when sorting or hashing large inputs during joins).

The Query Processor

A SQL statement passes through several modules of the query processor:

Query parser

The query parser does several things:

Resolve names and references

Canonicalize table names (e.g. into the form server.database.schema.table)
Invoke catalog manager to check if the table is registered in the system catalog

Check that the query is correctly specified

Ensure attributes are correct
Apply other standard SQL syntax checks

Verify that the user is authorized to execute the query
Convert query into internal format used by the optimizer

Query rewriter

The query rewriter simplifies and normalizes the query without accessing the data in the tables. Specifically:

View expansion

For each view, the rewriter gets the view definition from the catalog manager, and expands the view into tables

Constant arithmetic evaluation

e.g. 10 + 2 becomes 12

Rewrite predicates

e.g. NOT Employee.Salary > 1000 becomes Employee.Salary <= 1000)

Semantic optimization

e.g. redundant join elimination - a query could join two tables but not reference any of the columns in one of the tables

Subquery flattening and other heuristic rewrites

Query normalization: rewrite the query into a form that's better suited for the query optimizer
Flatten nested subqueries to help query optimizer, which usually optimizes within a query block, and not across query blocks

Query optimizer

The optimizer transforms an internal query representation into an efficient query plan. A query plan is like a data-flow diagram that pipes data from the tables through a series of query operators. (An example query plan is shown in the figure above.)

Before optimizing the queries, they are first broken down into SELECT-FROM-WHERE query blocks. After optimizing, a few post-processing operators are added (e.g. to compute GROUP BY, ORDER BY, HAVING), and the blocks are put together in order.

Optimizers today build on Selinger's System R paper in the following directions:

Plan space

System R only focused on left-deep query plans, where the righthand table input of a join is a base table

Optimizers today also consider bushy trees, where the righthand table input of a join is made up of nested tables

System R postponed Cartesian products, such that they appear only after all joins

Optimizers today also consider earlier Cartesian products

Selectivity estimation

System R's selectivity estimation was based on simple table and index cardinalities

Optimizers today also base selectivity estimation on distributions of values in attributes via histograms and sampling techniques, as well as other factors

Search algorithms

System R used a dynamic programming approach

Some optimizers today use a top-down search. However, both approaches can work well in production, and have exponential time and space complexity in the number of tables involved in the query.

Parallelism

Optimizers today support some parallel processing

Both single-phase and two-phase implementations exist in production

Auto-tuning

Ongoing research on how to let DBMS automatically make tuning decisions

Query Executor

The query executor operates on a query plan. Most query executors today use an iterator model (shown in the above figure). Operators in the query plan are implemented as subclasses of the iterator class. Any subclass of iterator can be used as input to any other.

Some important performance metrics are:

Time to query completion (most common)
Maximizing DBMS throughput
Time to first row (popular with interactive applications)

Writing a query plan for a query that updates is more complicated than for a query that just reads. The Halloween Problem was discovered by the System R group on October 31. It's about a plan that reads and writes the same table: for example, a query like "give everyone whose salary is under $20K a 10% raise".

If the query plan was simply an index scan iterator fed into an update iterator, the query executor might read and update a salary again and again until it's over $20K!

A correct query plan would have to be more complicated. One solution is a batch read-then-write scheme. A temporary file stores the IDs of tuples to be modified.

Access Methods

The query processor accesses data structures on disk through access methods. Most systems in production use heaps and B+-tree indexes for their data structures. There are also other types of indexes in use, such as hash indexes for equality lookups and bitmap indexes for read-heavy data warehouses.

To integrate access methods into the iterator API, we add a parameter to the init() method: a SARG (search argument in the form of a constant column operator). get_next() returns tuples that satisfies the SARG, and then NULL if there are no remaining tuples. SARGs are passed into access methods for a couple reasons:

B+-trees and other indexes need SARGs to be efficient.
The access methods can batch test the SARGs on a page of tuples at a time, and get_next() only returns if a tuple satisfies the SARG.

If the SARG is instead checked by a function which then calls an access method, the access method must return either a handle to a tuple in a pinned page in the buffer pool, or a copy of the tuple. Then the function which called the access method is responsible for decrementing the page's pin count, or deleting the copied tuple respectively. Either way, this is a huge CPU overhead!

Indexes also need to reference the rows in a base table. There are a few ways this is implemented:

Many DBMSs use row IDs, which are the actual physical disk addresses of the base table rows.

Pro: Fast
Con: Base table row movement is expensive - for example, many rows need to move when a B+-tree splits.

Other DBMSs use the row primary key.

Pro: avoids problems that row movement causes
Con: slightly slower performance when an index accesses the base table.

Oracle uses a mix of the above two implementations

If a row hasn't been moved, use a physical pointer
If a row has been moved, use the primary key.

Search This Blog

elizabeth wei's blog