tldr: Architecture of a Database System


Part of the tldr series, which summarizes papers I'm currently reading. Inspired by the morning paper.

tldr: Architecture of a Database System [2007]

Hellerstein, Stonebraker, Hamilton

This paper focuses on DBMSs, specifically RDBMSs, the "most mature and widely used database systems in production today."
The figure above shows a generic RDBMS. Before going into detail into each of the 5 main components, the paper walks through an example of a single-query transaction: 
  • At the airport, a gate agent clicks on a link to request a flight's passenger list. 
  • The gate agent's computer (the client) calls an API which establishes a connection over a network with the Client Communications Manager
    • This component is responsible for: 
      • Establishing and remembering the connection
      • Responding to SQL commands from the client
      • Returning data and control messages
    • There are a couple types of arrangements:
      • Client-server system: direct connection between client and database server (e.g via JDBC)
      • Three-tier system: client connects with middle-tier server (e.g. web server), which facilitates communication between the client and the database
  • The Process Manager receives the SQL command. This component is responsible for: 
    • Assigning a thread of computation to the command
    • Handling admission control, which means deciding whether the system should immediately execute the query or defer until enough system resources are available 
  • The Relational Query Processor begins to execute the SQL command. This component is responsible for:
    • Checking that the author is authorized to run the query
    • Compiling the SQL query into a query plan
    • Running the query plan with the plan executor, which is made up of a set of operators (e.g. joins, selection, projection, aggregation, sorting)
  • Some of the query plan's operators request data from the Transactional Storage Manager. 
    • This component is responsible for:
      • Managing data access calls (read)
      • Managing data manipulation calls (create, update, delete) 
    • This component's subcomponents are:
      • Access methods: algorithms and data structures (e.g. tables and indexes) to organize and access data on disk 
      • Buffer manager: decides what data to transfer between disk and memory buffers, and when
      • Lock manager: locks are acquired before accessing data to ensure correct execution with concurrent queries 
      • Log manager: if the gate agent's query had included updates to the data, the log manager would ensure that the transaction is durable
  • Once all the query operators have the data they requested, the stack of activities unwinds: 
    • Access methods return to query operators, which compute result tuples
    • Result tuples are shipped in buffers to the Client Communications Manager
    • Client Communications Manager returns the result tuples to the client 
    • The transaction manager, process manager, and communications manager clean up state 
The above example didn't mention any of the Shared Components and Utilities, which are involved throughout the process. For example: 
  • Catalog manager: Query processor uses this component during authentication, parsing, and query optimization
  • Memory manager: Used whenever memory is dynamically allocated or deallocated 

The Process Manager

Process Models

The paper first discusses several possible DBMS process models, which sketch out how concurrent user requests are modeled, and how to map them to OS processes/threads. Before going over the process models, here are some important definitions: 
  • OS process
    • OS thread + private, unique address space
    • Scheduled by kernel
  • OS thread
    • Program execution unit
    • Has access to memory of other threads executing in the same multi-threaded OS process
    • Scheduled by kernel scheduler
  • Lightweight thread
    • Scheduled by application-level thread scheduler
    • Appears to the OS scheduler as a single thread of execution, since all the lightweight threads run within a single OS process
    • Comparison with OS threads:
      • Faster thread switching
      • But blocking operations in any lightweight thread will block all other lightweight threads
  • DBMS worker
    • Thread of execution in DBMS, which handles a client's requests 
    • 1:1 between DBMS worker and DBMS client 
Now, we're ready to look at 3 DBMS process models:

  • Process per DBMS worker model
    • One OS process hosts each DBMS worker (direct mapping)
    • Pros:
      • Most straight-forward model 
      • OS scheduler manages timesharing for DBMS workers
      • OS can detect common bugs like memory overruns 
    • Cons:
      • DBMS connections share some in-memory data structures, so they must be explicitly allocated in shared memory 
      • Scaling issues: process uses more memory than a thread, and process switches are slower 
    • Current use:
      • Still heavily used
        • IBM DB2: defaults to this model on OSs that don't support scalable OS threads
        • PostgreSQL: exclusive process model 
        • Oracle: default process model
  • Thread per DBMS worker model 
    • Single multi-threaded process hosts all DBMS workers 
    • Each client connection allocates a new thread 
    • Pros:
      • Scales well to large numbers of concurrent connections 
      • Data sharing is easy 
    • Cons:
      • OS doesn't protect threads from common bugs (like memory overruns, stray pointers)
      • Debugging is tricky (race conditions) 
      • Porting (between different threading interfaces and multi-threaded scaling) 
    • Current use:
      • Commonly used:
        • IBM DB2: defaults to this model on OSs that support scalable OS threads
        • Microsoft SQL Server
        • MySQL 
      • Two major current variants:
        • OS thread per DBMS worker
        • DBMS thread per DBMS worker 
          • Pros: Avoids OS scheduler scaling/performance issues
          • Cons: High implementation costs, poor development tools support, substantial software maintenance 
  • Process pool model
    • A pool of processes hosts DBMS workers 
    • Client requests sent to a process in the process pool, or waits until a process is unavailable 
    • Pros:
      • Same pros as process per DBMS worker model
      • But more memory efficient 
      • And scales well to large number of users
    • Current use:
      • Still currently used
        • Oracle: optional model suggested by Oracle when there are large numbers of concurrently connected users 
      • Variant: thread pool model
        • Microsoft SQL Server: default model

Parallel Architecture 

The paper also discusses DBMSs on parallel hardware. There are several models of parallelism:

  • Shared Memory
    • All processors access the same RAM and disk
    • The 3 process models we discussed in the previous section run well on shared memory hardware architecture 
    • Supported by all major commercial DBMS providers 
  • Shared Nothing 
    • Cluster of independent machines communicate over network 
    • One machine can't directly access another machine's memory or disk 
    • Tables are spread out over several machines (horizontal data partitioning) through hash-based or range-based partitioning, or round-robin 
    • Scalable 

  • Shared Disk
    • All processor can access disk
    • Processors can't access other processor's RAM 
    • Lower administration cost 
      • Don't have to partition tables across machines to get parallelism 
      • Although very large databases have to be partitioned anyways 
    • One node's failure doesn't affect the other nodes' ability to access the entire database 
    • The disk is a single point of failure 
    • Need to explicitly coordinate sharing data across machines
      • Distributed lock manager
      • Cache-coherence protocol

Buffers 

SQL requests are moved to the server processes, and results are moved back to the client in buffers. 
  • Disk I/O Buffers: reads and writes to shared data store 
    • Database I/O requests
      • Persistent data is staged in the buffer pool:
        • In the thread per DBMS worker model, the buffer pool is heap-resident, and can be accessed by all threads
        • In the other models, the buffer pool is allocated in shared memory, and can be accessed by all processes 
      • When a thread reads a page, it sends a request with the disk address and the pointer to a free memory location in the buffer pool where the result can be stored
      • When a thread flushes a page, it sends a request with the page's current memory location in the buffer pool and the destination disk address 
    • Log I/O requests
      • The log is an array of entries stored on at least one disk
      • It's generated while processing transactions, and staged in an in-memory queue (the log tail) that's periodically flushed to disk 
        • In the thread per DBMS worker model, the log tail is heap-resident
        • In the other models, either:
          • A separate process manages the log
          • Or the log tail is allocated in shared memory 
  • Client communication buffers
    • DBMS worker uses the client communications socket as queue for tuples, to prefetch in advance of client requests 
    • Or, use client-side caching to store results that will probably be fetched soon 

Admission Control

DBMSs thrash as workload increases and throughput increases to a threshold. This is often because of memory pressure: the DBMS is busy replacing pages since it can't keep the working set of pages in the buffer pool. It's also sometimes because of lock contention: transactions deadlock with each other, and continually abort and restart. 

Thus, the job of the admission control policy is to defer query execution until enough resources are available. Ideally, the system will degrade gracefully: latencies of transactions will increase with the rate of queries, but throughput will hold up at its maximum threshold. 

Admission control has 2 layers:
  • The dispatcher process simply limits the number of client connections to a threshold. 
  • The relational query processor decides what to do with a query after it has been parsed and optimized. Specifically, 
    • The query can be postponed, execute with fewer resources, or execute as normal.
    • The decision is based on estimates of which disk devices the query will access and the number of random and sequential I/Os for each device, the CPU load based on operators and number of input tuples, and the memory footprint of the query's data structures (e.g. when sorting or hashing large inputs during joins).

The Query Processor 

A SQL statement passes through several modules of the query processor:

Query parser 

The query parser does several things: 
  • Resolve names and references 
    • Canonicalize table names (e.g. into the form server.database.schema.table)
    • Invoke catalog manager to check if the table is registered in the system catalog
  • Check that the query is correctly specified
    •  Ensure attributes are correct 
    • Apply other standard SQL syntax checks 
  • Verify that the user is authorized to execute the query 
  • Convert query into internal format used by the optimizer

Query rewriter

The query rewriter simplifies and normalizes the query without accessing the data in the tables. Specifically: 
  • View expansion
    • For each view, the rewriter gets the view definition from the catalog manager, and expands the view into tables 
  • Constant arithmetic evaluation
    • e.g. 10 + 2 becomes 12
  • Rewrite predicates
    • e.g. NOT Employee.Salary > 1000 becomes Employee.Salary <= 1000)
  • Semantic optimization
    • e.g. redundant join elimination - a query could join two tables but not reference any of the columns in one of the tables 
  • Subquery flattening and other heuristic rewrites 
    • Query normalization: rewrite the query into a form that's better suited for the query optimizer 
    • Flatten nested subqueries to help query optimizer, which usually optimizes within a query block, and not across query blocks

Query optimizer 

The optimizer transforms an internal query representation into an efficient query plan. A query plan is like a data-flow diagram that pipes data from the tables through a series of query operators. (An example query plan is shown in the figure above.) 

Before optimizing the queries, they are first broken down into SELECT-FROM-WHERE query blocks. After optimizing, a few post-processing operators are added (e.g. to compute GROUP BY, ORDER BY, HAVING), and the blocks are put together in order.

Optimizers today build on Selinger's System R paper in the following directions:
  • Plan space
    • System R only focused on left-deep query plans, where the righthand table input of a join is a base table 
      • Optimizers today also consider bushy trees, where the righthand table input of a join is made up of nested tables
    • System R postponed Cartesian products, such that they appear only after all joins
      • Optimizers today also consider earlier Cartesian products 
  • Selectivity estimation 
    • System R's selectivity estimation was based on simple table and index cardinalities 
      • Optimizers today also base selectivity estimation on distributions of values in attributes via histograms and sampling techniques, as well as other factors 
  • Search algorithms 
    • System R used a dynamic programming approach
      • Some optimizers today use a top-down search. However, both approaches can work well in production, and have exponential time and space complexity in the number of tables involved in the query. 
  • Parallelism 
    • Optimizers today support some parallel processing 
      • Both single-phase and two-phase implementations exist in production 
  • Auto-tuning 
    • Ongoing research on how to let DBMS automatically make tuning decisions 

Query Executor

The query executor operates on a query plan. Most query executors today use an iterator model (shown in the above figure).  Operators in the query plan are implemented as subclasses of the iterator class. Any subclass of iterator can be used as input to any other. 

Some important performance metrics are:
  • Time to query completion (most common)
  • Maximizing DBMS throughput 
  • Time to first row (popular with interactive applications) 
Writing a query plan for a query that updates is more complicated than for a query that just reads. The Halloween Problem was discovered by the System R group on October 31. It's about a plan that reads and writes the same table:  for example, a query like "give everyone whose salary is under $20K a 10% raise". 

If the query plan was simply an index scan iterator fed into an update iterator, the query executor might read and update a salary again and again until it's over $20K! 

A correct query plan would have to be more complicated. One solution is a batch read-then-write scheme. A temporary file stores the IDs of tuples to be modified. 

Access Methods

The query processor accesses data structures on disk through access methods. Most systems in production use heaps and B+-tree indexes for their data structures. There are also other types of indexes in use, such as hash indexes for equality lookups and bitmap indexes for read-heavy data warehouses

To integrate access methods into the iterator API, we add a parameter to the init() method: a SARG (search argument in the form of a constant column operator). get_next() returns tuples that satisfies the SARG, and then NULL if there are no remaining tuples. SARGs are passed into access methods for a couple reasons:
  • B+-trees and other indexes need SARGs to be efficient.
  • The access methods can batch test the SARGs on a page of tuples at a time, and get_next() only returns if a tuple satisfies the SARG. 
    • If the SARG is instead checked by a function which then calls an access method, the access method must return either a handle to a tuple in a pinned page in the buffer pool, or a copy of the tuple. Then the function which called the access method is responsible for decrementing the page's pin count, or deleting the copied tuple respectively. Either way, this is a huge CPU overhead! 
Indexes also need to reference the rows in a base table. There are a few ways this is implemented:
  • Many DBMSs use row IDs, which are the actual physical disk addresses of the base table rows.
    • Pro: Fast
    • Con: Base table row movement is expensive - for example, many rows need to move when a B+-tree splits. 
  • Other DBMSs use the row primary key.
    • Pro: avoids problems that row movement causes
    • Con: slightly slower performance when an index accesses the base table.
  • Oracle uses a mix of the above two implementations
    • If a row hasn't been moved, use a physical pointer
    • If a row has been moved, use the primary key. 

Popular posts from this blog

Space Race: a simple Rust game built on the Bevy framework

Building a Toy Language Compiler in Go

tldr: Raft + Implementation Tips