tldr: Concurrency Control & Recovery

Part of the tldr series, which summarizes papers I'm currently reading. Inspired by the morning paper.

tldr: Concurrency Control & Recovery


Concurrency control makes sure that users see consistent states of the database even though users' operations may be concurrent. Recovery makes sure that software or hardware failures don't corrupt the database state. Together, these DBMS components make things easier for developers who create applications on top of a database. 


A transaction groups a set of logical operations into a unit, that all abort or all commit as one. If a transaction commits, all its updates become permanent and visible to other transactions. If a transaction aborts, all its updates are erased. The properties that a transaction should have are:
  • Atomicity: all or none of the operations in a transaction succeed.
  • Consistency: the transaction leaves the database in a consistent* state. 
    • *A circular definition, but consistent means that the database obeys the integrity constraints set by a user. An example of an integrity constraint is "all the social security numbers in a column should be unique". 
  • Isolation: Concurrent transactions don't affect each other. In other words, a transaction runs as if it were running alone in the database. 
  • Durability: A committed transaction's results remain in the database despite failures.
A, C, and D ensure correctness for transactions that serially execute. In the classic bank account example, $50 is transferred from Account 1 to Account 2: 
  • Atomicity: We're at the point where $50 was subtracted from Account 1's balance, but hasn't been added yet to Account 2's balance. The atomicity property says that if at this point, the database fails, the update to Account 1 should be rolled back. Otherwise, the database just lost $50! 
  • Consistency: The sum of the balances of A and B should remain constant.
  • Durability: Once $50 has been transferred from Account 1 to Account 2, a user should be able to assume that the $50 in Account 2 is there to stay. 
Furthermore, the I property ensures correctness for transactions that concurrently execute. (We need concurrent executions since the database would get really slow otherwise.) 

Concurrency Control

The Basics

Concurrency control handles the I property of a transaction. To reason about the correctness of concurrent transactions, we first need to define what a schedule is. For some set of transactions, a schedule is an ordering of the operations that are part of those transactions such that all the operations of a transaction must appear in the same order in the schedule as they did in the transaction.
A schedule is serializable if and only if it's equivalent to some serial schedule in terms of results. In a serial schedule, all the operations of each transaction appear consecutively. We can say that a serializable schedule is correct

If we don't use serializable schedules, the following conflicts will happen*:

  • WR conflict: Dirty reads
  • RW conflict: Non-repeatable reads  
  • WW conflict: Overwriting uncommitted data 

*All these conflicts will be discussed in a later section, Isolation Levels

There are two ways to test serializability: 

  • View serializability: not easy to implement, so not practical
  • Conflict serializability: easy to implement & practical
    • Two schedules are conflict equivalent if they contain the same transactions with the same operations and order all conflicting operations of committed transactions in the same way. 
    • The outcome of a schedule depends only on the order of conflicting operations. Conflicting operations are ones that operate on the same data item, and at least one is a write. 
    • A schedule is conflict serializable if it's conflict equivalent to some serial schedule.
We can use a precedence graph to test whether a schedule is conflict serializable. A precedence graph is a directed graph, which contains edges from a transaction i to a transaction j if there's an operation in i that must precede an operation of j. In other words, if:

  • Transaction i reads/writes before transaction j writes the same data item
  • Transaction i writes before transaction j reads the same data item

A schedule is conflict serializable if and only its precedence graph is acyclic

Two Phase Locking 

The most popular implementation of concurrency control is locking. There are two kinds of locks: 
  • Shared locks: used to protect read access to data 
  • Exclusive locks: used to protect write access to data 
Two different transactions can hold shared locks at the same time. However, a transaction can't hold any kind of lock on an item if another transaction is holding an exclusive lock on it. The transaction is blocked until all conflicting locks held by other transactions are released. 

Two phase locking is sufficient (but not necessary) for implementing serializability, and consists of two rules:
  • All transactions must be well-formed. 
    • A transaction is well-formed with respect to reads if it always holds a shared or an exclusive lock on an item when reading it. 
    • A transaction is well-formed with respect to writes if it always holds an exclusive lock on an item when writing it.
  • Once a transaction has released a lock, it can't acquire any additional locks.
The second rule gives 2PL its name:
  • There's a growing phase in which the transaction is acquiring locks.
  • The instant the transaction releases a lock it goes into the shrinking phase, in which the transaction is releasing locks.
2PL is not cascadeless, which means one transaction's abort causes other transactions to abort. Cascadeless schedules are recoverable 

There are several variants of 2PL:
  • Strict 2PL
    • Holds exclusive locks for long duration 
    • Prevents cascading rollbacks, thus guaranteeing cascadeless recoverability 
  • Rigorous 2PL
    • Holds both exclusive and shared locks for long duration 
    • Prevents cascading rollbacks, thus guaranteeing cascadeless recoverability 
    • For two conflicting transactions, their commit order is their serializability order 
  • Conservative 2PL 
    • Transactions must acquire all the locks before the transactions begin. 
    • This ensures that a transaction that's already acquired some locks doesn't wait for other transactions to release their locks, thus avoiding deadlocks.

To implement 2PL, the lock manager grants and blocks lock requests, managers queues of waiting transactions, and deals with deadlock situations. 

Deadlock happens when each transaction in a group is waiting for another transaction in the group to release a lock, so no transaction can make progress. There are two ways to deal with deadlocks:
  • Avoidance
    • Impose an order in which locks can be acquired on data items 
    • Transactions must declare their locking requests beforehand
  • Detection 
    • Timeouts are the easiest way to implement detection: if a transaction is blocked beyond some time threshold, assume that there's deadlock. 
    • Another way is using a waits-for graph, which is a directed graph which contains an edge from transaction i to transaction j if transaction i is waiting on a lock held by transaction j. Any cycle of transactions are deadlocked. Then, at least one transaction should be rolled back, automatically releasing held locks and thus breaking the deadlock.

Isolation Levels

Levels define how transactions trade concurrency for consistency. In all levels, write operations are well-formed and exclusive locks are long duration (they are held until the end of a transaction). If exclusive locks were short duration, this would allow the following tricky scenario where transaction 0 writes A, transaction 1 writes A, and transaction 0 aborts. This is because:
  • Restoring A to transaction 0's original stored value of A is wrong because it overwrites transaction 1
  • Ignoring transaction 0's abort is wrong because if transaction 1 aborts, restoring A to transaction 1's original stored value restores transaction 0's write
Thus, the difference between the levels depends on whether:
  • Read operations are well-formed or not
  • Read locks are long duration or short duration
The levels from weakest to strongest are:
  • Read uncommitted 
    • Allows transactions to read data that has been written by other uncommitted transactions
    • Read operations are not well-formed
    • Doesn't prevent dirty reads (and problems from stronger levels)
      • Seeing updates that'll eventually be rolled back
    • Doesn't prevent seeing some but not all updates made by a transaction
  • Read committed
    • Transactions only see updates made by committed transactions 
    • Read operations are well-formed, but read locks are short duration 
    • Doesn't prevent non-repeatable reads (and problems from stronger levels)
      • Transaction 1 acquires shared lock on A
      • Transaction 1 reads A = 0 
      • Transaction 1 releases shared lock on A 
      • Transaction 2 acquires exclusive lock on A
      • Transaction 2 writes A = 1
      • Transaction 2 commits, releasing exclusive lock on A 
      • Transaction 1 acquires shared lock on A 
      • Transaction 1 reads A = 1
      • Problem! Transaction 1 read A = 0, then A = 1
    • Doesn't prevent lost updates (and problems from stronger levels)
      • Transaction 1 acquires shared lock on A 
      • Transaction 1 reads A = 100
      • Transaction 1 releases shared lock on A 
      • Transaction 2 acquires shared lock on A
      • Transaction 2 reads A = 100
      • Transaction 2 upgrades to exclusive lock on A
      • Transaction 2 writes A + 100 = 200
      • Transaction 2 commits
      • Transaction 2 releases exclusive lock on A
      • Transaction 1 acquires exclusive lock on A
      • Transaction 1 writes A + 100 = 200
      • Transaction 1 commits 
      • Problem! A should be 100 + 100 + 100 = 300, but is only 200.
  • Repeatable read
    • Reads to individual data items are repeatable
    • Read operations are well-formed, and read locks are long duration 
    • Doesn't prevent phantom problem (and problems from stronger levels)
      • Transaction i reads some set of tuples that satisfies a query predicate, e.g. READ(age > 50)
      • Transaction j inserts a new phantom tuple that satisfies the predicate, e.g. INSERT(age = 70)
      • Transaction i executes the query again, and discovers that the new result is different from the old result
  • Serializable
    • Equivalent to some arbitrary serial execution 
    • Read operations are well-formed, read locks are long duration, and reads on predicates are well-formed
    • Protects against all problems
An alternative level to serializable is snapshot isolation. In snapshot isolation,

  • A transaction has two timestamps: one where it reads, and one where it writes. At the point of the read timestamp, the transaction takes a snapshot of the database. That transaction won't see anything that happens after this point.
  • Transactions are concurrent if the read->write intervals overlap. Concurrent transactions are banned from writing to the same memory locations. The system enforces this rule by forcing violating transactions to abort.

Snapshot isolation prevents all the issues that serializable does:

  • Prevents dirty reads
  • Prevents lost updates 
  • Prevents phantom problem because a transaction takes a snapshot when it first starts, and bases its operations on that snapshot. Thus, it wouldn't read different values since the snapshot remains the same. 
However, snapshot isolation is not the same thing as serializable. Consider this issue known as write skew:
  • There are two doctors in a hospital. At least one must be on duty at all times. However, both the doctors would like to go on vacation.
  • One doctor starts transaction 1: reads a value to see if the other is on vacation. 
  • Meanwhile, the other doctor starts transaction 2: reads a value to see if the other is on vacation.
  • Both doctors see that the other is on duty, so both update their values to on vacation
  • The writes are to different locations, so this is allowed in snapshot isolation!

Hierarchical Locking 

Locking can happen not only at the tuple level, but also at the page, relation, or database level. There's always the trade off between concurrency and locking overhead: fine granularity means maximum concurrency, but also means a transaction will have to acquire a larger number of locks. 

In hierarchical locking, there are three new kinds of locks in addition to shared and exclusive:
  • Intention Shared: holder of a lock intends to acquire a shared lock on at least one finer level
  • Intention Exclusive: holder of a lock intends to acquire an exclusive lock on at least one finer level
  • Shared with Intention Exclusive: combines shared lock with an intention exclusive lock (for example, useful when scanning tuples in a relation but only updating some of them based on their values)
Another concept to know for hierarchical locking is lock escalation, which allows a DBMS to automatically adjust the granularity: if a transaction is acquiring locks on a large fraction of the granules that make up a larger granule, the system can try to grant the transaction a lock on the larger granule. This is useful since the access pattern of a transaction is usually not known until runtime. 

Optimistic Concurrency Control

Locking is pessimistic because it assumes transactions will probably conflict with each other. On the other hand, optimistic concurrency control allows transactions to execution without acquiring any locks. 

The pros of OCC are:
  • Never has to wait for locks
  • Doesn't run into deadlocks
The cons of OCC are:
  • Conflicting transactions must be restarted
  • Transactions may be starved (e.g. repeatedly restarted but never actually run) 
There are three phases in OCC:
  • Read phase
    • Transactions execute, but all updates only affect a local copy of the data 
      • A local copy is used to not make dirty results visible to other transactions and to be able to undo
    • The read and write sets for each transaction are recorded 
  • Validate phase
    • Verifies that the transaction didn't conflict with already committed transactions
    • An ordering, based on when the read phase finishes, is assigned to the transactions
    • For a transaction to be validated:
      • It reads all writes of every earlier transaction 
      • It finished all its writes after the writes of every earlier transaction 
    • This means at least one of the following conditions must be true (or the transaction will be aborted):
      • In the first condition, transaction i completed the write phase before transaction j started the read phase. (No overlap)
      • In the second condition, transaction i's write set doesn't intersect transaction j's read set. Transaction i completed the write phase before transaction starts its read phase. (Transaction j ends up seeing some but not all of transaction i's writes, since j reads while i writes)
      • In the third condition, transaction i's write set doesn't intersect transaction j's read or write set. Transaction i completes the read phase before transaction j completes the read phase. 
  • Write phase
    • Writes local copy to database
OCC might be more appropriate for in memory databases since in locking, a shared data structure like the lock table might be a bottleneck. On the other hand, in OCC, each transaction's thread has a local read and write set. 


The Basics

Recovery handles the A and D properties of a transaction. It must handle:
  • Transaction failures. When an in-progress transaction reaches a state where it can't commit successfully, all updates that it made should be rolled back
  • System failures. If the contents in main memory are lost, all the updates of all transactions that committed before the crash should be saved in the database, and all the updates of all transactions that didn't commit (in-progress or aborted) should be erased from the database.
  • Media failures. If the contents on disk are lost, the database must be restored from archival data and updated using logs. 
An UNDO is the process of erasing the effects of an unfinished/aborted transaction to get atomicity. A REDO is the process of restring the effects of a committed transaction to get durability. How a recovery system implements these two processes depend on how the buffer manager handles data being updated by in-progress transactions. 

The buffer manager transfers data between main memory and disk in pages. Pages in the volatile buffer pool are updated, and eventually written out to disk. The buffer management policy is made up of two properties:
  • STEAL: buffer manager allows updates from uncommitted transaction to overwrite the most recent committed value of a data item on disk 
    • If there's a crash, the system must UNDO when it restarts.
  • FORCE: buffer manager makes sure all updates made by a transaction are on disk before the transaction can commit 
    • If there's a crash, the system doesn't have to REDO when it restarts. 
NO-STEAL/FORCE gives the least amount of recovery work. However, it really slows down the database during normal operation: NO-STEAL makes the buffer manager maintain updated data in-memory or in a temporary location on disk until a transaction commits. FORCE makes writes slow.  Thus, most buffer managers follow the STEAL/NO-FORCE policy. 

To handle UNDOs and REDOs, DBMSs rely on logs. A log is a file that stores info about the transactions and the database state at certain points. Each entry in the log is a log record. The LSN (log sequence number) is a unique identifier for each log record. They are usually assigned in a monotonically increasing sequence. When an update is made to a data item, the corresponding LSN is written to the page containing the updated data item for easier recovery. Additionally, there are periodic checkpoints, which write checkpoint records about the current state of the database to the log. 

There are two kinds of logging, and a compromise between them:
  • Physical logging. Records location (e.g. offset on a specific page) of the updated data. If UNDOs need to be implemented, then the original value of the data before the update is written to the log record. If REDOs need to be implemented, then the new value of the data after the update is written to the log record. Recovery actions from physical log records are idempotent, which becomes important if the system fails and restarts repeatedly. 
  • Logical logging. Records high-level info about the operations. If UNDOs need to be implemented, then the set of actions that make up the inverse of the operation is written to the log record. If REDOs need to be implemented, then the set of actions that represent the operation is written to the log record. 
    • Advantage: It hides implementation details, and minimizes the amount of data to write to the log. 
    • Disadvantage: Recovery is hard to implement because it's hard to figure out which parts of the logical updates are saved in the database state after a crash. 
  • Physiological logging. Log records are constrained to refer to a single page, but can reflect logical operations on that page. 
The WAL protocol makes sure that if the database crashes, the log contains enough data to UNDO and REDO when a STEAL/NO-FORCE buffer management policy is used. The protocol specifies:
  • All log records corresponding to an updated page are written to disk before the page itself is allowed to be over-written in disk. 
    • This gives all the info that UNDOs need. 
  • A transaction is not committed until all of its log records (including its commit record) have been written to disk. 
    • This gives all the info that REDOs need. 

Popular posts from this blog


tldr: Raft + Implementation Tips

Building A Toy Language Interpreter in Go