LevelDB
I've been playing with the LevelDB library recently, so I took some time digging around its source code (https://github.com/google/leveldb) and reading articles/papers. In this post, I’m going to share some things related to LevelDB that I most enjoyed learning about.
If you don’t know much about LevelDB, here’s a quick overview: LevelDB is a C++ library that stores key value pairs of byte arrays sorted by key. Operations include Put(key, value), Get(key) and Delete(key). It’s called LevelDB because it is arranged by levels: cache, log, level 0, level 1, level 2 … In general, after Level x reaches a limit of 10^x MB, one of its files gets compacted (merged periodically) into Level x+1. Level 0 is a special level: when it hits its limit of 4 files, at least one of its file gets compacted into Level 1. Furthermore, the files in Level 0 can contain overlapping keys, whereas the keys in the files in the other levels cannot.
One of the things that interested me was the underlying data structures used in LevelDB. Each of the files in the levels is a sorted string table (SST), which is an map of string keys to string values sorted by key on disk. Values can be either the current value corresponding to the key or a deletion marker, and duplicate keys are fine. You look up values through an index that keeps track of the keys and the locations of their values. The log is stored in a memtable, which is just a SST in memory, and periodically flushed to a SST. Overall, this is one implementation of a log structured merge tree (LSM), which is known for handling write-heavy workloads well. Compared to B trees, which must update roughly O(number of keys) nodes often located in random locations for each write, LSMs only have to do one sequential batch of writes.
However, B trees are usually faster at reading since unoptimized LSMs have to look through each of the levels (which are on disk). This leads me to another thing that interested me -- bloom filters. Bloom filters eliminate the need to look through so many levels. This data structure tells us that the key we're looking up either 1) has a high chance of being in the SST or 2) is definitely not in the SST. (Sometimes, bloom filters give us false positives.) Then, to add a key to a bloom filter, we hash it several times (either with several different hash functions or double hashing with the same hash function several times). For each of the resulting indices, we set the array element at that index to 1. LevelDB uses double hashing:
static uint32_t BloomHash(const Slice& key) {
return Hash(key.data(), key.size(), 0xbc9f1d34);
}
for (int i = 0; i < n; i++) {
// Use double-hashing to generate a sequence of hash values.
// See analysis in [Kirsch,Mitzenmacher 2006].
uint32_t h = BloomHash(keys[i]);
const uint32_t delta = (h >> 17) | (h << 15); // Rotate right 17 bits
for (size_t j = 0; j < k_; j++) {
const uint32_t bitpos = h % bits;
array[bitpos/8] |= (1 << (bitpos % 8));
h += delta;
}
}
Notice also from the above code that LevelDB hashes k_ times. As always, there's a tradeoff: less hash functions means more false positives, and more hash functions means a slower bloom filter. LevelDB calculates k_ as follows:
// We intentionally round down to reduce probing cost a little bit
k_ = static_cast<size_t>(bits_per_key * 0.69); // 0.69 =~ ln(2)
if (k_ < 1) k_ = 1;
if (k_ > 30) k_ = 30;
When looking up a key using a bloom filter, we hash the key using the same hashing method, and check to see if the elements at the resulting indices in the array are equal to 1. If they all are, then the key is very likely in the SST. (We are not 100% certain the key is in the SST because hashing some other keys could have coincidentally set all these elements to 1 - thus, a false positive.) However, if not all the elements at the resulting indices are equal to 1, we know for sure that the element is not in the SST. Here's part of LevelDB's implementation of looking up a key:
uint32_t h = BloomHash(key);
const uint32_t delta = (h >> 17) | (h << 15); // Rotate right 17 bits
for (size_t j = 0; j < k; j++) {
const uint32_t bitpos = h % bits;
if ((array[bitpos/8] & (1 << (bitpos % 8))) == 0) return false;
h += delta;
}
return true;
According to LevelDB's documentation, using bloom filters "reduces the number of unnecessary disk reads need for Get() calls by a factor of 100."
Wrapping up, LevelDB is a cool C++ library, and there's a lot we can learn by reading the code - this post is just scratching the surface :)
If you don’t know much about LevelDB, here’s a quick overview: LevelDB is a C++ library that stores key value pairs of byte arrays sorted by key. Operations include Put(key, value), Get(key) and Delete(key). It’s called LevelDB because it is arranged by levels: cache, log, level 0, level 1, level 2 … In general, after Level x reaches a limit of 10^x MB, one of its files gets compacted (merged periodically) into Level x+1. Level 0 is a special level: when it hits its limit of 4 files, at least one of its file gets compacted into Level 1. Furthermore, the files in Level 0 can contain overlapping keys, whereas the keys in the files in the other levels cannot.
One of the things that interested me was the underlying data structures used in LevelDB. Each of the files in the levels is a sorted string table (SST), which is an map of string keys to string values sorted by key on disk. Values can be either the current value corresponding to the key or a deletion marker, and duplicate keys are fine. You look up values through an index that keeps track of the keys and the locations of their values. The log is stored in a memtable, which is just a SST in memory, and periodically flushed to a SST. Overall, this is one implementation of a log structured merge tree (LSM), which is known for handling write-heavy workloads well. Compared to B trees, which must update roughly O(number of keys) nodes often located in random locations for each write, LSMs only have to do one sequential batch of writes.
However, B trees are usually faster at reading since unoptimized LSMs have to look through each of the levels (which are on disk). This leads me to another thing that interested me -- bloom filters. Bloom filters eliminate the need to look through so many levels. This data structure tells us that the key we're looking up either 1) has a high chance of being in the SST or 2) is definitely not in the SST. (Sometimes, bloom filters give us false positives.) Then, to add a key to a bloom filter, we hash it several times (either with several different hash functions or double hashing with the same hash function several times). For each of the resulting indices, we set the array element at that index to 1. LevelDB uses double hashing:
static uint32_t BloomHash(const Slice& key) {
return Hash(key.data(), key.size(), 0xbc9f1d34);
}
for (int i = 0; i < n; i++) {
// Use double-hashing to generate a sequence of hash values.
// See analysis in [Kirsch,Mitzenmacher 2006].
uint32_t h = BloomHash(keys[i]);
const uint32_t delta = (h >> 17) | (h << 15); // Rotate right 17 bits
for (size_t j = 0; j < k_; j++) {
const uint32_t bitpos = h % bits;
array[bitpos/8] |= (1 << (bitpos % 8));
h += delta;
}
}
Notice also from the above code that LevelDB hashes k_ times. As always, there's a tradeoff: less hash functions means more false positives, and more hash functions means a slower bloom filter. LevelDB calculates k_ as follows:
// We intentionally round down to reduce probing cost a little bit
k_ = static_cast<size_t>(bits_per_key * 0.69); // 0.69 =~ ln(2)
if (k_ < 1) k_ = 1;
if (k_ > 30) k_ = 30;
When looking up a key using a bloom filter, we hash the key using the same hashing method, and check to see if the elements at the resulting indices in the array are equal to 1. If they all are, then the key is very likely in the SST. (We are not 100% certain the key is in the SST because hashing some other keys could have coincidentally set all these elements to 1 - thus, a false positive.) However, if not all the elements at the resulting indices are equal to 1, we know for sure that the element is not in the SST. Here's part of LevelDB's implementation of looking up a key:
uint32_t h = BloomHash(key);
const uint32_t delta = (h >> 17) | (h << 15); // Rotate right 17 bits
for (size_t j = 0; j < k; j++) {
const uint32_t bitpos = h % bits;
if ((array[bitpos/8] & (1 << (bitpos % 8))) == 0) return false;
h += delta;
}
return true;
According to LevelDB's documentation, using bloom filters "reduces the number of unnecessary disk reads need for Get() calls by a factor of 100."
Wrapping up, LevelDB is a cool C++ library, and there's a lot we can learn by reading the code - this post is just scratching the surface :)