The following ideas are from a ‘blog’ of one of my friends. He posted this ‘blog’ on Wechat, and Wechat doesn’t provide url or allow external users to access it. So I can’t referenece a link here.
BigTable is most diffult to understand paper because it 1) names lots of its terminologies quite misleadingly 2) barely illustrated any detail.
I’ve been confused of BigTable, until one day I read a blog called Understanding HBase and BigTable. I strongly recommending everyone find and read that blog.
Here is the first sentence of the “Data Model” section:
> A Bigtable is a sparse, distributed, persistent multidimensional sorted map.
So, BigTable is a Map, not a Table! Right, you’ve been misleaded to associate BigTable with a database table because of namings. What is a map? A map is key-value data structure! Thus, BigTable is a key-value store. Probably Google should better name it BigMap instead of BigTable!
The BigTable paper continues, explaining that:
> The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.
So how’s the map look like? The Map’s value is an array of bytes. Fairly simple. Map’s key as
<key1, key2, key3> is a combination of
<row, column, timestamp>,
column as both string,
timestamp as a 64 bit number.
column key is more complex, it’s formatted as
prefix is called column family, which is of limited number in each BigMap.
suffix can be of unlimited number.
Till now, you can tell that BigTable has nothing to do with a regular table in transactional database. It’s key-value data model.
BigMap is a SortedMap, not a HashMap. HBase/BigTable the key/value pairs are kept in strict alphabetical order.
Because these systems tend to be so huge and distributed, this sorting feature is actually very important. The spacial propinquity of rows with like keys ensures that when you must scan the table, the items of greatest interest to you are near each other.
This is important when choosing a row key convention. For example, consider a table whose keys are domain names. It makes the most sense to list them in reverse notation (so “com.jimbojw.www” rather than “www.jimbojw.com") so that rows about a subdomain will be near the parent domain row.
Continuing the domain example, the row for the domain “mail.jimbojw.com” would be right next to the row for “www.jimbojw.com" rather than say “mail.xyz.com” which would happen if the keys were regular domain notation.
It’s important to note that the term “sorted” when applied to HBase/BigTable does not mean that “values” are sorted. There is no automatic indexing of anything other than the keys, just as it would be in a plain-old map implementation.
Persistence merely means that the data you put in this special map “persists” after the program that created or accessed it is finished. This is no different in concept than any other kind of persistent storage such as a file on a filesystem.
As already mentioned, a given row can have any number of columns in each column family, or none at all. The other type of sparseness is row-based gaps, which merely means that there may be gaps between keys.
This, of course, makes perfect sense if you’ve been thinking about HBase/BigTable in the map-based terms of this article rather than perceived similar concepts in RDBMS’s.
BigTable provides several ways of searching
timestamp, return the largest element that is less than
column, return the element with largest
column, return all elements with the same
Though there’s not much details in BigTable paper. Luckily, Google open-sourced LevelDB, a key-value store which is well recognized as the implementation of BigTable on a single node. Thus, LevelDB can be a pretty authentic source for us to learn implementation details of BigTable.
BigTable is composed of a client library, a Master Server, and lots of Tablet Servers. A concrete BigTable would be broken into tablets of size from 100MB to 200MB, and those tablets will be distributed to some Tablet Servers to server client requests.
When system runs, the number of Tablet Servers is not fixed. The number will scale up and down according to the actual workload, and is controlled by the Master Server. Tablet Server doesn’t store any actual files - they act as a service and proxy to visit actual files in Google File System (See the foundamental impact of GFS/HDFS?).
Unlike Tablet Server, Master Server always exists. Master Server stores the metadata of which tablet is distributed to which Tablet Server, whether scale up or down Tablet Servers, and how to load balance among all Tablet Servers. Adding new Tables or modifying existing Tables, for instance adding a new column family, is all handled by Master Server through Tablet Server. While Master Server is not responsible for managing any Tablet.
Well, unlike what most people imagne, clients doesn’t need to talk to Master Server when it tries to visit a BigTable. This design greatly mitigates the load of Master Server. So, how does client visit BigTable? It uses Chubby!
Chubby is a highly-available distributed lock service. The open-source ZooKeeper project is a copycat of it. Chubby implements a file-system-like structure. Clients can visit those files to get lock of the visited object. According to the BigTable paper, Chubby is used by BigTable clients to locate Tablet and by Master Server to monitor Tablet Servers.
The most important for us to know is how clients manipulate BigTable data?
The first step is to visit metadata in three layers. First layer is a Chubby file. Through that Chubby file, clients can locate Root Tablet, which is the second layer. What’s special about Root Tablet is that is’s never partitioned. The last layer is all Metadata Tablets hold by Root Tablet. Thus, generally speaking, clients can locate data by visiting a Chubby file, a Root Tablet, and a Metadata Tablet.
Clients will cache the metadata it has found for future reference, but caching is not required. Besides, because of Chubby, Master Server doesn’t take any responsibilities in data locating, thus it’s workload is quite small.
The paper doesn’t talk about what’s the format of BigTable’s metadata. The most detailed description is this - “The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet’s table identifier and its end row.” Well, it’s actually not important. Engineers can always come up with their own formats.
In BigTable, SSTable (Sorted Strings Table) is a basic unit. Each Tablet has several SSTables. Paper doesn’t reveal how SSTable is implemented. But, we can deduce that based on our understanding of the open-sourced LevelDB. The implementation brings back to life a data structure called LSM-Tree developed by Patrick O’Neil, a retired professor from UMass Boston. Here’s the paper The Log-Structured Merge-Tree (LSM-Tree).
SSTable in LevelDB is composed of 1) memTable and 2) SSTable on disk. In memory, it uses skip list. Thus, write is only towards memory, and is very very fast. When memory is full, a memTable is turned into an immutable memTable. Then LevelDB opens a new writable memTable, and starts a new process to write the immutable memTable to disk as a SSTable. When this step finishes, space of the immutable memTable in memory is released. Over and over, disk accumulates lots of SSTables, and here comes the compaction. SSTables have different levels, like level1, level2, etc. SSTables that are of level1 will go through compaction called minor compact. There’s major compact in further levels. From level2, no more two SSTable will have overlapped keys in their trees, which is not guaranteed in level1.
Therefore, a read request needs to access memTable, immutable memTable, tree in level1, and one tree for any further level after and including level2. This indicates that a read transaction is more expensive than a write transaction.
Well, another noticeable thing is that, if clients try to access latest data, that data might be in memory. Thus, LevelDB’s design optimizes read for data of new version by sacrificing read for cold data.
BTW, one of biggest differences between LevelDB and RocksDB, a Facebook’s copycat of LevelDB, is that RocksDB introduced something called universal compact. I’m not familiar with what it is.
Of course, like lots of other similar system, BigTable’s recovery strategy is based on logs. All write transactions are persisted in logs before going into memory. LevelDB’s log format is of the classic append only strategy.
What’s funny yet sad is that LSM-Tree is such a gorgeous invention but its author hasn’t earned the matching reputation.