HBaseTDG Architecture

Seek vs. Transfer

我之前专门比较过B+ tree和LSM tree

http://www.cnblogs.com/fxjwind/archive/2012/06/09/2543357.html

里面最后一篇blog比较好的分析使用B+ tree和LSM tree (Log-Structured Merge-Trees) 的本质, 读写效率的balance, 全局有序和局部有序...
但之前对这个标题, Seek vs. Transfer不是很理解, 这儿重点解释一下,

B+树也是要存磁盘的, 和磁盘交换数据的单位是page, 以书中的例子, 当一个page超出配置的大小, 就会被split.

问题如下所述, 相邻的page在磁盘上并不相邻, 有可能距离很远.

The issue here is that the new pages aren't necessarily next to each other on disk. So now if you ask to query a range from key 1 to key 3, it's going to have to read two leaf pages which could be far apart from each other.

所以无论读写B+ tree上的node, 首先要做的是通过磁盘seek, 寻道找到这个node所在的page, 这个是很低效的, 而且如下数据显示, 这个问题随着磁盘的不断变大变的越来越严重.

While CPU, RAM and disk size double every 18-24 months the seek time remains nearly constant at around 5% speed-up per year.

对于读我们可以通过buffer cache来部分解决这个问题, 但是对于随机写, 这个issue无法避免, 而且还会产生大量的page碎片.
所以对于有大量随机写的场景, 用B+ tree作为索引是不合适的.

而LSM tree, 就比较好的对随机写进行了优化, 当然对于磁盘的随机写很难有好的优化, 所以采取的策略, 把随机写buffer在memory中, 并进行排序, 最终批量的把随机写flush到磁盘中去, 这样就把随机写转化为顺序写. 关于LSM tree的详细介绍, 参考http://www.cnblogs.com/fxjwind/archive/2012/08/14/2638371.html

这样就有效的避免了Seek的问题, 而只是不断的把数据transfer到磁盘上, 这样写效率就高了许多.

书上有数据来证明LSM tree的写效率有多高,

When updating 1% of entries (100,000,000) it takes:

• 1,000 days with random B-tree updates
• 100 days with batched B-tree updates
• 1 day with sort and merge

当然这样的问题是, 无法保证全局有序, 读数据的时候效率会比较低一些, 这样问题可以通过merge和bloom filter来解决.

B+ tree是传统的数据方式, 支持CRUD, 而实际证明, 支持UD确实会让数据系统变得非常复杂, 而且从本质上来看, 新数据的产生并不能否定老数据的曾经存在.
所以通过简化为CR, 可以大大简化系统, 而且可以更好的容错.

上面是我的理解, 下面是书中的比较的原文,

Comparing B+ trees and LSM-trees is about understanding where they have their relative strengths and weaknesses.
B+ trees work well until there are too many modifications, because they force you to perform costly optimizations to retain that advantage for a limited amount of time.
The more and faster you add data at random locations, the faster the pages become fragmented again. Eventually you may take in data at a higher rate than the optimization process takes to rewrite the existing files.
The updates and deletes are done at disk seek rates, and force you to use one of the slowest metric a disk has to offer.

LSM-trees work at disk transfer rates and scale much better to handle vast amounts of data.
They also guarantee a very consistent insert rate, as they transform random writes into sequential ones using the log file plus in-memory store.
The reads are independent from the writes, so you also get no contention between these two operations.
The stored data is always in an optimized layout. So, you have a predictable and consistent bound on number of disk seeks to access a key, and reading any number of records following that key doesn't incur any extra seeks. In general, what could be emphasized about an LSM-tree based system is cost transparency: you know that if you have five storage files, access will take a maximum of five disk seeks. Whereas you have no way to determine the number of disk seeks a RDBMS query will take, even if it is indexed.

Storage

之前在这个blog里面, 已经笔记过

http://www.cnblogs.com/fxjwind/archive/2012/08/21/2649499.html

Write-Ahead Log

The region servers keep data in-memory until enough is collected to warrant a flush to disk, avoiding the creation of too many very small files. While the data resides in memory it is volatile, meaning it could be lost of the server loses power for example. This is a typical problem, as explained in the section called “Seek vs. Transfer”.
A common approach to solving this issue is write-ahead logging[87]:
each update (also called "edit") is written to a log, and only if that has succeeded the client is informed that the operation has succeeded.
The server then has the liberty to batch or aggregate the data in memory as needed.

其实想法很简单, 需要把数据buffer在memory, 并批量flush到disk, memory中的数据容易丢失, 所以使用WAL来解决这个问题.

Overview

The WAL is the lifeline that is needed when disaster strikes. Similar to a binary log in MySQL, it records all changes to the data.
This is important in case something happens to the primary storage. If the server crashes it can effectively replay the log to get everything up to where the server should have been just before the crash. It also means that if writing the record to the WAL fails, the whole operation must be considered a failure.

Since it is shared by all regions hosted by the same region server it acts as a central logging backbone for every modification.
一个region server上所有regions共享一个WAL. 可以参考Bigtable论文里面, refinements里面对于log机制的优化.

HLog Class

The class which implements the WAL is called HLog. When a HRegion is instantiated the single HLog instance is passed on as a parameter to the constructor of HRegion. When a region receives an update operation, it can save the data directly to the shared WAL instance.

HLogKey Class

Currently the WAL is using a Hadoop SequenceFile, which stores records as sets of key/values.

Read Path

这节写的不太有条理, 反正他就想说明, 对一行数据, 不是简单的get, 而是scan. 为什么, 这个是由LSM tree结构导致的, 同一行数据, 可以分散在mem, 和不同的files里面.
详细参考http://www.cnblogs.com/fxjwind/archive/2012/08/14/2638371.html

所以可以看出不断compaction的重要性, 否则面对大量的file, read是会很慢的, 另外再用bloom filter, 和 time stamp对files进行进一步过滤来提高读效率.

Region Lookups

For the clients to be able to find the region server hosting a specific row key range HBase provides two special catalog tables called -ROOT- and .META..

The -ROOT- table is used to refer to all regions in the .META. table.
The design considers only one root region, i.e., the root region is never split to guarantee a three level, B+ tree like lookup scheme:
the first level is a node stored in ZooKeeper that contains the location of the root table's region, in other words the name of the region server hosting that specific region.
The second level is the lookup of a matching meta region from the -ROOT- table,
and the third is the retrieval of the user table region from the .META. table.

参考Bigtable论文里面, 5.1 Tablet Location, 完全一样