SQL for HBase

本文转自：http://www.orzota.com/sql-for-hbase/

HBase provides random, read-write access to Big Data stored in very large tables as a distributed columnar store. HBase thus gained immediately popularity as a Big Data technology that unlike Hadoop which was primarily used in the backend data warehousing infrastructure could be deployed to service online transactions as well. Just as Hive brought SQL to Hadoop (and now to HBase as described in my earlier blog post), there are many alternate projects providing SQL for HBase which is the subject of this blog article.

Phoenix

Phoenix is a technology developed by Salesforce.com to put a SQL skin over HBase. This is not adding a new layer, rather it exposes HBase functionality through SQL using an embedded JDBC Driver that allows clients to run at native HBase speed. The JDBC driver compiles SQL into native HBase calls.

The claim is that performance of Phoenix is much better than that of Hive over HBase – upto 10x at 100 million rows. Performance is also more constant as the number of rows increases. Optimizations are transparent to the user.

Impala

Cloudera has been promoting Impala heavily since its announcement in Strata last year. Impala is a SQL engine that can run on HDFS or HBase or both. They too claim that Impala performance is much better than that of Hive – over 45 times! However this comparison seems to be based on native Hive, not Hive over HBase. The performance gains are achieved by not writing intermediate results to disk (like MapReduce), no spin up/down times, optimized code as Impala is written in C++ rather than Java. This means that it needs to use JNI to make calls into Java for HBase integrations.

Impala also includes authentication via Kerberos support, JDBC/ODBC drivers (similar to Phoenix), an interactive shell and support for popular hadoop file formats such as parquet, avro, thrift and sequence files. Parquet was specifically engineered for Impala to get the best performance.

The use cases for HBase and Impala include the following:

Data is being streamed in
Need to run queries on data as it is being created
Need real time edits of data in-place
Need random access to computed results (results can be placed in HBase)

Drill

Apache Drill (the only in this series which is an Apache project) was inspired by Google Dremel. It is the most interesting in terms of its emphasis on interactive analysis of large scale datasets. It is similar to Impala in many ways but is community driven. Although Hive is also a community project, the need for Drill rose because of the tight coupling of Hive to MapReduce which is based on a pessimistic execution principle i.e. all jobs will be long running and process lots of data. This causes a lot of overhead for short-running jobs which Drill addresses.

Drill does great when execution times of 100ms to 10 seconds are required. Phoenix provides even better response times – below 10ms.

Drill is unique in its focus on analytical queries, so a lot of emphasis is placed on bulk reads and aggregations. Phoenix is more focused on write optimizations as in the RDBMS world.

Drill also provides extensions to SQL for nested data types similar to Big Query. Many systems nest data structures inside of HBase cells (for example a cell can be a whole log record which in turn has many nested fields). It is much better to treat these nested cells and process them in the query language for optimal performance.

Drill leverages recent research approaches (late record materialization, vectorized operators, etc.) As such it definitely seems the most promising for analytical work.

Summary

With all of these different options and rich technologies that provide SQL for HBase, HBase’s popularity is bound to grow. Of course with choice comes decisions. Which technology is right for you? Orzota can help analyze your use case and recommend the right set of technologies based not just on technical merits alone, but also taking into consideration its maturity, cost and vendor support. Please contact usfor more information.