记一次newApiHadoopRdd查询数据不一致问题

现象：

+----------+-------+--------+-----+-----+-----+----+----+------+---------+-------+--------+--------+------------+
|totalCount|January|February|March|April| May|June|July|August|September|October|November|December|totalMileage|
+----------+-------+--------+-----+-----+-----+----+----+------+---------+-------+--------+--------+------------+
| 33808| 0| 0| 0| 0|33798| 0| 0| 0| 0| 0| 0| 0| 79995.0|
+----------+-------+--------+-----+-----+-----+----+----+------+---------+-------+--------+--------+------------+

当前表预分区10个

按照当月数据看，当前测试表中总数量是：33798

hbase的总数量也是：33798

神奇的地方：使用sparkSQL对接hbase查询的数量是：33808

当时的sql语句是：select count(1) from orderData

很神奇，因为通过sql查询后，总数据多了10条

============================================================

原因：

这里设置了hbase SCAN_BATCHSIZE这个值，会设置scan的batchsize。这个设置的文档是这样说的：

Set the maximum number of values to return for each call to next()

之前一直以为这里是设置一次读取多少行，其实values貌似是读取多少列，并且开启了这个值会导致hbase scan时返回一行的部分结果；

于是将这个设置注释掉，程序即可正常运行

进一步的，我们从hbase端代码看看这个设置。hbase的scan会两个成员变量：

private boolean allowPartialResults = false;
private int batch = -1;

allowPartialResult这个很明显就是会返回部分结果的设置，那么这个batch呢？setBatch()时并不会设置allowPartialResult。但是在Scan的getResultsToAddToCache()函数中，如果batch值大于0，会设置isBatch=true。之后会有这段代码：

// If the caller has indicated in their scan that they are okay with seeing partial results,
// then simply add all results to the list. Note that since scan batching also returns results
// for a row in pieces we treat batch being set as equivalent to allowing partials. The
// implication of treating batching as equivalent to partial results is that it is possible
// the caller will receive a result back where the number of cells in the result is less than
// the batch size even though it may not be the last group of cells for that row.
    if (allowPartials || isBatchSet) {
      addResultsToList(resultsToAddToCache, resultsFromServer, 0,
          (null == resultsFromServer ? 0 : resultsFromServer.length));
      return resultsToAddToCache;
    }

之前错误代码：

TableInputFormat.SCAN_BATCHSIZE

lazy val buildScan = {

    val hbaseConf = HBaseConfiguration.create()
    hbaseConf.set("hbase.zookeeper.quorum", GlobalConfigUtils.hbaseQuorem)
    hbaseConf.set(TableInputFormat.INPUT_TABLE, hbaseTableName)
    hbaseConf.set(TableInputFormat.SCAN_COLUMNS, queryColumns)
    hbaseConf.set(TableInputFormat.SCAN_ROW_START, startRowKey)
    hbaseConf.set(TableInputFormat.SCAN_ROW_STOP, endRowKey)
    hbaseConf.set(TableInputFormat.SCAN_BATCHSIZE , "10000")//TODO 此处导致查询数据不一致
    hbaseConf.set(TableInputFormat.SCAN_CACHEDROWS , "10000")
    hbaseConf.set(TableInputFormat.SHUFFLE_MAPS , "1000")

    val hbaseRdd = sqlContext.sparkContext.newAPIHadoopRDD(
      hbaseConf,
      classOf[org.apache.hadoop.hbase.mapreduce.TableInputFormat],
      classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
      classOf[org.apache.hadoop.hbase.client.Result]
    )

    val rs: RDD[Row] = hbaseRdd.map(tuple => tuple._2).map(result => {

      var values = new ArrayBuffer[Any]()
      hbaseTableFields.foreach { field =>
        values += Resolver.resolve(field, result)
      }
      Row.fromSeq(values.toSeq)
    })
    rs
  }

解决：

去掉TableInputFormat.SCAN_BATCHSIZE的设置即可

去掉后的查询结果：

+----------+-------+--------+-----+-----+-----+----+----+------+---------+-------+--------+--------+------------+
|totalCount|January|February|March|April| May|June|July|August|September|October|November|December|totalMileage|
+----------+-------+--------+-----+-----+-----+----+----+------+---------+-------+--------+--------+------------+
| 33798| 0| 0| 0| 0|33798| 0| 0| 0| 0| 0| 0| 0| 79995.0|
+----------+-------+--------+-----+-----+-----+----+----+------+---------+-------+--------+--------+------------+

问题解决~