HBASE-读取数据-优化

1、设置scan缓存

scan.setCaching(1000);

定义一次交互从服务端传输到客户端的行数

2、显示的指定列

scan.addColumn(cf,column)

只获取需要的列，减少传输的数据量，减少IO的消耗

3、使用完resultScanner后关闭，否则可能出现一段时间内服务端一致保存着连接，资源无法释放，

造成服务端资源的不可用，可能引发RegionServer的其他问题

4、禁用块缓存

5、优化行健查询

使用Filter对数据进行过滤，在Regionserver端进行，效率快

可以降低客户端压力

只获取行代码：

FilterList filterList = new FilterList(FilterList.Operator.MUST_PASS_ALL);

filterList.addFilter(new FirstKeyOnlyFilter());

filterList.addFilter(new KeyOnlyFilter());

scan.setFilter(filterList);

6、多线程查询数据时

HTable线程不安全

使用HtablePool

7、使用批量读

Get时可以使用table.get(List<Get>)

8、使用Coprocessor统计行数

代码：

添加coprocessor

    Connection connection = ConnectionFactory.createConnection(conf);

//

 Admin admin = connection.getAdmin();

//

 TableName tableName1 = TableName.valueOf(tableName);

//



 String coprocessClassName = "org.apache.hadoop.hbase.coprocessor.AggregateImplementation";



 admin.disableTable(tableName1);



 HTableDescriptor htd = admin.getTableDescriptor(tableName1);



 htd.addCoprocessor(coprocessClassName);



 admin.modifyTable(tableName1,htd);



 admin.enableTable(tableName1);

查询代码

AggregationClient ac = new AggregationClient(conf);



try {

    System.out.println(ac.rowCount(tableName1,new LongColumnInterpreter(),scan));

} catch (Throwable throwable) {

    throwable.printStackTrace();

}

9、可以再客户端建立一个缓存系统

10、使用hbase shell查看单个region的数据条数？

同样可以查看一个区间范围的数据（使用api方式（spark，mapreduce，hive sql，hbase api）更方面，count方式不支持startrow，stoprow，上面这种是一个比较笨的方法）

scan 'test:no_compact_shixq02_20190305',{STARTROW => '100',STOPROW => '123'}

start和stop分别对应一个region的split key就可以了

11、如何查看一个region的数据存储大小？

60010界面table Details中每一个region都有一个name：如：test:no_compact_shixq02_20190305,,1554773814183.a48ba410b7cd9dd926fd0379995c097e.

其中a48ba410b7cd9dd926fd0379995c097e 是这个表下面的region在hdfs文件夹的名称
bin/hadoop fs -du -h /hbp_root/tenantA/hbase/data 这个命令可以统计文件夹大小，结果第一列为数据存储大小