HIVE高级(11):优化(11) HQL 语法优化(2) 多表优化

6 CBO 优化

　　join 的时候表的顺序的关系：前面的表都会被加载到内存中。后面的表进行磁盘扫描

select a.*, b.*, c.* from a join b on a.id = b.id join c on a.id = c.id;

　　Hive 自 0.14.0 开始，加入了一项 "Cost based Optimizer" 来对 HQL 执行计划进行优化，这个功能通过 "hive.cbo.enable" 来开启。在 Hive 1.1.0 之后，这个 feature 是默认开启的，

它可以自动优化 HQL 中多个 Join 的顺序，并选择合适的 Join 算法。

　　CBO，成本优化器，代价最小的执行计划就是最好的执行计划。传统的数据库，成本优化器做出最优化的执行计划是依据统计信息来计算的。

　　Hive 的成本优化器也一样，Hive 在提供最终执行前，优化每个查询的执行逻辑和物理执行计划。这些优化工作是交给底层来完成的。根据查询成本执行进一步的优化，从而产生

潜在的不同决策：如何排序连接，执行哪种类型的连接，并行度等等。

　　要使用基于成本的优化（也称为 CBO），请在查询开始设置以下参数：

set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;

7 谓词下推

将 SQL 语句中的 where 谓词逻辑都尽可能提前执行，减少下游处理的数据量。对应逻辑优化器是 PredicatePushDown，配置项为 hive.optimize.ppd，默认为 true。

案例实操：

1）打开谓词下推优化属性

hive (default)> set hive.optimize.ppd = true; #谓词下推，默认是 true

2）查看先关联两张表，再用 where 条件过滤的执行计划

hive (default)> explain select o.id from bigtable b join bigtable o on
o.id = b.id where o.id <= 10;

3）查看子查询后，再关联表的执行计划

hive (default)> explain select b.id from bigtable b
join (select id from bigtable where id <= 10) o on b.id = o.id;

8 小表join大表 (MapJoin)

将 key 相对分散，并且数据量小的表放在 join 的左边，可以使用 map join 让小的维度表先进内存。在 map 端完成 join。

实际测试发现：新版的 hive 已经对小表 JOIN 大表和大表 JOIN 小表进行了优化。小表放在左边和右边已经没有区别。

　　MapJoin 是将 Join 双方比较小的表直接分发到各个 Map 进程的内存中，在 Map 进程中进行 Join 操作，这样就不用进行 Reduce 步骤，从而提高了速度。如果不指定 MapJoin

或者不符合 MapJoin 的条件，那么 Hive 解析器会将 Join 操作转换成 Common Join，即：在Reduce 阶段完成 Join。容易发生数据倾斜。可以用 MapJoin 把小表全部加载到内存在 Map

端进行 Join，避免 Reducer 处理。

1）开启 MapJoin 参数设置

（1）设置自动选择 MapJoin

set hive.auto.convert.join=true; #默认为 true

（2）大表小表的阈值设置（默认 25M 以下认为是小表）：

set hive.mapjoin.smalltable.filesize=25000000;

2）MapJoin 工作机制

MapJoin 是将 Join 双方比较小的表直接分发到各个 Map 进程的内存中，在 Map 进

程中进行 Join 操作，这样就不用进行 Reduce 步骤，从而提高了速度。

3）案例实操：

（1）开启 MapJoin 功能

set hive.auto.convert.join = true; 默认为 true

（2）执行小表 JOIN 大表语句

注意：此时小表(左连接)作为主表，所有数据都要写出去，因此此时会走 reduce，mapjoin失效

Explain insert overwrite table jointable
select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from smalltable s
left join bigtable b
on s.id = b.id;

Time taken: 24.594 seconds

（3）执行大表 JOIN 小表语句

Explain insert overwrite table jointable
select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable b
left join smalltable s
on s.id = b.id;

4）建大表、小表和 JOIN 后表的语句

// 创建大表
create table bigtable(id bigint, t bigint, uid string, keyword string, 
url_rank int, click_num int, click_url string) row format delimited 
fields terminated by '	';

// 创建小表
create table smalltable(id bigint, t bigint, uid string, keyword string, 
url_rank int, click_num int, click_url string) row format delimited 
fields terminated by '	';

// 创建 join 后表的语句
create table jointable(id bigint, t bigint, uid string, keyword string, 
url_rank int, click_num int, click_url string) row format delimited 
fields terminated by '	';

5）分别向大表和小表中导入数据

hive (default)> load data local inpath '/opt/module/data/bigtable' into 
table bigtable;
hive (default)>load data local inpath '/opt/module/data/smalltable' into 
table smalltable;

6）小表 JOIN 大表语句

insert overwrite table jointable
select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from smalltable s
join bigtable b
on b.id = s.id;

7）大表 JOIN 小表语句

insert overwrite table jointable
select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable b
join smalltable s
on s.id = b.id;

9 大表join大表（重点）

9.1 SMB Join

SMB Join ：Sort Merge Bucket Join

1）创建第二张大表

create table bigtable2(
 id bigint,
 t bigint,
 uid string,
 keyword string,
 url_rank int,
 click_num int,
 click_url string)
row format delimited fields terminated by '	';
load data local inpath '/opt/module/data/bigtable' into table bigtable2;

2）测试大表直接 JOIN

insert overwrite table jointable
select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable a
join bigtable2 b
on a.id = b.id;

测试结果：Time taken: 72.289 seconds

insert overwrite table jointable
select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable a
join bigtable2 b
on a.id = b.id;

3）创建分通表 1

create table bigtable_buck1(
 id bigint,
 t bigint,
 uid string,
 keyword string,
 url_rank int,
 click_num int,
 click_url string)
clustered by(id)
sorted by(id)
into 6 buckets
row format delimited fields terminated by '	';

load data local inpath '/opt/module/data/bigtable' into table 
bigtable_buck1;

4）创建分通表 2，分桶数和第一张表的分桶数为倍数关系

create table bigtable_buck2(
 id bigint,
 t bigint,
 uid string,
 keyword string,
 url_rank int,
 click_num int,
 click_url string)
clustered by(id)
sorted by(id)
into 6 buckets
row format delimited fields terminated by '	';

load data local inpath '/opt/module/data/bigtable' into table 
bigtable_buck2;

5）设置参数

set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set 
hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

6）测试 Time taken: 34.685 seconds

insert overwrite table jointable
select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable_buck1 s
join bigtable_buck2 b
on b.id = s.id;

9.2 空 KEY 过滤

有时 join 超时是因为某些 key 对应的数据太多，而相同 key 对应的数据都会发送到相同的 reducer 上，从而导致内存不够。此时我们应该仔细分析这些异常的 key，很多情况下，

这些 key 对应的数据是异常数据，我们需要在 SQL 语句中进行过滤。例如 key 对应的字段为空，操作如下：

案例实操

（1）配置历史服务器

配置 mapred-site.xml

<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop102:10020</value>
</property>
<property>
 <name>mapreduce.jobhistory.webapp.address</name>
 <value>hadoop102:19888</value>
</property>

启动历史服务器

sbin/mr-jobhistory-daemon.sh start historyserver

查看 jobhistory

http://hadoop102:19888/jobhistory

（2）创建原始数据空 id 表

// 创建空 id 表
create table nullidtable(id bigint, t bigint, uid string, keyword string, 
url_rank int, click_num int, click_url string) row format delimited 
fields terminated by '	';

（3）分别加载原始数据和空 id 数据到对应表中

hive (default)> load data local inpath '/opt/module/data/nullid' into
table nullidtable;

（4）测试不过滤空 id

hive (default)> insert overwrite table jointable select n.* from
nullidtable n left join bigtable o on n.id = o.id;

（5）测试过滤空 id

hive (default)> insert overwrite table jointable select n.* from (select
* from nullidtable where id is not null) n left join bigtable o on n.id =
o.id;

9.3 空 key 转换

有时虽然某个 key 为空对应的数据很多，但是相应的数据不是异常数据，必须要包含在

join 的结果中，此时我们可以表 a 中 key 为空的字段赋一个随机的值，使得数据随机均匀地

分不到不同的 reducer 上。例如：

案例实操：

不随机分布空 null 值：

（1）设置 5 个 reduce 个数

set mapreduce.job.reduces = 5;

（2）JOIN 两张表

insert overwrite table jointable
select n.* from nullidtable n left join bigtable b on n.id = b.id;

结果：如下图所示，可以看出来，出现了数据倾斜，某些 reducer 的资源消耗远大于其他 reducer。

随机分布空 null 值

（1）设置 5 个 reduce 个数

set mapreduce.job.reduces = 5;

（2）JOIN 两张表

insert overwrite table jointable
select n.* from nullidtable n full join bigtable o on 
nvl(n.id,rand()) = o.id;

结果：如下图所示，可以看出来，消除了数据倾斜，负载均衡 reducer 的资源消耗

10 笛卡尔积

Join 的时候不加 on 条件，或者无效的 on 条件，因为找不到 Join key，Hive 只能使用 1

个 Reducer 来完成笛卡尔积。当 Hive 设定为严格模式（hive.mapred.mode=strict，nonstrict）

时，不允许在 HQL 语句中出现笛卡尔积。

本文来自博客园，作者：秋华，转载请注明原文链接：https://www.cnblogs.com/qiu-hua/p/15143811.html