hive总结

1、hive.mapred.mode=strict 对分区表进行查询必须设置where子句的分区条件

2、创建表引用HIve本身
create table parameter_example
( id string,
name string,
age int
) partitioned by (year string,month string)
row format delimited
fields terminated by '01'
collection items terminated by '02'
map keys terminated by '03'
lines terminated by ' '
stored as textfile
tblproperties('orc.compress'='snappy')

# 使用外部
create table parameter_example
( id string,
name string,
age int
) partition by (year string,month string)
row format serde 'com.linkedin.haivvreo.AvroSerDe'
with serdeproperties ('schema.url'='http://schema_provider/kst.avsc')
stored as
inputformat 'com.linkedin.haivvreo.AvroContainerInputFormat'
outputformat 'com.linkedin.haivvreo.AvroContaineroutputFormat'

3、分区修改
alter table partition_example add if not exists
partition(year=2020,month=09,day=30) location '/test/hdp/hive/external/parameter_example/2020/09/30'
partition(year=2020,month=10,day=01) location '/test/hdp/hive/external/parameter_example/2020/10/01';

alter table parameter_example partition(year=2020,month=10,day=01) set fileformat textfile;

4、 # 导入
load data local path '/usr/hdp/data/aa.csv'
overwrite into table data_example
partition (year=2020,month=10,day=01);

from data_source se
insert overwrite table data_example partition(year=2020,month=10,day=01)
select * where se.year='2020' and se.month='10' and se.day='01'
insert overwrite table data_example partition(year=2020,month=10,day=02)
select * where se.year='2020' and se.month='10' and se.day='02';

# 导出
hdfs dfs -cp source_path target_path
insert overwrite local directory target_path select a,b,c from data_example;

5、动态分区
hive.exec.dynamic.partition 设置为true表示开启动态分区，默认值是false
hive.exec.dynamic.partition.mode 设置为nonstrict表示允许所有分区都是动态的，默认值为strict
hive.exec.max.dynamic.partitions.pernode 每个mapper和reduer可以创建的分区最大值，默认值100
hive.exec.max.dynamic.partitions 一个动态分区创建语句可以创建的动态分区个数，默认值1000
hive.exec.max.created.files 全局可以创建的最大文件个数，默认值100000

6、Hive中不支持join的不等式连接，如下所示
from tab a join tabs b on a.year<=b.year
Hive亦不支持在on子句中使用谓词OR

Hive不支持使用在in的查询中使用select，如下所示
select s.* from stocks s where s.ymd,s.symbol in (select d.ymd,d.symbol from dividends d)
可以使用left semi join解决此问题，但不能使用右表中的字段，类似于exists的原理

7、Hive SQL Job的划分
如sql： select a.year,b.month,c.day,a.name,b.age,c.id from stocks a join stocks b on a.year=b.year join stocks c on a.id=c.id
多数情况，Hive会对每一个JOIN连接对象启动一个MR任务，按照表从左到右的顺序。
本sql会对a表和b表启动一个MR任务，然后将第一个MR任务的输出和表c启动一个MR任务。

当多个表的join谓词使用了相同的连接键，那么Hive通过优化将在同一个MR任务中连接三个表。

8、Hive优化
1）如7所示，Hive的优化是从左到右依次保证表的数据量是递增的。
或是使用 select /*+STREAMTABLE(s)*/ s.ymd,s.symbol,d.dividend from stocks s join dividends d on s.ymd=d.ymd and s.symbol-d.symbol
2) map-side Join
设置hive.auto.convert.join=true
hive.mapjoin.smalltable.filesize=250000 默认是字节
3) sql 执行计划
explain 和 explain extended
4) 启用本地模式
hive.exec.mode.local.auto=true 对于小数据集，，会在所在机器执行
5) 并行执行
hive.exec.parallel = true 将一个查询转换为一个或多个阶段。
6) 严格模式
hive.mapred.mode=strict 可以禁止三类查询
(1) 对于分区表，where中没有分区字段来过滤数据，则不允许执行
(2) 对于使用了order by的语句，必须使用limit语句
(3) 限制笛卡尔积的查询
7) mapper和reducer的个数
hive 是按照输入的数据量大小来确定reducer的个数的，dfs -count 统计计算输入量大小

hive.exec.reducers.bytes.per.reducer 默认值是1G
设置mapred.reduce.tasks的值

执行大任务时，hive.exec.reducers.max 设置比较重要
设置经验：集群总Reduce槽位个数*1.5
8) JVM重用，当执行大任务时，，设置JVM重用，可以使得jvm实例在同一个Job中使用N次
在mapred-site.xml中配置
mapred.job.reuse.jvm.num.tasks=10
缺点是，JVM会一直占用使用到的task插槽，一直到任务结束。
9) 推测执行 mapred-site.xml中配置
对于执行慢的task，推测执行机制会启用备份机制，哪个task先执行完成就结束
mapred.map.tasks.speculative.execution=true
mapred.reduce.tasks.speculative.execution=true

10) 如果查询中的多个group by操作想要组装到MapReduce任务中，可以启动优化
hive.multigroupby.singlemr=false

9、sort by是局部排序，在每个reducer中对数据排序
如果hive.mapred.mode=strict，参考8.6。

distribute by 是将相同key的数据发往一个reduer，控制map-reduce的数据走向在sort by前边
cluster by 是distribute by 和sort by组合在一起的功能

10、cast 类型转换用法 cast(b as string)

11、数据抽样
数据块抽样：基于hdfs数据库抽样
参数 hive.sample.seednumber
select * from tba tablesample(0.1 percent) s;
rand函数抽样
select * from tba tablesample(bucket 3 out of 10 on rand()) s;

12、模式设计
分区分桶分区分桶融合

13、动态分区参数设置
hive.exec.max.dynamic.partitions=100000 设置动态分区的最大个数
hive.exec.max.dynamic.partitions.pernode=100000 设置每个节点的动态分区

14、Hive的虚拟列
1) 要进行划分的输入文件名
2) 文件中的块内偏移量
3) 文件的行偏移量

hive.exec.rowoffset=true
select input_file_name,block_offset_inside_file,line from test where line like '%hive%' limit 2

15、开启中间任务压缩
hive.exec.compress.intermidiate=true
mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec

16、Hive orc事务表，ambari创建hive表，默认为事务表。ambari开启 llap，速度会提升
spark 可与 hive 进行融合。

insert会生成delta_开头的文件
update会生成delete_开头的文件

delta和delete文件合并有两种，Minor compaction和Major compaction