Hive中排序和聚集

//五种子句是有严格顺序的：
where → group by → having → order by → limit

//where和having的区别:
//where是先过滤再分组(对原始数据过滤),where限定聚合函数
hive> select count(*),age from tea where id>18 group by age;

//having是先分组再过滤(对每个组进行过滤,having后只能跟select中已有的列)
hive> select age,count(*) c from tea group by age having c>2;

//group by后面没有的列,select后面也绝不能有(聚合函数除外)
hive> select ip,sum(load) as c from logs  group by ip sort by c desc limit 5;

//distinct关键字返回唯一不同的值(返回age和id均不相同的记录)
hive> select distinct age,id from tea;

//hive只支持Union All,不支持Union
//hive的Union All相对sql有所不同,要求列的数量相同,并且对应的列名也相同,但不要求类的类型相同(可能是存在隐式转换吧)
select name,age from tea where id<80
union all
select name,age from stu where age>18;

Order By特性：

对数据进行全局排序，只有一个reducer task，效率低下。
与mysql中 order by区别在于：在 strict 模式下，必须指定 limit，否则执行会报错

• 使用命令set hive.mapred.mode; 查询当前模式
• 使用命令set hive.mapred.mode=strick; 设置当前模式

hive> select * from logs where date='2015-01-02' order by te;
FAILED: SemanticException 1:52 In strict mode,
 if ORDER BY is specified, LIMIT must also be specified. 
Error encountered near token 'te'

对于分区表，还必须显示指定分区字段查询

hive> select * from logs order by te limit 5;                
FAILED: SemanticException [Error 10041]: 
No partition predicate found for Alias "logs" Table "logs"

Sort BY特性：

可以有多个Reduce Task（以DISTRIBUTE BY后字段的个数为准）。也可以手工指定：set mapred.reduce.tasks=4;
每个Reduce Task 内部数据有序，但全局无序

set mapred.reduce.tasks = 2;
insert overwrite local directory '/root/hive/b'
    select * from logs                         
    sort by te;

上述查询语句，将结果保存在本地磁盘 /root/hive/b ，此目录下产生2个结果文件：000000_0 + 000001_0 。每个文件中依据te字段排序。

Distribute by特性：

按照指定的字段对数据进行划分到不同的输出 reduce 文件中
distribute by相当于MR 中的paritioner，默认是基于hash 实现的
distribute by通常与Sort by连用

set mapred.reduce.tasks = 2;
insert overwrite local directory '/root/hive/b'
    select * from logs
    distribute by date
    sort by te;

Cluster By特性：

如果 Sort By 和 Distribute By 中所有的列相同，可以缩写为Cluster By以便同时指定两者所使用的列。
注意被cluster by指定的列只能是降序，不能指定asc和desc。一般用于桶表

set mapred.reduce.tasks = 2;
insert overwrite local directory '/root/hive/b'
    select * from logs
    cluster by date;