Cube
hive (hdata)> select * from test;
test.f1 test.f2 test.f3 test.cnt
A A B 1
B B A 1
A A A 2
hive (hdata)> SELECT f1,
> f2,
> f3,
> sum(cnt)
> FROM test
> GROUP BY f1,
> f2,
> f3;
f1 f2 f3 _c3
A A A 2
A A B 1
B B A 1
hive (hdata)> SELECT f1,
> f2,
> f3,
> sum(cnt),
> GROUPING__ID,
> rpad(reverse(bin(cast(GROUPING__ID AS bigint))),3,'0')
> FROM test
> GROUP BY f1,
> f2,
> f3 WITH CUBE;
f1 f2 f3 _c3 grouping__id _c5
NULL NULL NULL 4 7 000
NULL NULL A 3 6 001
NULL NULL B 1 6 001
NULL A NULL 3 5 010
NULL A A 2 4 011
NULL A B 1 4 011
NULL B NULL 1 5 010
NULL B A 1 4 011
A NULL NULL 3 3 100
A NULL A 2 2 101
A NULL B 1 2 101
A A NULL 3 1 110
A A A 2 0 111
A A B 1 0 111
B NULL NULL 1 3 100
B NULL A 1 2 101
B B NULL 1 1 110
B B A 1 0 111
cube简称数据魔方,可以实现hive多个任意维度的查询,cube(a,b,c)则首先会对(a,b,c)进行group by,然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在对全表进行group by,他会统计所选列中值的所有组合的聚合。
GROUPING SETS
GROUPING SETS作为GROUP BY的子句,允许开发人员在GROUP BY语句后面指定多个统计选项,可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起来,下面是几个实例可以帮助我们了解,
select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id))
等价于
SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id
select device_id,os_id,app_id,count(user_id) from test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id),(device_id))
等价于
SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id
UNION ALL
SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by
device_id
Rollup
可以实现从右到左递减多级的统计,显示统计某一层次结构的聚合。
SELECT
device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by
device_id,os_id
UNION ALL
SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by
device_id
等价于
select device_id,os_id,app_id,client_version,from_id,count(user_id)
from test_xinyan_reg
group by device_id,os_id,app_id,client_version,from_id
grouping sets ((device_id,os_id,app_id,client_version,from_id),(device_id,os_id,app_id,client_version),(device_id,os_id,app_id),(device_id,os_id),(device_id),());
Grouping_ID函数
当我们没有统计某一列时,它的值显示为null,这可能与列本身就有null值冲突,这就需要一种方法区分是没有统计还是值本来就是null。(写一个排列组合的算法,就马上理解了,grouping_id其实就是所统计各列二进制和)
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup