Hive之cube和roolup

 

Cube

hive (hdata)> select * from test;

 

test.f1     test.f2     test.f3     test.cnt

A            A            B            1

B            B            A            1

A            A            A            2

 

hive (hdata)> SELECT f1,

            >        f2,

            >        f3,

            >        sum(cnt)

            > FROM test

            > GROUP BY f1,

            >          f2,

            >          f3;

f1    f2    f3    _c3

A     A     A     2

A     A     B     1

B     B     A     1

 

hive (hdata)> SELECT f1,

            >        f2,

            >        f3,

            >        sum(cnt),

            >        GROUPING__ID,

            >        rpad(reverse(bin(cast(GROUPING__ID AS bigint))),3,'0')

            > FROM test

            > GROUP BY f1,

            >          f2,

            >          f3 WITH CUBE;

 

f1           f2           f3           _c3         grouping__id        _c5

NULL      NULL      NULL      4            7                          000

NULL      NULL      A            3            6                          001

NULL      NULL      B            1            6                          001

NULL      A            NULL      3            5                          010

NULL      A            A            2            4                          011

NULL      A            B            1            4                          011

NULL      B            NULL      1            5                          010

NULL      B            A            1            4                          011

A            NULL      NULL      3            3                          100

A            NULL      A            2            2                          101

A            NULL      B            1            2                          101

A            A            NULL      3            1                          110

A            A            A            2            0                          111

A            A            B            1            0                          111

B            NULL      NULL      1            3                          100

B            NULL      A            1            2                          101

B            B            NULL      1            1                          110

B            B            A            1            0                          111

 

cube简称数据魔方,可以实现hive多个任意维度的查询,cube(a,b,c)则首先会对(a,b,c)进行group by,然后依次是(a,b),(a,c),(a),(b,c),(b),(c),最后在对全表进行group by,他会统计所选列中值的所有组合的聚合。

 

GROUPING SETS

GROUPING SETS作为GROUP BY的子句,允许开发人员在GROUP BY语句后面指定多个统计选项,可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起来,下面是几个实例可以帮助我们了解,

select device_id,os_id,app_id,count(user_id) from  test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id))

等价于

SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id

 

select device_id,os_id,app_id,count(user_id) from  test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id),(device_id))

等价于

SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id 
UNION ALL 
SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id

 

Rollup

可以实现从右到左递减多级的统计,显示统计某一层次结构的聚合。

SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id 
UNION ALL 
SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id

等价于

select device_id,os_id,app_id,client_version,from_id,count(user_id) 

from test_xinyan_reg 

group by device_id,os_id,app_id,client_version,from_id 

grouping sets ((device_id,os_id,app_id,client_version,from_id),(device_id,os_id,app_id,client_version),(device_id,os_id,app_id),(device_id,os_id),(device_id),());

 

Grouping_ID函数

当我们没有统计某一列时,它的值显示为null,这可能与列本身就有null值冲突,这就需要一种方法区分是没有统计还是值本来就是null。(写一个排列组合的算法,就马上理解了,grouping_id其实就是所统计各列二进制和)

 

https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup

原文地址:https://www.cnblogs.com/BetterThanEver_Victor/p/9364081.html