hive:(group by, having;order by)的使用;group by+多个字段,以及wiki说的group by两种使用限制验证

hive> select * from app_data_stats_historical where os='1' group by dt limit 100;
出现结果如下:
2014-01-01
2014-01-06
......
2014-02-07
2014-02-10
2014-02-14
2014-02-17
2014-02-24
(只返回了一列日期。说明* 不起作用。不过这样,可以查看总共哪些日期,有效,存在数据)
加上having命令,having只作用于group by中的字段,非group的字段不行:
select * from app_data_stats_historical where os='1' group by dt having dt>' limit 1002014-02-01' limit 100;
2014-02-03
2014-02-04
2014-02-07
2014-02-10
2014-02-14
2014-02-17
2014-02-24
(可以用日期进行比较。)
 
hive> select * from app_data_stats_historical where os='1' order by dt desc limit 100;
字段是全部返回的。降序。
 

group by 多个字段:

2周内分联盟(5是多盟)分平台(分平台指的是分ios和android),分时段的曝光及点击。 select substr(createtime,12,2)hour,logtype,os_id,count(distinct logtype) from wizad_mdm_raw_hdfs where ad_id in('19057','19058','812b4ba287f5ee0bc9d43bbf5bbe87fb') and day >= '2014-12-01'and adn=5 group by substr(createtime,12,2), logtype, os_id; 注意,时段hour提取函数substr和substring 是通用的!   或者分步做:将group的字段按取值加到where中去(os_id取值1是android,2是iOS): select substr(createtime,12,2)hour,logtype,count(*) from wizad_mdm_raw_hdfs where ad_id in('19057','19058','812b4ba287f5ee0bc9d43bbf5bbe87fb') and day >= '2014-12-01'and os_id='2' and adn=5 group by substr(createtime,12,2), logtype; 以及: select substr(createtime,12,2),logtype,count(*) from wizad_mdm_raw_hdfs where ad_id in('19057','19058','812b4ba287f5ee0bc9d43bbf5bbe87fb') and day >= '2014-12-01'and os_id='1' and adn=5 group by substr(createtime,12,2), logtype;   substr(createtime,12,2)提取小时字段,起hour别名,group by 不支持。 group 里面不能起别名hour,直接用或引用都不行。 (1)Group by定义别名hour,报错,解析错误 selectsubstr(createtime,12,2),logtype,count(*) from wizad_mdm_raw_hdfs where day >= '2014-12-01' group by substr(createtime,12,2) hour,logtype;   (2)引用定义的hour别名也不行: Line 4:9 Invalid table alias or column reference 'hour' select substr(createtime,12,2)hour,logtype,count(*) from wizad_mdm_raw_hdfs where day >= '2014-12-01' group by hour, logtype;       hive wiki上说group by有两种使用情况限制 (1)group by只有一列,则distinct只能作用一列(可以多次作用同一列)  INSERT OVERWRITE TABLE pv_gender_sum  SELECT pv_users.gender, count (DISTINCT pv_users.userid)  FROM pv_users  GROUP BY pv_users.gender;    像这样 SELECTpv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCTpv_users.userid)  只distinct一列,但出现多次是可以的。 下面的查询错误,不允许DISTINCT多个列  INSERT OVERWRITE TABLE pv_gender_agg  SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCTpv_users.ip)  FROM pv_users  GROUP BY pv_users.gender; 但我测试发现是可以的。有大神可以给我解释么?   如 group by统计去重distinct个数 select substr(createtime,12,2)hour,logtype,os_id,count(distinct logtype), count(distinct os_id) from wizad_mdm_raw_hdfs where ad_id in('19057','19058','812b4ba287f5ee0bc9d43bbf5bbe87fb') and day >= '2014-12-01'and adn=5 group by substr(createtime,12,2), logtype, os_id;     或者一些测试脚本都说明了,我们 select logtype, count(distinct os_id),count(distinctip),count(distinct id) from wizad_mdm_raw_hdfs where ad_id in('19057','19058','812b4ba287f5ee0bc9d43bbf5bbe87fb') and day >= '2014-12-01' group by logtype; 结果: 1      2       8493314 77579300 2      2       267685  211119   select substr(createtime,12,2) hour, count(distinctid),count(distinct ip) from wizad_mdm_raw_hdfs where ad_id in('19057','19058','812b4ba287f5ee0bc9d43bbf5bbe87fb') and day >= '2014-12-01'and adn=5 group by substr(createtime,12,2);                                                       部分结果如下: 00     1598136 154053 04     989745  51201 …… 18     1711493 201436 21     3644241 374243     (2)另一个wiki限制说明:group by后,除了作用列和聚合函数统计项,多余列不能存在。 这个我验证过确实是不行的。 SELECT a,sum(b) FROM t1 GROUP BY a; 是正确的   下面是错误的。 SELECT a,b FROM t1 GROUP BY a; 因为有多余列b,其不在group by的字段属性,(且不是聚合函数). 查询结果会是这样 a   b ------ 100 1 100 2 100 3 gourp by a后,b不能成为集合{1,2,3}返回,你可以count,但不能直接返回b。b是多值的。hive摒弃了这种猜测无效的SQL(HQL,要准确):有一列在select子句中,却不包含在GROUPBY子句中。 pig是可以构成集合返回的。    


原文地址:https://www.cnblogs.com/cl1024cl/p/6205474.html