Hive学习小记-(7)group by原理&tips

原理

参考:https://blog.csdn.net/u013668852/article/details/79866931

group by多字段

select rank, isonline, count(*) from city group by rank, isonline;

将GroupBy的字段组合为map的输出key值,利用MapReduce的排序,在reduce阶段保存LastKey区分不同的key。MapReduce的过程如下(当然这里只是说明Reduce端的非Hash聚合过程)

image

group by单字段

group by 单字段和多字段时的不同在于key上,以如下例子为例(出处太多):
SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;

hive> SELECT * FROM logs;
a   苹果  5
a   橙子  3
a   苹果  2
b   烧鸡  1

hive> SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
a   10
b   1

其过程如下图:
image

默认设置了hive.map.aggr=true,所以会在mapper端先group by一次,最后再把结果merge起来,为了减少reducer处理的数据量。注意看explain的mode是不一样的。mapper是hash,reducer是mergepartial。如果把hive.map.aggr=false,那将groupby放到reducer才做,他的mode是complete.

image

TIPS

减少group by字段

措施一:先group by ;再关联

参考:https://blog.51cto.com/imysqldba/1364808

措施二:若是其中有些字段是固定取值,可以限定死

参考:https://www.cnblogs.com/hoojjack/p/7460574.html

措施三:本身在其他字段取了group by 就已经能确保取值唯一的字段,如张三项目组的职责一定是洗碗,李四组的职责一定是刷盘子,可以给字段打上聚合函数max,min等:select proj_group, duty, sum(salary) from test_table group by proj_group, duty  可以改写为:select proj_group, max(duty), sum(salary) from test_table group by proj_group

hive与MySQL group by的区别

1.hive执行顺序group 在前,select在后;MySQL select在前,group 在后

参考:https://blog.csdn.net/m0_48714980/article/details/109136659

eg:

--在year(date)起别名
-- 1.在Mysql中顺利执行
select count(*),year(date) a from student group by a ;

-- 2.但在hive中出现报错
select count(*),year(date) a from student group by a ;
Invalid table alias or column reference ‘a’: (possible column names are: s_id, s_name, s_birth, s_sex

-- 原因分析:select执行顺序中,group by 在select前先执行,所以此时别名在解析器中没法识别

2.hive要将非聚合字段都加到group中;MySQL没加到group的字段可能会随机取:

eg:

mysql> select * from youhua1;
+--------+--------------+----------------+----------------+
| custid | jijin_change | baoxian_change | cunkuan_change |
+--------+--------------+----------------+----------------+
|      1 |          1.1 |            1.2 |            1.3 |
|      2 |         2.67 |           2.34 |            2.1 |
|      3 |          3.1 |            3.4 |           3.78 |
|      3 |          1.1 |            2.3 |            7.8 |
+--------+--------------+----------------+----------------+
4 rows in set (0.00 sec)

mysql> select custid,jijin_change,sum(baoxian_change) from youhua1 group by custid;;
+--------+--------------+---------------------+
| custid | jijin_change | sum(baoxian_change) |
+--------+--------------+---------------------+
|      1 |          1.1 |  1.2000000476837158 |
|      2 |         2.67 |  2.3399999141693115 |
|      3 |          3.1 |   5.700000047683716 |
+--------+--------------+---------------------+
3 rows in set (0.00 sec)
原文地址:https://www.cnblogs.com/foolangirl/p/14208537.html