Hive 窗口函数

表数据如下

select * from business;

business.name	business.orderdate	business.cost
jack	2017-01-01	10
tony	2017-01-02	15
jack	2017-02-03	23
tony	2017-01-04	29
jack	2017-01-05	46
jack	2017-04-06	42
tony	2017-01-07	50
jack	2017-01-08	55
mart	2017-04-08	62
mart	2017-04-09	68
neil	2017-05-10	12
mart	2017-04-11	75
neil	2017-06-12	80
mart	2017-04-13	94

  

一、聚合函数开窗

  1.全局范围

  查询顾客姓名及总人数(多次购买只算一人)

select name,count(*) over() from business group by name;

name	count_window_0
jack	4
mart	4
neil	4
tony	4

  2.排序后的范围

  对所有人的消费明细,将 cost 按照日期进行累加

select *,sum(cost) over(order by orderdate) from business;

business.name	business.orderdate	business.cost	sum_window_0
jack	2017-01-01	10	10
tony	2017-01-02	15	25
tony	2017-01-04	29	54
jack	2017-01-05	46	100
tony	2017-01-07	50	150
jack	2017-01-08	55	205
jack	2017-02-03	23	228
jack	2017-04-06	42	270
mart	2017-04-08	62	332
mart	2017-04-09	68	400
mart	2017-04-11	75	475
mart	2017-04-13	94	569
neil	2017-05-10	12	581
neil	2017-06-12	80	661

  3.分区加排序后的范围。分区排序搭配:distribute by + sort by或者partition by + order by。

  将不同人的消费明细,按日期累加

select *,sum(cost) over(distribute by name sort by orderdate) from business;

business.name	business.orderdate	business.cost	sum_window_0
jack	2017-01-01	10	10
jack	2017-01-05	46	56
jack	2017-01-08	55	111
jack	2017-02-03	23	134
jack	2017-04-06	42	176
mart	2017-04-08	62	62
mart	2017-04-09	68	130
mart	2017-04-11	75	205
mart	2017-04-13	94	299
neil	2017-05-10	12	12
neil	2017-06-12	80	92
tony	2017-01-02	15	15
tony	2017-01-04	29	44
tony	2017-01-07	50	94

  4.其他范围

  ①开始到当前行

over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row )

  ②当前行到最后

over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING)

  ③前一行、当前行和下一行

over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING)

  

二、其他函数开窗

  1.lag()函数

  查询每个顾客的购买明细及上一次购买时间

select *,lag(orderdate,1,'1970-01-01') over(distribute by name sort by orderdate) from business;

business.name	business.orderdate	business.cost	lag_window_0
jack	2017-01-01	10	1970-01-01
jack	2017-01-05	46	2017-01-01
jack	2017-01-08	55	2017-01-05
jack	2017-02-03	23	2017-01-08
jack	2017-04-06	42	2017-02-03
mart	2017-04-08	62	1970-01-01
mart	2017-04-09	68	2017-04-08
mart	2017-04-11	75	2017-04-09
mart	2017-04-13	94	2017-04-11
neil	2017-05-10	12	1970-01-01
neil	2017-06-12	80	2017-05-10
tony	2017-01-02	15	1970-01-01
tony	2017-01-04	29	2017-01-02
tony	2017-01-07	50	2017-01-04

  2.lead()函数

  查询每个顾客的购买明细及下一次购买时间

select *,lead(orderdate,1,'9999-99-99') over(distribute by name sort by orderdate) from business;

jack	2017-01-01	10	2017-01-05
jack	2017-01-05	46	2017-01-08
jack	2017-01-08	55	2017-02-03
jack	2017-02-03	23	2017-04-06
jack	2017-04-06	42	9999-99-99
mart	2017-04-08	62	2017-04-09
mart	2017-04-09	68	2017-04-11
mart	2017-04-11	75	2017-04-13
mart	2017-04-13	94	9999-99-99
neil	2017-05-10	12	2017-06-12
neil	2017-06-12	80	9999-99-99
tony	2017-01-02	15	2017-01-04
tony	2017-01-04	29	2017-01-07
tony	2017-01-07	50	9999-99-99

  3.ntile()函数

把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。注意:n必须为int类型。即把数据分成n份,n为编号。

  查询前20%时间的订单信息

select * from (select *,ntile(5) over(order by orderdate) nt from business) t1 where t1.nt = 1;

t1.name	t1.orderdate	t1.cost	t1.nt
jack	2017-01-01	10	1
tony	2017-01-02	15	1
tony	2017-01-04	29	1

总结:所谓开窗,就是把数据先查出来,然后增加一个窗口列,对已经查出的数据进行再加工。

原文地址:https://www.cnblogs.com/noyouth/p/12733030.html