hive窗口函数

1.相关函数说明

OVER()指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变化

CURRENT ROW:当前行

n PRECEDING:往前n行数据

n FOLLOWING:往后n行数据

UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING表示到后面的终点

LAG(col,n):往前第n行数据

LEAD(col,n):往后第n行数据

NTILE(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。注意:n必须为int类型。

2.数据准备:name,orderdate,cost

jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

3.需求

(1)查询在20174月份购买过的顾客及总人数

(2)查询顾客的购买明细及月购买总额

(3)上述的场景,要将cost按照日期进行累加

(4)查询顾客上次的购买时间

(5)查询前20%时间的订单信息

4.创建数据库并将文件的数据导入

create table business(
    > name string,
    > orderdate string,
    > cost int)
    > row format delimited fields terminated by ',';//列分割符

load data local inpath "/home/hadoop/file/business" into table business;

5.编码实现及结果

(1)查询在20174月份购买过的顾客及总人数

select name,count(*) over () 
    > 
    > from business 
    > 
    > where substring(orderdate,1,7) = '2017-04' 
    > 
    > group by name;

 

(2)查询顾客的购买明细及月购买总额

select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) from business;

 

(3)上述的场景,要将cost按照日期进行累加

select name,orderdate,cost, 
sum(cost) over() as sample1,--所有行相加 
sum(cost) over(partition by name) as sample2,--按name分组,组内数据相加 
sum(cost) over(partition by name order by orderdate) as sample3,--按name分组,组内数据累加 
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3一样,由起点到当前行的聚合 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --当前行和前面一行做聚合 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--当前行和前边一行及后面一行 
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行 
from business;

(4)查询顾客上次的购买时间

 select name,orderdate,cost,
    > lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate) as time1,
    > lag(orderdate,2) over (partition by name order by orderdate) as time2
    > from business;

 

 (5)查询前20%时间的订单信息

 select * from(
    > select name,orderdate,cost,ntile(5) over (order by orderdate) sorted from business)t
    > where sorted=1;

 

注:

lag 和lead 可以 获取结果集中,按一定排序所排列的当前行的上下相邻若干offset 的某个行的某个列(不用结果集的自关联);
lag ,lead 分别是向前,向后;
lag 和lead 有三个参数,第一个参数是列名,第二个参数是偏移的offset,第三个参数是 超出记录窗口时的默认值)

原文地址:https://www.cnblogs.com/837634902why/p/11468773.html