【Hive】HiveQL实战之分析函数&窗口函数(补充)

本篇承接【Hive】HiveQL实战之分析函数&窗口函数并对其进行补充。

一 语法结构

分析函数的使用常和Over、Partition By、Order By 和窗口说明一起,具体语法如下:
Function (arg1,..., argn) OVER ([PARTITION BY <...>] [ORDER BY <....>]
[<window_clause>])

1 Function (arg1,..., argn)

函数可以是以下类型:
  • 标准聚合函数:Count()、Sum()、Min()、MAX()或AVG();
  • Rank;
  • Dense_Rank;
  • Row_number;
  • Cume_dist;
  • Percent_Rank;
  • NTile;
  • Lead;
  • Lag;
  • First_Value;
  • Last_Value;
2 PARTITION BY <...>

对数据行进行分组,和Group By类似。

3 ORDER BY <....>

对数据进行排序,默认为升序。

4 <window_clause>

窗口说明进一步对结果进行分区并应用分析函数,有两种类型的窗口,分别为Row类型和Range类型。

1)Row类型窗口定义

ROWS BETWEEN <start_expr> AND <end_expr>

2)窗口说明

<start_expr>只能是以下之一:
  • Unbounded preceding;
  • Current row;
  • N Preceding 或 Following;

<end_expr>只能是以下之一:
  • Unbounded following;
  • Current row;
  • N preceding 或Following;

3)图形展示


二 实战示例

1 数据准备
0: jdbc:hive2://localhost:10000/hive> create table ana(name string , depart int,salary int)
. . . . . . . . . . . . . . . . . . > row format delimited
. . . . . . . . . . . . . . . . . . > fields terminated by '|';
No rows affected (3.863 seconds)
0: jdbc:hive2://localhost:10000/hive> load data local inpath '/home/hadoop/tt' overwrite into table ana;
No rows affected (1.044 seconds)
0: jdbc:hive2://localhost:10000/hive> select *from ana;
+-----------+-------------+-------------+
| ana.name  | ana.depart  | ana.salary  |
+-----------+-------------+-------------+
| Lucy      | 1000        | 5500        |
| Michael   | 1000        | 5000        |
| Steven    | 1000        | 6400        |
| Will      | 1000        | 4000        |
| Will      | 1000        | 4000        |
| Jess      | 1001        | 6000        |
| Lily      | 1001        | 5000        |
| Mike      | 1001        | 6400        |
| Richard   | 1002        | 8000        |
| Wei       | 1002        | 7000        |
| Yun       | 1002        | 5500        |
+-----------+-------------+-------------+
11 rows selected (0.377 seconds)
2 示例一
0: jdbc:hive2://localhost:10000/hive> select row_number()over(order by depart,name) rnk,
. . . . . . . . . . . . . . . . . . > name,depart,salary,count(*)over(partition by depart) as row_cnt,
. . . . . . . . . . . . . . . . . . > sum(salary)over(partition by depart order by depart) as deptTotal,
. . . . . . . . . . . . . . . . . . > sum(salary)over(order by depart) as runningTotal,
. . . . . . . . . . . . . . . . . . > sum(salary)over(order by depart rows unbounded preceding) as runningTotal2,
. . . . . . . . . . . . . . . . . . > sum(salary)over(order by depart rows between unbounded preceding and current row) as runningTotal3
. . . . . . . . . . . . . . . . . . > from ana
. . . . . . . . . . . . . . . . . . > order by depart,name;
+------+----------+---------+---------+----------+------------+---------------+----------------+----------------+
| rnk  |   name   | depart  | salary  | row_cnt  | depttotal  | runningtotal  | runningtotal2  | runningtotal3  |
+------+----------+---------+---------+----------+------------+---------------+----------------+----------------+
| 1    | Lucy     | 1000    | 5500    | 5        | 24900      | 24900         | 5500           | 5500           |
| 2    | Michael  | 1000    | 5000    | 5        | 24900      | 24900         | 10500          | 10500          |
| 3    | Steven   | 1000    | 6400    | 5        | 24900      | 24900         | 16900          | 16900          |
| 5    | Will     | 1000    | 4000    | 5        | 24900      | 24900         | 24900          | 24900          |
| 4    | Will     | 1000    | 4000    | 5        | 24900      | 24900         | 20900          | 20900          |
| 6    | Jess     | 1001    | 6000    | 3        | 17400      | 42300         | 30900          | 30900          |
| 7    | Lily     | 1001    | 5000    | 3        | 17400      | 42300         | 35900          | 35900          |
| 8    | Mike     | 1001    | 6400    | 3        | 17400      | 42300         | 42300          | 42300          |
| 9    | Richard  | 1002    | 8000    | 3        | 20500      | 62800         | 50300          | 50300          |
| 10   | Wei      | 1002    | 7000    | 3        | 20500      | 62800         | 57300          | 57300          |
| 11   | Yun      | 1002    | 5500    | 3        | 20500      | 62800         | 62800          | 62800          |
+------+----------+---------+---------+----------+------------+---------------+----------------+----------------+
11 rows selected (241.63 seconds)
3 示例二
0: jdbc:hive2://localhost:10000/hive> select row_number()over(order by depart,name) rnk,
. . . . . . . . . . . . . . . . . . > name,depart,salary,
. . . . . . . . . . . . . . . . . . > rank() over(partition by depart order by salary) as rank,
. . . . . . . . . . . . . . . . . . > dense_rank()over(partition by depart order by salary) as dense_rank,
. . . . . . . . . . . . . . . . . . > row_number()over() as row_num,
. . . . . . . . . . . . . . . . . . > round(cume_dist()over(partition by depart order by salary),1) as cume_dist,
. . . . . . . . . . . . . . . . . . > percent_rank()over(partition by depart order by salary) as percent_rank,
. . . . . . . . . . . . . . . . . . > ntile(4) over(partition by depart order by salary) as ntile
. . . . . . . . . . . . . . . . . . > from ana
. . . . . . . . . . . . . . . . . . > order by depart;
+------+----------+---------+---------+-------+-------------+----------+------------+---------------+--------+
| rnk  |   name   | depart  | salary  | rank  | dense_rank  | row_num  | cume_dist  | percent_rank  | ntile  |
+------+----------+---------+---------+-------+-------------+----------+------------+---------------+--------+
| 5    | Will     | 1000    | 4000    | 1     | 1           | 11       | 0.4        | 0.0           | 1      |
| 4    | Will     | 1000    | 4000    | 1     | 1           | 10       | 0.4        | 0.0           | 1      |
| 2    | Michael  | 1000    | 5000    | 3     | 2           | 9        | 0.6        | 0.5           | 2      |
| 1    | Lucy     | 1000    | 5500    | 4     | 3           | 8        | 0.8        | 0.75          | 3      |
| 3    | Steven   | 1000    | 6400    | 5     | 4           | 7        | 1.0        | 1.0           | 4      |
| 7    | Lily     | 1001    | 5000    | 1     | 1           | 6        | 0.3        | 0.0           | 1      |
| 6    | Jess     | 1001    | 6000    | 2     | 2           | 5        | 0.7        | 0.5           | 2      |
| 8    | Mike     | 1001    | 6400    | 3     | 3           | 4        | 1.0        | 1.0           | 3      |
| 11   | Yun      | 1002    | 5500    | 1     | 1           | 3        | 0.3        | 0.0           | 1      |
| 10   | Wei      | 1002    | 7000    | 2     | 2           | 2        | 0.7        | 0.5           | 2      |
| 9    | Richard  | 1002    | 8000    | 3     | 3           | 1        | 1.0        | 1.0           | 3      |
+------+----------+---------+---------+-------+-------------+----------+------------+---------------+--------+
11 rows selected (229.274 seconds)
4 示例三
0: jdbc:hive2://localhost:10000/hive> select name,depart,salary,
. . . . . . . . . . . . . . . . . . > row_number()over(order by depart,name) rnk,
. . . . . . . . . . . . . . . . . . > lead(salary,2)over(partition by depart order by salary) as lead,
. . . . . . . . . . . . . . . . . . > lag(salary,2)over(partition by depart order by salary) as lag,
. . . . . . . . . . . . . . . . . . > first_value(salary)over(partition by depart order by salary) as first_value,
. . . . . . . . . . . . . . . . . . > last_value(salary)over(partition by depart order by salary) as last_value_default,
. . . . . . . . . . . . . . . . . . > last_value(salary)over(partition by depart order by salary rows between unbounded preceding and unbounded following) as last_value
. . . . . . . . . . . . . . . . . . > from ana
. . . . . . . . . . . . . . . . . . > order by depart;
+----------+---------+---------+------+-------+-------+--------------+---------------------+-------------+
|   name   | depart  | salary  | rnk  | lead  |  lag  | first_value  | last_value_default  | last_value  |
+----------+---------+---------+------+-------+-------+--------------+---------------------+-------------+
| Steven   | 1000    | 6400    | 3    | NULL  | 5000  | 4000         | 6400                | 6400        |
| Lucy     | 1000    | 5500    | 1    | NULL  | 4000  | 4000         | 5500                | 6400        |
| Michael  | 1000    | 5000    | 2    | 6400  | 4000  | 4000         | 5000                | 6400        |
| Will     | 1000    | 4000    | 4    | 5500  | NULL  | 4000         | 4000                | 6400        |
| Will     | 1000    | 4000    | 5    | 5000  | NULL  | 4000         | 4000                | 6400        |
| Mike     | 1001    | 6400    | 8    | NULL  | 5000  | 5000         | 6400                | 6400        |
| Jess     | 1001    | 6000    | 6    | NULL  | NULL  | 5000         | 6000                | 6400        |
| Lily     | 1001    | 5000    | 7    | 6400  | NULL  | 5000         | 5000                | 6400        |
| Richard  | 1002    | 8000    | 9    | NULL  | 5500  | 5500         | 8000                | 8000        |
| Wei      | 1002    | 7000    | 10   | NULL  | NULL  | 5500         | 7000                | 8000        |
| Yun      | 1002    | 5500    | 11   | 8000  | NULL  | 5500         | 5500                | 8000        |
+----------+---------+---------+------+-------+-------+--------------+---------------------+-------------+
11 rows selected (172.483 seconds)
5 示例四
0: jdbc:hive2://localhost:10000/hive> select name,depart,salary,
. . . . . . . . . . . . . . . . . . > row_number()over(order by depart,name) rnk,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows between 2 preceding and current row) win1,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows between 2 preceding and unbounded following) win2,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows between 1 preceding and 2 following) win3,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows between 1 following and 2 following) win5,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows between current row and current row) win6,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows between current row and 1 following) win7,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows between current row and unbounded following) win8,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows between unbounded preceding and current row) win9,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows between unbounded preceding and 1 following) win10,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows between unbounded preceding and unbounded following) win11,
. . . . . . . . . . . . . . . . . . > max(salary)over(partition by depart order by name rows 2 preceding) win12
. . . . . . . . . . . . . . . . . . > from ana
. . . . . . . . . . . . . . . . . . > order by depart,name;
+----------+---------+---------+------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+
|   name   | depart  | salary  | rnk  | win1  | win2  | win3  | win5  | win6  | win7  | win8  | win9  | win10  | win11  | win12  |
+----------+---------+---------+------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+
| Lucy     | 1000    | 5500    | 1    | 5500  | 6400  | 6400  | 6400  | 5500  | 5500  | 6400  | 5500  | 5500   | 6400   | 5500   |
| Michael  | 1000    | 5000    | 2    | 5500  | 6400  | 6400  | 6400  | 5000  | 6400  | 6400  | 5500  | 6400   | 6400   | 5500   |
| Steven   | 1000    | 6400    | 3    | 6400  | 6400  | 6400  | 4000  | 6400  | 6400  | 6400  | 6400  | 6400   | 6400   | 6400   |
| Will     | 1000    | 4000    | 4    | 6400  | 6400  | 4000  | NULL  | 4000  | 4000  | 4000  | 6400  | 6400   | 6400   | 6400   |
| Will     | 1000    | 4000    | 5    | 6400  | 6400  | 6400  | 4000  | 4000  | 4000  | 4000  | 6400  | 6400   | 6400   | 6400   |
| Jess     | 1001    | 6000    | 6    | 6000  | 6400  | 6400  | 6400  | 6000  | 6000  | 6400  | 6000  | 6000   | 6400   | 6000   |
| Lily     | 1001    | 5000    | 7    | 6000  | 6400  | 6400  | 6400  | 5000  | 6400  | 6400  | 6000  | 6400   | 6400   | 6000   |
| Mike     | 1001    | 6400    | 8    | 6400  | 6400  | 6400  | NULL  | 6400  | 6400  | 6400  | 6400  | 6400   | 6400   | 6400   |
| Richard  | 1002    | 8000    | 9    | 8000  | 8000  | 8000  | 7000  | 8000  | 8000  | 8000  | 8000  | 8000   | 8000   | 8000   |
| Wei      | 1002    | 7000    | 10   | 8000  | 8000  | 8000  | 5500  | 7000  | 7000  | 7000  | 8000  | 8000   | 8000   | 8000   |
| Yun      | 1002    | 5500    | 11   | 8000  | 8000  | 7000  | NULL  | 5500  | 5500  | 5500  | 8000  | 8000   | 8000   | 8000   |
+----------+---------+---------+------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+--------+
11 rows selected (190.488 seconds)
6 示例五
0: jdbc:hive2://localhost:10000/hive> select name,depart,salary,                                                                                                      
. . . . . . . . . . . . . . . . . . > row_number()over(order by depart,name) rnk,                                               
. . . . . . . . . . . . . . . . . . > max(salary)over w1 as win1,                                                               
. . . . . . . . . . . . . . . . . . > max(salary)over w1 as win2,                                                               
. . . . . . . . . . . . . . . . . . > max(salary)over w3 as win3                                                                
. . . . . . . . . . . . . . . . . . > from ana 
. . . . . . . . . . . . . . . . . . > window 
. . . . . . . . . . . . . . . . . . > w1 as (partition by depart order by name rows between 2 preceding and current row),
. . . . . . . . . . . . . . . . . . > w3 as (partition by depart order by name rows between 1 preceding and 2 following)
. . . . . . . . . . . . . . . . . . > order by depart,name;
+----------+---------+---------+------+-------+-------+-------+
|   name   | depart  | salary  | rnk  | win1  | win2  | win3  |
+----------+---------+---------+------+-------+-------+-------+
| Lucy     | 1000    | 5500    | 1    | 5500  | 5500  | 6400  |
| Michael  | 1000    | 5000    | 2    | 5500  | 5500  | 6400  |
| Steven   | 1000    | 6400    | 3    | 6400  | 6400  | 6400  |
| Will     | 1000    | 4000    | 4    | 6400  | 6400  | 4000  |
| Will     | 1000    | 4000    | 5    | 6400  | 6400  | 6400  |
| Jess     | 1001    | 6000    | 6    | 6000  | 6000  | 6400  |
| Lily     | 1001    | 5000    | 7    | 6000  | 6000  | 6400  |
| Mike     | 1001    | 6400    | 8    | 6400  | 6400  | 6400  |
| Richard  | 1002    | 8000    | 9    | 8000  | 8000  | 8000  |
| Wei      | 1002    | 7000    | 10   | 8000  | 8000  | 8000  |
| Yun      | 1002    | 5500    | 11   | 8000  | 8000  | 7000  |
+----------+---------+---------+------+-------+-------+-------+
11 rows selected (189.893 seconds)



原文地址:https://www.cnblogs.com/alen-liu-sz/p/12975622.html