【Hive】HiveQL实战之操作符和函数

Hive中的操作符合函数，和关系型数据库的类似，本篇主要讲解Hive的一些函数。

一函数分类

Hive中的函数可以分为以下几种：

数学函数：主要用于数学运算，例如：Randy（）和E（）；
集合函数：主要用于查找Size、Keys和复杂类型的值，例如：Size（Array<T>）；
类型转换函数：主要是Cast和Binary，用于将一种类型转为另一种类型；
日期函数：用于执行与日期相关的操作，例如：Year（string date）和Month（string data）；
条件函数：主要使用返回值来检查特定的条件，例如：Coalesce、IF和CASE WHEN；
字符函数：主要用于字符相关的操作，例如：Upper（string A）和Trim（string A）；
聚合函数：主要用于聚合操作，例如：Sum（）、Count（*）；
表生成函数：主要用于将单个输入转换为多行输出，例如：Explode（Map）和Json_tuple（就送String ，k1,k2……）；
自定义函数：由Java代码创建，作为Hive的函数的扩展。

二函数查看

在Hive中，可以使用如下方法查看Hive支持的函数，以及函数的使用：

1 查看Hive中的函数

0: jdbc:hive2://localhost:10000/hive> show functions;

2 查看函数的使用

0: jdbc:hive2://localhost:10000/hive> desc function upper;
+----------------------------------------------------+
|                      tab_name                      |
+----------------------------------------------------+
| upper(str) - Returns str with all characters changed to uppercase |
+----------------------------------------------------+
1 row selected (0.254 seconds)
0: jdbc:hive2://localhost:10000/hive> desc function extended upper;
+----------------------------------------------------+
|                      tab_name                      |
+----------------------------------------------------+
| upper(str) - Returns str with all characters changed to uppercase |
| Synonyms: ucase                                    |
| Example:                                           |
|   > SELECT upper('Facebook') FROM src LIMIT 1;     |
|   'FACEBOOK'                                       |
| Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper |
| Function type:BUILTIN                              |
+----------------------------------------------------+
7 rows selected (0.22 seconds)

注：后者查看的内容更详细。

三函数示例

1 示例数据

0: jdbc:hive2://localhost:10000/hive> select *from employee;
+----------------+-------------------------+----------------------------+------------------------+----------------------------------------+
| employee.name  |   employee.work_place   |      employee.sex_age      | employee.skills_score  |         employee.depart_title          |
+----------------+-------------------------+----------------------------+------------------------+----------------------------------------+
| Michael        | ["Montreal","Toronto"]  | {"sex":"Male","age":30}    | {"DB":80}              | {"Product":["Developer","Lead"]}       |
| Will           | ["Montreal"]            | {"sex":"Male","age":35}    | {"Perl":85}            | {"Product":["Lead"],"Test":["Lead"]}   |
| Shelley        | ["New York"]            | {"sex":"Female","age":27}  | {"Python":80}          | {"Test":["Lead"],"COE":["Architect"]}  |
| Lucy           | ["Vancouver"]           | {"sex":"Female","age":57}  | {"Sales":89,"HR":94}   | {"Sales":["Lead"]}                     |
+----------------+-------------------------+----------------------------+------------------------+----------------------------------------+
4 rows selected (0.353 seconds)

2 复杂数据类型

1）SIZE

该函数主要用于计算Map、Array或者嵌套Map/Array的大小，如果大小未知，则返回-1.

0: jdbc:hive2://localhost:10000/hive> select size(work_place) as array_size,
. . . . . . . . . . . . . . . . . . > size(skills_score) as map_size,
. . . . . . . . . . . . . . . . . . > size(depart_title) as complex_size,
. . . . . . . . . . . . . . . . . . > size(depart_title['Product']) as nest_size
. . . . . . . . . . . . . . . . . . > from employee;
+-------------+-----------+---------------+------------+
| array_size  | map_size  | complex_size  | nest_size  |
+-------------+-----------+---------------+------------+
| 2           | 1         | 1             | 2          |
| 1           | 1         | 2             | 1          |
| 1           | 1         | 2             | -1         |
| 1           | 2         | 1             | -1         |
+-------------+-----------+---------------+------------+
4 rows selected (0.38 seconds)

2）ARRAY_CONTAINS / SORT_ARRAY

该函数用于检查一个数组是否包含某一个值，如果包含，则返回True，否则返回FALSE，而SORT_ARRAY用于对数组进行升序排序。

0: jdbc:hive2://localhost:10000/hive> select name,array_contains(work_place,'Toronto') as is_Toronto,
. . . . . . . . . . . . . . . . . . > sort_array(work_place) as sort_array
. . . . . . . . . . . . . . . . . . > from employee;
+----------+-------------+-------------------------+
|   name   | is_toronto  |       sort_array        |
+----------+-------------+-------------------------+
| Michael  | true        | ["Montreal","Toronto"]  |
| Will     | false       | ["Montreal"]            |
| Shelley  | false       | ["New York"]            |
| Lucy     | false       | ["Vancouver"]           |
+----------+-------------+-------------------------+
4 rows selected (0.547 seconds)

3 日期函数

1）From_unixtime(unix_timestamp())

该函数类似于Oracle中的SYSDATE（）函数，用于动态返回当前时间。

0: jdbc:hive2://localhost:10000/hive> select from_unixtime(unix_timestamp()) as current_time
. . . . . . . . . . . . . . . . . . > from employee limit 1;
+----------------------+
|     current_time     |
+----------------------+
| 2018-07-09 11:04:31  |
+----------------------+
1 row selected (0.473 seconds)

2）unix_timestamp

该函数可用于比较两个日期，或者可以用于Order by后进行排序。

0: jdbc:hive2://localhost:10000/hive> select from_unixtime(unix_timestamp()),(unix_timestamp()-unix_timestamp('2018-07-09 7:00:00'))/60/60/24 as daydiff from employee limit 1;
+----------------------+----------------------+
|         _c0          |       daydiff        |
+----------------------+----------------------+
| 2018-07-09 11:12:06  | 0.17506944444444442  |
+----------------------+----------------------+
1 row selected (0.905 seconds)

3）To_date

该函数用于返回年月日格式的日期。

0: jdbc:hive2://localhost:10000/hive> select to_date(from_unixtime(unix_timestamp())) as day from employee limit 1;
+-------------+
|     day     |
+-------------+
| 2018-07-09  |
+-------------+
1 row selected (0.485 seconds)

4 CASE

0: jdbc:hive2://localhost:10000/hive> select 
. . . . . . . . . . . . . . . . . . > case when 1 is null then 'True' else 0 end as case_result 
. . . . . . . . . . . . . . . . . . > from employee limit 1;
+--------------+
| case_result  |
+--------------+
| 0            |
+--------------+
1 row selected (0.37 seconds)

5 解析器和搜索

1）数据准备

0: jdbc:hive2://localhost:10000/hive> insert into employee
. . . . . . . . . . . . . . . . . . > select 'Steven' as name,array(null) as work_place,
. . . . . . . . . . . . . . . . . . > named_struct('sex','Male','age',30) as sex_age,
. . . . . . . . . . . . . . . . . . > map('Python',90) as skills_score,
. . . . . . . . . . . . . . . . . . > map('R&D',array('Developer')) as depart_title
. . . . . . . . . . . . . . . . . . > from employee limit 1;
No rows affected (194.183 seconds)
0: jdbc:hive2://localhost:10000/hive> select *from employee;
+----------------+-------------------------+----------------------------+------------------------+----------------------------------------+
| employee.name  |   employee.work_place   |      employee.sex_age      | employee.skills_score  |         employee.depart_title          |
+----------------+-------------------------+----------------------------+------------------------+----------------------------------------+
| Steven         | NULL                    | {"sex":"Male","age":30}    | {"Python":90}          | {"R&D":["Developer"]}                  |
| Michael        | ["Montreal","Toronto"]  | {"sex":"Male","age":30}    | {"DB":80}              | {"Product":["Developer","Lead"]}       |
| Will           | ["Montreal"]            | {"sex":"Male","age":35}    | {"Perl":85}            | {"Product":["Lead"],"Test":["Lead"]}   |
| Shelley        | ["New York"]            | {"sex":"Female","age":27}  | {"Python":80}          | {"Test":["Lead"],"COE":["Architect"]}  |
| Lucy           | ["Vancouver"]           | {"sex":"Female","age":57}  | {"Sales":89,"HR":94}   | {"Sales":["Lead"]}                     |
+----------------+-------------------------+----------------------------+------------------------+----------------------------------------+
5 rows selected (0.786 seconds)

2）Lateral View

该函数会忽视Expode返回NULL的行。

0: jdbc:hive2://localhost:10000/hive> select name,workplace,skills,score
. . . . . . . . . . . . . . . . . . > from employee
. . . . . . . . . . . . . . . . . . > lateral view explode(work_place) wp as workplace
. . . . . . . . . . . . . . . . . . > lateral view explode(skills_score) ss as skills,score;
+----------+------------+---------+--------+
|   name   | workplace  | skills  | score  |
+----------+------------+---------+--------+
| Michael  | Montreal   | DB      | 80     |
| Michael  | Toronto    | DB      | 80     |
| Will     | Montreal   | Perl    | 85     |
| Shelley  | New York   | Python  | 80     |
| Lucy     | Vancouver  | Sales   | 89     |
| Lucy     | Vancouver  | HR      | 94     |
+----------+------------+---------+--------+
6 rows selected (0.459 seconds)

3）Outer Lateral View

该函数会可以Expode返回NULL的行。

0: jdbc:hive2://localhost:10000/hive> select name,workplace,skills,score                    
. . . . . . . . . . . . . . . . . . > from employee                                         
. . . . . . . . . . . . . . . . . . > lateral view outer explode(work_place) wp as workplace
. . . . . . . . . . . . . . . . . . > lateral view explode(skills_score) ss as skills,score;
+----------+------------+---------+--------+
|   name   | workplace  | skills  | score  |
+----------+------------+---------+--------+
| Steven   | NULL       | Python  | 90     |
| Michael  | Montreal   | DB      | 80     |
| Michael  | Toronto    | DB      | 80     |
| Will     | Montreal   | Perl    | 85     |
| Shelley  | New York   | Python  | 80     |
| Lucy     | Vancouver  | Sales   | 89     |
| Lucy     | Vancouver  | HR      | 94     |
+----------+------------+---------+--------+
7 rows selected (0.366 seconds)

4）Split

该函数用于将字符串拆分为数组。

0: jdbc:hive2://localhost:10000/hive> select split('/user/hive/warehouse/hive.db/employee','/') from employee limit 1;
+----------------------------------------------------+
|                        _c0                         |
+----------------------------------------------------+
| ["","user","hive","warehouse","hive.db","employee"] |
+----------------------------------------------------+
1 row selected (0.303 seconds)

5）Reverse

该函数用于反转字符。

0: jdbc:hive2://localhost:10000/hive> select reverse(split(reverse('/user/hive/warehouse/hive.db/employee'),'/')[0]) from employee limit 1;
+-----------+
|    _c0    |
+-----------+
| employee  |
+-----------+
1 row selected (0.392 seconds)

6）Collect_set

该函数用于合并为一行，但会去掉重复的值。

0: jdbc:hive2://localhost:10000/hive> select collect_set(work_place[0]) from employee;
+--------------------------------------+
|                 _c0                  |
+--------------------------------------+
| ["Montreal","New York","Vancouver"]  |
+--------------------------------------+
1 row selected (140.198 seconds)

7）Collect_list

该函数和Collect_set类似，但不会去重。

0: jdbc:hive2://localhost:10000/hive> select collect_list(work_place[0]) from employee;
0: jdbc:hive2://localhost:10000/hive> select collect_list(work_place[0]) from employee;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
+-------------------------------------------------+
|                       _c0                       |
+-------------------------------------------------+
| ["Montreal","Montreal","New York","Vancouver"]  |
+-------------------------------------------------+
1 row selected (133.142 seconds)

更多内容，可参考Hive函数（Show Functions）。