【Hive】HiveQL实战之操作符和函数

Hive中的操作符合函数,和关系型数据库的类似,本篇主要讲解Hive的一些函数。

一 函数分类

Hive中的函数可以分为以下几种:
  • 数学函数:主要用于数学运算,例如:Randy()和E();
  • 集合函数:主要用于查找Size、Keys和复杂类型的值,例如:Size(Array<T>);
  • 类型转换函数:主要是Cast和Binary,用于将一种类型转为另一种类型;
  • 日期函数:用于执行与日期相关的操作,例如:Year(string date)和Month(string data);
  • 条件函数:主要使用返回值来检查特定的条件,例如:Coalesce、IF和CASE WHEN;
  • 字符函数:主要用于字符相关的操作,例如:Upper(string A)和Trim(string A);
  • 聚合函数:主要用于聚合操作,例如:Sum()、Count(*);
  • 表生成函数:主要用于将单个输入转换为多行输出,例如:Explode(Map)和Json_tuple(就送String ,k1,k2……);
  • 自定义函数:由Java代码创建,作为Hive的函数的扩展。
二 函数查看

在Hive中,可以使用如下方法查看Hive支持的函数,以及函数的使用:

1 查看Hive中的函数
0: jdbc:hive2://localhost:10000/hive> show functions;
2 查看函数的使用
0: jdbc:hive2://localhost:10000/hive> desc function upper;
+----------------------------------------------------+
|                      tab_name                      |
+----------------------------------------------------+
| upper(str) - Returns str with all characters changed to uppercase |
+----------------------------------------------------+
1 row selected (0.254 seconds)
0: jdbc:hive2://localhost:10000/hive> desc function extended upper;
+----------------------------------------------------+
|                      tab_name                      |
+----------------------------------------------------+
| upper(str) - Returns str with all characters changed to uppercase |
| Synonyms: ucase                                    |
| Example:                                           |
|   > SELECT upper('Facebook') FROM src LIMIT 1;     |
|   'FACEBOOK'                                       |
| Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper |
| Function type:BUILTIN                              |
+----------------------------------------------------+
7 rows selected (0.22 seconds)
注:后者查看的内容更详细。

三 函数示例

1 示例数据
0: jdbc:hive2://localhost:10000/hive> select *from employee;
+----------------+-------------------------+----------------------------+------------------------+----------------------------------------+
| employee.name  |   employee.work_place   |      employee.sex_age      | employee.skills_score  |         employee.depart_title          |
+----------------+-------------------------+----------------------------+------------------------+----------------------------------------+
| Michael        | ["Montreal","Toronto"]  | {"sex":"Male","age":30}    | {"DB":80}              | {"Product":["Developer","Lead"]}       |
| Will           | ["Montreal"]            | {"sex":"Male","age":35}    | {"Perl":85}            | {"Product":["Lead"],"Test":["Lead"]}   |
| Shelley        | ["New York"]            | {"sex":"Female","age":27}  | {"Python":80}          | {"Test":["Lead"],"COE":["Architect"]}  |
| Lucy           | ["Vancouver"]           | {"sex":"Female","age":57}  | {"Sales":89,"HR":94}   | {"Sales":["Lead"]}                     |
+----------------+-------------------------+----------------------------+------------------------+----------------------------------------+
4 rows selected (0.353 seconds)
2 复杂数据类型

1)SIZE

该函数主要用于计算Map、Array或者嵌套Map/Array的大小,如果大小未知,则返回-1.
0: jdbc:hive2://localhost:10000/hive> select size(work_place) as array_size,
. . . . . . . . . . . . . . . . . . > size(skills_score) as map_size,
. . . . . . . . . . . . . . . . . . > size(depart_title) as complex_size,
. . . . . . . . . . . . . . . . . . > size(depart_title['Product']) as nest_size
. . . . . . . . . . . . . . . . . . > from employee;
+-------------+-----------+---------------+------------+
| array_size  | map_size  | complex_size  | nest_size  |
+-------------+-----------+---------------+------------+
| 2           | 1         | 1             | 2          |
| 1           | 1         | 2             | 1          |
| 1           | 1         | 2             | -1         |
| 1           | 2         | 1             | -1         |
+-------------+-----------+---------------+------------+
4 rows selected (0.38 seconds)
2)ARRAY_CONTAINS / SORT_ARRAY

该函数用于检查一个数组是否包含某一个值,如果包含,则返回True,否则返回FALSE,而SORT_ARRAY用于对数组进行升序排序。
0: jdbc:hive2://localhost:10000/hive> select name,array_contains(work_place,'Toronto') as is_Toronto,
. . . . . . . . . . . . . . . . . . > sort_array(work_place) as sort_array
. . . . . . . . . . . . . . . . . . > from employee;
+----------+-------------+-------------------------+
|   name   | is_toronto  |       sort_array        |
+----------+-------------+-------------------------+
| Michael  | true        | ["Montreal","Toronto"]  |
| Will     | false       | ["Montreal"]            |
| Shelley  | false       | ["New York"]            |
| Lucy     | false       | ["Vancouver"]           |
+----------+-------------+-------------------------+
4 rows selected (0.547 seconds)
3 日期函数

1)From_unixtime(unix_timestamp())

该函数类似于Oracle中的SYSDATE()函数,用于动态返回当前时间。
0: jdbc:hive2://localhost:10000/hive> select from_unixtime(unix_timestamp()) as current_time
. . . . . . . . . . . . . . . . . . > from employee limit 1;
+----------------------+
|     current_time     |
+----------------------+
| 2018-07-09 11:04:31  |
+----------------------+
1 row selected (0.473 seconds)
2)unix_timestamp

该函数可用于比较两个日期,或者可以用于Order by后进行排序。
0: jdbc:hive2://localhost:10000/hive> select from_unixtime(unix_timestamp()),(unix_timestamp()-unix_timestamp('2018-07-09 7:00:00'))/60/60/24 as daydiff from employee limit 1;
+----------------------+----------------------+
|         _c0          |       daydiff        |
+----------------------+----------------------+
| 2018-07-09 11:12:06  | 0.17506944444444442  |
+----------------------+----------------------+
1 row selected (0.905 seconds)
3)To_date

该函数用于返回年月日格式的日期。
0: jdbc:hive2://localhost:10000/hive> select to_date(from_unixtime(unix_timestamp())) as day from employee limit 1;
+-------------+
|     day     |
+-------------+
| 2018-07-09  |
+-------------+
1 row selected (0.485 seconds)
4 CASE
0: jdbc:hive2://localhost:10000/hive> select 
. . . . . . . . . . . . . . . . . . > case when 1 is null then 'True' else 0 end as case_result 
. . . . . . . . . . . . . . . . . . > from employee limit 1;
+--------------+
| case_result  |
+--------------+
| 0            |
+--------------+
1 row selected (0.37 seconds)
5 解析器和搜索

1)数据准备
0: jdbc:hive2://localhost:10000/hive> insert into employee
. . . . . . . . . . . . . . . . . . > select 'Steven' as name,array(null) as work_place,
. . . . . . . . . . . . . . . . . . > named_struct('sex','Male','age',30) as sex_age,
. . . . . . . . . . . . . . . . . . > map('Python',90) as skills_score,
. . . . . . . . . . . . . . . . . . > map('R&D',array('Developer')) as depart_title
. . . . . . . . . . . . . . . . . . > from employee limit 1;
No rows affected (194.183 seconds)
0: jdbc:hive2://localhost:10000/hive> select *from employee;
+----------------+-------------------------+----------------------------+------------------------+----------------------------------------+
| employee.name  |   employee.work_place   |      employee.sex_age      | employee.skills_score  |         employee.depart_title          |
+----------------+-------------------------+----------------------------+------------------------+----------------------------------------+
| Steven         | NULL                    | {"sex":"Male","age":30}    | {"Python":90}          | {"R&D":["Developer"]}                  |
| Michael        | ["Montreal","Toronto"]  | {"sex":"Male","age":30}    | {"DB":80}              | {"Product":["Developer","Lead"]}       |
| Will           | ["Montreal"]            | {"sex":"Male","age":35}    | {"Perl":85}            | {"Product":["Lead"],"Test":["Lead"]}   |
| Shelley        | ["New York"]            | {"sex":"Female","age":27}  | {"Python":80}          | {"Test":["Lead"],"COE":["Architect"]}  |
| Lucy           | ["Vancouver"]           | {"sex":"Female","age":57}  | {"Sales":89,"HR":94}   | {"Sales":["Lead"]}                     |
+----------------+-------------------------+----------------------------+------------------------+----------------------------------------+
5 rows selected (0.786 seconds)
2)Lateral View

该函数会忽视Expode返回NULL的行。
0: jdbc:hive2://localhost:10000/hive> select name,workplace,skills,score
. . . . . . . . . . . . . . . . . . > from employee
. . . . . . . . . . . . . . . . . . > lateral view explode(work_place) wp as workplace
. . . . . . . . . . . . . . . . . . > lateral view explode(skills_score) ss as skills,score;
+----------+------------+---------+--------+
|   name   | workplace  | skills  | score  |
+----------+------------+---------+--------+
| Michael  | Montreal   | DB      | 80     |
| Michael  | Toronto    | DB      | 80     |
| Will     | Montreal   | Perl    | 85     |
| Shelley  | New York   | Python  | 80     |
| Lucy     | Vancouver  | Sales   | 89     |
| Lucy     | Vancouver  | HR      | 94     |
+----------+------------+---------+--------+
6 rows selected (0.459 seconds)
3)Outer Lateral View

该函数会可以Expode返回NULL的行。
0: jdbc:hive2://localhost:10000/hive> select name,workplace,skills,score                    
. . . . . . . . . . . . . . . . . . > from employee                                         
. . . . . . . . . . . . . . . . . . > lateral view outer explode(work_place) wp as workplace
. . . . . . . . . . . . . . . . . . > lateral view explode(skills_score) ss as skills,score;
+----------+------------+---------+--------+
|   name   | workplace  | skills  | score  |
+----------+------------+---------+--------+
| Steven   | NULL       | Python  | 90     |
| Michael  | Montreal   | DB      | 80     |
| Michael  | Toronto    | DB      | 80     |
| Will     | Montreal   | Perl    | 85     |
| Shelley  | New York   | Python  | 80     |
| Lucy     | Vancouver  | Sales   | 89     |
| Lucy     | Vancouver  | HR      | 94     |
+----------+------------+---------+--------+
7 rows selected (0.366 seconds)
4)Split

该函数用于将字符串拆分为数组。
0: jdbc:hive2://localhost:10000/hive> select split('/user/hive/warehouse/hive.db/employee','/') from employee limit 1;
+----------------------------------------------------+
|                        _c0                         |
+----------------------------------------------------+
| ["","user","hive","warehouse","hive.db","employee"] |
+----------------------------------------------------+
1 row selected (0.303 seconds)
5)Reverse

该函数用于反转字符。
0: jdbc:hive2://localhost:10000/hive> select reverse(split(reverse('/user/hive/warehouse/hive.db/employee'),'/')[0]) from employee limit 1;
+-----------+
|    _c0    |
+-----------+
| employee  |
+-----------+
1 row selected (0.392 seconds)
6)Collect_set

该函数用于合并为一行,但会去掉重复的值。
0: jdbc:hive2://localhost:10000/hive> select collect_set(work_place[0]) from employee;
+--------------------------------------+
|                 _c0                  |
+--------------------------------------+
| ["Montreal","New York","Vancouver"]  |
+--------------------------------------+
1 row selected (140.198 seconds)
7)Collect_list

该函数和Collect_set类似,但不会去重。
0: jdbc:hive2://localhost:10000/hive> select collect_list(work_place[0]) from employee;
0: jdbc:hive2://localhost:10000/hive> select collect_list(work_place[0]) from employee;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
+-------------------------------------------------+
|                       _c0                       |
+-------------------------------------------------+
| ["Montreal","Montreal","New York","Vancouver"]  |
+-------------------------------------------------+
1 row selected (133.142 seconds)
更多内容,可参考Hive函数(Show Functions)。

原文地址:https://www.cnblogs.com/alen-liu-sz/p/12975623.html