hive 总结二

本文参考：黑泽君相关博客
本文是我总结日常工作中遇到的坑，结合黑泽君相关博客，选取、补充了部分内容。

查询函数（Hive高级）

NVL(cloumn,replace_with)

如果cloumn为NULL，则NVL函数返回 replace_with 的值；
否则返回cloumn的值；
如果两个参数都为NULL，则返回NULL。

hive> select nvl(a,1) from (select null as a) as  aa;
1
Time taken: 0.147 seconds, Fetched: 1 row(s)

hive> select nvl(a,1) from (select 2 as a) as  aa;
2

case … when … then … else … end 函数

select 与 case结合使用最大的好处有两点：
一是在显示查询结果时可以灵活的组织格式；
二是有效避免了多次对同一个表或几个表的访问。

Case具有两种格式：
简单Case函数
Case搜索函数

简单Case函数
hive> CASE sex WHEN '1' THEN '男' WHEN '2' THEN '女' ELSE '其他' END

Case搜索函数 
hive> CASE WHEN sex = '1' THEN '男' WHEN sex = '2' THEN '女' ELSE '其他' END  
种方式，可以实现相同的功能，  
简单Case函数的写法相对比较简洁，但是和Case搜索函数相比，功能方面会有些限制。  
比如写判断式，或者对多个列判断简单Case函数都有点不方便。   
在Case函数中Else部分的默认值是NULL

行转列

CONCAT(string A/col, string B/col, …)
返回输入字符串连接后的结果，支持任意个输入字符串。

CONCAT_WS(separator, str1, str2,…)
它是一个特殊形式的CONCAT()。
第一个参数是剩余参数间的分隔符；
分隔符可以是与剩余参数一样的字符串；
如果分隔符是 NULL，返回值也将为NULL；
这个函数会跳过分隔符参数后的任何 NULL和空字符串； >分隔符将被加到被连接的字符串之间。

COLLECT_SET(col)
函数只接受基本数据类型，它的主要作用是将某字段的值进行去重汇总，产生array类型字段
注意：CONCAT()和CONCAT_WS()都是UDTF函数
COLLECT_SET()函数类似聚合函数。

列转行

EXPLODE(col)
将hive一列中复杂的array或者map结构拆分成多行。
LATERAL VIEW

用法：LATERAL VIEW udtf(expression) tableAlias AS columnAlias
解释：lateral view用于和split,explode等UDTF函数一起使用，它能够将一列数据拆成多行数据，在此基础上可以对拆分后的数据进行聚合。

hive> select movie explode(category) from  movie_info;
上面是错误的。假设能执行的话，得到的是笛卡尔积。

正确写法
hive> select  movie,  category_name from 
  movie_info  lateral view explode(category) table_tmp as category_name;   
lateral view对原表的字段进行了侧写，得到侧写表和侧写字段。

像split,explode等UDTF函数，是不能跟原表的字段直接进行查询的，UDTF函数一定要和lateral view联合在一块用。

窗口函数

OVER()：指定分析函数工作的数据窗口大小，这个数据窗口大小可能会随着行的变化而变化（比如在首行或者最后一行）。
CURRENT ROW：当前行。
n PRECEDING：往前n行数据。
n FOLLOWING：往后n行数据。
UNBOUNDED：起点，UNBOUNDED PRECEDING 表示从前面的起点， UNBOUNDED FOLLOWING 表示到后面的终点。
LAG(col,n)：往前第n行数据。
LEAD(col,n)：往后第n行数据。
NTILE(n)：把有序分区中的行分发到指定数据的组中，各个组有编号，编号从1开始，对于每一行，NTILE返回此行所属的组的编号。注意：n必须为int类型。

hive> desc  business;
OK
name                    string
orderdate               string
cost                    int

hive> select * from business;
OK
jack    2017-01-01      10
tony    2017-01-02      15
jack    2017-02-03      23
tony    2017-01-04      29
jack    2017-01-05      46
jack    2017-04-06      42
tony    2017-01-07      50
jack    2017-01-08      55
mart    2017-04-08      62
mart    2017-04-09      68
neil    2017-05-10      12
mart    2017-04-11      75
neil    2017-06-12      80
mart    2017-04-13      94


只有一个分区，所有行相加，得一个值
hive> select * ,sum(cost) over() as sample1 from business ;
OK
mart    2017-04-13      94      661
neil    2017-06-12      80      661
mart    2017-04-11      75      661
neil    2017-05-10      12      661
mart    2017-04-09      68      661
mart    2017-04-08      62      661
jack    2017-01-08      55      661
tony    2017-01-07      50      661
jack    2017-04-06      42      661
jack    2017-01-05      46      661
tony    2017-01-04      29      661
jack    2017-02-03      23      661
tony    2017-01-02      15      661
jack    2017-01-01      10      661

按orderdate排序，只有一个分区，区内数据累加
hive> select * ,sum(cost) over(order by orderdate)  from business ;
OK
jack    2017-01-01      10      10
tony    2017-01-02      15      25
tony    2017-01-04      29      54
jack    2017-01-05      46      100
tony    2017-01-07      50      150
jack    2017-01-08      55      205
jack    2017-02-03      23      228
jack    2017-04-06      42      270
mart    2017-04-08      62      332
mart    2017-04-09      68      400
mart    2017-04-11      75      475
mart    2017-04-13      94      569
neil    2017-05-10      12      581
neil    2017-06-12      80      661

等效上一个语句
hive> select * ,sum(cost) over(order by orderdate rows between UNBOUNDED PRECEDING and CURRENT ROW)  from business ;
OK
jack    2017-01-01      10      10
tony    2017-01-02      15      25
tony    2017-01-04      29      54
jack    2017-01-05      46      100
tony    2017-01-07      50      150
jack    2017-01-08      55      205
jack    2017-02-03      23      228
jack    2017-04-06      42      270
mart    2017-04-08      62      332
mart    2017-04-09      68      400
mart    2017-04-11      75      475
mart    2017-04-13      94      569
neil    2017-05-10      12      581
neil    2017-06-12      80      661

按name分区，按orderdate排序，有多个分区，区内数据各自累加
hive> select * ,sum(cost) over(partition by name order by orderdate)  from business ;
OK
jack    2017-01-01      10      10
jack    2017-01-05      46      56
jack    2017-01-08      55      111
jack    2017-02-03      23      134
jack    2017-04-06      42      176
mart    2017-04-08      62      62
mart    2017-04-09      68      130
mart    2017-04-11      75      205
mart    2017-04-13      94      299
neil    2017-05-10      12      12
neil    2017-06-12      80      92
tony    2017-01-02      15      15
tony    2017-01-04      29      44
tony    2017-01-07      50      94

同上
hive> select * ,sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and CURRENT ROW)  from business ;

当前行和前一行累积
hive> select * ,sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and CURRENT ROW) from business ;
OK
jack    2017-01-01      10      10
jack    2017-01-05      46      56
jack    2017-01-08      55      101
jack    2017-02-03      23      78
jack    2017-04-06      42      65
mart    2017-04-08      62      62
mart    2017-04-09      68      130
mart    2017-04-11      75      143
mart    2017-04-13      94      169
neil    2017-05-10      12      12
neil    2017-06-12      80      92
tony    2017-01-02      15      15
tony    2017-01-04      29      44
tony    2017-01-07      50      79

前一行、当前行、下一行做累积
hive> select * ,sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and 1 FOLLOWING) from business ;
OK
jack    2017-01-01      10      56
jack    2017-01-05      46      111
jack    2017-01-08      55      124
jack    2017-02-03      23      120
jack    2017-04-06      42      65
mart    2017-04-08      62      130
mart    2017-04-09      68      205
mart    2017-04-11      75      237
mart    2017-04-13      94      169
neil    2017-05-10      12      92
neil    2017-06-12      80      92
tony    2017-01-02      15      44
tony    2017-01-04      29      94
tony    2017-01-07      50      79

当前行和后续所有行做累积
hive> select * ,sum(cost) over(partition by name order by orderdate rows between CURRENT ROW and UNBOUNDED FOLLOWING) from business ;
OK
jack    2017-01-01      10      176
jack    2017-01-05      46      166
jack    2017-01-08      55      120
jack    2017-02-03      23      65
jack    2017-04-06      42      42
mart    2017-04-08      62      299
mart    2017-04-09      68      237
mart    2017-04-11      75      169
mart    2017-04-13      94      94
neil    2017-05-10      12      92
neil    2017-06-12      80      80
tony    2017-01-02      15      94
tony    2017-01-04      29      79
tony    2017-01-07      50      50


按月份分区
hive> select *,
    > sum(cost) over(distribute by month(orderdate)) 
    > from business;
OK
jack    2017-01-01      10      205
jack    2017-01-08      55      205
tony    2017-01-07      50      205
jack    2017-01-05      46      205
tony    2017-01-04      29      205
tony    2017-01-02      15      205
jack    2017-02-03      23      23
mart    2017-04-13      94      341
jack    2017-04-06      42      341
mart    2017-04-11      75      341
mart    2017-04-09      68      341
mart    2017-04-08      62      341
neil    2017-05-10      12      12
neil    2017-06-12      80      80

按月份分区（同上）
hive> select *,
    > sum(cost) over(partition by month(orderdate)) 
    > from business;
OK
jack    2017-01-01      10      205
jack    2017-01-08      55      205
tony    2017-01-07      50      205
jack    2017-01-05      46      205
tony    2017-01-04      29      205
tony    2017-01-02      15      205
jack    2017-02-03      23      23
mart    2017-04-13      94      341
jack    2017-04-06      42      341
mart    2017-04-11      75      341
mart    2017-04-09      68      341
mart    2017-04-08      62      341
neil    2017-05-10      12      12
neil    2017-06-12      80      80

按日期累加 从开始到现在
hive> select *,
    > sum(cost) over(sort by orderdate rows between UNBOUNDED PRECEDING and CURRENT ROW)
    > from business;
OK
jack    2017-01-01      10      10
tony    2017-01-02      15      25
tony    2017-01-04      29      54
jack    2017-01-05      46      100
tony    2017-01-07      50      150
jack    2017-01-08      55      205
jack    2017-02-03      23      228
jack    2017-04-06      42      270
mart    2017-04-08      62      332
mart    2017-04-09      68      400
mart    2017-04-11      75      475
mart    2017-04-13      94      569
neil    2017-05-10      12      581
neil    2017-06-12      80      661

按日期累加 某一天和前后一天 统计(当前行和前边一行及后面一行)
hive> select *,
    > sum(cost) over(sort by orderdate rows between  1 PRECEDING and 1 FOLLOWING)
    > from business;
OK
jack    2017-01-01      10      25
tony    2017-01-02      15      54
tony    2017-01-04      29      90
jack    2017-01-05      46      125
tony    2017-01-07      50      151
jack    2017-01-08      55      128
jack    2017-02-03      23      120
jack    2017-04-06      42      127
mart    2017-04-08      62      172
mart    2017-04-09      68      205
mart    2017-04-11      75      237
mart    2017-04-13      94      181
neil    2017-05-10      12      186
neil    2017-06-12      80      92

按用户 日累积
hive> select *,
    > sum(cost) over(distribute by name sort by orderdate rows between UNBOUNDED PRECEDING and CURRENT ROW)
    > from business;
OK
jack    2017-01-01      10      10
jack    2017-01-05      46      56
jack    2017-01-08      55      111
jack    2017-02-03      23      134
jack    2017-04-06      42      176
mart    2017-04-08      62      62
mart    2017-04-09      68      130
mart    2017-04-11      75      205
mart    2017-04-13      94      299
neil    2017-05-10      12      12
neil    2017-06-12      80      92
tony    2017-01-02      15      15
tony    2017-01-04      29      44
tony    2017-01-07      50      94

查询上一次，当前购买时间
hive> select *,
    > lag(orderdate, 1) over(distribute by name sort by orderdate) ,
    > lead(orderdate, 1) over(distribute by name sort by orderdate)
    > from business;
OK
jack    2017-01-01      10      NULL    2017-01-05
jack    2017-01-05      46      2017-01-01      2017-01-08
jack    2017-01-08      55      2017-01-05      2017-02-03
jack    2017-02-03      23      2017-01-08      2017-04-06
jack    2017-04-06      42      2017-02-03      NULL
mart    2017-04-08      62      NULL    2017-04-09
mart    2017-04-09      68      2017-04-08      2017-04-11
mart    2017-04-11      75      2017-04-09      2017-04-13
mart    2017-04-13      94      2017-04-11      NULL
neil    2017-05-10      12      NULL    2017-06-12
neil    2017-06-12      80      2017-05-10      NULL
tony    2017-01-02      15      NULL    2017-01-04
tony    2017-01-04      29      2017-01-02      2017-01-07
tony    2017-01-07      50      2017-01-04      NULL

查询前20%时间的订单信息
hive> select *
    > from (select *,
    >        ntile(5) over(order by orderdate) as gid
    >        from business) as t
    > where t.gid=1;
OK
jack    2017-01-01      10      1
tony    2017-01-02      15      1
tony    2017-01-04      29      1

注： ntile(n) 将有序分区中的行平均分发到指定数据的组中,每个组中记录数量为total/n，  
那么取前20%就是前五分之一，只要分为五个组，去第一个组即可
hive> select *,
    > ntile(5) over(sort by orderdate)
    > from business;
OK
jack    2017-01-01      10      1
tony    2017-01-02      15      1
tony    2017-01-04      29      1
jack    2017-01-05      46      2
tony    2017-01-07      50      2
jack    2017-01-08      55      2
jack    2017-02-03      23      3
jack    2017-04-06      42      3
mart    2017-04-08      62      3
mart    2017-04-09      68      4
mart    2017-04-11      75      4
mart    2017-04-13      94      4
neil    2017-05-10      12      5
neil    2017-06-12      80      5

rank函数

RANK()：排序相同时会重复，总数不会变。（两个100分为列第一名和第二名，99分的为第三名）
DENSE_RANK()：排序相同时会重复，总数会减少。（两个100分并列第一，99分的为第二名）
ROW_NUMBER()：会根据顺序计算。
注意：使用rank函数需要配合over函数（窗口函数）使用

准备数据
vim score.txt
李白,语文,100
李白,数学,90
李白,天文,85
屈原,语文,100
屈原,数学,85
屈原,天文,90
诸葛亮,语文,95
诸葛亮,数学,99
诸葛亮,天文,110
祖冲之,语文,90
祖冲之,数学,100
祖冲之,天文,90


hive> create table score(
    > name string,
    > subject string,
    > score int)
    > row format delimited fields terminated by ",";
OK
hive> load data local inpath '/root/tmp_lillcol/score.txt' into table score;
Loading data to table iptv.score
Table iptv.score stats: [numFiles=1, totalSize=226]
OK
Time taken: 0.325 seconds
hive> select * from score;
OK
李白    语文    100
李白    数学    90
李白    天文    85
屈原    语文    100
屈原    数学    85
屈原    天文    90
诸葛亮  语文    95
诸葛亮  数学    99
诸葛亮  天文    110
祖冲之  语文    90
祖冲之  数学    100
祖冲之  天文    90

根据学科排名，排序相同时会重复，总数不会变。
hive> select *,
    > rank() over(partition by subject order by score desc)
    > from score;
OK
诸葛亮  天文    110     1
祖冲之  天文    90      2
屈原    天文    90      2
李白    天文    85      4
祖冲之  数学    100     1
诸葛亮  数学    99      2
李白    数学    90      3
屈原    数学    85      4
屈原    语文    100     1
李白    语文    100     1
诸葛亮  语文    95      3
祖冲之  语文    90      4
Time taken: 19.103 seconds, Fetched: 12 row(s)

根据学科排名，排序相同时会重复，总数会减少。
hive> select *,
    > dense_rank()over(partition by subject order by score desc)
    > from score;
OK
诸葛亮  天文    110     1
祖冲之  天文    90      2
屈原    天文    90      2
李白    天文    85      3
祖冲之  数学    100     1
诸葛亮  数学    99      2
李白    数学    90      3
屈原    数学    85      4
屈原    语文    100     1
李白    语文    100     1
诸葛亮  语文    95      2
祖冲之  语文    90      3

根据学科排名，返回的是行号，分数相同排序不同
hive> select *,
    > row_number() over(partition by subject order by score desc)
    > from score;
OK
诸葛亮  天文    110     1
祖冲之  天文    90      2
屈原    天文    90      3
李白    天文    85      4
祖冲之  数学    100     1
诸葛亮  数学    99      2
李白    数学    90      3
屈原    数学    85      4
屈原    语文    100     1
李白    语文    100     2
诸葛亮  语文    95      3
祖冲之  语文    90      4

函数

查看系统自带的函数
hive> show functions;
OK
!
!=
%
&
*
+
-
/
<
<=
<=>
<>
=
==
>
>=
^
abs
acos
add_months
and
array
array_contains
ascii
asin
assert_true
atan
avg
base64
between
bin
case
cbrt
ceil
ceiling
coalesce
collect_list
collect_set
compute_stats
concat
concat_ws
context_ngrams
conv
corr
cos
count
covar_pop
covar_samp
crc32
create_union
cume_dist
current_database
current_date
current_timestamp
current_user
date_add
date_format
date_sub
datediff
day
dayofmonth
dayofweek
decode
degrees
dense_rank
div
e
elt
encode
ewah_bitmap
ewah_bitmap_and
ewah_bitmap_empty
ewah_bitmap_or
exp
explode
field
find_in_set
first_value
floor
format_number
from_unixtime
from_utc_timestamp
get_json_object
greatest
hash
hex
histogram_numeric
hour
if
in
in_file
index
initcap
inline
instr
isnotnull
isnull
java_method
json_tuple
lag
last_day
last_value
lcase
lead
least
length
levenshtein
like
ln
locate
log
log10
log2
logged_in_user
lower
lpad
ltrim
map
map_keys
map_values
matchpath
max
md5
min
minute
month
months_between
named_struct
negative
next_day
ngrams
noop
noopstreaming
noopwithmap
noopwithmapstreaming
not
ntile
nvl
or
parse_url
parse_url_tuple
percent_rank
percentile
percentile_approx
pi
pmod
posexplode
positive
pow
power
printf
radians
rand
rank
reflect
reflect2
regexp
regexp_extract
regexp_replace
repeat
reverse
rlike
round
row_number
rpad
rtrim
second
sentences
sha2
sign
sin
size
sort_array
soundex
space
split
sqrt
stack
std
stddev
stddev_pop
stddev_samp
str_to_map
struct
substr
substring
sum
tan
to_date
to_unix_timestamp
to_utc_timestamp
translate
trim
trunc
ucase
unbase64
unhex
unix_timestamp
upper
uuid
var_pop
var_samp
variance
version
weekofyear
when
windowingtablefunction
xpath
xpath_boolean
xpath_double
xpath_float
xpath_int
xpath_long
xpath_number
xpath_short
xpath_string
year
|
~

显示自带函数用法
hive> desc function extended  minute;
OK
minute(date) - Returns the minute of date
date is a string in the format of 'yyyy-MM-dd HH:mm:ss' or 'HH:mm:ss'.
Example:
   > SELECT minute('2009-07-30 12:58:59') FROM src LIMIT 1;
  58
  > SELECT minute('12:58:59') FROM src LIMIT 1;
  58

简单显示
hive> desc function minute;
minute(date) - Returns the minute of date

自定义函数

Hive自定义函数包括三种UDF、UDAF、UDTF
　　UDF(User-Defined-Function) 一进一出
　　UDAF(User- Defined Aggregation Funcation)聚集函数，多进一出。Count/max/min
　　UDTF(User-Defined Table-Generating Functions) 一进多出，如lateral view explore()
　　使用方式：在HIVE会话中add自定义函数的jar文件，然后创建function继而使用函数

UDF(User-Defined-Function) 一进一出

依赖：
sbt:
libraryDependencies += "org.apache.hive" % "hive-exec" % "1.2.1"  

maven:  
<dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-exec</artifactId>
        <version>1.2.1</version>
    </dependency>
</dependencies>

关键代码：  
import org.apache.hadoop.hive.ql.exec.UDF;
/**
 * @author lillcol
 * 2019/7/15-17:56
 */
public class MyUDF extends UDF {
    public int evaluate(String str) {
        try{
            int n = Integer.parseInt(str);
            return n*n;
        }catch(Exception e){
            e.printStackTrace();
            return -1;
        }
    }
}

上述代码打包为udf.jar

将jar包添加到hive的class path  
hive> add jar /root/tmp_lillcol/udf.jar;
Added [/root/tmp_lillcol/udf.jar] to class path
Added resources: [/root/tmp_lillcol/udf.jar]

创建临时函数与开发好的java class关联  
hive> create temporary function square as "com.iptv.test.MyUDF";
OK
hive> select * ,square(score) from score ;
OK
李白    语文    100     10000
李白    数学    90      8100
李白    天文    85      7225
屈原    语文    100     10000
屈原    数学    85      7225
屈原    天文    90      8100
诸葛亮  语文    95      9025
诸葛亮  数学    99      9801
诸葛亮  天文    110     12100
祖冲之  语文    90      8100
祖冲之  数学    100     10000
祖冲之  天文    90      8100