Hive-03 常用函数

Hive常用函数大全一览

hive中split、coalesce及collect_list函数的用法（可举例）
Split将字符串转化为数组。
split('a,b,c,d' , ',') ==> ["a","b","c","d"]
COALESCE(T v1, T v2, …) 返回参数中的第一个非空值；如果所有值都为 NULL，那么返回NULL。
collect_list列出该字段所有的值，不去重 select collect_list(id) from table;

常用时间函数：

将mongodb中时区转换过来：由UTC时区转换为 GMT时区，差8个小时

date_format(from_utc_timestamp( CONCAT_WS(' ',substring(updatetime,1,10),substring(updatetime,12,8) ) ,'GMT+8'),'yyyy-MM-dd HH:mm:ss') updatetime

select date_format(from_utc_timestamp(create_time,"UTC"),'yyyy-MM-dd HH:mm:ss') as local_time

select date_format(from_utc_timestamp(create_time,"GMT+8"),'yyyy-MM-dd HH:mm:ss') as local_time

时间戳 秒S是10位；  毫秒ms是13位;

date_format(from_unixtime(cast(h.updatetime as int)),'yyyy-MM-dd HH:mm:ss')

substring(h.id, 10, 24) 59409d1d2cdcc90b91c62be5    ObjectId(59409d1d2cdcc90b91c62be5)

今天： select date_format(current_timestamp,'yyyy-MM-dd')
前一天： select date_sub(current_date,1);

Hive中字段的合并

contact：简单合并功能；

CONCAT_WS("/",r.province,r.city,a.area) channel_address  => 北京/北京市/朝阳区 ，字段必须是string；

concat(payid,' ',carlogid)  => 01a893092b914703b75941b713767ebf 408693

concat(substr(summary_time,9,2),'',substr(summary_time,6,2),'',substr(summary_time,1,4),'_',concat(market_id),'_' ,concat(mid))
09022019_108_0
12022019_108_0
21022019_108_0

concat_ws("_" ,substr(summary_time,9,2),substr(summary_time,6,2),substr(summary_time,1,4),concat(market_id),concat(mid))
09_03_2019_108_0
13_03_2019_108_0
21_03_2019_108_0

concat(sum(total_amount_pur),'&&',sum(total_amount_sig),'&&',sum(total_amount_jump))
0.0&&16665.0&&0.0

order by date desc, market_id asc;

date    market_id    total_fee
2019-10-08    110    23000
2019-10-08    110    13000
2019-10-08    141    1400
2019-10-08    141    2250
2019-10-08    218    4000
2019-10-08    218    1100
2019-10-08    218    2300
2019-10-08    218    4500
2019-10-08    234    0
2019-10-08    234    0

查询仅出现一次的数据

SELECT name,count(name) AS count_times  FROM tb_test  GROUP BY name  HAVING count_times = 1;

查询语句返回某字段出现超过1次的所有记录

select * from stu where sname in (select sname from stu group by sname having count(sname)>1);
有重复的sname的记录，并计算相同sname的数量
select *,count(sname)as count from stu group by sname having (count(sname)>1);

Hive取非Group by字段数据的方法，但可加聚合函数如count、sum、avg、max等

方法①

输入数据为一个ID对应多个name，要求输出数据为ID是唯一的，name随便取一个就可以。

HIVE有这么一个函数collect_set，类似于mysql的group_concat函数，把每个分组的其他字段，按照逗号进行拼接，得到一个最终字符串：

collect_set(col)
返回类型：array
解释：返回一个去重后的对象集合

select sid,collect_set(class_id) from table1 group by sid;
===>>
1 [11,12,13]
2 [11,14]
3 [12,15]
可以针对第二列做一些计数、求和操作，分别对应到Hive的聚合函数count、sum。
对应到本文的目的，直接从数组获取第一个元素就达到目的了，这样做：
select sid,collect_set(class_id)[0] from table1 group by sid;

方法② select后需要几个字段就gruop by几个

select sid,class_id from table1 group by sid, class_id;

相同的user_id，但是每个create_time不一样，现在的需求是根据create_time创建时间选取最早的，那么思路是现根据user_id进行分组，然后根据user_id,create_time进行排序，取row_number 为1的值

SELECT * FROM(SELECT *, row_number() OVER(PARTITION BY dt,id ORDER BY dt,id DESC) rk FROM test_table)t 
WHERE t.rk = 1

Hive查询中数值累加

1. 需求分析

现有 hive 表 record, 内容如下：

其中字段意义: channel_type(string) dt(string) num(int)；分别代表: 渠道类型日期该天个数，原数据模拟如下：

select * from record;
channel_type  dt     num
A    2015-01-01    8
A    2015-01-02    4
A    2015-01-02    5
C    2015-02-01    1
A    2015-01-04    5
A    2015-01-05    6
B    2015-01-03    2
B    2015-01-02    3
A    2015-01-03    2
C    2015-01-30    8
C    2015-01-30    7
B    2015-01-02    9
B    2015-01-01    1
C    2015-02-02    3

View Code

统计每个渠道截止到当天为止的最大单日人数和累计到该天的总人数：

# 先求出每个渠道每天总访问量：

create table record_nj as  
select 
    channel_type,dt,sum(num) as new_join 
from record 
     group by channel_type,dt;

channel_type    dt    new_join
A    2015-01-01    8
A    2015-01-02    9
A    2015-01-03    2
A    2015-01-04    5
A    2015-01-05    6
B    2015-01-01    1
B    2015-01-02    12
B    2015-01-03    2
C    2015-01-30    15
C    2015-02-01    1
C    2015-02-02    3

View Code

方法一： 使用Hive窗口函数over  max()、sum()

select 
    channel_type, dt, new_join, 
    sum(new_join) over(partition by channel_type order by dt) as sum_count,
    max(new_join) over(partition by channel_type order by dt) as max_count
from record_nj;

方法二：使用group by  join自连接
select 
    t1.channel_type,t1.dt,t1.new_join, 
    sum(t2.new_join) sum_count, max(t2.new_join) max_count 
from record_nj t1 join record_nj t2 on t1.channel_type = t2.channel_type 
where t1.dt >= t2.dt 
group by t1.channel_type,t1.dt,t1.new_join order by t1.channel_type,t1.dt;

数据结果如下：

channel_type    dt    new_join    sum_count    max_count
A    2015-01-01    8    8    8
A    2015-01-02    9    17    9
A    2015-01-03    2    19    9
A    2015-01-04    5    24    9
A    2015-01-05    6    30    9
B    2015-01-01    1    1    1
B    2015-01-02    12    13    12
B    2015-01-03    2    15    12
C    2015-01-30    15    15    15
C    2015-02-01    1    16    15
C    2015-02-02    3    19    15

View Code

累加、累乘、最大值：

select 
    channel_type,
    new_join, 
    sum(new_join) over(partition by channel_type order by dt) as sum_count,--累加
    sum(new_join) over(partition by channel_type order by dt rows between unbounded preceding and current row) sum_count, --累加
    round(power(10, sum(log(10, new_join))over(partition by channel_type order by dt rows between unbounded preceding and current row))) as tired,--累乘处理-
    max(new_join) over(partition by channel_type order by dt) as max_count --最大值
from record_nj;

行列转换

原数据如下：

select * from score;
name    subject    score
孙悟空    语文    87
孙悟空    数学    95
孙悟空    英语    68
大海    语文    94
大海    数学    56
大海    英语    84
kris    语文    64
kris    数学    86
kris    英语    84
婷婷    语文    65
婷婷    数学    85
婷婷    英语    78

View Code

求语文成绩比数学成绩好的学生：

方法一：join 
select 
    s1.name,s1.subject, s1.score 
from score s1 inner join score s2 on s1.name = s2.name 
where  s1.score > s2.score  and s1.subject = '语文' and s2.subject = '数学';

s1.name    s1.subject    s1.score
大海    语文    94

View Code

方法二：行列转换
①
create table t1 AS
select 
 name,
 case subject when '语文' then score else 0 end as chinese_score,
 case subject when '数学' then score else 0 end as math_score 
from score;

name    chinese_score    math_score
孙悟空    87    0
孙悟空    0    95
孙悟空    0    0
大海            94    0
大海            0    56
大海            0    0
kris           64    0
kris           0    86
kris           0    0
婷婷          65    0
婷婷          0       85
婷婷          0        0

View Code

②
create table t2 AS
select
name,max(chinese_score) chinese_score,max(math_score) math_score
from t1 group by name;

name    chinese_score    math_score
kris    64    86
大海    94    56
婷婷    65    85
孙悟空    87    95

View Code


③
select 
  name, chinese_score, math_score
from t2 where chinese_score > math_score;

或者三个hql合并为一个如下====>
select
    name,chinese_score,math_score
from(

select
   name,max(chinese_score) chinese_score,max(math_score) math_score
from( select name,
        case subject when '语文' then score else 0 end as chinese_score,
        case subject when '数学' then score else 0 end as math_score 
       from score)t1 
group by t1.name

)t2 where chinese_score >= math_score ;

name    chinese_score    math_score
大海    94    56

View Code

列转行的实现：

数据如下：

id  sname   math    computer    english
1   Jed     34      58          58
2   Tony    45      87          45
3   Tom     76      34          89

View Code

select id, sname, 'math' as course, math as score from score
union 
select id, sname, 'computer' as course, computer as score from score
union 
select id, sname, 'english' as course, english as score from score
order by id, sname, course;

结果如下：

id  sname   course      score
1   Jed     computer    58
1   Jed     english     58
1   Jed     math        34
2   Tony    computer    87
2   Tony    english     45
2   Tony    math        45
3   Tom     computer    34
3   Tom     english     89
3   Tom     math        76

View Code

窗口函数

一、聚合函数sum、avg、max、min

建表，load数据：

create table cookie(
  cookie_id  string,
  create_time string,
  pv int
)row format delimited fields terminated by ",";

a_cookie,2019-06-10,1
a_cookie,2019-06-11,9
a_cookie,2019-06-12,7
a_cookie,2019-06-13,3
a_cookie,2019-06-14,2
a_cookie,2019-06-15,4
a_cookie,2019-06-16,4
b_cookie,2019-08-17,6
b_cookie,2019-08-18,9
b_cookie,2019-08-19,5
b_cookie,2019-08-17,2
load data local inpath  "/opt/module/datas/cookie.txt" into table cookie;

View Code

select 
    cookie_id,create_time,pv,
    sum(pv) over(partition by cookie_id) as sum1, --对组内的pv值进行全部累加
    sum(pv) over(partition by  cookie_id order by create_time) as accu_sum  --从第一行开始累加到当前行
from cookie;

数据如下：

a_cookie    2019-06-10    1    30    1
a_cookie    2019-06-11    9    30    10
a_cookie    2019-06-12    7    30    17
a_cookie    2019-06-13    3    30    20
a_cookie    2019-06-14    2    30    22
a_cookie    2019-06-15    4    30    26
a_cookie    2019-06-16    4    30    30
b_cookie    2019-08-17    2    22    8
b_cookie    2019-08-17    6    22    8
b_cookie    2019-08-18    9    22    17
b_cookie    2019-08-19    5    22    22

View Code

select
    cookie_id,create_time,pv,
    sum(pv) over(partition by cookie_id) as sum1, --对组内的pv值进行全部累加
    sum(pv) over(partition by  cookie_id order by create_time) as accu_sum, --从第一行开始累加到当前行
    sum(pv) over(partition by cookie_id order by create_time rows between unbounded preceding and current row) as sum3,
                         --默认就是从起点到当前行往前累加，所以between unbounded(起点) preceding(往前) and current row这个条件可以不写。
    sum(pv) over(partition by  cookie_id order by create_time rows between current row and unbounded following) as sum4,
                       --从当前行累加到最前边,跟前一个是相反的累加,相当于是降序累加
    sum(pv) over(partition by  cookie_id order by create_time rows between 3 preceding and current row) as sum5,
                       --取当前行-至往前数3行的数进行累加, 一共4行进行累加处理
    sum(pv) over(partition by  cookie_id order by create_time rows between 3 preceding and 1 following) as sum6
                        --取当前行,往前数3行,往后数1行的数值进行累加处理
from cookie;

数据如下：

a_cookie    2019-06-10    1    30    1    1    30    1    10
a_cookie    2019-06-11    9    30    10    10    29    10    17
a_cookie    2019-06-12    7    30    17    17    20    17    20
a_cookie    2019-06-13    3    30    20    20    13    20    22
a_cookie    2019-06-14    2    30    22    22    10    21    25
a_cookie    2019-06-15    4    30    26    26    8    16    20
a_cookie    2019-06-16    4    30    30    30    4    13    13
b_cookie    2019-08-17    2    22    8    2    22    2    8
b_cookie    2019-08-17    6    22    8    8    20    8    17
b_cookie    2019-08-18    9    22    17    17    14    17    22
b_cookie    2019-08-19    5    22    22    22    5    22    22

View Code

sum、avg、max、min这些窗口函数的语法都是一样的；同上；

二、 ntile分片

ntile(n)，用于将分组数据按照顺序切分成n片，返回当前切片值，ntile不支持rows between；如果切片不均匀，默认增加第一个切片的分布。

将分组数据按照顺序切分成1份、2份、3份、4份
select 
    cookie_id,create_time,pv,
    ntile(1) over(partition by  cookie_id order by create_time) as nt1,
    ntile(2) over(partition by  cookie_id order by create_time) as nt2,
    ntile(3) over(partition by  cookie_id order by create_time) as nt3,
    ntile(4) over(partition by  cookie_id order by create_time) as nt4
from cookie;

结果: a_cookie组有7条数据, 所以默认第一个分片加1条数据
cookie_id    create_time pv nt1 nt2 nt3 nt4
a_cookie    2019-06-10    1    1    1    1    1
a_cookie    2019-06-11    9    1    1    1    1
a_cookie    2019-06-12    7    1    1    1    2
a_cookie    2019-06-13    3    1    1    2    2
a_cookie    2019-06-14    2    1    2    2    3
a_cookie    2019-06-15    4    1    2    3    3
a_cookie    2019-06-16    4    1    2    3    4
b_cookie    2019-08-17    6    1    1    1    1
b_cookie    2019-08-17    2    1    1    1    2
b_cookie    2019-08-18    9    1    2    2    3
b_cookie    2019-08-19    5    1    2    3    4

应用场景：统计一个每个cookie的pv数最多的前1/3的天

select 
    cookie_id,create_time,pv
from(
select 
    cookie_id,create_time,pv,
    ntile(3) over(partition by  cookie_id order by pv desc) as nt1 --按pv降序排,取nt = 1份的数据就是要求的结果
from cookie
)t where nt1 = 1;

a_cookie    2019-06-11    9
a_cookie    2019-06-12    7
a_cookie    2019-06-16    4
b_cookie    2019-08-18    9
b_cookie    2019-08-17    6

View Code

三、排序 row_number()等

row_number()从1开始，按照顺序，生成分组内记录的行号, 值相同的排名不同；

rank：生成数据项在分组中的排名，相同的值编相同的号，排名相等会在名次中留下空位

dense_rank：按顺序编号，相同的值编相同的号，不留空位

select 
    cookie_id,create_time,pv,
    row_number() over(partition by  cookie_id order by pv desc) as row_index,
    rank() over(partition by  cookie_id order by pv desc) as rank_index,
    dense_rank() over(partition by  cookie_id order by pv desc) as dense_index
from cookie;

cookie_id    create_time pv  row_index rank_index dense_index
a_cookie    2019-06-11    9    1    1    1
a_cookie    2019-06-12    7    2    2    2
a_cookie    2019-06-16    4    3    3    3
a_cookie    2019-06-15    4    4    3    3
a_cookie    2019-06-13    3    5    5    4
a_cookie    2019-06-14    2    6    6    5
a_cookie    2019-06-10    1    7    7    6

常用场景：分组取TopN, 比如求每个cookie排名前三的pv

create table tmp as
select cookieid, createtime,  pv,
row_number() over(partition by cookieid order by pv desc) as index
from cookie2;

select * from tmp where index <= 3;

四、lag | lead | first_value| last_value

LAG(col,n,DEFAULT)用于统计窗口内往上第n行值

第一个参数为列名
第二个参数为往上第n行(可选，默认为1)
第三个参数为默认值(当往上第n行为NULL时候，取默认值，如不指定，则为NULL)

select 
    cookie_id,create_time,pv, 
    lag(pv,1) over (partition by cookie_id order by create_time) as last_1_pv //可设置默认值为0即 lag(pv,1,0) over(...)
 from cookie;

数据如下：

结果: 没有设置默认值,没有上一行时显示为null
cookie_id  create_time  pv  last_1_pv
a_cookie    2019-06-10    1    NULL
a_cookie    2019-06-11    9    1
a_cookie    2019-06-12    7    9
a_cookie    2019-06-13    3    7
a_cookie    2019-06-14    2    3
a_cookie    2019-06-15    4    2
a_cookie    2019-06-16    4    4
b_cookie    2019-08-17    6    NULL
b_cookie    2019-08-17    2    6
b_cookie    2019-08-18    9    2
b_cookie    2019-08-19    5    9

View Code

lead的作用与lag相反
LEAD(col,n,DEFAULT)用于统计窗口内往下第n行值
第一个参数为列名
第二个参数为往下第n行(可选，默认为1)
第三个参数为默认值(当往下第n行为NULL时候，取默认值，如不指定，则为NULL)

select 
    cookie_id,create_time,pv, 
    lead(pv,1) over(partition by cookie_id order by create_time) as next_1_pv,
    lead(create_time,1,'1970-01-01') over(partition by cookie_id order by create_time) as next_1_time 
from cookie;
数据如下：

cookie_id create_time pv next_1_pv next_1_time 
a_cookie    2019-06-10    1    9    2019-06-11
a_cookie    2019-06-11    9    7    2019-06-12
a_cookie    2019-06-12    7    3    2019-06-13
a_cookie    2019-06-13    3    2    2019-06-14
a_cookie    2019-06-14    2    4    2019-06-15
a_cookie    2019-06-15    4    4    2019-06-16
a_cookie    2019-06-16    4    NULL    1970-01-01
b_cookie    2019-08-17    6    2    2019-08-17
b_cookie    2019-08-17    2    9    2019-08-18
b_cookie    2019-08-18    9    5    2019-08-19
b_cookie    2019-08-19    5    NULL    1970-01-01

View Code

first_value: 取分组内排序后，截止到当前行，第一个值

select 
    cookie_id,create_time,pv, 
    first_value(pv) over (partition by cookie_id  order by pv) as first_pv 
from cookie;
数据如下:

cookie_id create_time  pv first_pv 
a_cookie    2019-06-10    1    1
a_cookie    2019-06-14    2    1
a_cookie    2019-06-13    3    1
a_cookie    2019-06-16    4    1
a_cookie    2019-06-15    4    1
a_cookie    2019-06-12    7    1
a_cookie    2019-06-11    9    1
b_cookie    2019-08-17    2    2
b_cookie    2019-08-19    5    2
b_cookie    2019-08-17    6    2
b_cookie    2019-08-18    9    2

View Code

last_value 取分组内排序后，截止到当前行，最后一个值(其实就是它自己)

select 
    cookie_id,create_time,pv, 
    last_value(pv) over (partition by cookie_id  order by pv) as last_pv 
from cookie;

cookie_id create_time  pv last_pv 
a_cookie    2019-06-10    1    1
a_cookie    2019-06-14    2    2
a_cookie    2019-06-13    3    3
a_cookie    2019-06-16    4    4
a_cookie    2019-06-15    4    4
a_cookie    2019-06-12    7    7
a_cookie    2019-06-11    9    9
b_cookie    2019-08-17    2    2
b_cookie    2019-08-19    5    5
b_cookie    2019-08-17    6    6
b_cookie    2019-08-18    9    9

View Code

参考：

grouping sets 、 grouping__id 、 cube 、 rollup函数；

cume_dist 、 percent_rank等函数

https://www.jianshu.com/nb/19948302