[Hive_12] Hive 的自定义函数

0. 说明

　　UDF 　　//user define function
　　　　　　//输入单行，输出单行，类似于 format_number(age,'000')

　　UDTF 　　//user define table-gen function
　　　　　　 //输入单行，输出多行，类似于 explode(array);

　　UDAF 　　//user define aggr function
　　　　　　 //输入多行，输出单行，类似于 sum(xxx)

　　Hive 通过 UDF 实现对 temptags 的解析

1. UDF

　　1.1 代码示例

　　Code

　　1.2 用户自定义函数的使用

　　1. 将 Hive 自定义函数打包并发送到 /soft/hive/lib 下
　　2. 重启 Hive
　　3. 注册函数

# 永久函数
　　create function myudf as 'com.share.udf.MyUDF';

# 临时函数
　　create temporary function myudf as 'com.share.udf.MyUDF';

　　1.3 Demo

　　Hive 通过 UDF 实现对 temptags 的解析

　　0. 准备数据

　　1. 建表

    create table temptags(id int,json string) row format delimited fields terminated by '	';

　　2. 加载数据

    load data local inpath '/home/centos/files/temptags.txt' into table temptags;

　　3. 代码编写

　　Code

　　4. 打包

　　5. 添加 fastjson-1.2.47.jar & myhive-1.0-SNAPSHOT.jar 到 /soft/hive/lib 中

　　6. 重启 Hive

　　7. 注册临时函数

    create temporary function parsejson as 'com.share.udf.ParseJson';

　　8. 测试

select id ,parsejson(json) as tags from temptags;

# 将 id 和 tag 炸开
select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag;

# 开始统计每个商家每个标签个数
select id, tag, count(*) as count
from (select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id, tag;

# 进行商家内标签数的排序
select id, tag , count, row_number()over(partition by id order by count desc) as rank
from  (select id, tag, count(*) as count from (select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id,tag) b ;

# 将标签和个数进行拼串，取得前 10 标签数
select id, concat(tag,'_',count)
from (select id, tag , count, row_number()over(partition by id order by count desc) as rank 
from  (select id, tag, count(*) as count from (select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id,tag) b )c
where rank<=10;

#聚合拼串 
    //concat_ws(',', List<>)
    //collect_set(name) 将所有字段变为数组,去重
    //collect_list(name) 将所有字段变为数组,不去重
select id, concat_ws(',',collect_set(concat(tag,'_',count))) as tags
from (select id, tag , count, row_number()over(partition by id order by count desc) as rank
from  (select id, tag, count(*) as count from (select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id,tag) b )c  where rank<=10 group by id;

　　1.4 虚列：lateral view

　　123456 味道好_10,环境卫生_9

　　id　　 tags
　　1 　　[味道好，环境卫生]　　 =>　　 1 味道好
　　　　　　　　　　　　　　　　　　1 环境卫生

select name, workplace from employee lateral view explode(work_place) xx as workplace;

　　1.5 类找不到异常

　　缺少 jar 包导致的: 类找不到异常的解决方案

　　问题描述

　　Caused by: java.lang.ClassNotFoundException: com.share.udf.ParseJson

　　解决方案

　　1. 将 fastjson 和 myhive.jar 放在 /soft/hadoop/share/hadoop/common/lib 下

　　cp /soft/hive/lib/myhive-1.0-SNAPSHOT.jar /soft/hadoop/share/hadoop/common/lib/

　　cp /soft/hive/lib/fastjson-1.2.47.jar /soft/hadoop/share/hadoop/common/lib/

　　2. 同步到其他节点

　　xsync.sh /soft/hadoop/share/hadoop/common/lib/fastjson-1.2.47.jar

　　xsync.sh /soft/hadoop/share/hadoop/common/lib/myhive-1.0-SNAPSHOT.jar

　　3. 重启 Hadoop 和 Hive

　　stop-all.sh

　　hive

2. UDTF

　　2.0 说明

　　Hive 实现 Word Count 通过以下两种方式

　　array => explode

　　string => split => explode

　　现在直接通过 UDTF 实现 WordCount

　　string => myudtf

　　2.1 代码编写

　　Code

　　2.2 打包

　　将 myhive-1.0-SNAPSHOT.jar 添加到 /soft/hive/lib 中

　　2.3 重启 Hive

　　2.4 注册临时函数

　　create function myudtf as 'com.share.udtf.MyUDTF';

　　2.5 测试

    select myudtf(line) from wc2;

　　2.6 流程分析

　　1. 通过 initialize的参数(方法参数)类型或参数个数

　　2. 返回输出表的表结构(字段名+字段类型)

　　3. 通过 process函数，取出参数值

　　4. 进行处理后通过 forward函数将其输出