[Hive_add_11] Hive 使用 UDTF 实现日志降维

0. 说明

　　对日志进行降维处理，将日志分为几个小表

　　通过编写 UDTF ，对日志降维，将日志聚合体相关字段抽取出来，形成新表。

1. 操作流程

　　1.0 日志部分内容

1532868065.518#192.168.23.1#1532868067548#200#{"appChannel":"appstore","appErrorLogs":

[{"createdAtMs":1530455040000,"errorBrief":"at cn.lift.appIn.control.CommandUtil.getInfo(CommandUtil.java:67)",
"errorDetail":"at cn.lift.dfdfdf.control.CommandUtil.getInfo(CommandUtil.java:67) at sun.reflect.DelegatingMethodAccessorImpl.
invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606)"},{"createdAtMs":1530393180000,
"errorBrief":"at cn.lift.appIn.control.CommandUtil.getInfo(CommandUtil.java:67)","errorDetail":
"at cn.lift.dfdfdf.control.CommandUtil.getInfo(CommandUtil.java:67)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
at java.lang.reflect.Method.invoke(Method.java:606)"}],"appEventLogs":[{"createdAtMs":1530412800000,"eventId":
"share","logType":"event","mark":"4","musicID":"傲红尘"}],"appPageLogs":[{"createdAtMs":1530449520000,
"logType":"page","nextPage":"list.html","pageId":"list.html","pageViewCntInSession":0,"visitIndex":"2"}],
"appPlatform":"ios","appStartupLogs":[{"brand":"联想","carrier":"中国联通","country":"china","createdAtMs":
1530385560000,"logType":"startup","network":"cell","province":"hebei","screenSize":"960 * 640"},
{"brand":"魅族","carrier":"中国铁通","country":"america","createdAtMs":1530412860000,"logType":"startup",
"network":"3g","province":"guangxi","screenSize":"480 * 320"}],"appUsageLogs":[{"createdAtMs":1530379200000,
"logType":"usage","singleDownloadTraffic":"12800","singleUploadTraffic":"128","singleUseDurationSecs":"123"}],
"appVersion":"1.0.0","deviceId":"Device000099","deviceStyle":"oppo 1","osType":"1.4.0"}

　　1.1 创建 logAgg表

　　创建 logAgg表，分区表 => year, month, day

    create table logAgg(serverTime string,remoteIp string,clientTime string,status string, json string)
    partitioned by(year string, month string, day string)
    row format delimited
    fields terminated by '#' ;

　　1.2 load 数据到 logAgg表

　　load data local inpath '/home/centos/files/2018-07-01.log' into table logagg partition(year='2018',month='07',day='01');

　　1.3 降维处理

　　1. 代码编写

　　ParseJsonUtil.java

　　ParseEvent.java

　　2. 上传并同步

　　先打包再放入 /soft/hive/lib 中

    cp /soft/hive/lib/myhive-1.0-SNAPSHOT.jar /soft/hadoop/share/hadoop/common/lib/

    xsync.sh /soft/hadoop/share/hadoop/common/lib/myhive-1.0-SNAPSHOT.jar

　　3. 注册临时函数

    create temporary function parseEvent as 'com.share.udtf.ParseEvent';

　　4.测试

    select parseEvent(json) from logAgg;

　　1.4 创建 logEvent表

    create table logevent(deviceId string, createdAtMs string, eventId string, logType string , mark string, musicID string)
    stored as parquet tblproperties('parquet.compression'='GZIP');

　　1.5 转储

    insert into logevent 
    select parseEvent(json) from logagg
    where year='2018' and month='07' and day='01';

　　1.6 对 logEvent表进行操作

　　1. 计算每个用户对每首歌的评分

select deviceid, musicid, sum(cast(mark as int)) as sum
from logevent
where musicId is not null
group by deviceid, musicid；

　　2. 计算每个用户对每首歌的评分与最高评分

select deviceid , musicid, sum, max(sum)over(partition by deviceid) as sum2
from (
select deviceid, musicid, sum(cast(mark as int)) as sum
from logevent where musicId is not null group by deviceid, musicid
)a；

　　3. 使用 sql 计算出，每个用户最喜欢(评分最高)的歌曲，及其评分

select deviceid , musicid, sum from (
select deviceid , musicid, sum, max(sum)over(partition by deviceid) as sum2
from (
select deviceid, musicid, sum(cast(mark as int)) as sum
from logevent where musicId is not null group by deviceid, musicid
)a
)b 
where sum=sum2 ;

2. 附加内容

　　2.1 Hive 加载 jar 方式

　　1. 将 jar 放在 Hive 的 lib 下,并重启 Hive
　　　　/soft/hive/lib

　　2. 通过配置文件指定 jar，并重启 Hive
　　　　hive.aux.jars.path=/x/x/x.jar

　　3. 临时加载 jar
　　　　hive> add jar /x/x/x.jar

　　2.2 注册函数方法

　　1. create function as '';　　　　// 永久

　　2. create temporary function as '';　　　　// 临时

　　3. create function xxx as '' using jar 'hdfs://mycluster/xxx.jar';　　　　// 将 jar 包放入 HDFS 中，避免重启