yhd日志分析(一)

依据yhd日志文件统计分析每日各时段的pv和uv

建hive表，表列分隔符和文件保持一致
load数据到hive表
写hive sql统计pv和uv, 结果保存到hive表2

使用sqoop从hive表2导出结果数据到mysql

 mysql表格式: daily_hour_visit(date, hour,uv, pv)

环境:

CentOS 6.4

hadoop-2.5.0-cdh5.3.6

hive-0.13.1-cdh5.3.6

sqoop-1.4.5-cdh5.3.6

1, 建立hive表

根据yhd日志数据格式，使用分区表，分区字段为日期date和小时数hour
create table if not exists yhd_log(
	id                                   string,	
	url                                  string,
	referer                              string,
	keyword                              string,
	type                                 string,
	guid                                 string,
	pageId				                 string,
	moduleId				             string,
	linkId				                 string,
	attachedInfo                         string,
	sessionId                            string,
	trackerU                             string,
	trackerType                          string,
	ip                                   string,
	trackerSrc                           string,
	cookie                               string,
	orderCode                            string,
	trackTime                            string,
	endUserId                            string,
	firstLink                            string,
	sessionViewNo                        string,
	productId                            string,
	curMerchantId                        string,
	provinceId                           string,
	cityId                               string,
	fee                                  string,
	edmActivity                          string,
	edmEmail                             string,
	edmJobId                             string,
	ieVersion                            string,
	platform                             string,
	internalKeyword                      string,
	resultSum                            string,
	currentPage                          string,
	linkPosition                         string,
	buttonPosition                       string
)
partitioned by(date string, hour int)
row format delimited fields terminated by '	'
stored as textfile;

2 Load数据到hive表

先把数据2015082818和2015082819上传到目录/home/hadoop (我的家目录)，然后load到hive表中

LOAD DATA LOCAL INPATH '/home/hadoop/2015082818' OVERWRITE INTO TABLE yhd_log PARTITION (date='20150828', hour=18);
LOAD DATA LOCAL INPATH '/home/hadoop/2015082819' OVERWRITE INTO TABLE yhd_log PARTITION (date='20150828', hour=19);

测试数据是否导入成功

select url, endUserId from yhd_log where date='20150828'and hour=18 limit 10; 

select url, endUserId from yhd_log where date='20150828'and hour=19 limit 10;

3 hive sql统计

统计pv和uv，并将结果写入新的表： yhd_log_stat中

第一次使用ctas创建表
create table if not exists yhd_log_stat
as 
select '20150828' as date, 18 as hour, count(url) as pv, count(distinct guid) as uv from yhd_log where date='20150828' and hour=18;


from yhd_log
insert into table yhd_log_stat
select '20150828' as date, 19 as hour,count(url) as pv, count(distinct guid) as uv
where date='20150828' and hour=19

查看yhd_log_stat中的数据

4 使用sqoop把yhd_log_stat导出到mysql

先在mysql上把表建立好

CREATE TABLE `yhd_log_stat` (
`date` VARCHAR(50) NOT NULL DEFAULT '0',
`hour` TINYINT(4) NOT NULL DEFAULT '0',
`pv` BIGINT(20) NULL DEFAULT '0',
`uv` BIGINT(20) NULL DEFAULT '0',
PRIMARY KEY (`date`, `hour`),
INDEX `date` (`date`),
INDEX `hour` (`hour`)
)

使用sqoop export，把数据从hdfs导出到mysql

sqoop export --connect jdbc:mysql://mysql-server:3306/test --username mysql --password mysql-pwd --table yhd_log_stat --export-dir /user/hive/warehouse/db0731.db/yhd_log_stat --columns date,hour,pv,uv --input-fields-terminated-by '01'

在mysql中查看导出结果

附件： yhd日志数据