搜狗搜索日志分析系统

 

 

1数据预处理(Linux环境)

搜狗数据的数据格式:

访问时间 用户ID [查询词] URL在返回结果中的排名 用户点击的顺序号 用户点击的URL

其中,用户ID是根据用户使用浏览器访问搜索引擎时的Cookie信息自动赋值,即同一次使用浏览器输入的不同查询对应同一个用户ID

 

1.1数据扩展 awk -F  {print ()} sogo1w  > sogo1w_e

 awk -F' ''{print$0" "substr($1,1,4)" "substr($1,5,2)" "substr($1,7,2)" "substr($1,9,2)}' sougoulog.txt >  sougoulog1.txt

将时间字段拆分并拼接,添加年、月、日、小时字段,结果如图所示:

 

2基于Hive构建日志数据的数据仓库

2.1创外部表

1 创建扩展4个字段(年、月、日、小时)数据的外部表

create external table sogou1(time string,uid string,name string,rank int,oder int,url string, year string,month string,day string,hour string) row format delimited fields terminated by ' ';

2 装载数据

load data local inpath '/usr/local/hivedata/sougoulog1.txt' into table sogou1;

2.2创建带分区的表(按照年、月、天、小时分区)

Create table(,) partitioned by (,,,,)

Create table sogou2 (time string,uid string,name string,rank int,oder int,url string) partitioned by (year string,month string,day string,hour string) row format delimited fields terminated by ;

2.3灌入数据-->动态分区

Insert into table sogou2 patition (year,month,day,hour) select time,uid,name,rank,oder,url,year,month,day,hour from sogou1;

 

3条数统计

3.1数据总条数

Select count(*) from sogou2;

3.2关键字的非空查询条数

Select count(*) from sogou2 where name is not null;

3.3独立UID总数

Select count(*) from (select distinct uid from sogou2) s;

Select count(*) from (select count(*) from sogou2 group by uid)s;

 

单个uid出现的次数

Select count(*) from (select count(*) from sogou2 group by uid having count(*)=1)s;

 

 

 

 

4 UID分析

4.1 UID的查询次数分布(查询1次的UID个数,...查询N次的UID个数)

select s.c ,count(*) from (select count(*) c from sogou2 group by uid)s group by s.c;

 

4.2 每个UID平均查询次数

select avg(s.c) from (select count(*) c from sogou2 group by uid) s ;

 

4.3查询次数大于2次的用户总数

select sum(s.c) from (select count(*) c from sogou2 group by uid ) s where s.c>2; 

 

4.3查询次数大于2次的用户占比

select sum(m.c)/sum(n.d) from (select count(*) c from sogou2 group by uid having c>2) m , (select count(*) d from sogou2 group by uid) n;

 

5 Rank和关键词的分析

5.1 Rank在10以内的点击次数占比

select m.c/n.d from (select count(*)c from sogou2 where runk < 10)m,(select count(*)d from sogou2 )n;

5.2 直接输入URL作为查询词的比例

select m.c/n.d from (select count(*)c from sogou2 where name = url)m,(select count(*) d from sogou2)n;

 

 

 

 

 

 

原文地址:https://www.cnblogs.com/beiyi888/p/9634472.html