InActon-日志分析（KPI）

我参照的前辈的文章http://blog.fens.me/hadoop-mapreduce-log-kpi/

从1.x改到了2.x。虽然没什么大改。（说实话，视频没什么看的，看文章最好）

先用maven构建hadoop项目

下载maven、添加环境变量、替换eclipse默认maven配置、修改maven默认库位置... ...

这里没有像前辈一样用maven命令去新建一个maven项目，直接用eclipse这个方便IDE就行了

重要的pom.xml添加依赖

 1 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 2   xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
 3   <modelVersion>4.0.0</modelVersion>
 4 
 5   <groupId>org.admln</groupId>
 6   <artifactId>getKPI</artifactId>
 7   <version>0.0.1-SNAPSHOT</version>
 8   <packaging>jar</packaging>
 9 
10   <name>getKPI</name>
11   <url>http://maven.apache.org</url>
12 
13   <properties>
14     <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
15   </properties>
16 
17   <dependencies>
18     <dependency>
19         <groupId>junit</groupId>
20         <artifactId>junit</artifactId>
21         <version>4.4</version>
22         <scope>test</scope>
23     </dependency>
24     <dependency>
25         <groupId>org.apache.hadoop</groupId>
26         <artifactId>hadoop-common</artifactId>
27         <version>2.2.0</version>
28     </dependency>
29     <dependency>
30         <groupId>org.apache.hadoop</groupId>
31         <artifactId>hadoop-mapreduce-client-core</artifactId>
32         <version>2.2.0</version>
33     </dependency>
34     <dependency>
35         <groupId>org.apache.hadoop</groupId>
36         <artifactId>hadoop-mapreduce-client-common</artifactId>
37         <version>2.2.0</version>
38     </dependency>
39     <dependency>
40         <groupId>org.apache.hadoop</groupId>
41         <artifactId>hadoop-hdfs</artifactId>
42         <version>2.2.0</version>
43     </dependency>
44     <dependency>
45         <groupId>jdk.tools</groupId>
46         <artifactId>jdk.tools</artifactId>
47         <version>1.7</version>
48         <scope>system</scope>
49         <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
50     </dependency>
51   </dependencies>
52 </project>

然后让maven下载jar包就行了（第一次下载很多很慢，以后就不用下载，快的很了）

然后就是MR了。

这个MR的任务就是根据日志提取一些KPI指标。

日志格式：

1 222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939
2  "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1)
3  AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"

有用的变量：

remote_addr: 记录客户端的ip地址, 222.68.172.190
remote_user: 记录客户端用户名称, –
time_local: 记录访问时间与时区, [18/Sep/2013:06:49:57 +0000]
request: 记录请求的url与http协议, “GET /images/my.jpg HTTP/1.1″
status: 记录请求状态,成功是200, 200
body_bytes_sent: 记录发送给客户端文件主体内容大小, 19939
http_referer: 用来记录从那个页面链接访问过来的, “http://www.angularjs.cn/A00n”
http_user_agent: 记录客户浏览器的相关信息, “Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36″

KPI目标：

PV(PageView): 页面访问量统计
IP: 页面独立IP的访问量统计
Time: 用户每小时PV的统计
Source: 用户来源域名的统计
Browser: 用户的访问设备统计

具体MR:

KPI.java

  1 package org.admln.kpi;
  2 
  3 import java.text.ParseException;
  4 import java.text.SimpleDateFormat;
  5 import java.util.Date;
  6 import java.util.HashSet;
  7 import java.util.Locale;
  8 import java.util.Set;
  9 
 10 /**
 11  * @author admln
 12  *
 13  */
 14 public class KPI {
 15     private String remote_addr;// 记录客户端的ip地址
 16     private String remote_user;// 记录客户端用户名称,忽略属性"-"
 17     private String time_local;// 记录访问时间与时区
 18     private String request;// 记录请求的url与http协议
 19     private String status;// 记录请求状态；成功是200
 20     private String body_bytes_sent;// 记录发送给客户端文件主体内容大小
 21     private String http_referer;// 用来记录从那个页面链接访问过来的
 22     private String http_user_agent;// 记录客户浏览器的相关信息
 23 
 24     private boolean valid = true;// 判断数据是否合法
 25     
 26     
 27     public static KPI parser(String line) {
 28         KPI kpi = new KPI();
 29         String [] arr = line.split(" ");
 30         if(arr.length>11) {
 31             kpi.setRemote_addr(arr[0]);
 32             kpi.setRemote_user(arr[1]);
 33             kpi.setTime_local(arr[3].substring(1));
 34             kpi.setRequest(arr[6]);
 35             kpi.setStatus(arr[8]);
 36             kpi.setBody_bytes_sent(arr[9]);
 37             kpi.setHttp_referer(arr[10]);
 38             
 39             if(arr.length>12) {
 40                 kpi.setHttp_user_agent(arr[11]+" "+arr[12]);
 41             }else {
 42                 kpi.setHttp_user_agent(arr[11]);
 43             }
 44             
 45             if(Integer.parseInt(kpi.getStatus())>400) {
 46                 kpi.setValid(false);
 47             }
 48             
 49         }else {
 50             kpi.setValid(false);
 51         }
 52         
 53         return kpi;
 54         
 55     }
 56     public static KPI filterPVs(String line) {
 57         KPI kpi = parser(line);
 58         Set pages = new HashSet();
 59         pages.add("/about");
 60         pages.add("/black-ip-list/");
 61         pages.add("/cassandra-clustor/");
 62         pages.add("/finance-rhive-repurchase/");
 63         pages.add("/hadoop-family-roadmap/");
 64         pages.add("/hadoop-hive-intro/");
 65         pages.add("/hadoop-zookeeper-intro/");
 66         pages.add("/hadoop-mahout-roadmap/");
 67 
 68         if (!pages.contains(kpi.getRequest())) {
 69             kpi.setValid(false);
 70         }
 71         return kpi;
 72     }
 73 
 74     public String getRemote_addr() {
 75         return remote_addr;
 76     }
 77 
 78     public void setRemote_addr(String remote_addr) {
 79         this.remote_addr = remote_addr;
 80     }
 81 
 82     public String getRemote_user() {
 83         return remote_user;
 84     }
 85 
 86     public void setRemote_user(String remote_user) {
 87         this.remote_user = remote_user;
 88     }
 89 
 90     public String getTime_local() {
 91         return time_local;
 92     }
 93     
 94     public Date getTime_local_Date() throws ParseException {
 95         SimpleDateFormat df = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss",Locale.US);
 96         return df.parse(this.time_local);
 97     }
 98     //为了以小时为单位统计数据
 99     public String getTime_local_Date_Hour() throws ParseException {
100         SimpleDateFormat df = new SimpleDateFormat("yyyyMMddHH");
101         return df.format(this.getTime_local_Date());
102     }
103 
104     public void setTime_local(String time_local) {
105         this.time_local = time_local;
106     }
107 
108     public String getRequest() {
109         return request;
110     }
111 
112     public void setRequest(String request) {
113         this.request = request;
114     }
115 
116     public String getStatus() {
117         return status;
118     }
119 
120     public void setStatus(String status) {
121         this.status = status;
122     }
123 
124     public String getBody_bytes_sent() {
125         return body_bytes_sent;
126     }
127 
128     public void setBody_bytes_sent(String body_bytes_sent) {
129         this.body_bytes_sent = body_bytes_sent;
130     }
131 
132     public String getHttp_referer() {
133         return http_referer;
134     }
135 
136     public void setHttp_referer(String http_referer) {
137         this.http_referer = http_referer;
138     }
139 
140     public String getHttp_user_agent() {
141         return http_user_agent;
142     }
143 
144     public void setHttp_user_agent(String http_user_agent) {
145         this.http_user_agent = http_user_agent;
146     }
147 
148     public boolean isValid() {
149         return valid;
150     }
151 
152     public void setValid(boolean valid) {
153         this.valid = valid;
154     }
155 }

KPIBrowser.java

 1 package org.admln.kpi;
 2 
 3 import java.io.IOException;
 4 
 5 import org.apache.hadoop.conf.Configuration;
 6 import org.apache.hadoop.fs.Path;
 7 import org.apache.hadoop.io.IntWritable;
 8 import org.apache.hadoop.mapreduce.Job;
 9 import org.apache.hadoop.mapreduce.Mapper;
10 import org.apache.hadoop.mapreduce.Reducer;
11 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.io.Text;
14 
15 /**
16  * @author admln
17  *
18  */
19 public class KPIBrowser {
20     
21     public static class browserMapper extends Mapper<Object,Text,Text,IntWritable> {
22         Text word = new Text();
23         IntWritable ONE = new IntWritable(1);
24         @Override
25         public void map(Object key,Text value,Context context) throws IOException, InterruptedException {
26             KPI kpi = KPI.parser(value.toString());
27             if(kpi.isValid()) {
28                 word.set(kpi.getHttp_user_agent());
29                 context.write(word, ONE);
30             }
31         }
32     }
33     
34     public static class browserReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
35         int sum;
36         public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
37             sum = 0;
38             for(IntWritable val : values) {
39                 sum += val.get();
40             }
41             context.write(key, new IntWritable(sum));
42         }
43     }
44 
45     public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
46         Path input = new Path("hdfs://hadoop:9001/fens/kpi/input/");
47         Path output = new Path("hdfs://hadoop:9001/fens/kpi/browser/output");
48         
49         Configuration conf = new Configuration();
50         
51         @SuppressWarnings("deprecation")
52         Job job = new Job(conf,"get KPI Browser");
53         
54         job.setJarByClass(KPIBrowser.class);
55         
56         job.setMapperClass(browserMapper.class);
57         job.setCombinerClass(browserReducer.class);
58         job.setReducerClass(browserReducer.class);
59         
60         job.setOutputKeyClass(Text.class);
61         job.setOutputValueClass(IntWritable.class);
62         
63         FileInputFormat.addInputPath(job,input);
64         FileOutputFormat.setOutputPath(job,output);
65         
66         System.exit(job.waitForCompletion(true)?0:1);
67 
68     }
69 }

KPIIP.java

 1 package org.admln.kpi;
 2 
 3 import java.io.IOException;
 4 import java.util.HashSet;
 5 import java.util.Set;
 6 
 7 import org.apache.hadoop.conf.Configuration;
 8 import org.apache.hadoop.fs.Path;
 9 import org.apache.hadoop.io.Text;
10 import org.apache.hadoop.mapreduce.Job;
11 import org.apache.hadoop.mapreduce.Mapper;
12 import org.apache.hadoop.mapreduce.Reducer;
13 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
14 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
15 
16 /**
17  * @author admln
18  *
19  */
20 public class KPIIP {
21     //map类
22     public static class ipMapper extends Mapper<Object,Text,Text,Text> {
23         private Text word = new Text();
24         private Text ips = new Text();
25         
26         @Override
27         public void map(Object key,Text value,Context context) throws IOException, InterruptedException {
28             KPI kpi = KPI.parser(value.toString());
29             if(kpi.isValid()) {
30                 word.set(kpi.getRequest());
31                 ips.set(kpi.getRemote_addr());
32                 context.write(word, ips);
33             }
34         }
35     }
36     
37     //reduce类
38     public static class ipReducer extends Reducer<Text,Text,Text,Text> {
39         private Text result = new Text();
40         private Set<String> count = new HashSet<String>();
41         
42         public void reduce(Text key,Iterable<Text> values,Context context) throws IOException, InterruptedException {
43             
44             for (Text val : values) {
45                 count.add(val.toString());
46             }
47             result.set(String.valueOf(count.size()));
48             context.write(key, result);
49         }
50     }
51     
52     public static void main(String[] args) throws Exception {
53         Path input = new Path("hdfs://hadoop:9001/fens/kpi/input/");
54         Path output = new Path("hdfs://hadoop:9001/fens/kpi/ip/output");
55         
56         Configuration conf = new Configuration();
57         
58         @SuppressWarnings("deprecation")
59         Job job = new Job(conf,"get KPI IP");
60         job.setJarByClass(KPIIP.class);
61         
62         job.setMapperClass(ipMapper.class);
63         job.setCombinerClass(ipReducer.class);
64         job.setReducerClass(ipReducer.class);
65         
66         job.setOutputKeyClass(Text.class);
67         job.setOutputValueClass(Text.class);
68         
69         FileInputFormat.addInputPath(job,input);
70         FileOutputFormat.setOutputPath(job,output);
71         System.exit(job.waitForCompletion(true)?0:1);
72         
73     }
74 }

KPIPV.java

 1 package org.admln.kpi;
 2 
 3 import java.io.IOException;
 4 
 5 import org.apache.hadoop.conf.Configuration;
 6 import org.apache.hadoop.fs.Path;
 7 import org.apache.hadoop.io.IntWritable;
 8 import org.apache.hadoop.io.Text;
 9 import org.apache.hadoop.mapreduce.Job;
10 import org.apache.hadoop.mapreduce.Mapper;
11 import org.apache.hadoop.mapreduce.Reducer;
12 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
14 
15 /**
16  * @author admln
17  *
18  */
19 public class KPIPV {
20     
21     public static class pvMapper extends Mapper<Object,Text,Text,IntWritable> {
22         private Text word = new Text();
23         private final static IntWritable ONE = new IntWritable(1);
24         
25         public void map(Object key,Text value,Context context) throws IOException, InterruptedException {
26             KPI kpi = KPI.filterPVs(value.toString());
27             if(kpi.isValid()) {
28                 word.set(kpi.getRequest());
29                 context.write(word, ONE);
30             }
31         }
32     }
33     
34     public static class pvReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
35         IntWritable result = new IntWritable();
36         public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
37             int sum = 0;
38             for (IntWritable val : values) {
39                 sum += val.get();
40             }
41             result.set(sum);
42             context.write(key,result);
43         }
44     }
45     
46     public static void main(String[] args) throws Exception {
47         Path input = new Path("hdfs://hadoop:9001/fens/kpi/input/");
48         Path output = new Path("hdfs://hadoop:9001/fens/kpi/pv/output");
49         
50         Configuration conf = new Configuration();
51         
52         @SuppressWarnings("deprecation")
53         Job job = new Job(conf,"get KPI PV");
54         
55         job.setJarByClass(KPIPV.class);
56         
57         job.setMapperClass(pvMapper.class);
58         job.setCombinerClass(pvReducer.class);
59         job.setReducerClass(pvReducer.class);
60         
61         job.setOutputKeyClass(Text.class);
62         job.setOutputValueClass(IntWritable.class);
63         
64         FileInputFormat.addInputPath(job,input);
65         FileOutputFormat.setOutputPath(job,output);
66         
67         System.exit(job.waitForCompletion(true)?0:1);
68         
69     }
70 
71 }

KPISource.java

 1 package org.admln.kpi;
 2 
 3 import java.io.IOException;
 4 
 5 import org.apache.hadoop.conf.Configuration;
 6 import org.apache.hadoop.fs.Path;
 7 import org.apache.hadoop.io.IntWritable;
 8 import org.apache.hadoop.io.Text;
 9 import org.apache.hadoop.mapreduce.Job;
10 import org.apache.hadoop.mapreduce.Mapper;
11 import org.apache.hadoop.mapreduce.Reducer;
12 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
14 
15 /**
16  * @author admln
17  *
18  */
19 public class KPISource {
20     
21     public static class sourceMapper extends Mapper<Object,Text,Text,IntWritable> {
22         Text word = new Text();
23         IntWritable ONE = new IntWritable(1);
24         @Override
25         public void map(Object key,Text value,Context context) throws IOException, InterruptedException {
26             KPI kpi = KPI.parser(value.toString());
27             if(kpi.isValid()) {
28                 word.set(kpi.getHttp_referer());
29                 context.write(word, ONE);
30             }
31         }
32     }
33     
34     public static class sourceReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
35         int sum;
36         public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
37             sum = 0;
38             for(IntWritable val : values) {
39                 sum += val.get();
40             }
41             context.write(key, new IntWritable(sum));
42         }
43     }
44     
45     public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
46         Path input = new Path("hdfs://hadoop:9001/fens/kpi/input/");
47         Path output = new Path("hdfs://hadoop:9001/fens/kpi/source/output");
48         
49         Configuration conf = new Configuration();
50         
51         @SuppressWarnings("deprecation")
52         Job job = new Job(conf,"get KPI Source");
53         
54         job.setJarByClass(KPISource.class);
55         
56         job.setMapperClass(sourceMapper.class);
57         job.setCombinerClass(sourceReducer.class);
58         job.setReducerClass(sourceReducer.class);
59         
60         job.setOutputKeyClass(Text.class);
61         job.setOutputValueClass(IntWritable.class);
62         
63         FileInputFormat.addInputPath(job,input);
64         FileOutputFormat.setOutputPath(job,output);
65         
66         System.exit(job.waitForCompletion(true)?0:1);
67     }
68 }

KPITime.java

 1 package org.admln.kpi;
 2 
 3 import java.io.IOException;
 4 import java.text.ParseException;
 5 
 6 import org.apache.hadoop.conf.Configuration;
 7 import org.apache.hadoop.fs.Path;
 8 import org.apache.hadoop.io.IntWritable;
 9 import org.apache.hadoop.io.Text;
10 import org.apache.hadoop.mapreduce.Job;
11 import org.apache.hadoop.mapreduce.Mapper;
12 import org.apache.hadoop.mapreduce.Reducer;
13 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
14 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
15 
16 /**
17  * @author admln
18  *
19  */
20 public class KPITime {
21 
22     public static class timeMapper extends Mapper<Object,Text,Text,IntWritable> {
23         Text word = new Text();
24         IntWritable ONE = new IntWritable(1);
25         @Override
26         public void map(Object key,Text value,Context context) throws IOException, InterruptedException {
27             KPI kpi = KPI.parser(value.toString());
28             if(kpi.isValid()) {
29                 try {
30                     word.set(kpi.getTime_local_Date_Hour());
31                 } catch (ParseException e) {
32                     e.printStackTrace();
33                 }
34                 context.write(word, ONE);
35             }
36         }
37     }
38     
39     public static class timeReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
40         int sum;
41         public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
42             sum = 0;
43             for(IntWritable val : values) {
44                 sum += val.get();
45             }
46             context.write(key, new IntWritable(sum));
47         }
48     }
49     
50     public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
51         Path input = new Path("hdfs://hadoop:9001/fens/kpi/input/");
52         Path output = new Path("hdfs://hadoop:9001/fens/kpi/time/output");
53         
54         Configuration conf = new Configuration();
55         
56         @SuppressWarnings("deprecation")
57         Job job = new Job(conf,"get KPI Time");
58         
59         job.setJarByClass(KPITime.class);
60         
61         job.setMapperClass(timeMapper.class);
62         job.setCombinerClass(timeReducer.class);
63         job.setReducerClass(timeReducer.class);
64         
65         job.setOutputKeyClass(Text.class);
66         job.setOutputValueClass(IntWritable.class);
67         
68         FileInputFormat.addInputPath(job,input);
69         FileOutputFormat.setOutputPath(job,output);
70         
71         System.exit(job.waitForCompletion(true)?0:1);
72 
73     }
74 
75 }

其实五个MR都差不多，都是WordCount稍作改变。（前辈好像写的有点小错误，被我发现改了）
hadoop环境是：hadoop2.2.0；JDK1.7;虚拟机伪分布式；IP 192.168.111.132。

具体效果：

这里前辈是把指定目录提取出来了。实际情况可以根据自己的需求提取指定页面。

具体代码和日志文件：http://pan.baidu.com/s/1qW5D63M

实验日志数据也可以从别的地方获得来练手，比如搜狗http://www.sogou.com/labs/dl/q.html

关于CRON。我觉得一个可行的方法是：比如我的日志是由tomcat产生的，定义tomcat产生日志是每天写在一个目录里面，目录以日志命名；然后写一个shell脚本，是执行hadoop命令把当天日期的tomcat日志目录复制到HDFS上，然后执行MR；当然HDFS上的命名也要考虑；执行完后把结果再通过shell复制到HBase、Hive、MySQL、redis等需要的地方，供应用使用。

不当之处期盼喷正。

欲为大树，何与草争；心若不动，风又奈何。