Hadoop之计数器与自定义计数器及Combiner的使用

1，计数器：

显示的计数器中分为四个组，分别为：File Output Format Counters、FileSystemCounters、File Input Format Counters和Map-Reduce Framkework。

分组File Input Format Counters包括一个计数器Bytes Read，表示job执行结束后输出文件的内容包括的字节数(空格、换行都是字符)

关于以上这段计数器日志中详细的说明请见下面的注释：

1    Counters: 19 // Counter表示计数器，19表示有19个计数器（下面一共4计数器组）
 2    File Output Format Counters // 文件输出格式化计数器组
 3      Bytes Written=19 // reduce输出到hdfs的字节数，一共19个字节
 4    FileSystemCounters// 文件系统计数器组
 5      FILE_BYTES_READ=481
 6      HDFS_BYTES_READ=38
 7      FILE_BYTES_WRITTEN=81316
 8      HDFS_BYTES_WRITTEN=19
 9    File Input Format Counters // 文件输入格式化计数器组
10      Bytes Read=19 // map从hdfs读取的字节数
11    Map-Reduce Framework // MapReduce框架
12      Map output materialized bytes=49
13      Map input records=2 // map读入的记录行数，读取两行记录,”hello you”,”hello me”
14      Reduce shuffle bytes=0 // 规约分区的字节数
15      Spilled Records=8
16      Map output bytes=35
17      Total committed heap usage (bytes)=266469376
18      SPLIT_RAW_BYTES=105
19      Combine input records=0 // 合并输入的记录数
20      Reduce input records=4 // reduce从map端接收的记录行数
21      Reduce input groups=3  // reduce函数接收的key数量，即归并后的k2数量
22      Combine output records=0 // 合并输出的记录数
23      Reduce output records=3 // reduce输出的记录行数。<helllo,{1,1}>,<you,{1}>,<me,{1}>
24      Map output records=4 // map输出的记录行数，输出4行记录

2，自定义计数器：

由于不同的场景有不同的计数器应用需求，我们可以自定义不同的计数器：

一：敏感词准备：

将文件中的hello设为敏感词，自定义计数器的目的是将文件中出现的计数器的次数给记录出来

二：代码部分：

仅需要修改之前博客中WordClass中的map的代码部分：

public class WordClass {
    public static class MyMapper extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, LongWritable> {
        protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws IOException, InterruptedException {
           //动态申明
　　　　　　  Counter sensitiveCounter = context.getCounter("Sensitive Words", "hello");
            String line = value.toString();
           //假设hello是一个敏感词
            if (line.contains("hello")){
                sensitiveCounter.increment(1L);

            }
            String[] split = line.split("	");
            for (String word :
                    split) {
                context.write(new Text(word), new LongWritable(1));

            }

        }
    }

计数器声明

1.通过枚举声明 context.getCounter(Enum enum)

2.动态声明 context.getCounter(String groupName,String counterName)

计数器操作

counter.setValue(long value);//设置初始值

counter.increment(long incr);//增加计数

三：结果

在日志信息中可以看到文件中的敏感词信息

3，Combiner的使用

Combiner的作用就是在map端对数据进行合并，从而提高网络的通讯速率，减少map到reduce的网络带宽

我们可以本地把Map的输出做一个合并计算，把具有相同key的数据做一个计算，然后再把此输出作为reduce的输入，这样传给reduce的数据就少了很多。Combiner是用reducer来定义的，多数的情况下Combiner和reduce处理的是同一种逻辑，所以job.setCombinerClass()的参数可以直接使用定义的reduce，括号中可以直接传入类似于MyReducer.Class，当然也可以单独去定义一个有别于reduce的Combiner，继承Reducer，写法基本上定义reduce一样。

----那么，既然Combiner这么有用为什么不能将它作为默认设置呢？

----因为当有类似于求平均数的任务时，在map端执行Combiner会影响最终的结果，所以有些操作并不适合用Combiner，在工作使用中我们应该先选取小部分的数据进行测试，如果结果无误的话，则可以使用Combiner进行map端的数据合并