hadoop之 mapreduce Combiner

许多mapreduce作业会受限与集群的带宽，因此尽量降低map和reduce任务之间的数据传输是有必要的。Hadoop允许用户针对map任务的输出指定一个combiner函数处理map任务的输出，并作为reduce函数的输入。因为combine是优化方案，所以Hadoop无法确定针对map输出记录需要调用多少次combine函数。in the other word，不管调用多少次combine函数，reducer的输出结果都是一样的。
The contract for the combiner function constrains the type of function that may be used。
combiner函数协议会制约可用的函数类型。举个例子：

假设第一个map输出如下：

(1950, 0)
(1950, 20)
(1950, 10)

第二个map输出如下：

(1950, 25)
(1950, 15)

reduce函数被调用时，其输入是

(1950, [0, 20, 10, 25, 15])

结果：

(1950, 25)

如果调用combine函数，像reduce函数一样去寻找每个map的输出的最大温度。那么输出结果应该是：

(1950, [20, 25])

reduce 输出结果和以前一样。可用通过下面的表达式来说明气温数值的函数调用：

max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25

并不是所有函数都有这个属性。例如，我们计算平均气温，就不能使用平均函数作为combiner。

mean(0, 20, 10, 25, 15) = 14

但是：

mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15

combiner函数不能取代reducer。但它能有效减少mapper和reducer之间的数据传输量。

指定一个 combiner

       Job job = Job.getInstance();
            job.setJarByClass(MaxTemperatureJob.class);
            job.setJobName("max temperature");
            //方法为什么不保持一致，不是一个人写的？
            FileInputFormat.addInputPath(job, new Path(INPUT_PATH));
            FileOutputFormat.setOutputPath(job, new Path(OUT_PATH));

            job.setMapperClass(MaxTemperatureMapper.class);
            job.setReducerClass(MaxTemperatureReducer.class);
            //设置combiner
            job.setCombinerClass(MaxTemperatureReducer.class);
            
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
            
           // job.setInputFormatClass();

            System.out.println(job.waitForCompletion(true) ? 0 : 1);

用放荡不羁的心态过随遇而安的生活