Mapreduce实例--求平均值

求平均数是MapReduce比较常见的算法，求平均数的算法也比较简单，一种思路是Map端读取数据，在数据输入到Reduce之前先经过shuffle，将map函数输出的key值相同的所有的value值形成一个集合value-list，然后将输入到Reduce端，Reduce端汇总并且统计记录数，然后作商即可。具体原理如下图所示：

操作环境：

Centos 7

jdk 1.8

hadoop-3.2.0

IDEA2019

实现内容：

将自定义的电商关于商品点击情况的数据文件，包含两个字段(商品分类，商品点击次数)，用" "分割，类似如下：

商品分类 商品点击次数
52127    5
52120    93
52092    93
52132    38
52006    462
52109    28
52109    43
52132    0
52132    34
52132    9
52132    30
52132    45
52132    24
52009    2615
52132    25
52090    13
52132    6
52136    0
52090    10
52024    347

使用mapreduce统计出每类商品的平均点击次数

商品分类 商品平均点击次数
52006    462
52009    2615
52024    347
52090    11
52092    93
52109    35
52120    93
52127    5
52132    23
52136    0

一、启动Hadoop集群，上传数据集文件到hdfs

hadoop fs -mkdir -p /mymapreduce4/in
hadoop fs -put /data/mapreduce4/goods_click /mymapreduce4/in

二、在IDEA中创建java项目，导入Jar包，如果清楚自己使用的集群的jar包，可使用maven导入指定的jar包。

　　为了避免Jar包冲突，使用hadoop/share目录下的所有jar包。

三、编写java代码程序

package mapreduce;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class MyAverage{
    public static class Map extends Mapper<Object , Text , Text , IntWritable>{
        private static Text newKey=new Text();
        public void map(Object key,Text value,Context context) throws IOException, InterruptedException{
            //将输入的纯文本文件数据转化成string
            String line=value.toString();
            System.out.println(line);
            //将值通过split()方法截取出来
            String arr[]=line.split("	");
            newKey.set(arr[0]);
            int click=Integer.parseInt(arr[1]);
            //将数据和值输入到reduce处理
            context.write(newKey, new IntWritable(click));
        }
    }
    public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>{
        public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException, InterruptedException{
            int num=0;
            int count=0;
            for(IntWritable val:values){
                //每个元素求和num
                num+=val.get();
                //统计元素的次数count
                count++;
            }
            //统计次数
            int avg=num/count;
            context.write(key,new IntWritable(avg));
        }
    }
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException{
        Configuration conf=new Configuration();
        System.out.println("start");
        Job job =new Job(conf,"MyAverage");
        job.setJarByClass(MyAverage.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        Path in=new Path("hdfs://172.18.74.137:9000/mymapreduce4/in/goods_click");
        Path out=new Path("hdfs://172.18.74.137:9000/mymapreduce4/out");
        FileInputFormat.addInputPath(job,in);
        FileOutputFormat.setOutputPath(job,out);
        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }
}

四、运行查看结果

hadoop fs -ls /mymapreduce4/out
hadoop fs -cat /mymapreduce4/out/part-r-00000