Hadoop Mapreduce中wordcount 过程解析

将文件split

文件1：分割结果：

hello world <0, "hello world">

this is wordcount <12,"this is wordcount">

文件2：

hello china <0,"hello china">

hello IT <12,"hello IT">

测试文件较小，所以一般测试文件就是一个split

MapReduce 框架完成了以上分割

Then,将分割好的<key ,value > 交给用户自定义的map 方法进行处理，生成新的<key,value>:

<0, "hello world"> map() <hello,1> <world,1>

<12,"this is wordcount"> map() <this,1> <is,1> <wordcount,1>

<0,"hello china"> map() <hello,1> <china,1>

<12,"hello IT"> map() <hello,1><IT,1>

map() reduce() 中间有个shuffle :

<hello,1> <world,1> shuffle () <hello,1>

<this,1> <is,1> <wordcount,1> shuffle () <is,1>

<wordcount,1>

<world,1>

<hello,1> <china,1> shuffle () <china,1>

<hello,1> <IT,1> shuffle () <hello,1>

<hello,1>

<IT,1>

分组，将相同的key 合并在一起：

<hello,1> <hello,list(1)>

<is,1> <is,list(1)>

<wordcount,1> <wordcount,list(1)>

<world,1> <world,list(1)>

<china,1> <china,list(1)>

<hello,1>

<hello,1> <hello,list(2)>

<IT,1> <IT,1>

<china,list(1)>

<hello,list(1,2)>

<is,list(1)>

<wordcount,list(1)>

<world,list(1)>

<IT,list(1)>

得到最新的<key,value> 之后，再交给用户的reduce()方法，得到最新的<key,value >,并组为wordcount 的结果输出:

<china,1>

<hello,3>

<is,1>

<wordcount,1>

<world,1>

<IT,1>