Partitioner— redirecting output from Mapper如何来分割引导来自mapper的输出

Partitioner— redirecting output from Mapper

A common misconception for first-time MapReduce programmers is to use only a
single reducer

大部分的一种错误的概念是程序只用一个单一的reducer

After all, a single reducer sorts all of your data before processing—
and who doesn’t like sorted data? Our discussions regarding MapReduce expose
the folly of such thinking. We would have ignored the benefits of parallel computation
. With one reducer, our compute cloud has been demoted to a compute
raindrop.

毕竟，一个单一的reducer在处理之前排序你所有的数据-并且谁不喜欢对数据排序？我们的讨论是暴露这些愚笨的想法.

我们将忽略用one reducer有利于并行计算，我们的云计已经降级到一个雨滴。

(key/value) pair outputted by a mapper. The default behavior is to hash the key to
determine the reducer. Hadoop enforces this strategy by use of the HashPartitioner
class . Sometimes the HashPartitioner will steer you awry.（一些时刻hashpartitioner 将会扭曲的驾驶） Let’s return to the Edge
class introduced in section 3.2.1.
Suppose you used the Edge class to analyze flight information data to determine the
number of passengers departing from each airport. Such data may be
假如你想用edge class 去分析统计从一个时刻起，每一个机场的乘客数目。假如这个数据是

(San Francisco, Los Angeles) Chuck Lam
(San Francisco, Dallas) James Warren

If you used HashPartitioner, the two rows could be sent to different reducers. The
number of departures would be processed twice and both times erroneously.

假如你用这个hashpartioner你就大错特错了，这两行数据将会被发生到不同的reucers.

How do we customize the partitioner for your applications? In this situation, we
want all edges with a common departure point to be sent to the same reducer. This is
done easily enough by hashing the departureNode member of the Edge :

该怎么解决呢？

public class EdgePartitioner implements Partitioner<Edge, Writable>
{
@Override
public int getPartition(Edge key, Writable value, int numPartitions)
{
return key.getDepartureNode().hashCode() % numPartitions;
}
@Override
public void configure(JobConf conf) { }
}

custom partitioner only needs to implement two functions: configure() and
getPartition() . The former uses the Hadoop job configuration to configure the
partitioner, and the latter returns an integer between 0 and the number of reduce tasks
indexing to which reducer the (key/value) pair will be sent.
The exact mechanics of the partitioner may be difficult to follow. Figure 3.2 illustrates
this for better understanding.
Between the map and reduce stages, a MapReduce application must take the output
from the mapper tasks and distribute the results among the reducer tasks. This process
is typically called shuffling , because the output of a mapper on a single node may be
sent to reducers across multiple nodes in the cluster.