hadoop之定制自己的Partitioner

　　partitioner负责shuffle过程的分组部分,目的是让map出来的数据均匀分布在reducer上,当然,如果我们不需要数据均匀,那么这个时候可以自己定制符合要求的partitioner. 下面内容涉及到的源代码请参考https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Partitioner.html

Partitioner

1.Maperduce提供了四中的partitioner,如下图所示,它们都实现了Partitioner基类中的方法,源代码如下:

  public abstract class Partitioner<KEY, VALUE> {
  /** 
   * Get the partition number for a given key (hence record) given the total 
   * number of partitions i.e. number of reduce-tasks for the job.
   *   
   * <p>Typically a hash function on a all or a subset of the key.</p>
   *
   * @param key the key to be partioned.
   * @param value the entry value.
   * @param numPartitions the total number of partitions.
   * @return the partition number for the <code>key</code>.
   */
  public abstract int getPartition(KEY key, VALUE value, int numPartitions);
  
}

2.默认的为 HashPartitioner,实现的分组方式如下:

public class HashPartitioner<K, V> extends Partitioner<K, V> {
  /** Use {@link Object#hashCode()} to partition. */
  public int getPartition(K key, V value,int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }
}

3.BinaryPatitioner是partitioner的偏特化自雷,该类提供leftOffset和rightOffset,在计算时对key的[leftOffset,rightOffset]这个区间取hash.

4.KeyFieldBasedPartioner也是基于hash的partitionwe,和BinaryPatitioner不同,它提供了多个区间计算hash,当区间数为0时,退化成HashParitioner.5.TotalOrderPartitioner这个类可以实现输出的全排序。不同于以上3个partitioner，这个类并不是基于hash的。在另外一篇文章:hadoop之全排列

定制自己的partitioner

实现自己的partitioner只需要将自己的类继承Partitioner即可,如下面自定义分组代码:

public class MyPartition extends Partitioner<Text, LongWritable> {
    
    public int getPartition(Text key, LongWritable value, int numReduceTasks) {
        return key.toString().contains("luoliang") ? 0 : 1;
    }    
}

在main函数中添加,实现定制的patitioner:

job.setPartitionerClass(MyPartitioner.class);

在文章Hadoop 2.2.0词频统计,实现了定制partitioner和combiner的单词统计的完整例子,可以作为参考.

待续