MapReduce库类

Hadoop除了可以让开发人员自行编写map函数和reduce函数，还提供一些常用函数（mapper、reducer和partitioner）的类库，这些类位于 org.apache.hadoop.mapred.lib 包内，在1.2.1版，该包包含一个接口和若干类。在org.apache.hadoop.mapreduce.lib 包内也存在相关类库，且有部分重复。mapred包内部是旧API，mapreduce包是重构之后的新API，但两者都可以使用。

接口如下：

InputSampler.Sampler<K,V> Interface to sample using an InputFormat.

类如下：

BinaryPartitioner<V>	Partition `BinaryComparable` keys using a configurable part of the bytes array returned by `BinaryComparable.getBytes()`.
ChainMapper	The ChainMapper class allows to use multiple Mapper classes within a single Map task.
ChainReducer	The ChainReducer class allows to chain multiple Mapper classes after a Reducer within the Reducer task.
CombineFileInputFormat<K,V>	An abstract `InputFormat` that returns `CombineFileSplit`'s in `InputFormat.getSplits(JobConf, int)` method.
CombineFileRecordReader<K,V>	A generic RecordReader that can hand out different recordReaders for each chunk in a `CombineFileSplit`.
CombineFileSplit	A sub-collection of input files.
DelegatingInputFormat<K,V>	An `InputFormat` that delegates behaviour of paths to multiple other InputFormats.
DelegatingMapper<K1,V1,K2,V2>	An `Mapper` that delegates behaviour of paths to multiple other mappers.
FieldSelectionMapReduce<K,V>	This class implements a mapper/reducer class that can be used to perform field selections in a manner similar to unix cut.
HashPartitioner<K2,V2>	Partition keys by their `Object.hashCode()`.
IdentityMapper<K,V>	Implements the identity function, mapping inputs directly to outputs.
IdentityReducer<K,V>	Performs no reduction, writing all input values directly to the output.
InputSampler<K,V>	Utility for collecting samples and writing a partition file for `TotalOrderPartitioner`.
InputSampler.IntervalSampler<K,V>	Sample from s splits at regular intervals.
InputSampler.RandomSampler<K,V>	Sample from random points in the input.
InputSampler.SplitSampler<K,V>	Samples the first n records from s splits.
InverseMapper<K,V>	A `Mapper` that swaps keys and values.
KeyFieldBasedComparator<K,V>	This comparator implementation provides a subset of the features provided by the Unix/GNU Sort.
KeyFieldBasedPartitioner<K2,V2>	Defines a way to partition keys based on certain key fields (also see `KeyFieldBasedComparator`.
LongSumReducer<K>	A `Reducer` that sums long values.
MultipleInputs	This class supports MapReduce jobs that have multiple input paths with a different `InputFormat` and `Mapper` for each path
MultipleOutputFormat<K,V>	This abstract class extends the FileOutputFormat, allowing to write the output data to different output files.
MultipleOutputs	The MultipleOutputs class simplifies writting to additional outputs other than the job default output via the `OutputCollector`passed to the `map()` and `reduce()` methods of the `Mapper` and `Reducer` implementations.
MultipleSequenceFileOutputFormat<K,V>	This class extends the MultipleOutputFormat, allowing to write the output data to different output files in sequence file output format.
MultipleTextOutputFormat<K,V>	This class extends the MultipleOutputFormat, allowing to write the output data to different output files in Text output format.
MultithreadedMapRunner<K1,V1,K2,V2>	Multithreaded implementation for @link org.apache.hadoop.mapred.MapRunnable.
NLineInputFormat	NLineInputFormat which splits N lines of input as one split.
NullOutputFormat<K,V>	Consume all outputs and put them in /dev/null.
RegexMapper<K>	A `Mapper` that extracts text matching a regular expression.
TokenCountMapper<K>	A `Mapper` that maps text values into <token,freq>pairs.
TotalOrderPartitioner<K extends WritableComparable,V>	Partitioner effecting a total order by reading split points from an externally generated source.

目前，用到的有一下几个类，后续将对其他类及接口进行研究。

1）ChainMapper类和ChainReducer类：可以在一个mapper中运行多个mapper，再运行reducer，之后还可以再运行多个mapper。这两个类组合使用，用于需要执行多个mapreduce过程的情况。这个方案可以明显降低磁盘的I/O开销。

2）TokenCounterMapper类：将输入值分解成独立的单词（使用Java的StringTokenizer）、输出各单词及其计数器（值为1）

3）InverseMapper类：一个能交换键和值的mapper

参考资料：

1. hadoop API 文档

2. Hadoop 权威指南