MapReduce中的InputFormat

InputFormat在hadoop源码中是一个抽象类 public abstract class InputFormat<K, V>

https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/InputFormat.java

可以参考文章

https://cloud.tencent.com/developer/article/1043622

其中有两个抽象方法

  public abstract 
    List<InputSplit> getSplits(JobContext context
                               ) throws IOException, InterruptedException;

  public abstract 
    RecordReader<K,V> createRecordReader(InputSplit split,
                                         TaskAttemptContext context
                                        ) throws IOException, 
                                                 InterruptedException;

 getSplits方法负责将输入的文件做一个逻辑上的切分,切分成一个List<InputSplit>,InputSplit的源码在

https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/InputSplit.java

 在下文中提到 InputSplit是一个逻辑概念,并没有对实际文件进行切分,它只包含一些元数据信息,比如数据的起始位置,数据长度,数据所在的节点等

https://cloud.tencent.com/developer/article/1481777
原文地址:https://www.cnblogs.com/tonglin0325/p/13750952.html