Hadoop1.1.2开发笔记（一）

本文参考hadoop权威指南，开发一个单词统计的程序，首先需要下载hadoop相应版本的依赖文件，本人是采用的maven项目管理，在pom.xml文件加入依赖

<dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-core</artifactId>
          <version>1.1.2</version>
          <type>jar</type>
          <scope>compile</scope>
      </dependency>

编写map类，用于分解任务

public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {
    
    private final static IntWritable one = new IntWritable(1);

    private Text word = new Text();

    public void map(Object key, Text value, Context context

                    ) throws IOException, InterruptedException {

      StringTokenizer itr = new StringTokenizer(value.toString());

      while (itr.hasMoreTokens()) {

        word.set(itr.nextToken());

        context.write(word, one);

      }
    }
}

编写Reduce类，用于规约

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();
    public void reduce(Text key, Iterable<IntWritable> values,  
            Context context) throws IOException, InterruptedException {  
        int sum = 0;  
        for (IntWritable val : values) {  
            sum += val.get();  
        }  
        result.set(sum);  
        context.write(key, result);  
    }  

}

编写WordCount类，定义作业

public class WordCount {
    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        /** 创建一个job，起个名字以便跟踪查看任务执行情况 **/
        Job job = new Job(conf, "word count");

        /**
         * 当在hadoop集群上运行作业时，需要把代码打包成一个jar文件（hadoop会在集群分发这个文件），
         * 通过job的setJarByClass设置一个类，hadoop根据这个类找到所在的jar文件
         **/

        job.setJarByClass(WordCount.class);

        /** 设置要使用的map、combiner、reduce类型 **/

        job.setMapperClass(WordCountMapper.class);

        job.setCombinerClass(WordCountReducer.class);

        job.setReducerClass(WordCountReducer.class);

        /**
         * 设置map和reduce函数的输入类型，这里没有代码是因为我们使用默认的TextInputFormat，针对文本文件，按行将文本文件切割成
         * InputSplits, 并用 LineRecordReader 将 InputSplit 解析成 <key,value&gt:
         * 对，key 是行在文件中的位置，value 是文件中的一行
         **/

        /** 设置map和reduce函数的输出键和输出值类型 **/

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(IntWritable.class);

        /** 设置输入和输出路径 **/

        FileInputFormat.addInputPath(job, new Path("D:\\JAVA\\workspacejee\\hadoop\\path1"));

        FileOutputFormat.setOutputPath(job, new Path("D:\\JAVA\\workspacejee\\hadoop\\path2"));

        /** 提交作业并等待它完成 **/

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}

在输入路径存放我们需要统计词语的文件，运行上面的WordCount类，即可在输出路径看到运行的结果

本人是在win2003系统运行，会遇到相关目录权限检测错误，需要改写org.apache.hadoop.fs包里面的FileUtil类，将方法checkReturnValue里面的代码注释掉即可

---------------------------------------------------------------------------

本系列Hadoop1.1.2开发笔记系本人原创

转载请注明出处博客园刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/05/09/3068233.html