eclipse运行WordCount

1）

可以完全参考http://www.cnblogs.com/archimedes/p/4539751.html在eclipse下创建MapReduce工程，创建了MR工程，并完成WordCount.java的编写之后，运行WordCount.java，结果可能如图所示，原因是未设置MR读取文件的路径以及输出结果的路径，修改方法如下图所示

需要注意的就是，这里的in和out就是hdfs中的路径，in就是输入数据所在的路径，ou就是最后结果的输出路径。使用完全分布式运行MR程序，设置如下：

，其实Master:9000/user/input中只是存储了数据集的元数据(9000是hdfs-site.xml中配置的)，并没有存储真正的数据集。另外，第二次运行WordCounts时会提示output文件已存在，需要删除output才能正常运行。

以上在eclipse中点击run直接运行的方式只是在本地机器上运行mapreduce(单机模式)，可以在http://master:50030/jobtracker.jsp中看到Running Jobs是none，在Eclipse的控制台就是这种形式：

可以看到LocalJobRunner，就是使用本地主机运行MR，一直都是mapred.MapTask，即一直进行map操作，这就是因为没有把MR程序部署到集群上去。程序运行时间是54分钟。

2）

下图就是将MR部署到集群上之后，运行MR时候的情况：

可以看到，当map达到一定的比例时，map和reduce操作是并行运行的。

map运行完毕，reduce继续运行。

在http://master:50030/jobtracker.jsp中看到Running Jobs。

程序运行时间是17分9秒。集群中1个master，3个slave。

3）

如何是MR程序在集群上运行呢？

需要将eclipse中的MR程序打包，利用eclipse打包过程如下：

生成jar包之后，使用

bin/hadoop jar /home/hadoop/WordCount.jar org.apache.hadoop.examples.WordCount /user/input /user/output

其中： 1)/home/hadoop/WordCount.jar 指示jar包的位置

　　　2)org.apache.hadoop.examples.WordCount表示package org.apache.hadoop.examples（源程序中第一行生命了包）中的主类WordCount。

3)/user/input /user/output分别是hdfs中数据集的输入目录和运算结果的输出目录。

4）WordCount原码如下：

/**
 *  Licensed under the Apache License, Version 2.0 (the "License");
 *  you may not use this file except in compliance with the License.
 *  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 *  Unless required by applicable law or agreed to in writing, software
 *  distributed under the License is distributed on an "AS IS" BASIS,
 *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 *  See the License for the specific language governing permissions and
 *  limitations under the License.
 */


package org.apache.hadoop.examples;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

  public static class TokenizerMapper 
       extends Mapper<Object, Text, Text, IntWritable>{
    
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
      
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  
  public static class IntSumReducer 
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, 
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration(); 
    //JobConf conf=new JobConf();
    //
    //conf.setJar("org.apache.hadoop.examples.WordCount.jar");
   // conf.set("fs.default.name", "hdfs://Master:9000/");  
    //conf.set("hadoop.job.user","hadoop");    
    //指定jobtracker的ip和端口号，master在/etc/hosts中可以配置  
   // conf.set("mapred.job.tracker","Master:9001"); 
    /*
    FileSystem hdfs =FileSystem.get(conf);
    Path findf=new Path("/user/output");
    boolean isExists=hdfs.exists(findf);
    System.out.println("/user/output exit?"+isExists);
    if(isExists)
    {
        hdfs.delete(findf, true);
        System.out.println("delete /user/output");
        
    }
    */
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount <in> <out>");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

这种代码可以直接在elipse中以单机模式运行，但是再次运行之前需要手动删除output目录，所以就想在程序中加入代码，检测output是否已经存在，是的话就删除，代码如下：

  1 /**
  2  *  Licensed under the Apache License, Version 2.0 (the "License");
  3  *  you may not use this file except in compliance with the License.
  4  *  You may obtain a copy of the License at
  5  *
  6  *      http://www.apache.org/licenses/LICENSE-2.0
  7  *
  8  *  Unless required by applicable law or agreed to in writing, software
  9  *  distributed under the License is distributed on an "AS IS" BASIS,
 10  *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 11  *  See the License for the specific language governing permissions and
 12  *  limitations under the License.
 13  */
 14 
 15 
 16 package org.apache.hadoop.examples;
 17 
 18 import java.io.IOException;
 19 import java.util.StringTokenizer;
 20 
 21 import org.apache.hadoop.conf.Configuration;
 22 import org.apache.hadoop.fs.Path;
 23 import org.apache.hadoop.io.IntWritable;
 24 import org.apache.hadoop.io.Text;
 25 import org.apache.hadoop.fs.FileSystem;
 26 import org.apache.hadoop.mapred.JobConf;
 27 import org.apache.hadoop.mapreduce.Job;
 28 import org.apache.hadoop.mapreduce.Mapper;
 29 import org.apache.hadoop.mapreduce.Reducer;
 30 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 31 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 32 import org.apache.hadoop.util.GenericOptionsParser;
 33 
 34 public class WordCount {
 35 
 36   public static class TokenizerMapper 
 37        extends Mapper<Object, Text, Text, IntWritable>{
 38     
 39     private final static IntWritable one = new IntWritable(1);
 40     private Text word = new Text();
 41       
 42     public void map(Object key, Text value, Context context
 43                     ) throws IOException, InterruptedException {
 44       StringTokenizer itr = new StringTokenizer(value.toString());
 45       while (itr.hasMoreTokens()) {
 46         word.set(itr.nextToken());
 47         context.write(word, one);
 48       }
 49     }
 50   }
 51   
 52   public static class IntSumReducer 
 53        extends Reducer<Text,IntWritable,Text,IntWritable> {
 54     private IntWritable result = new IntWritable();
 55 
 56     public void reduce(Text key, Iterable<IntWritable> values, 
 57                        Context context
 58                        ) throws IOException, InterruptedException {
 59       int sum = 0;
 60       for (IntWritable val : values) {
 61         sum += val.get();
 62       }
 63       result.set(sum);
 64       context.write(key, result);
 65     }
 66   }
 67 
 68   public static void main(String[] args) throws Exception {
 69     Configuration conf = new Configuration(); 
 70     //JobConf conf=new JobConf();
 71     //
 72     //conf.setJar("org.apache.hadoop.examples.WordCount.jar");
 73    // conf.set("fs.default.name", "hdfs://Master:9000/");  
 74     //conf.set("hadoop.job.user","hadoop");    
 75     //指定jobtracker的ip和端口号，master在/etc/hosts中可以配置  
 76    // conf.set("mapred.job.tracker","Master:9001"); 
 77     
 78     FileSystem hdfs =FileSystem.get(conf);
 79     Path findf=new Path("/eclipse-test5/output");
 80     boolean isExists=hdfs.exists(findf);
 81     System.out.println("/eclipse-test5/output exit?"+isExists);
 82     if(isExists)
 83     {
 84         hdfs.delete(findf, true);
 85         System.out.println("delete /eclipse-test5/output");
 86         
 87     }
 88     
 89     String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
 90     if (otherArgs.length != 2) {
 91       System.err.println("Usage: wordcount <in> <out>");
 92       System.exit(2);
 93     }
 94     Job job = new Job(conf, "word count");
 95     
 96     job.setJarByClass(WordCount.class);
 97     job.setMapperClass(TokenizerMapper.class);
 98     job.setCombinerClass(IntSumReducer.class);
 99     job.setReducerClass(IntSumReducer.class);
100     job.setOutputKeyClass(Text.class);
101     job.setOutputValueClass(IntWritable.class);
102     FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
103     FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
104     System.exit(job.waitForCompletion(true) ? 0 : 1);
105   }
106 }

78-88行代码实现检测output目录是否存在，存在的话就删除的功能。但是78-88行使用的hdfs的API却检测到output不存在，但是运行程序的时候却提示output已经存在，如图所示:

但是，如果将上述程序打成jar包再运行就不会出错。

5）

如果是单单使用HDFS提供的API对文件进行操作，又想直接在eclipse中直接运行，不想打jar包使用hadoop命令运行的话，可以在代码中加入以下三行代码：

conf.set("fs.default.name", "hdfs://Master:9000/");  
conf.set("hadoop.job.user","hadoop");    
//指定jobtracker的ip和端口号，master在/etc/hosts中可以配置  
conf.set("mapred.job.tracker","Master:9001");

这样可以实现不打jar包直接对hdfs进行操作的目的。

但是，将这三行代码加入WordCount中的话却会报错。

6)最后，需要搞清楚这三行代码到底做了什么？

conf.set("fs.default.name", "hdfs://Master:9000/");  
conf.set("hadoop.job.user","hadoop");    
//指定jobtracker的ip和端口号，master在/etc/hosts中可以配置  
conf.set("mapred.job.tracker","Master:9001");