「hadoop」win7 idea maven hadoop 运行WordCount示例

运行一个简单的hadoop实例，已自测成功。

假设已安装如下环境：

1、win7跑三台ubuntu虚拟机，虚拟机已成功安装hadoop2.8.1环境；

2、win7安装idea工具 idea2017；

3、win7安装hadoop2.8.1环境，并已配置相关的环境变量；

4、拷贝windows用的已编译好的hadoop.dll和winutils.exe，务必注意一定要是2.8.1版本的，参考 https://github.com/steveloughran/winutils

【步骤】

1、参考 http://blog.csdn.net/u011654631/article/details/70037219，该地址简称参考页；

2、idea创建maven的java工程；

3、按参考页pom.xml中集成相应的hadoop jar包；(有hadoop-mapreduce-client-core，hadoop-hdfs，hadoop-mapreduce-client-jobclient（务必去掉provideed控制），hadoop-mapreduce-client-common，hadoop-common。

4、最后通过$hdfs dfs -cat /test/out/part-r-00000查看统计结果。

WordCount代码

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;

public class WordCount extends Configured implements Tool {
    public int run(String[] strings) throws Exception {
        try {
            System.setProperty("hadoop.home.dir", "C:\LearnTool\hadoop");
            System.setProperty("HADOOP_USER_NAME", "chendajian");

            Configuration conf = getConf();
            conf.set("mapreduce.job.jar", "C:\Workspace\javaweb\hadoop\out\artifacts\hadoop_jar\hadoop.jar");
//            conf.set("yarn.resourcemanager.hostname", "10.0.10.231");
            conf.set("mapreduce.app-submission.cross-platform", "true");

            Job job = Job.getInstance(conf);
            job.setJarByClass(WordCount.class);

            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(LongWritable.class);

            job.setMapperClass(WcMapper.class);
            job.setReducerClass(WcReducer.class);

            job.setInputFormatClass(TextInputFormat.class);
            job.setOutputFormatClass(TextOutputFormat.class);

            // 清空out
            FileSystem fs = FileSystem.get(conf);
            String out = "hdfs://10.0.10.231:9000/test/out";
            Path outPath = new Path(out);
            if (fs.exists(outPath)) {
                fs.delete(outPath, true);
            }

            FileInputFormat.setInputPaths(job, "hdfs://master:9000/test/testvim.txt");
            FileOutputFormat.setOutputPath(job, new Path(out));

            job.waitForCompletion(true);
        } catch (Exception e) {
            e.printStackTrace();
        }
        return 0;
    }

    public static class WcMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String mVal = value.toString();
            context.write(new Text(mVal), new LongWritable(1));
        }
    }

    public static class WcReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
            long sum = 0;
            for (LongWritable lVal : values) {
                sum += lVal.get();
            }
            context.write(key, new LongWritable(sum));
        }
    }

    public static void main(String[] args) throws Exception {
        ToolRunner.run(new WordCount(), args);
    }
}

View Code

几点补充：

1、把core-site.xml，mapred-site.xml，yarn-site.xml等拷到工程的resources目录下；

2、如遇到 hdfs:master:9000 访问refused，用IP地址替换master试试；

3、input文件位于hdfs系统内，linux只能通过hdfs dfs方式访问；

4、2.8.1版本的hadoop.dll和winutils.exe需另行下载，参考 https://github.com/steveloughran/winutils；

5、用户权限问题，win7增加环境变量 HADOOP_USER_NAME，值为 hadoop的用户名；

6、增加日志打印配置文件log4j.xml，放到工程的resources目录下，xml内容参考 http://www.cnblogs.com/ftrako/p/7570094.html ；

7、pom.xml中的hadoop-mapreduce-client-jobclient依赖中去掉provide控制，会导致不会使用YARN模式，而使用local模式；