3、MR开发入门

1、预先准备2个文件file1.txt和file2.txt。文件内容为网页上摘下，不具有代表性，只为举例。

file1.txt内容为：

With this setup, whenever you change the content of ambari-web files on the host machine, brunch will pick up changes in the background and update. Because of the symbolic link, the changes are automatically picked up by Ambari Server. All you have to do is hit refresh on the browser to see the frontend code changes reflected.
Not seeing code changes as expected? If you have run the maven command to build Ambari previously, you will see files called app.js.gz and vendor.js.gz under the public folder. You need to delete these files for the frontend code changes to be effective, as the app.js.gz and vendor.js.gz files take precedence over app.js and vendor.js, respectively.

file2.txt内容为：

Apache Eagle (incubating) is a highly extensible, scalable monitoring and alerting platform, designed with its flexible application framework and proven big data technologies, such as Kafka, Spark and Storm. It ships a rich set of applications for big data platform monitoring, service health check, JMX metrics, daemon logs, audit logs and yarn applications. External Eagle developers can define applications to monitoring their NoSQLs or Web Servers, and publish to Eagle application repository at your own discretion. It also provides the state-of-art alert engine to report security breaches, service failures, and application anomalies, highly customizable by the alert policy definition.

并将它们传到hdfs中。

hadoop fs -put ~/file* /user/input/

2、使用Eclipse创建一个Java工程

3、导入Hadoop的Jar文件：

将集群中HADOOP_HOME/share/hadoop/目录下的jar包down下来，放到项目新建文件来个lib中，然后把lib目录下的Jar包加入到classpath中。

4、导入Hadoop的配置文件：

将HADOOP_HOME/etc/hadoop/目录下的core-site.xml、hdfs-site.xml、mapper-site.xml、yarn-site.xml文件down下，放到src目录下。

5、MR代码实现

1）、WordMapper类实现

WordMapper.java

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one=new IntWritable(1);

private Text word=new Text();

/**

* 把字符串解析成Key-Value形式，发给Reduce来统计

* key 每行文件的偏移量

* value 每行文件的内容

* context map的上下文

public void map(Object key,Text value,Context context) throws IOException, InterruptedException{

StringTokenizer st=new StringTokenizer(value.toString());

while(st.hasMoreTokens()){

word.set(st.nextToken());

context.write(word, one);

}

2）、WordReducer类实现

WordReducer.java

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class WordReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable result=new IntWritable();

/**

* 获取map方法的Key-Value结果，相同的Key发送到同一个reduce里处理，

* 然后迭代Key,把Value相加，结果写到HDFS

* key map端输出的key

* values map端输出的集合

* context reduce端的上下文

public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException, InterruptedException{

int sum=0;

for(IntWritable val:values){

sum+=val.get();

}

result.set(sum);

context.write(key, result);

}

3）、WordMain驱动类实现：

WordMain.java

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

/**

* 驱动类，用来启动了个MR作业

* @author liudebin

public class WordMain {

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

Configuration conf=new Configuration();

String[] otherArgs=new GenericOptionsParser(conf,args).getRemainingArgs();

if(otherArgs.length< 2){

System.err.println("Usage:wordCount <in> <out>");

System.exit(2);

}

Job job=new Job(conf,"word count");

job.setJarByClass(WordMain.class);//主类

job.setMapperClass(WordMapper.class);//Mapper

job.setCombinerClass(WordReducer.class);//作业合成类

job.setReducerClass(WordReducer.class);//Reducer

job.setOutputKeyClass(Text.class);//设置作业输出数据的关键类

job.setOutputValueClass(IntWritable.class);//设置作业输出值类

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));//文件输入

FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));//文件输出

System.exit(job.waitForCompletion(true)?0:1);//等待完成退出

}

6、打包、部署和运行

在集群中的Master节点部署打包后的JAR文件，Hadoop会自己把任务传送到各个Slave

1）、打包JAR文件，将项目export成Jar文件。注意，不要export lib目录下的文件，因为集群环境已经有这些文件。

2）、将打包好的文件上传到集群的Master节点，并运行如下命令：

hadoop jar wordcount.jar com.upit.mr.op.WordMain /user/input/file* /user/output

注意：如果想要在Eclipse下直接运行程序，需要在Eclipse中添加hadoop的插件，并配置。