Hadoop with tool interface

Often Hadoop jobsare executed through a command line. Therefore, each Hadoop job has to
support reading, parsing, and processing command-line arguments. To avoid each developer
having to rewrite this code, Hadoop provides a org.apache.hadoop.util.Toolinterface.

Sample code :

public class WordcountWithTools extends Configured implements Tool {

	public int run(String[] args) throws Exception {
		if (args.length < 2) {
			System.out
					.println("chapter3.WordCountWithTools WordCount <inDir> <outDir>");
			ToolRunner.printGenericCommandUsage(System.out);
			System.out.println("");
			return -1;
		}

		System.out.println(Arrays.toString(args));
		// just for test
		System.out.println(getConf().get("test"));

		Job job = new Job(getConf(), "word count");
		job.setJarByClass(WordCount.class);
		job.setMapperClass(TokenizerMapper.class);
		// Uncomment this to
		// job.setCombinerClass(IntSumReducer.class);
		job.setReducerClass(IntSumReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		FileInputFormat.addInputPath(job, new Path(args[0]));
		// delete target if exists
		FileSystem.get(getConf()).delete(new Path(args[1]), true);
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		job.waitForCompletion(true);

		return 0;
	}

	public static void main(String[] args) throws Exception {
		int res = ToolRunner.run(new Configuration(), new WordcountWithTools(),
				args);
		System.exit(res);
	}

}

Generic options supported are
-conf<configuration file> specify an application configuration
file
-D <property=value> use value for given property
-fs<local|namenode:port> specify a namenode
-jt<local|jobtracker:port> specify a job tracker
-files<comma separated list of files> specify comma separated
files to be copied to the map reduce cluster
-libjars<comma separated list of jars> specify comma separated
jar files to include in the classpath.
-archives<comma separated list of archives> specify comma
separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
这里一定要注意顺序，我曾经用错过顺序，把-input -output放在前面，后面使用-D,-libjars不起作用。

使用示例：

JAR_NAME=/home/hadoop/workspace/myhadoop/target/myhadoop-0.0.1-SNAPSHOT.jar
MAIN_CLASS=chapter3.WordcountWithTools
INPUT_DIR=/data/input/
OUTPUT_DIR=/data/output/
hadoop jar $JAR_NAME $MAIN_CLASS -Dtest=lovejava $INPUT_DIR $OUTPUT_DIR

在代码中测试传递的test属性的值。

JAR_NAME=/home/hadoop/workspace/myhadoop/target/myhadoop-0.0.1-SNAPSHOT.jar
MAIN_CLASS=chapter3.WordcountWithTools
INPUT_DIR=/home/hadoop/data/test1.txt
OUTPUT_DIR=/home/hadoop/data/output/
hadoop jar $JAR_NAME $MAIN_CLASS -Dtest=lovejava -fs=file:/// -files=home/hadoop/data/test2.txt
$INPUT_DIR $OUTPUT_DIR

测试处理本地文件系统的文件。

JAR_NAME=/home/hadoop/workspace/myhadoop/target/myhadoop-0.0.1-SNAPSHOT.jar
MAIN_CLASS=chapter3.WordcountWithTools
INPUT_DIR=/home/hadoop/data/test1.txt
OUTPUT_DIR=/home/hadoop/data/output/
hadoop jar $JAR_NAME $MAIN_CLASS -conf=/home/hadoop/data/democonf.xml -fs=file:/// $INPUT_DIR $OUTPUT_DIR

指定配置文件。

－libjars可以把你写的mapreduce中引用的第三方包放到HDFS上，然后各结点在运行作业的时候复制到本地临时目录，以避免找不到引用类的情况。

Looking for a job working at Home about MSBI