Hadoop之HelloWorld

Hadoop开始：

1. 下载最新的发行版，解压到你喜欢的路径。

2. 配置，Hadoop的配置文件位于～/hadoop/conf/ 目录下。这里我先只配置了core-site.xml文件。

 1 <?xml version="1.0"?>
 2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 3 
 4 <!-- Put site-specific property overrides in this file. -->
 5 
 6 <configuration>
 7     <property>
 8         <name>fs.default.name</name>
 9         <value>hdfs://localhost:9000</value>
10     </property>
11     <property>
12         <name>hadoop.tmp.dir</name>
13         <value>/home/Jack/dfs</value>
14     </property>
15 </configuration>

上面我指定了hadoop的DFS文件系统的路径。

3. 格式化DFS系统，输入命令: > ./hadoop namenode -format

4. 启动Hadoop，输入命令: > ./start-all.sh

**到这里Hadoop的启动已经正常，可以在端口50070和50030查看集群的状态。

======================================================================

第一个程序：HadoopHelloWorld

import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class HadoopHelloWorld {
    
    public static class Map extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable> {
        private final static IntWritable one=new IntWritable(1);
        private Text word=new Text();

        public void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter)
        throws IOException {
            String line= value.toString();
            StringTokenizer tokenizer=new StringTokenizer(line);
            while(tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                output.collect(word, one);
            }
        }
    }
    
    public static class Reduce extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable> {
        public void reduce(Text key,Iterator<IntWritable> values,OutputCollector<Text,IntWritable>output, Reporter reporter)
        throws IOException{
            int sum=0;
            while(values.hasNext()) {
                sum+=values.next().get();
            }
            output.collect(key, new IntWritable(sum));
    
        }
    }
    
    public static void main(String args[]) throws Exception {
        JobConf conf=new JobConf(HadoopHelloWorld.class);
        conf.setJobName("wordcount");
        
        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(IntWritable.class);
        
        conf.setMapperClass(Map.class);
        conf.setReducerClass(Reduce.class);
        
        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);
        
        FileInputFormat.setInputPaths(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));
        
        JobClient.runJob(conf);    
    }

}

HadoopHelloWorld

需要引入的基础包：

JRE system Library

Hadoop-core.jar

commons-logging.jar

说明一下，别的文档中没有将需要commons-logging.jar 这个包，可以我的没有这个包一直报错。java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory

以上工作做好了之后，编译HadoopHelloWorld.java文件就好，将生成的class文件放入文件夹~/source/java2013/HadoopHelloWorld/，然后打成一个jar包。

[Jack@win bin]$ jar -cvf HadoopHelloWorld.jar -C ~/source/java2013/HadoopHelloWorld/ .

上传2个input文件作为程序输入[ file01,file02 ]。

[Jack@win bin]$./ hadoop fs -mkdir input

[Jack@win bin]$ ./hadoop dfs -put ~/source/java2012/FirstJar/input/file* input

运行程序：

[Jack@win bin]$./hadoop jar HadoopHelloWorld.jar HadoopHelloWorld input output

13/06/20 03:16:44 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/06/20 03:16:45 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/06/20 03:16:45 WARN snappy.LoadSnappy: Snappy native library not loaded
13/06/20 03:16:45 INFO mapred.FileInputFormat: Total input paths to process : 4
13/06/20 03:16:45 INFO mapred.JobClient: Running job: job_201306200226_0002
13/06/20 03:16:46 INFO mapred.JobClient: map 0% reduce 0%
13/06/20 03:16:59 INFO mapred.JobClient: map 40% reduce 0%
13/06/20 03:17:05 INFO mapred.JobClient: map 80% reduce 0%
13/06/20 03:17:08 INFO mapred.JobClient: map 80% reduce 26%
13/06/20 03:17:11 INFO mapred.JobClient: map 100% reduce 26%
13/06/20 03:17:23 INFO mapred.JobClient: map 100% reduce 100%
13/06/20 03:17:28 INFO mapred.JobClient: Job complete: job_201306200226_0002
13/06/20 03:17:28 INFO mapred.JobClient: Counters: 30
13/06/20 03:17:28 INFO mapred.JobClient: Job Counters 
13/06/20 03:17:28 INFO mapred.JobClient: Launched reduce tasks=1
13/06/20 03:17:28 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=32074
13/06/20 03:17:28 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/06/20 03:17:28 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/06/20 03:17:28 INFO mapred.JobClient: Launched map tasks=5
13/06/20 03:17:28 INFO mapred.JobClient: Data-local map tasks=3
13/06/20 03:17:28 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=23534
13/06/20 03:17:28 INFO mapred.JobClient: File Input Format Counters 
13/06/20 03:17:28 INFO mapred.JobClient: Bytes Read=54
13/06/20 03:17:28 INFO mapred.JobClient: File Output Format Counters 
13/06/20 03:17:28 INFO mapred.JobClient: Bytes Written=41
13/06/20 03:17:28 INFO mapred.JobClient: FileSystemCounters
13/06/20 03:17:28 INFO mapred.JobClient: FILE_BYTES_READ=104
13/06/20 03:17:28 INFO mapred.JobClient: HDFS_BYTES_READ=541
13/06/20 03:17:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=128481
13/06/20 03:17:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=41
13/06/20 03:17:28 INFO mapred.JobClient: Map-Reduce Framework
13/06/20 03:17:28 INFO mapred.JobClient: Map output materialized bytes=128
13/06/20 03:17:28 INFO mapred.JobClient: Map input records=2
13/06/20 03:17:28 INFO mapred.JobClient: Reduce shuffle bytes=122
13/06/20 03:17:28 INFO mapred.JobClient: Spilled Records=16
13/06/20 03:17:28 INFO mapred.JobClient: Map output bytes=82
13/06/20 03:17:28 INFO mapred.JobClient: Total committed heap usage (bytes)=912719872
13/06/20 03:17:28 INFO mapred.JobClient: CPU time spent (ms)=5190
13/06/20 03:17:28 INFO mapred.JobClient: Map input bytes=50
13/06/20 03:17:28 INFO mapred.JobClient: SPLIT_RAW_BYTES=487
13/06/20 03:17:28 INFO mapred.JobClient: Combine input records=0
13/06/20 03:17:28 INFO mapred.JobClient: Reduce input records=8
13/06/20 03:17:28 INFO mapred.JobClient: Reduce input groups=5
13/06/20 03:17:28 INFO mapred.JobClient: Combine output records=0
13/06/20 03:17:28 INFO mapred.JobClient: Physical memory (bytes) snapshot=932745216
13/06/20 03:17:28 INFO mapred.JobClient: Reduce output records=5
13/06/20 03:17:28 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2390478848
13/06/20 03:17:28 INFO mapred.JobClient: Map output records=8

Result