MapReduce 气象数据集

通过MapReduce程序分析气象数据集,更好的了解计算过程。

环境:Hadoop 1.2.1 & Centos 6.5 x64

1、气象数据集准备

下载链接:ftp://ftp3.ncdc.noaa.gov/pub/data

完整数据集非常大,可以下载部分数据集作为日常实验数据。

2、气象数据上传到HDFS

[huser@master 1971]$ ls
034700-99999-1971.gz  273730-99999-1971.gz  338850-99999-1971.gz  943290-99999-1971.gz
035623-99999-1971.gz  273930-99999-1971.gz  338870-99999-1971.gz  943320-99999-1971.gz
035833-99999-1971.gz  274020-99999-1971.gz  338890-99999-1971.gz  943330-99999-1971.gz
035963-99999-1971.gz  274120-99999-1971.gz  338930-99999-1971.gz  943350-99999-1971.gz
036880-99999-1971.gz  274280-99999-1971.gz  338960-99999-1971.gz  943400-99999-1971.gz
040180-16201-1971.gz  274790-99999-1971.gz  338980-99999-1971.gz  943430-99999-1971.gz
041650-99999-1971.gz  274850-99999-1971.gz  339020-99999-1971.gz  943549-99999-1971.gz
041750-99999-1971.gz  275020-99999-1971.gz  339070-99999-1971.gz  943550-99999-1971.gz
042350-99999-1971.gz  275090-99999-1971.gz  339100-99999-1971.gz  943660-99999-1971.gz
061800-99999-1971.gz  275320-99999-1971.gz  339150-99999-1971.gz  943670-99999-1971.gz
[huser@master 1971]$ zcat *.gz > sample.txt
[huser@master hadoop-1.2.1]$ bin/hadoop fs -put /home/huser/hadoop/1971/sample.txt /user/huser/in/

3、编写MapReduce程序

参考权威指南,摘出部分程序,计算年份最高气温

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class MaxTemperatureMapper extends
        Mapper<LongWritable, Text, Text, IntWritable> {
    private static final int MISSING = 9999;

    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String line = value.toString();
        String year = line.substring(15, 19);
        int airTemperature;
        if (line.charAt(87) == '+') { // parseInt doesn't like leading plus
                                        // signs
            airTemperature = Integer.parseInt(line.substring(88, 92));
        } else {
            airTemperature = Integer.parseInt(line.substring(87, 92));
        }
        String quality = line.substring(92, 93);
        if (airTemperature != MISSING && quality.matches("[01459]")) {
            context.write(new Text(year), new IntWritable(airTemperature));
        }
    }
}
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class MaxTemperatureReducer extends
        Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
        int maxValue = Integer.MIN_VALUE;
        for (IntWritable value : values) {
            maxValue = Math.max(maxValue, value.get());
        }
        context.write(key, new IntWritable(maxValue));
    }
}
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MaxTemperature {
    public static void main(String[] args) throws Exception {
        if (args.length != 2) {
            System.err
                    .println("Usage: MaxTemperature <input path> <output path>");
            System.exit(-1);
        }
        Job job = new Job();
        job.setJarByClass(MaxTemperature.class);
        job.setJobName("Max temperature");
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.setMapperClass(MaxTemperatureMapper.class);
        job.setReducerClass(MaxTemperatureReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

4、编译程序

[huser@master bin]$ javac -classpath ../hadoop-core-1.2.1.jar *.java

5、运行程序

[huser@master bin]$ ../bin/hadoop MaxTemperature ./in/sample.txt ./out6
Warning: $HADOOP_HOME is deprecated.

14/04/18 15:31:15 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/04/18 15:31:16 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
14/04/18 15:31:16 INFO input.FileInputFormat: Total input paths to process : 1
14/04/18 15:31:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/04/18 15:31:16 WARN snappy.LoadSnappy: Snappy native library not loaded
14/04/18 15:31:17 INFO mapred.JobClient: Running job: job_201404181009_0003
14/04/18 15:31:18 INFO mapred.JobClient:  map 0% reduce 0%
14/04/18 15:31:33 INFO mapred.JobClient: Task Id : attempt_201404181009_0003_m_000002_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: MaxTemperatureMapper
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
        at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.ClassNotFoundException: MaxTemperatureMapper
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:270)
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855)
        ... 8 more

14/04/18 15:31:33 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave1:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000002_0&filter=stdout
14/04/18 15:31:33 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave1:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000002_0&filter=stderr
14/04/18 15:31:33 INFO mapred.JobClient: Task Id : attempt_201404181009_0003_m_000003_0, Status : FAILED
14/04/18 15:31:33 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave1:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000003_0&filter=stdout
14/04/18 15:31:33 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave1:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000003_0&filter=stderr
14/04/18 15:31:37 INFO mapred.JobClient: Task Id : attempt_201404181009_0003_m_000000_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: MaxTemperatureMapper
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
        at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.ClassNotFoundException: MaxTemperatureMapper
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:270)
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855)
        ... 8 more

14/04/18 15:31:37 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave2:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000000_0&filter=stdout
14/04/18 15:31:37 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave2:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000000_0&filter=stderr
14/04/18 15:31:37 INFO mapred.JobClient: Task Id : attempt_201404181009_0003_m_000001_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: MaxTemperatureMapper
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
        at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.ClassNotFoundException: MaxTemperatureMapper
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:270)
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855)
        ... 8 more

14/04/18 15:31:37 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave2:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000001_0&filter=stdout
14/04/18 15:31:37 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave2:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000001_0&filter=stderr
14/04/18 15:31:41 INFO mapred.JobClient: Task Id : attempt_201404181009_0003_m_000006_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: MaxTemperatureMapper
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
        at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.ClassNotFoundException: MaxTemperatureMapper
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:270)
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855)
        ... 8 more

报错原因是因为JAVA程序有三个类,运行程序找不到调用类,需要打成JAR包。

[huser@master bin]$ jar cvf MaxTemperature.jar *.class
已添加清单
正在添加: MaxTemperature.class(输入 = 1418) (输出 = 801)(压缩了 43%)
正在添加: MaxTemperatureMapper.class(输入 = 1876) (输出 = 804)(压缩了 57%)
正在添加: MaxTemperatureReducer.class(输入 = 1664) (输出 = 707)(压缩了 57%)

[huser@master bin]$ ls
hadoop                      MaxTemperatureMapper.java    start-jobhistoryserver.sh
hadoop-config.sh            MaxTemperatureReducer.class  start-mapred.sh
hadoop-daemon.sh            MaxTemperatureReducer.java   stop-all.sh
hadoop-daemons.sh           rcc                          stop-balancer.sh
MaxTemperature.class        slaves.sh                    stop-dfs.sh
MaxTemperature.jar          start-all.sh                 stop-jobhistoryserver.sh
MaxTemperature.java         start-balancer.sh            stop-mapred.sh
MaxTemperatureMapper.class  start-dfs.sh                 task-controller

[huser@master bin]$ rm -rf *.class

以JAR包方式运行程序

[huser@master bin]$ ../bin/hadoop jar MaxTemperature.jar MaxTemperature ./in/sample.txt ./out7
Warning: $HADOOP_HOME is deprecated.

14/04/18 15:42:35 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments Applications should implement Tool for the same.
14/04/18 15:42:48 INFO input.FileInputFormat: Total input paths to process : 1
14/04/18 15:42:48 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/04/18 15:42:48 WARN snappy.LoadSnappy: Snappy native library not loaded
14/04/18 15:43:50 INFO mapred.JobClient: Running job: job_201404181009_0005
14/04/18 15:43:52 INFO mapred.JobClient: map 0% reduce 0%
14/04/18 15:51:04 INFO mapred.JobClient: map 1% reduce 0%
14/04/18 15:51:42 INFO mapred.JobClient: map 2% reduce 0%
14/04/18 15:51:43 INFO mapred.JobClient: map 10% reduce 0%
14/04/18 15:52:46 INFO mapred.JobClient: map 11% reduce 0%
14/04/18 15:53:03 INFO mapred.JobClient: map 12% reduce 0%
14/04/18 15:53:14 INFO mapred.JobClient: map 13% reduce 0%
14/04/18 15:53:16 INFO mapred.JobClient: map 14% reduce 0%
14/04/18 15:53:19 INFO mapred.JobClient: map 15% reduce 0%
14/04/18 15:53:22 INFO mapred.JobClient: map 16% reduce 0%
14/04/18 15:53:32 INFO mapred.JobClient: map 18% reduce 0%
14/04/18 15:54:09 INFO mapred.JobClient: map 19% reduce 0%
14/04/18 16:00:36 INFO mapred.JobClient: map 98% reduce 26%
14/04/18 16:00:41 INFO mapred.JobClient: map 98% reduce 30%
14/04/18 16:00:45 INFO mapred.JobClient: map 100% reduce 30%
14/04/18 16:00:56 INFO mapred.JobClient: map 100% reduce 33%
14/04/18 16:01:13 INFO mapred.JobClient: map 100% reduce 100%
14/04/18 16:01:25 INFO mapred.JobClient: Job complete: job_201404181009_0005
14/04/18 16:01:25 INFO mapred.JobClient: Counters: 30
14/04/18 16:01:25 INFO mapred.JobClient: Job Counters
14/04/18 16:01:25 INFO mapred.JobClient: Launched reduce tasks=1
14/04/18 16:01:25 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=2001708
14/04/18 16:01:25 INFO mapred.JobClient: Total time spent by all reduces waiting after eserving slots (ms)=0
14/04/18 16:01:25 INFO mapred.JobClient: Total time spent by all maps waiting after resrving slots (ms)=0
14/04/18 16:01:25 INFO mapred.JobClient: Rack-local map tasks=3
14/04/18 16:01:25 INFO mapred.JobClient: Launched map tasks=11
14/04/18 16:01:25 INFO mapred.JobClient: Data-local map tasks=8
14/04/18 16:01:25 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=638749
14/04/18 16:01:25 INFO mapred.JobClient: File Output Format Counters
14/04/18 16:01:25 INFO mapred.JobClient: Bytes Written=9
14/04/18 16:01:25 INFO mapred.JobClient: FileSystemCounters
14/04/18 16:01:25 INFO mapred.JobClient: FILE_BYTES_READ=111429430
14/04/18 16:01:25 INFO mapred.JobClient: HDFS_BYTES_READ=1311937676
14/04/18 16:01:25 INFO mapred.JobClient: FILE_BYTES_WRITTEN=167764543
14/04/18 16:01:25 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=9
14/04/18 16:01:25 INFO mapred.JobClient: File Input Format Counters
14/04/18 16:01:25 INFO mapred.JobClient: Bytes Read=1311936596
14/04/18 16:01:25 INFO mapred.JobClient: Map-Reduce Framework
14/04/18 16:01:25 INFO mapred.JobClient: Map output materialized bytes=55714697
14/04/18 16:01:25 INFO mapred.JobClient: Map input records=5140229
14/04/18 16:01:25 INFO mapred.JobClient: Reduce shuffle bytes=55714697
14/04/18 16:01:25 INFO mapred.JobClient: Spilled Records=15194901
14/04/18 16:01:25 INFO mapred.JobClient: Map output bytes=45584703
14/04/18 16:01:25 INFO mapred.JobClient: Total committed heap usage (bytes)=2127904768
14/04/18 16:01:25 INFO mapred.JobClient: CPU time spent (ms)=118580
14/04/18 16:01:25 INFO mapred.JobClient: Combine input records=0
14/04/18 16:01:25 INFO mapred.JobClient: SPLIT_RAW_BYTES=1080
14/04/18 16:01:25 INFO mapred.JobClient: Reduce input records=5064967
14/04/18 16:01:25 INFO mapred.JobClient: Reduce input groups=1
14/04/18 16:01:25 INFO mapred.JobClient: Combine output records=0
14/04/18 16:01:25 INFO mapred.JobClient: Physical memory (bytes) snapshot=1685221376
14/04/18 16:01:25 INFO mapred.JobClient: Reduce output records=1
14/04/18 16:01:25 INFO mapred.JobClient: Virtual memory (bytes) snapshot=7951810560
14/04/18 16:01:25 INFO mapred.JobClient: Map output records=5064967

查看结果

[huser@master bin]$ ../bin/hadoop fs -cat ./out7/part-r-00000
Warning: $HADOOP_HOME is deprecated.

1971    478
原文地址:https://www.cnblogs.com/guarder/p/3744766.html