转载请注明出处:http://www.cnblogs.com/zhengrunjian/p/4527269.html
1作为输入
当压缩文件做为mapreduce的输入时,mapreduce将自动通过扩展名找到相应的codec对其解压。
如果我们压缩的文件有相应压缩格式的扩展名(比如lzo,gz,bzip2等),hadoop就会根据扩展名去选择解码器解压。
hadoop对每个压缩格式的支持,详细见下表:
如果压缩的文件没有扩展名,则需 要在执行mapreduce任务的时候指定输入格式.
- hadoop jar /usr/home/hadoop/hadoop-0.20.2/contrib/streaming/hadoop-streaming-0.20.2-CDH3B4.jar
- -file /usr/home/hadoop/hello/mapper.py -mapper /usr/home/hadoop/hello/mapper.py
- -file /usr/home/hadoop/hello/reducer.py -reducer /usr/home/hadoop/hello/reducer.py
- -input lzotest -output result4
- -jobconf mapred.reduce.tasks=1
- -inputformat org.apache.hadoop.mapred.LzoTextInputFormat
2作为输出
当mapreduce的输出文件需要压缩时,可以更改mapred.output.compress为true,mapped.output.compression.codec为想要使用的codec的类名就
可以了,当然你可以在代码中指定,通过调用FileOutputFormat的静态方法去设置这两个属性,我们来看代码:
- package com.sweetop.styhadoop;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.io.compress.GzipCodec;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- import java.io.IOException;
- /**
- * Created with IntelliJ IDEA.
- * User: lastsweetop
- * Date: 13-6-27
- * Time: 下午7:48
- * To change this template use File | Settings | File Templates.
- */
- public class MaxTemperatureWithCompression {
- public static void main(String[] args) throws Exception {
- if (args.length!=2){
- System.out.println("Usage: MaxTemperature <input path> <out path>");
- System.exit(-1);
- }
- Job job=new Job();
- job.setJarByClass(MaxTemperature.class);
- job.setJobName("Max Temperature");
- FileInputFormat.addInputPath(job, new Path(args[0]));
- FileOutputFormat.setOutputPath(job, new Path(args[1]));
- job.setMapperClass(MaxTemperatrueMapper.class);
- job.setCombinerClass(MaxTemperatureReducer.class);
- job.setReducerClass(MaxTemperatureReducer.class);
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(IntWritable.class);
- FileOutputFormat.setCompressOutput(job, true);
- FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
- System.exit(job.waitForCompletion(true)?0:1);
- }
- }
- ~/hadoop/bin/hadoop com.sweetop.styhadoop.MaxTemperatureWithCompression input/data.gz output/
输出的每一个part都会被压缩,我们这里只有一个part,看下压缩了的输出
- [hadoop@namenode test]$hadoop fs -get output/part-r-00000.gz .
- [hadoop@namenode test]$ls
- 1901 1902 ch2 ch3 ch4 data.gz news.gz news.txt part-r-00000.gz
- [hadoop@namenode test]$gunzip -c part-r-00000.gz
- 1901<span style="white-space:pre"> </span>317
- 1902<span style="white-space:pre"> </span>244
当然代码里也可以设置,你只需调用SequenceFileOutputFormat的setOutputCompressionType方法进行设置。
- SequenceFileOutputFormat.setOutputCompressionType(job, SequenceFile.CompressionType.BLOCK);
3压缩map输出
即使你的mapreduce的输入输出都是未压缩的文件,你仍可以对map任务的中间输出作压缩,因为它要写在硬盘并且通过网络传输到reduce节点,对其压
缩可以提高很多性能,这些工作也是只要设置两个属性即可,我们看下代码里怎么设置:
- Configuration conf = new Configuration();
- conf.setBoolean("mapred.compress.map.output", true);
- conf.setClass("mapred.map.output.compression.codec",GzipCodec.class, CompressionCodec.class);
- Job job=new Job(conf);
- 转至:http://blog.csdn.net/lastsweetop/article/details/9187721