MR排序和输入输出格式

mapreduce作业提交流程：
    1、配置文件        //输入输出格式(TextInput(output)Format)
    2、job.waitforcompletion
    3、submit
    4、int map = split.size
        1)、看文件格式，textFile
            判断文件的压缩编解码器(文件名后缀)，如果是压缩，则判断是否是可切割的压缩
            如果不是压缩，则肯定可切割

        2)、getMaxSplitSize    //最大切片默认是long的最大值
            getMinSplitSize    //最小切片默认是1
            blockSize        //块大小
        
    5、开始作业真正提交过程
        ExecutorService mapService = createMapExecutor();
        runTasks(mapRunnables, mapService, "map");        //执行MapTaskRunnables对象的run函数

        //map中除了partition、combiner还有sort阶段

        使用maptask的runNewMapper方法开始正式的map阶段
            1、根据自定义map类名，获得自定义map对象
            2、调用Mapper的run函数来运行用户自定义的map方法
                //设置相关变量或者参数，一个map只调用一次
                setup(context);
                try {
                  while (context.nextKeyValue()) {
                //使用while循环调用自定义map的方法
                map(context.getCurrentKey(), context.getCurrentValue(), context);
                  }
                } finally {
                  //清理过程，包括清理一些没用的k-v
                  cleanup(context);
                }

        Spill：溢出    //当map中的数据超出内存空间的80%，超出的数据就会被本地化

        map端的输出称为IFile：key-len, value-len, key, value
        
        shuffle在调用的时候对ifile进行处理

    4、通过shuffle进行网络间分发，reduce的调用过程类似于map过程

    5、细节：FileInputFormat ====> RecordReader ==> 
         map ===> partition ===> sort ===> combiner ===> shuffle ====> 
         reduce ===> RecordWriter ====> FileOutputFormat

MR的数据倾斜处理：
    大量数据涌入单个节点，造成单个节点的负载量变大
    1、重新设计key
    2、随机分区

    均使用两个job


排序：
    部分排序：对每个分区中的key分别进行排序

    全排序：  对所有分区中的key进行排序
         1、使用一个reduce
         2、自定义分区函数    
         3、采样        //设置采样路径，要放在输入输出路径之后
                    //保证map的输入key和输出key一致！！！
                    
                    
            如果有n个reduce，那么对应的会产生n-1个key集

            1)随机采样                随机取得样本
                //freq ：每个key被选中的几率    样本数/总数< freq 则继续
                //numSamples：获得样本总数
                //maxSplitsSampled： 最大切片数
                //只要一个条件成立则停止采样
                RandomSampler(double freq, int numSamples, int maxSplitsSampled)
                
            2)间隔采样：适用场景(key有序的情况)    每隔一个相等的间隔对样本进行采样
                //freq ：每个key被选中的几率
                //maxSplitsSampled： 最大切片数
                //

            3)切片采样                对每个切片的数据头部取得数据



        采样过程代码顺序：保证全分区和采样设置在最后
                
            1、设置输入输出路径
            2、设置输出k-v
            3、设置输入输出文件类型
            4、指定map和reduce的类
            5、任务数设置
            6、设置全分区函数
            7、设置采样器


            Configuration conf = job.getConfiguration();
            Job job = Job.getInstance();
            job.setJarByClass(App.class);
            job.setJobName("word count");

            // 设置输入输出路径
            FileInputFormat.addInputPath(job, new Path("D:/Temp/seq/1.seq"));
            FileOutputFormat.setOutputPath(job, new Path("D:/Temp/out"));

            FileSystem fs = FileSystem.get(conf);
            if (fs.exists(new Path("D:/Temp/out"))) {
                fs.delete(new Path("D:/Temp/out"), true);
            }
            
            job.setNumReduceTasks(6);

            // 设置输出kv
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
            //设置输入类型
            job.setInputFormatClass(SequenceFileInputFormat.class);

            // 指定map和reduce的类
            job.setMapperClass(WCMapper.class);
            
            // 指定分区函数
            job.setPartitionerClass(TotalOrderPartitioner.class);
            TotalOrderPartitioner.setPartitionFile(conf, new Path("D:/Temp/par/par.seq"));
            
            /**
             * freq 一个key被选中的几率
             * numSamples 样本总数.
             * maxSplitsSampled 一个切片检查的最大数量. 
             */
            InputSampler.Sampler<Text, Text> sampler = new InputSampler.RandomSampler(0.1, 500,10);
            //InputSampler.Sampler<Text, Text> sampler = new InputSampler.IntervalSampler(0.1, 500);
            //InputSampler.Sampler<Text, Text> sampler = new InputSampler.SplitSampler(500,10);
            InputSampler.writePartitionFile(job, sampler);
            
            
            System.exit((job.waitForCompletion(true) ? 0 : 1));


    二次排序：对key进行全排序的基础上，对value进行排序


    当Map和Reduce的输出K-V不同时，要进行分别配置

        //设置reduce的输出K-V
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //设置Map输出k-v
        job.setMapOutputKeyClass(KVPair.class);
        job.setMapOutputValueClass(NullWritable.class);


    分组对比器：
        public class GroupComparator extends WritableComparator {
            protected GroupComparator() {
            super(KVPair.class,true);
            }
            //将相同年份的KVPair识别为一组
            @Override
            public int compare(WritableComparable a, WritableComparable b) {
            KVPair kv1 = (KVPair)a;
            KVPair kv2 = (KVPair)b;

            String year1 = kv1.getYear();
            String year2 = kv2.getYear();
            return year1.compareTo(year2);
            }
        }
    
InputFormat OutputFormat：
=====================================
    TextInputFormat：    文本类型，以行为单位处理，k：LongWritable， v:Text
    KeyValueTextInputFormat    k-v文本类型，以制表符为k-v的分隔，默认k-v均为text格式

    SequenceFileInputFormat:k-v序列文件，k-v类型以sequenceFile为主
        1、创建源文件
        2、在App中指定文件格式
            job.setInputFormatclass()
        3、对Map的输入k-v类型进行修改
            

    NLineInputFormat：    文本类型，以n行为单位进行处理，一次切片按照指定的行号进行切分
                在处理一行文本数据量小的情况下可以使用，减少分发过程
                k：LongWritable， v:Text
        
    

    DBInputFormat：对SQL中的数据进行数据迁移或分布式计算、k:LongWritable    v:DBWritable

        DBWritable:

            public class MyWritable implements Writable, DBWritable {

                 private String line;
                 
                
                 public void write(DataOutput out) throws IOException {
                   out.writeUTF(line);
                   
                 }
                 
                 public void readFields(DataInput in) throws IOException {
                   line = in.readUTF();
                 }
                
                 //使用ppst对数据进行写入
                 public void write(PreparedStatement statement) throws SQLException {
                   statement.setString(1, line);
                 }
                
                 //jdbc读取数据返回的是结果集对象(ResultSet)
                 public void readFields(ResultSet resultSet) throws SQLException {
                   line = resultSet.getString(1);
                  
                 } 
               }