初学MapReduce离线计算（eclipse实现）

一、导入jia包

需要导入common,hdfs以及mapreduce下的所有jar包

二、代码实现诗词出现字数统计

先在桌面上创建一个文本文档（明月几时有.txt）,内容为一首诗词

在eclipse新建三个类：WordCountMapper、WordCountReducer、WordCountDriver。

在我们用MAPREDUCE编程的时候 MAPREDUCE有一套自己的数据类型
字符串 Text 提供Java的数据类型可以和MapReduce的类型做一个数据转换
整数 IntWritable ShortWritable LongWritbale
浮点数 FloatWritable DoubleWritable
字符串类型 Text Text.toString转换为字符串 new Text("") 把字符串转换为Text
整数类型 IntWritable get() 转换为int new IntWritable(1) 把Java类型转换为MapReduce的类型

WordCountMapper类：

Map类会输出成一个文件 temp.html
Map类规范必须得继承Mapper类并且重写mapper方法

Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>:
KEYIN ：表示我们当前读取一个文件[xxx.txt] 读到多少个字节了数量词
VALUEIN ：表示我们当前读的是文件的多少行逐行读取表示我们读取的一行文字
KEYOUT：我们执行MAPPER之后写入到文件中KEY的类型
VALUEOUT ：我们执行MAPPER之后写入到文件中VALUE的类型

package com.blb.core;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<LongWritable,Text,Text,IntWritable>
{


    protected void map(LongWritable key,Text value,Mapper<LongWritable,Text,Text,IntWritable>.Context context) throws IOException, InterruptedException
    {
        String replace = value.toString().replace(" ", "");
        char[] array=replace.toCharArray();
        for (char c : array) {
            context.write(new Text(c+""), new IntWritable(1));
        }

    }



}

Mapper阶段产生一个临时文件
Reduce 读取Mapper生成的那个临时文件

WordCountReducer类：

Reduce类规范必须得继承Reducer类并且重写Reducer方法
Reducer会把我们Mapper执行后的那个临时文件作为他的输入，使用之后会把这个临时文件给删除掉

Reducer<Text,IntWritable, Text,IntWritable>：
KEYIN Text
VALUEIN IntWritbale
KEYOUT Text ：我们Reduce之后这个文件中内容的 Key是什么
VALUEOUT IntWritable ：这个文件中内容Value是什么

package com.blb.core;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

    @Override
    protected void reduce(Text key,Iterable<IntWritable> values,Reducer<Text,IntWritable,Text,IntWritable>.Context context) throws IOException, InterruptedException
    {
        int sum=0;
        for(IntWritable value:values)
        {
            sum+=value.get();
        }
        context.write(key, new IntWritable(sum));
    }

}

WordCountDriver类：

Driver这个类用来执行一个任务 Job
任务=Mapper+Reduce+HDFS把他们3者关联起来

package com.blb.core;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountDriver {

    public static void main(String[] args) {
        Configuration conf=new Configuration();
        conf.set("fs.defaultFS", "hdfs://192.168.0.32:9000");
        try {
            Job job=Job.getInstance(conf);
            job.setJarByClass(WordCountDriver.class);　　//要给当前的任务取一个名称

            job.setMapperClass(WordCountMapper.class);　　  //我当前的任务的Mapper类是谁
            job.setMapOutputKeyClass(Text.class);　　　　　　//我们Mapper任务输出的文件的Key值类型
            job.setMapOutputValueClass(IntWritable.class); //我们Mapper任务输出的文件的Value值类型

            job.setReducerClass(WordCountReducer.class);　　//我们当前任务的Reducer类是谁
            job.setOutputKeyClass(Text.class);　　　　　　　　//我们Reducer任务输出的文件的Key值类型
            job.setOutputValueClass(IntWritable.class);　　//我们Reducer任务输出的文件的Value值类型

            FileInputFormat.setInputPaths(job,new Path("/words"));

　　　　　　　　//关联我们HDFS文件 HDFS文件的绝对路径
　　　　　　　　//输入的路径是文件夹把这个文件夹下面的所有文件都执行一遍

            FileOutputFormat.setOutputPath(job, new Path("/out"));

　　　　　　　　//最终要有一个结果我最终计算完成生成的结果存放在HDFS上的哪里
　　　　　　　　//Mapper执行的后的结果是一个临时文件这个文件存放在本地
　　　　　　　　//Reducer执行后的结果自动的上传到HDFS之上并且还会把Mapper执行后的结果给删除掉

            job.waitForCompletion(true);　　//我们关联完毕后  我们要执行了
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (ClassNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (InterruptedException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}

三、执行

写完之后，将程序导出成jar包:WordCount.jar

1、在hdfs上新建文件夹words

hadoop fs -mkdir /words

2、将要计算的文件（明月几时有.txt）上传到hdfs上。

先rz上传到linux上，再用命令hadoop fs -put 明月几时有.txt /words

3、将jar包上传到linux上。

4、开启服务

start-all.sh

5、运行jar包（hadoop jar jar包名称主类[带main方法的那个类]）

hadoop jar WordCount.jar com.blb.core.WordCountDriver

最后成功后会在hdfs的/out目录下生成最终的结果的文件part-r-00000

可以将文件下载下来查看