day03

今天学

习过程

和小结

一、Hadoop基础

1、环境搭建

1）虚拟机，下载安装虚拟机

2）Java环境和Hadoop包

3）创建Hadoop用户，配置ssh、配置无密码登录

4）搭建Hadoop单机环境

5）配置伪分布式环境

6）eclipse连接Hadoop环境

下载eclipse-hadoop的jar包
下载后，将 release 中的 hadoop-eclipse-kepler-plugin-2.6.0.jar （还提供了 2.2.0 和 2.4.1 版本）复制到 Eclipse 安装目录的 plugins 文件夹中，运行 eclipse -clean 重启 Eclipse 即可
打开eclipse->windows->performance

Windows->perspective->open perspective->others打开Mapreduce视图
启动 Eclipse 后就可以在左侧的Project Explorer中看到 DFS Locations
配置Mapreduce连接

完成就可以看到hdfs数据库内容

2、wordcount实例，统计文件中单词出现的数量

思路：将文件读入到Mapper，然后按空格切割，将单词输出为<单词，1>的形式，传输到Reducer，汇总key的个数，输出key的总次数。

1）编写Mapper

// 1 将maptask传给我们的文本内容先转换成String

String line = value.toString();

// 2 根据空格将这一行切分成单词

String[] words = line.split(" ");

// 3 将单词输出为<单词，1>

for(String word:words){

// 将单词作为key，将次数1作为value,以便于后续的数据分发，可以根据单词分发，以便于相同单词会到相同的reducetask中

context.write(new Text(word), new IntWritable(1));

}

2）编写Reducer

int count = 0;

// 1 汇总各个key的个数

for(IntWritable value:values){

count +=value.get();

}

// 2输出该key的总次数

context.write(key, new IntWritable(count));

3）编写Driver

1、获取配置信息，job对象实例

2、指定所需的jar包所在的路径

3、关联Mapper/Reducer

4、指定Mapper输出数据的类型

5、注定最终输出的类型

6、指定输出文件的路径

3、join连接案例

设计思路：在map中读取student_info.txt和student_class_info.txt文件，将student_info.txt文件中的数据作为左表，并用 l 做标记，student_class_info.txt作为右表，并做标记 r，数据整理成如下格式

Mapper会将Map方法输出的（key/value）进行排序，将key值相同的value值合并，合并后的数据大体如下：

1）编写Mapper

//判断数据来自哪个文件

if(filePath.contains(LEFT_FILENAME)){

fileFlag = LEFT_FILENAME_FLAG;

joinKey = value.toString().split(" ")[1];

joinValue = value.toString().split(" ")[0];

//System.out.println("l--"+joinKey+":"+joinValue);

}else if(filePath.contains(RIGHT_FILENAME)){

fileFlag = RIGHT_FILENAME_FLAG;

joinKey = value.toString().split(" ")[0];

joinValue = value.toString().split(" ")[1];

//System.out.println("l--"+joinKey+":"+joinValue);

}

//输出简直对并标识该结果属于哪个文件

context.write(new Text(joinKey), new Text(joinValue+" "+fileFlag));

2）编写Reducer

while(iterator.hasNext()){

String [] infos = iterator.next().toString().split(" ");

//判断该条记录来自哪个文件

if(infos[1].equals(LEFT_FILENAME_FLAG)){

studentName = infos[0];

//System.out.println("reduce -l");

}else if (infos[1].equals(RIGHT_FILENAME_FLAG)){

studentClassNames.add(infos[0]);

//System.out.println("reduce-r");

}

//作笛卡尔积

for(int i = 0;i<studentClassNames.size();i++){

context.write(new Text(studentName), new Text(studentClassNames.get(i)));

}

3）编写Driver

Configuration conf = new Configuration();

Job job = new Job(conf,"MRjoin");

job.setJarByClass(JoinTest.class);

FileInputFormat.addInputPath(job, new Path("hdfs://127.0.0.1:9000/user/root/join"));

FileOutputFormat.setOutputPath(job, new Path("hdfs://127.0.0.1:9000/user/joinResult"));

job.setMapperClass(JoinMap.class);

job.setReducerClass(JoinReduce.class);

job.setOutputFormatClass(TextOutputFormat.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

System.exit(job.waitForCompletion(true)? 0:1);

遇到问

题汇总

一、 Eclipse连接hadoop，Mapreduce出错

Map Reduce端口配置错误，根基修改的信息，应该将HDFS端口连接配置为9000，而Map Reduce的端口连接方式不变

二、 Map reduce的key和value的值类型不匹配导致运行出错

value 可读写

key 可读写可排序

key-value 可序列化

在Map reduce中mapper的输入是从文件按读取的数据，输出的之是需要传入到reduce的，所以reduce的输入出要和mapper匹配，reduce的输出为储存的结果。

三、 Eclipse配置Map reduce没有map的视图

因为使用的eclipse的版本是scala版本的，不是Java开发版本，没有Hadoop和Map reduce环境，需要将eclipse的版本换为JavaEE开发的版本。

学习技能思维导图

（持续增加）