hadoop elementary course

导引
两个主要的问题
如何存储海量数据
如何分析海量数据

Hadoop就是Hadoop项目
它包括Common, Avro, MapReduce, HDFS, Pig, Hive, Hbase, ZooKeeper, Sqoop, Oozie

Hadoop文件系统适合于有数据流(一次写,多次读)和运行的普通主机上的海量数据
但是Hadoop文件系统部适合运行延迟性输入,多次写以及随意修改的小文件

HDFS 框架
分块:默认64M(很大,因为用于海量数据)
名字结点:含有文件系统的目录,文件信息以及相应的分块信息(很重要)
数据结点:储存分块信息
HA策略:1.x只能有一个名字结点,2.x之后就有针对名字结点的活动-待机模式

MapReduce
就是用于处理并行计算海量数据的编程模式
举个例子,求9个数字的最大值
第一步,调用map函数得到每三个数的最大值,这三个数都是用Hadoop文件系统的方式储存的
第二步,用reduce函数得到最大的值

总结,Hadoop文件系统就是提供储存海量数据在多个主机上的方法,以及相应的策略
而Mapreduce就是用分而治之的思想来分析数据

INTRODUCTORY
the two main question
first, how to handle the mass data storage - HDFS
second, how to analyze the mass data - MapReduce

Hadoop = The Hadoop projects
including Common, Avro, MapReduce, HDFS, Pig, Hive, Hbase, ZooKeeper, Sqoop, Oozie

Hapood is suitable for very large files which possess streaming date access and run in commodity hardware.
but hadoop is not proper for small files which have low-latency date access, multiply writer, arbitrary modification.


HDFS Frame
Block: default 64M(big, because for mass data)
NameNode: contain catalogue of the file system, file info and according block info. (crucial)
DateNode: store block info.
HA strategy: 1.x just has one NameNode, and after 2.x, there is active-standy pattern of NameNode.


MapReduce
which is progroming, using for parallel computation of mass data.
For example, get max of the nice numbers.
Firstly, using map function get max of three numbers respectively.
you know that the data is stored by the HDFS.
Secondly, using reduce function to get the maximum value.


In conclusion, the HDFS provide the method that store mess data in many host, incluing some strategy.
then Mapreduce analyze the data by divide and rule.

原文地址:https://www.cnblogs.com/chuanlong/p/2822933.html