Hadoop生态框架介绍

一、总览
- 1.1 参考论文
- 1.2 生态组件
2. 各个框架

一、总览

1.1 参考论文

引领大数据前进的三驾马车分别是：

年份	论文页面	直链下载
2003	The Google File System	GFS论文下载蓝奏云地址
2004	MapReduce: Simplified Data Processing on Large Clusters	MapReduce论文下载蓝奏云地址
2006	Bigtable: A Distributed Storage System for Structured Data	Bigtable论文下载蓝奏云地址

1.2 生态组件

Components of the Hadoop Ecosystem

num	item	img
1	HDFS
2	MapReduce
3	YARN
4	HBase
5	Pig
6	Hive
7	Sqoop
8	Flume
9	Kafka
10	Zookeeper
11	Spark

总体框架图：
生态框架图

2. 各个框架

1.1 HDFS (Hadoop Distributed File System)

item	content
主页	http://hadoop.apache.org/
文档	https://hadoop.apache.org/docs/current/
介绍	The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

1.2 MapReduce

item	content
1	2

1.3 YARN

item	content
主页	https://classic.yarnpkg.com/lang/en/docs/
介绍	Yarn is a package manager for your code. It allows you to use and share (e.g. JavaScript) code with other developers from around the world. Yarn does this quickly, securely, and reliably so you don’t ever have to worry. Yarn allows you to use other developers’ solutions to different problems, making it easier for you to develop your software. If you have problems, you can report issues or contribute back, and when the problem is fixed, you can use Yarn to keep it all up to date. Code is shared through something called a package (sometimes referred to as a module). A package contains all the code being shared as well as a package.json file which describes the package.

item

content

主页

https://classic.yarnpkg.com/lang/en/docs/

介绍

Yarn is a package manager for your code. It allows you to use and share (e.g. JavaScript) code with other developers from around the world. Yarn does this quickly, securely, and reliably so you don’t ever have to worry.

Yarn allows you to use other developers’ solutions to different problems, making it easier for you to develop your software. If you have problems, you can report issues or contribute back, and when the problem is fixed, you can use Yarn to keep it all up to date.

Code is shared through something called a package (sometimes referred to as a module). A package contains all the code being shared as well as a package.json file which describes the package.

1.4 HBase

item	content
官网	http://hbase.apache.org/
介绍	Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

item

content

官网

http://hbase.apache.org/

介绍

Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

1.5 Pig

...

1.6 Hive

...

1.7 Sqoop

...

1.8 Flume

...

1.9 Kafka

...

1.10 Zookeeper

item	content
主页	http://zookeeper.apache.org/
介绍	Apache ZooKeeper is an effort to develop and maintain an open-source server which enables highly reliable distributed coordination. What is ZooKeeper? ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them, which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.Learn more about ZooKeeper on the ZooKeeper Wiki.

item

content

主页

http://zookeeper.apache.org/

介绍

Apache ZooKeeper is an effort to develop and maintain an open-source server which enables highly reliable distributed coordination.

What is ZooKeeper?

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them, which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.Learn more about ZooKeeper on the ZooKeeper Wiki.

1.11 Spark

item	content
主页	http://spark.apache.org/
文档	http://spark.apache.org/docs/latest/
介绍	Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or cluster

有了计划记得推动，不要原地踏步。