Hadoop生态框架介绍

一、总览

1.1 参考论文

引领大数据前进的三驾马车分别是:

年份 论文页面 直链下载
2003 The Google File System GFS论文下载蓝奏云地址
2004 MapReduce: Simplified Data Processing on Large Clusters MapReduce论文下载蓝奏云地址
2006 Bigtable: A Distributed Storage System for Structured Data Bigtable论文下载蓝奏云地址

1.2 生态组件

Components of the Hadoop Ecosystem

num item img
1 HDFS
2 MapReduce
3 YARN
4 HBase
5 Pig
6 Hive
7 Sqoop
8 Flume
9 Kafka
10 Zookeeper
11 Spark

总体框架图:
生态框架图

2. 各个框架

1.1 HDFS (Hadoop Distributed File System)

item content
主页 http://hadoop.apache.org/
文档 https://hadoop.apache.org/docs/current/
介绍 The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

1.2 MapReduce

item content
1 2

1.3 YARN

item content
主页 https://classic.yarnpkg.com/lang/en/docs/
介绍 Yarn is a package manager for your code. It allows you to use and share (e.g. JavaScript) code with other developers from around the world. Yarn does this quickly, securely, and reliably so you don’t ever have to worry.

Yarn allows you to use other developers’ solutions to different problems, making it easier for you to develop your software. If you have problems, you can report issues or contribute back, and when the problem is fixed, you can use Yarn to keep it all up to date.

Code is shared through something called a package (sometimes referred to as a module). A package contains all the code being shared as well as a package.json file which describes the package.

1.4 HBase

item content
官网 http://hbase.apache.org/
介绍 Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

1.5 Pig

...

1.6 Hive

...

1.7 Sqoop

...

1.8 Flume

...

1.9 Kafka

...

1.10 Zookeeper

item content
主页 http://zookeeper.apache.org/
介绍 Apache ZooKeeper is an effort to develop and maintain an open-source server which enables highly reliable distributed coordination.

What is ZooKeeper?

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them, which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.Learn more about ZooKeeper on the ZooKeeper Wiki.

1.11 Spark

item content
主页 http://spark.apache.org/
文档 http://spark.apache.org/docs/latest/
介绍 Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or cluster
有了计划记得推动,不要原地踏步。
原文地址:https://www.cnblogs.com/amnotgcs/p/15623997.html