HDFS

HDFS集群主要由管理文件系统元数据的NameNode和存储实际数据的DataNode组成.

HDFS架构描述了NameNode,DataNodes与客户端的基本交互.
客户端与NameNode联系以进行文件元数据或文件修改，并直接与DataNode执行实际的文件I / O。

Hadoop一些显著的特性:
1)Hadoop, including HDFS, is well suited for distributed storage and distributed processing using commodity hardware. It is fault tolerant, scalable, and extremely simple to expand. MapReduce, well known for its simplicity and applicability for large set of distributed applications, is an integral part of Hadoop.

2)HDFS is highly configurable with a default configuration well suited for many installations. Most of the time, configuration needs to be tuned only for very large clusters.

3)Hadoop is written in Java and is supported on all major platforms.

4)Hadoop supports shell-like commands to interact with HDFS directly.

5)The NameNode and Datanodes have built in web servers that makes it easy to check current status of the cluster.

6)New features and improvements are regularly implemented in HDFS. The following is a subset of useful features in HDFS:

7)File permissions and authentication.
8)Rack awareness: to take a node’s physical location into account while scheduling tasks and allocating storage.
9)Safemode: an administrative mode for maintenance.
10)fsck: a utility to diagnose health of the file system, to find missing files or blocks.
11)fetchdt: a utility to fetch DelegationToken and store it in a file on the local system.
12)Balancer: tool to balance the cluster when the data is unevenly distributed among DataNodes.
13)Upgrade and rollback: after a software upgrade, it is possible to rollback to HDFS’ state before the upgrade in case of unexpected problems.
14)Secondary NameNode: performs periodic checkpoints of the namespace and helps keep the size of file containing log of HDFS modifications within certain limits at the NameNode.
15)Checkpoint node: performs periodic checkpoints of the namespace and helps minimize the size of the log stored at the NameNode containing changes to the HDFS. Replaces the role previously filled by the Secondary NameNode, though is not yet battle hardened. The NameNode allows multiple Checkpoint nodes simultaneously, as long as there are no Backup nodes registered with the system.
16)Backup node: An extension to the Checkpoint node. In addition to checkpointing it also receives a stream of edits from the NameNode and maintains its own in-memory copy of the namespace, which is always in sync with the active NameNode namespace state. Only one Backup node may be registered with the NameNode at once.

Web界面
每个NameNode和DataNode都运行了一个内部web服务器.
默认配置下,NameNode首页为:http://namenode-name:50070/
也可以浏览HDFS文件系统(使用"Browse the file system")

Shell命令:
bin/hdfs dfs -help #Hadoop shell所支持的命令列表
bin/hdfs dfs -help command-name #显示某个命令的详细帮助信息

dfsadmin命令
bin/hdfs dfsadmin -help

hdfs dfsadmin -printTopology # 输出集群的拓扑

Although the Hadoop framework is implemented in Java™, MapReduce applications need not be written in Java.

Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer.

Hadoop Pipes is a SWIG-compatible C++ API to implement MapReduce applications (non JNI™ based).