HDFS

HDFS简介

(Hadoop Distributed File System)

*****************************************

fsimage，namenode的元数据镜像文件，保存在磁盘

editlog，namenode操作日志

fstime，最近一次的checkpoint时间

metadata，一个文件存储在哪些DataNode节点的哪些位置的元数据信息

NN，namenode

SNN，secondarynamenode

***************************************

主要特点如下：

1.处理超大文件

2.流式的访问数据

数据集一旦由数据源生成，就会被复制分发到不同的存储节点中，然后响应各种各样的分析任务需求，

大多数情况下，分析任务会设计数据集中的大部分数据。对于HDFS来说读取整个数据集比读取一条记录

更加高效。

3.运行在廉价的机器集群中

这就要求设计HDFS时要充分考虑数据的可靠性，安全性以及高可用性

局限性如下:

1. 不适合低延迟的数据访问

hdfs是为了处理大型数据集分析任务，主要是为达到高的数据吞吐量而设计的，这就要求可能以高延迟作为代价

可以使用HBase，通过上层数据管理项目来尽可能的弥补这个不足

2.无法高效存储大量小文件

hadoop使用namenode来管理文件系统的元数据，以响应客户端请求返回文件位置等，因此文件数量大小的限制

要由namenode来决定。文件越多，消耗内存越大，那么namenode的工作压力会更大，检索处理元数据所需时间

就不可接受了。

3.不支持多用户写入及任意修改文件

hdfs的一个文件只有一个写入者，而且写操作只能在文件末尾完成，只能执行追加操作

HDFS相关概念

1.块（block）

操作系统中，文件以块的形式存储在磁盘中。通常一个文件系统块大小有几千字节，而磁盘块为512字节。

hdfs中的块是一个抽象概念，在配置hadoop时，会看到块默认大小为64M,文件被分块进行存储，它是文件存储处理

的逻辑单元

HDFS中抽象块的好处：

a. 可以存储任意大小的文件，不受网络中任一单个节点磁盘大小的限制，大文件可以分成众多块，分布在集群中

b.抽象块作为操作单元可以简化存储子系统，块大小固定，简化了存储系统的管理，元数据可以与文件块内容分开

存储；更有利于分布式文件系统中的复制容错（多个副本，默认3个）的实现。

HDFS体系结构

HDFS采用了主从（Master/Slave）结构模型，一个HDFS集群是由一个NameNode和若干个DataNode组成的。

其中NameNode作为主服务器，管理文件系统的命名空间和客户端对文件的访问操作；集群中的DataNode管理存储的数据。

HDFS允许用户以文件的形式存储数据。从内部来看，文件被分成若干个数据块，而且这若干个数据块存放在一组DataNode上。

NameNode执行文件系统的命名空间操作，比如打开、关闭、重命名文件或目录等，它也负责数据块到具体DataNode的映射。

Namenode主要维护两个文件，一个是fsimage，一个是editlog

fsimage保存了最新的元数据检查点，包含了整个HDFS文件系统的所有目录和文件的信息。

对于文件来说包括了数据块描述信息、修改时间、访问时间等；

对于目录来说包括修改时间、访问权限控制信息(目录所属用户，所在组)等。

[datalink@master current]$ ls -al
total 6276
drwx------. 2 datalink datalink 4096 Jan 28 13:34 .
drwxrwxr-x. 3 datalink datalink 40 Jan 27 12:33 ..
-rw-rw-r--. 1 datalink datalink 1048576 Jan 27 11:54 edits_0000000000000000001-0000000000000000001
-rw-rw-r--. 1 datalink datalink 1048576 Jan 27 12:17 edits_0000000000000000002-0000000000000000002
-rw-rw-r--. 1 datalink datalink 1048576 Jan 27 12:19 edits_0000000000000000003-0000000000000000003
-rw-rw-r--. 1 datalink datalink 1048576 Jan 27 12:28 edits_0000000000000000004-0000000000000000004
-rw-rw-r--. 1 datalink datalink 1048576 Jan 27 12:28 edits_0000000000000000005-0000000000000000005
-rw-rw-r--. 1 datalink datalink 42 Jan 27 12:34 edits_0000000000000000006-0000000000000000007
-rw-rw-r--. 1 datalink datalink 42 Jan 27 13:34 edits_0000000000000000008-0000000000000000009
-rw-rw-r--. 1 datalink datalink 42 Jan 27 14:34 edits_0000000000000000010-0000000000000000011
-rw-rw-r--. 1 datalink datalink 42 Jan 27 15:34 edits_0000000000000000012-0000000000000000013
-rw-rw-r--. 1 datalink datalink 42 Jan 27 16:34 edits_0000000000000000014-0000000000000000015
-rw-rw-r--. 1 datalink datalink 42 Jan 27 17:34 edits_0000000000000000016-0000000000000000017
-rw-rw-r--. 1 datalink datalink 42 Jan 27 18:34 edits_0000000000000000018-0000000000000000019
-rw-rw-r--. 1 datalink datalink 42 Jan 27 19:34 edits_0000000000000000020-0000000000000000021
-rw-rw-r--. 1 datalink datalink 42 Jan 27 20:34 edits_0000000000000000022-0000000000000000023
-rw-rw-r--. 1 datalink datalink 42 Jan 27 21:34 edits_0000000000000000024-0000000000000000025
-rw-rw-r--. 1 datalink datalink 42 Jan 27 22:34 edits_0000000000000000026-0000000000000000027
-rw-rw-r--. 1 datalink datalink 42 Jan 27 23:34 edits_0000000000000000028-0000000000000000029
-rw-rw-r--. 1 datalink datalink 42 Jan 28 00:34 edits_0000000000000000030-0000000000000000031
-rw-rw-r--. 1 datalink datalink 42 Jan 28 01:34 edits_0000000000000000032-0000000000000000033
-rw-rw-r--. 1 datalink datalink 42 Jan 28 02:34 edits_0000000000000000034-0000000000000000035
-rw-rw-r--. 1 datalink datalink 42 Jan 28 03:34 edits_0000000000000000036-0000000000000000037
-rw-rw-r--. 1 datalink datalink 42 Jan 28 04:34 edits_0000000000000000038-0000000000000000039
-rw-rw-r--. 1 datalink datalink 42 Jan 28 05:34 edits_0000000000000000040-0000000000000000041
-rw-rw-r--. 1 datalink datalink 42 Jan 28 06:34 edits_0000000000000000042-0000000000000000043
-rw-rw-r--. 1 datalink datalink 42 Jan 28 07:34 edits_0000000000000000044-0000000000000000045
-rw-rw-r--. 1 datalink datalink 42 Jan 28 08:34 edits_0000000000000000046-0000000000000000047
-rw-rw-r--. 1 datalink datalink 542 Jan 28 09:34 edits_0000000000000000048-0000000000000000055
-rw-rw-r--. 1 datalink datalink 42 Jan 28 10:34 edits_0000000000000000056-0000000000000000057
-rw-rw-r--. 1 datalink datalink 42 Jan 28 11:34 edits_0000000000000000058-0000000000000000059
-rw-rw-r--. 1 datalink datalink 42 Jan 28 12:34 edits_0000000000000000060-0000000000000000061
-rw-rw-r--. 1 datalink datalink 42 Jan 28 13:34 edits_0000000000000000062-0000000000000000063
-rw-rw-r--. 1 datalink datalink 1048576 Jan 28 13:34 edits_inprogress_0000000000000000064
-rw-rw-r--. 1 datalink datalink 506 Jan 28 12:34 fsimage_0000000000000000061
-rw-rw-r--. 1 datalink datalink 62 Jan 28 12:34 fsimage_0000000000000000061.md5
-rw-rw-r--. 1 datalink datalink 506 Jan 28 13:34 fsimage_0000000000000000063
-rw-rw-r--. 1 datalink datalink 62 Jan 28 13:34 fsimage_0000000000000000063.md5
-rw-rw-r--. 1 datalink datalink 3 Jan 28 13:34 seen_txid
-rw-rw-r--. 1 datalink datalink 217 Jan 27 11:52 VERSION
[datalink@master current]$ pwd
/opt/module/hadoop-3.3.0/data/tmp/dfs/name/current

上面数据中，通过edits_inprogress_0000000000000000064，可以看出fsimage文件已经加载到了最新的一个edits_0000000000000000062-0000000000000000063文件，仅仅只有inprogress状态的edit log未被加载。在启动HDFS时，只需要读入fsimage_0000000000000000063以及edits_inprogress_0000000000000000064就可以还原出当前hdfs的最新状况。HDFS会采用checkpoing机制定期将edit log合并到fsimage中生成新的fsimage

查看fsimage文件

[datalink@master current]$ hdfs oiv -i fsimage_0000000000000000063

考虑到合并过程很耗费磁盘IO，网络IO，CPU，合并过程都放在secondarynamenode上进行。

checkpoint的触发条件：

--满足dfs.namenode.checkpoint.preiod【默认1小时】时间点

--或者满足dfs.namenode.checkpoint.txns【默认100万次txns id】

合并步骤

SNN告诉NN滚动inprogress editlog文件，这样新的操作都会写到新的editlog文件，同时，NN更新seen_txid
SNN通过http get方式从NN获取最新的fsimage和editlog
SNN将fsimage加载到内存，并从editlog中读取每一次事务，应用到fsimage，这样就产生了一个新的fsimage
SNN将新的fsimage通过http put的方式发送到NN，NN将fsimage保存为临时fsimage.ckpt文件中
NN将fsimage.ckpt文件重命名，此完成了fsimage和editlog的同步

NameNode 在一个称为 FsImage 的文件中存储所有关于文件系统名称空间的信息。这个文件和一个包含所有事务的记录文件

（这里是 EditLog）将存储在 NameNode 的本地文件系统上。FsImage 和 EditLog 文件也需要复制副本，以防文件损坏或 NameNode 系统丢失。

NameNode本身不可避免地具有SPOF（Single Point Of Failure）单点失效的风险，主备模式并不能解决这个问题，

通过Hadoop Non-stop namenode才能实现100% uptime可用时间。

DataNode负责处理文件系统客户端的文件读写请求，并在NameNode的统一调度下进行数据块的创建、删除和复制工作。

NameNode 依赖来自每个 DataNode 的定期心跳（heartbeat）消息。每条消息都包含一个块报告，NameNode 可以根据这个报告

验证块映射和其他文件系统元数据。如果 DataNode 不能发送心跳消息，NameNode 将采取修复措施，重新复制在该节点上丢失的块。

HDFS采用Java语言开发，因此任何支持Java的机器都可以部署NameNode和DataNode。

Hadoop 由许多元素构成。其最底部是 Hadoop Distributed File System（HDFS），它存储 Hadoop 集群中所有存储节点上的文件。

HDFS（对于本文）的上一层是MapReduce 引擎，该引擎由 JobTrackers 和 TaskTrackers 组成。

Hadoop分布式计算平台的核心：分布式文件系统HDFS、MapReduce处理过程，以及数据仓库工具Hive和分布式数据库Hbase

首先，HDFS将每一个文件的数据进行分块存储，同时每一个数据块又保存有多个副本，这些数据块副本分布在不同的机器节点上，

这种数据分块存储+副本的策略是HDFS保证可靠性和性能的关键，这是因为：

一.文件分块存储之后按照数据块来读，提高了文件随机读的效率和并发读的效率；

二.保存数据块若干副本到不同的机器节点实现可靠性的同时也提高了同一数据块的并发读效率；

三.数据分块是非常切合MapReduce中任务切分的思想。在这里，副本的存放策略又是HDFS实现高可靠性和搞性能的关键。

HDFS采用一种称为机架感知的策略来改进数据的可靠性、可用性和网络带宽的利用率。通过一个机架感知的过程，NameNode

可以确定每一个DataNode所属的机架id。一个简单但没有优化的策略就是将副本存放在不同的机架上，这样可以防止当整个机架

失效时数据的丢失，并且允许读数据的时候充分利用多个机架的带宽。这种策略设置可以将副本均匀分布在集群中，有利于当组件

失效的情况下的均匀负载，但是，因为这种策略的一个写操作需要传输到多个机架，这增加了写的代价。

在大多数情况下，副本系数是3，HDFS的存放策略是将一个副本存放在本地机架节点上，一个副本存放在同一个机架的另一个节点上，

最后一个副本放在不同机架的节点上。这种策略减少了机架间的数据传输，提高了写操作的效率。机架的错误远远比节点的错误少，

所以这种策略不会影响到数据的可靠性和可用性。与此同时，因为数据块只存放在两个不同的机架上，所以此策略减少了读取数据时

需要的网络传输总带宽。在这种策略下，副本并不是均匀的分布在不同的机架上：三分之一的副本在一个节点上，三分之二的副本在

一个机架上，其它副本均匀分布在剩下的机架中，这种策略在不损害数据可靠性和读取性能的情况下改进了写的性能。

在读取数据时，为了减少整体的带宽消耗和降低整体的带宽延时，HDFS会尽量让读取程序读取离客户端最近的副本。

Hadoop文件操作

HDFS主要支持以流的形式访问写入的大型文件。

如果客户机想将文件写到 HDFS 上，首先需要将该文件缓存到本地的临时存储。如果缓存的数据大于所需的 HDFS 块大小，

创建文件的请求将发送给 NameNode。NameNode 将以 DataNode 标识和目标块响应客户机。

同时也通知将要保存文件块副本的 DataNode。当客户机开始将临时文件发送给第一个 DataNode 时，将立即通过管道方式

将块内容转发给副本 DataNode。客户机也负责创建保存在相同 HDFS名称空间中的校验和（checksum）文件。

在最后的文件块发送之后，NameNode 将文件创建提交到它的持久化元数据存储（在 EditLog 和 FsImage 文件）。

hadoop的集群是基于master/slave模式，namenode和jobtracker属于master，datanode和tasktracker属于slave，

master只有一个，而slave有多个。

·分布式应用（mapreduce）角度：集群中的节点有一个jobtracker和多个tasktracker组成。jobtracker负责任务的调度，

tasktracker负责并行执行任务。tasktracker必须运行在datanode上，这样便于数据的本地计算，而jobtracker和namenode

则必须在同一台机器上。

file本地文件

[root@host ~]# hdfs dfs -ls file:/root/tmpdata
Found 14 items
-rw-r--r-- 1 root root 46 2017-10-24 17:03 file:///root/tmpdata/20171024.txt
-rw-r--r-- 1 root root 38 2017-10-24 17:04 file:///root/tmpdata/20171026.txt
........................

hdfs分布式文件系统

[root@host ~]# hdfs dfs -ls /
Found 4 items
drwxr-xr-x - root supergroup 0 2017-09-20 16:59 /scalashell
drwxr-xr-x - root supergroup 0 2017-09-15 16:35 /system
drwx-wx-wx - root supergroup 0 2017-10-24 16:07 /tmp
drwx-wx-wx - root supergroup 0 2017-09-06 18:48 /user
[root@host ~]# hdfs dfs -ls hdfs:/
Found 4 items
drwxr-xr-x - root supergroup 0 2017-09-20 16:59 hdfs:///scalashell
drwxr-xr-x - root supergroup 0 2017-09-15 16:35 hdfs:///system
drwx-wx-wx - root supergroup 0 2017-10-24 16:07 hdfs:///tmp
drwx-wx-wx - root supergroup 0 2017-09-06 18:48 hdfs:///user

Thrift 是facebook开发的一种可伸缩的跨语言服务的发展软件框架。

解决了各个系统间大数据量的传输通讯

创建文件夹

[root@host ~]# hdfs dfs -mkdir /tianyongtao

[root@host ~]# hdfs dfs -mkdir /tianyongtao/20171206/

[root@host ~]# hdfs dfs -lsr /tianyongtao
lsr: DEPRECATED: Please use 'ls -R' instead.
drwxr-xr-x - root supergroup 0 2017-12-06 15:37 /tianyongtao/20171206

[root@host ~]# hdfs dfs -ls -R /tianyongtao
drwxr-xr-x - root supergroup 0 2017-12-06 15:37 /tianyongtao/20171206

删除包含子目录的文件夹

[root@host ~]# hdfs dfs -rm -R /tianyongtao
17/12/06 15:43:04 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /tianyongtao
[root@host ~]# hdfs dfs -ls /
Found 4 items
drwxr-xr-x - root supergroup 0 2017-09-20 16:59 /scalashell
drwxr-xr-x - root supergroup 0 2017-09-15 16:35 /system
drwx-wx-wx - root supergroup 0 2017-10-24 16:07 /tmp
drwx-wx-wx - root supergroup 0 2017-09-06 18:48 /user

appendToFile

Usage: hadoop fs -appendToFile <localsrc> ... <dst>

从本地复制一个或者多个文件添加到目标文件

[root@host ~]# hdfs dfs -text /tmp/20171024/20171024.txt
10 20 30 50 80 100 60 90 60 60 31 80 70 51 50

[root@host ~]# hdfs dfs -appendToFile /root/tmpdata/20171026.txt /tmp/20171024/20171024.txt
[root@host ~]# hdfs dfs -text /tmp/20171024/20171024.txt
10 20 30 50 80 100 60 90 60 60 31 80 70 51 50
100 500 600 800 10 30 66 96 89 80 100

// /root/tmpdata目录下有两个 201710开头的文本文件

[root@host ~]# hdfs dfs -appendToFile /root/tmpdata/201710* /tmp/20171024/20171024.txt
[root@host ~]# hdfs dfs -text /tmp/20171024/20171024.txt
10 20 30 50 80 100 60 90 60 60 31 80 70 51 50
100 500 600 800 10 30 66 96 89 80 100
10 20 30 50 80 100 60 90 60 60 31 80 70 51 50
100 500 600 800 10 30 66 96 89 80 100

cat

Usage: hadoop fs -cat [-ignoreCrc] URI [URI ...]

Copies source paths to stdout.

[root@host ~]# hdfs dfs -cat /tmp/20171024/20171024.txt
10 20 30 50 80 100 60 90 60 60 31 80 70 51 50
100 500 600 800 10 30 66 96 89 80 100
10 20 30 50 80 100 60 90 60 60 31 80 70 51 50
100 500 600 800 10 30 66 96 89 80 100

[root@host ~]# hdfs dfs -cat file:/root/tmpdata/20171024.txt
10 20 30 50 80 100 60 90 60 60 31 80 70 51 50

[root@host ~]# hdfs dfs -text file:/root/tmpdata/20171024.txt
10 20 30 50 80 100 60 90 60 60 31 80 70 51 50

checksum:校验

Usage: hadoop fs -checksum URI

Returns the checksum information of a file.

[root@host ~]# hdfs dfs -checksum file:/root/tmpdata/20171024.txt
file:///root/tmpdata/20171024.txt NONE
[root@host ~]# hdfs dfs -checksum /tmp/20171024/20171024.txt
/tmp/20171024/20171024.txt MD5-of-0MD5-of-512CRC32C 000002000000000000000000b4e4ea708a2db35d3c8479d0a3362814

chgrp

Usage: hadoop fs -chgrp [-R] GROUP URI [URI ...]

Change group association of files. The user must be the owner of files, or else a super-user. Additional information is in the Permissions Guide.

chmod

Usage: hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI ...]

Change the permissions 许可，权限 of files. With -R, make the change recursively through the directory structure. The user must be the owner of the file,

or else a super-user. Additional information is in the Permissions Guide.

chown

Usage: hadoop fs -chown [-R] [OWNER][:[GROUP]] URI [URI ]

Change the owner of files. The user must be a super-user. Additional information is in the Permissions Guide.

Options

copyFromLocal

Usage: hadoop fs -copyFromLocal <localsrc> URI

Similar to the fs -put command, except that the source is restricted to a local file reference.

[root@host ~]# hdfs dfs -copyFromLocal /root/tmpdata/hello.txt /tmp/20171024/

[root@host ~]# hdfs dfs -ls -R /tmp/20171024/h*
-rw-r--r-- 1 root supergroup 292 2017-12-07 14:16 /tmp/20171024/hello.txt

copyToLocal

Usage: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

Similar to get command, except that the destination is restricted to a local file reference.

[root@host ~]# hdfs dfs -copyToLocal /tmp/20171024/avg.txt /root/tmpdata/avg

[root@host avg]# ls
part-00000 part-00001 _SUCCESS

count

Usage: hadoop fs -count [-q] [-h] [-v] <paths>

Count the number of directories, files and bytes under the paths that match the specified file pattern.

The output columns with -count are: DIR_COUNT, FILE_COUNT, CONTENT_SIZE, PATHNAME

The output columns with -count -q are:

QUOTA, REMAINING_QUATA, SPACE_QUOTA, REMAINING_SPACE_QUOTA, DIR_COUNT, FILE_COUNT, CONTENT_SIZE, PATHNAME

The -h option shows sizes in human readable format.

[root@host ~]# hdfs dfs -count /tmp/20171024
6 26 2602 /tmp/20171024
[root@host ~]# hdfs dfs -count -q /tmp/20171024
none inf none inf 6 26 2602 /tmp/20171024
[root@host ~]# hdfs dfs -count -q -h /tmp/20171024
none inf none inf 6 26 2.5 K /tmp/20171024

cp

Usage: hadoop fs -cp [-f] [-p | -p[topax]] URI [URI ...] <dest>

Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory.

[root@host ~]# hdfs dfs -cp /tmp/20171024/hello.txt /tmp/
[root@host ~]# hdfs dfs -ls /tmp/
Found 4 items
drwxr-xr-x - root supergroup 0 2017-10-13 16:06 /tmp/20170915
drwxr-xr-x - root supergroup 0 2017-12-07 14:16 /tmp/20171024
-rw-r--r-- 1 root supergroup 292 2017-12-07 14:34 /tmp/hello.txt
drwx-wx-wx - root supergroup 0 2017-11-02 16:42 /tmp/hive

createSnapshot

See HDFS Snapshots Guide.

deleteSnapshot

See HDFS Snapshots Guide.

df

Usage: hadoop fs -df [-h] URI [URI ...]

Displays free space.

[root@host ~]# hdfs dfs -df /tmp/
Filesystem Size Used Available Use%
hdfs://localhost:9000 483794378752 262447 452775460864 0%
[root@host ~]# hdfs dfs -df -h /tmp/
Filesystem Size Used Available Use%
hdfs://localhost:9000 450.6 G 256.3 K 421.7 G 0%

du

Usage: hadoop fs -du [-s] [-h] URI [URI ...]

Displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.

[root@host ~]# hdfs dfs -du /tmp/
171 /tmp/20170915
2602 /tmp/20171024
292 /tmp/hello.txt
0 /tmp/hive
[root@host ~]# hdfs dfs -du -h /tmp/
171 /tmp/20170915
2.5 K /tmp/20171024
292 /tmp/hello.txt
0 /tmp/hive
[root@host ~]# hdfs dfs -du -s /tmp/
3065 /tmp
[root@host ~]# hdfs dfs -du -h -s /tmp/
3.0 K /tmp

dus //已经停用

Usage: hadoop fs -dus <args>

Displays a summary of file lengths.

Note: This command is deprecated. Instead use hadoop fs -du -s.

find

Usage: hadoop fs -find <path> ... <expression> ...

Finds all files that match the specified expression and applies selected actions to them. If no path is specified then defaults

to the current working directory.

If no expression is specified then defaults to -print.

[root@host ~]# hdfs dfs -find /tmp/ -name hello.txt
/tmp/20171024/hello.txt
/tmp/hello.txt
[root@host ~]# hdfs dfs -find /tmp/20171024 -name hello.txt
/tmp/20171024/hello.txt

-----------------

The -h option shows sizes in human readable format.

The -v option displays a header line.

-p : Preserves access and modification times, ownership and the permissions.

-f : Overwrites the destination if it already exists.

---------------------------------------------

get

Usage: hadoop fs -get [-ignorecrc] [-crc] [-p] [-f] <src> <localdst>

Copy files to the local file system. Files that fail the CRC check may be copied with the -ignorecrc option.

Files and CRCs may be copied using the -crc option.

[root@host ~]# hdfs dfs -get /tmp/20171024/hello.txt /root/tmpdata/
get: `/root/tmpdata/hello.txt': File exists
[root@host ~]# hdfs dfs -get -f /tmp/20171024/hello.txt /root/tmpdata/
-get: Illegal option -f
Usage: hadoop fs [generic options] -get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>

getfacl

Usage: hadoop fs -getfacl [-R] <path>

Displays the Access Control Lists (ACLs) of files and directories. If a directory has a default ACL,

then getfacl also displays the default ACL.

[root@host ~]# hdfs dfs -getfacl /tmp/
# file: /tmp
# owner: root
# group: supergroup
getfacl: The ACL operation has been rejected. Support for ACLs has been disabled by setting dfs.namenode.acls.enabled to false.

getfattr

Usage: hadoop fs -getfattr [-R] -n name | -d [-e en] <path>

Displays the extended attribute names and values (if any) for a file or directory.

getmerge

hadoop fs -getmerge < hdfs dir >  < local file >

将hdfs指定目录下所有文件排序后合并到local指定的文件中，文件不存在时会自动创建，文件存在时会覆盖里面的内容

hadoop fs -getmerge -nl  < hdfs dir >  < local file >

加上nl后，合并到local file中的hdfs文件之间会空出一行

ls

Usage: hadoop fs -ls [-d] [-h] [-R] <args>

Options:

-d: Directories are listed as plain files.
-h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864).
-R: Recursively list subdirectories encountered.

mkdir

Usage: hadoop fs -mkdir [-p] <paths>

Takes path uri’s as argument and creates directories.

[root@host ~]# hdfs dfs -mkdir /tmp/20171207 /tmp/20171206
[root@host ~]# hdfs dfs -ls /tmp/
Found 6 items
drwxr-xr-x - root supergroup 0 2017-10-13 16:06 /tmp/20170915
drwxr-xr-x - root supergroup 0 2017-12-07 14:16 /tmp/20171024
drwxr-xr-x - root supergroup 0 2017-12-07 16:04 /tmp/20171206
drwxr-xr-x - root supergroup 0 2017-12-07 16:04 /tmp/20171207
-rw-r--r-- 1 root supergroup 292 2017-12-07 14:34 /tmp/hello.txt
drwx-wx-wx - root supergroup 0 2017-11-02 16:42 /tmp/hive

moveFromLocal

Usage: hadoop fs -moveFromLocal <localsrc> <dst>

Similar to put command, except that the source localsrc is deleted after it’s copied.

moveToLocal

Usage: hadoop fs -moveToLocal [-crc] <src> <dst>

Displays a “Not implemented yet” message.

mv

Usage: hadoop fs -mv URI [URI ...] <dest>

Moves files from source to destination. This command allows multiple sources as well in which case the destination

needs to be a directory. Moving files across file systems is not permitted.

[root@host ~]# hdfs dfs -ls -R /tmp/ | grep hello
-rw-r--r-- 1 root supergroup 292 2017-12-07 14:16 /tmp/20171024/hello.txt
-rw-r--r-- 1 root supergroup 292 2017-12-07 14:34 /tmp/hello.txt
[root@host ~]# hdfs dfs -mv /tmp/hello.txt /tmp/20171207/
[root@host ~]# hdfs dfs -ls -R /tmp/ | grep hello
-rw-r--r-- 1 root supergroup 292 2017-12-07 14:16 /tmp/20171024/hello.txt
-rw-r--r-- 1 root supergroup 292 2017-12-07 14:34 /tmp/20171207/hello.txt

put

Usage: hadoop fs -put [-f] [-p] [-l] [-d] [ - | <localsrc1> .. ]. <dst>

Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system if the source is set to “-”

Copying fails if the file already exists, unless the -f flag is given.

Options:

-p : Preserves access and modification times, ownership and the permissions. (assuming the permissions can be propagated across filesystems)
-f : Overwrites the destination if it already exists.
-l : Allow DataNode to lazily persist the file to disk, Forces a replication factor of 1. This flag will result in reduced durability. Use with care.
-d : Skip creation of temporary file with the suffix ._COPYING_.

rm

Usage: hadoop fs -rm [-f] [-r |-R] [-skipTrash] URI [URI ...]

Delete files specified as args.

The -R option deletes the directory and any content under it recursively.
The -r option is equivalent to -R.

删除文件夹以及子文件：

Hdfs dfs –rm –r <文件夹>

或者 hdfs dfs -rmr 文件夹

rmr

Usage: hadoop fs -rmr [-skipTrash] URI [URI ...]

Recursive [rɪ'kɝsɪv] 循环 version of delete.

tail

Usage: hadoop fs -tail [-f] URI

Displays last kilobyte of the file to stdout.

Options:

The -f option will output appended data as the file grows, as in Unix.

[root@host ~]# hdfs dfs -tail /tmp/20171024/20171024.txt
10 20 30 50 80 100 60 90 60 60 31 80 70 51 50
100 500 600 800 10 30 66 96 89 80 100
10 20 30 50 80 100 60 90 60 60 31 80 70 51 50
100 500 600 800 10 30 66 96 89 80 100

test

Usage: hadoop fs -test -[defsz] URI

Options:

-d: f the path is a directory, return 0.
-e: if the path exists, return 0.
-f: if the path is a file, return 0.
-s: if the path is not empty, return 0.
-z: if the file is zero length, return 0.

Example:

hadoop fs -test -e filename

text

Usage: hadoop fs -text <src>

Takes a source file and outputs the file in text format. The allowed formats are zip and TextRecordInputStream.

[root@host ~]# hdfs dfs -text /tmp/20171024/20171024.txt
10 20 30 50 80 100 60 90 60 60 31 80 70 51 50
100 500 600 800 10 30 66 96 89 80 100
10 20 30 50 80 100 60 90 60 60 31 80 70 51 50
100 500 600 800 10 30 66 96 89 80 100

touchz

Usage: hadoop fs -touchz URI [URI ...]

Create a file of zero length. An error is returned if the file exists with non-zero length.

[root@host ~]# hdfs dfs -touchz /tmp/20171024/20171029.txt
[root@host ~]# hdfs dfs -ls -R /tmp | grep 20171029
-rw-r--r-- 1 root supergroup 0 2017-12-07 16:35 /tmp/20171024/20171029.txt

truncate

Usage: hadoop fs -truncate [-w] <length> <paths>

Truncate all files that match the specified file pattern to the specified length.

Options:

The -w flag requests that the command waits for block recovery to complete, if necessary. Without -w flag the file may remain unclosed for some time while the recovery is in progress. During this time file cannot be reopened for append.

[root@host ~]# hdfs dfs -cat /tmp/20171207/hello.txt
1#2#3#4#5#6#7#8#9#10#11#12#13#14#15#16#17#18#19#20#21#22#23#24#25#26#27#28#29#30#31#32#33#34#35#36#37

#38#39#40#41#42#43#44#45#46#47#48#49#50#51#52#53#54#55#56#57#58#59#60#61#62#63#64#65#66#67#68#69#70#71#72#7

3#74#75#76#77#78#79#80#81#82#83#84#85#86#87#88#89#90#91#92#93#94#95#96#97#98#99#100#

[root@host ~]# hdfs dfs -truncate -w 50 /tmp/20171207/hello.txt
Waiting for /tmp/20171207/hello.txt ...
Truncated /tmp/20171207/hello.txt to length: 50
[root@host ~]# hdfs dfs -text /tmp/20171207/hello.txt
1#2#3#4#5#6#7#8#9#10#11#12#13#14#15#16#17#18#19#20 //50

fsck

检查系统的不一致情况

[root@host ~]# hdfs fsck /
Connecting to namenode via http://localhost:50070/fsck?ugi=root&path=%2F
FSCK started by root (auth:SIMPLE) from /127.0.0.1 for path / at Thu Dec 07 16:52:35 CST 2017
....................................Status: HEALTHY
Total size: 3561 B
Total dirs: 69
Total files: 36
Total symlinks: 0
Total blocks (validated): 28 (avg. block size 127 B)
Minimally replicated blocks: 28 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 1.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Thu Dec 07 16:52:35 CST 2017 in 6 milliseconds

The filesystem under path '/' is HEALTHY

可以使用*来匹配零个或多个字符，而用?匹配一个字符。

[root@host ~]# hdfs dfs -ls -R /tmp/20170915/t* -rw-r--r-- 1 root supergroup 66 2017-09-29 09:59 /tmp/20170915/tian.txt -rw-r--r-- 1 root supergroup 7 2017-09-15 16:02 /tmp/20170915/tyt.txt

[root@host ~]# hdfs dfs -ls -R /tmp/20170915/t?an.txt -rw-r--r-- 1 root supergroup 66 2017-09-29 09:59 /tmp/20170915/tian.txt

删除文件夹

[datalink@slave4 ~]$ hdfs dfs -mkdir /test/20210128

[datalink@slave4 ~]$ hdfs dfs -ls -R /
-rw-r--r-- 3 datalink supergroup 42 2021-01-28 09:28 /helloworld
drwxr-xr-x - datalink supergroup 0 2021-01-28 14:29 /test
drwxr-xr-x - datalink supergroup 0 2021-01-28 14:29 /test/20210128
[datalink@slave4 ~]$ hdfs dfs -rm -r /test
Deleted /test
[datalink@slave4 ~]$ hdfs dfs -ls -R /
-rw-r--r-- 3 datalink supergroup 42 2021-01-28 09:28 /helloworld