《OD学hadoop》第二周0703

 hdfs可视化界面: http://beifeng-hadoop-01:50070/dfshealth.html#tab-overview

yarn可视化界面: http://beifeng-hadoop-01:8088/cluster

历史服务器可视化界面:http://beifeng-hadoop-01:19888/

 

sbin/hadoop-daemon.sh start namenode

sbin/hadoop-daemon.sh start datanode

sbin/yarn-daemon.sh start resourcemanager

sbin/yarn-daemon.sh start nodemanager

sbin/mr-jobhistory-daemon.sh start historyserver

sbin/hadoop-daemon.sh stop namenode

sbin/hadoop-daemon.sh stop datanode

sbin/yarn-daemon.sh stop resourcemanager

sbin/yarn-daemon.sh stop nodemanager

sbin/mr-jobhistory-daemon.sh stop historyserver

一、替换本地库

mv native/ bak_native

tar -zxf native-**.gz -C /opt/modules/hadoop-2.5.0/lib

二、SecondaryNameNode

1、namenode 存储的是整个文件系统的元数据

2、格式化之后会产生一个目录

3、格式化之后还会产生文件初始的元数据

bin/hdfs namenode -format

4、元数据是放在内存中的

5、在namenode没有启动之前,元数据存在本地系统文件中

6、格式化之后,会生成一个fsimage文件

准确的说是文件系统的镜像文件,存储元数据

7、在HDFS上任何的操作,比如:上传,创建,会导致元数据发生改变

8、记录HDFS上操作的行为记录,操作日志,记录这些信息

edits logs 编辑日志文件

9、  有了日志文件之后,namenode再次启动的时候首先会去读取fsimage

再去读取编辑日志文件 edits,这样就不怕丢失了

10、考虑有一个服务进程去定时的将fsimage和edits进行合并?

11、SecondaryNameNode会去读取fsimage和eitds,读到内存中

将内存中的东西,写到一个新的fsimage文件中,原来的两个文件就不需要了,接着再生成一个eitds文件,继续记录

注意:读取fsimage速度很快,读取edits速度很慢

12、SecondaryNameNode作用:

(1)合并

(2)减少一次namenode的启动时间

13、配置

hdfs-site.xml

<property>
<name>dfs.namenode.secondary.http-address</name>
<value>beifeng-hadoop-01:50090</value>

</property>

14、启动命令:

$ sbin/hadoop-daemon.sh start secondarynamenode

 http://beifeng-hadoop-01:50090/status.html

fsimage: file:///opt/modules/hadoop-2.5.0/data/tmp/dfs/namesecondaryfsimage

edits: file:///opt/modules/hadoop-2.5.0/data/tmp/dfs/namesecondaryeidts

三、HDFS存储目录的配置

四、配置文件、客户端、服务端

1、Hadoop的配置文件有两类

默认的,自定义的

如果要提高集群性能,就可以通过修改配置来实现

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

每一个模块对应一个配置文件。

2、 运行启动加载文件

(1)第一步加载默认的配置文件

(2)第二步加载自定义的配置文件

  • core-site.xml
  • hdfs-site.xml
  • mapred-site.xml
  • yarn-site.xml

3、自定义配置文件优先级高于默认配置文件

4、hdfs有四个配置文件

  • core-default.xml
  • hdfs-default.xml
  • core-site.xml
  • hdfs-site.xml

5、服务端、namenode,datanode启动都会读取配置文件

6、客户端 

七、ssh无密钥登陆

命令通过脚本执行,脚本通过ssh协议远程连接

生成公钥:ssh-keygen -t rsa

id_rsa

id_rsa.pub

ssh-copy-id beifeng-hadoop-01

known_hosts

authorized_keys

八、以Hadoop2.x 为核心的生态系统

计算框架: MapReduce

计算框架容器:YARN 

数据存储: HDFS

操作系统:CentOS

数据来源: 关系型数据库,日志文件    ======>  HDFS

Sqoop:关系型数据库的表的数据<==>HDFS

http://blog.csdn.net/yfkiss/article/details/8700480

Flume:  实时抽取日志文件的数据,监控日志文件中的数据==>HDFS

Zookeeper:分布式协调框架

Hive: HiveQL语句,解析成mapreduce

Pig: 流式编程语言

实时查询:一张表,上亿的数据,快速检索,Bigtable->HBase(分布式数据库)

Oozie: 是一种框架,它让我们可以把多个Map/Reduce作业组合到一个逻辑工作单元中

CM:集成以上所述组件

九、 HDFS架构

1. namenode

2. datanode

复制的文件块是为了保证数据的安全性,

适用于大数据集,GB或者TB.

HDFS不适合的场景:

大量小文件处理;

多用户写入,任意修改文件;

HDFS文件和目录元数据存储在fsimage二进制文件中

edits

fsimage操作目的:

(1)从fsimage中读取HDFS中把偶错呢的每个目录和文件;

(2)初始化每个目录和文件的元数据信息;

(3)根据目录和文件的路径构造出整个命名空间在内存中的景象

(4)如果是文件,

十、HDFS Shell命令

[beifeng@beifeng-hadoop-01 hadoop-2.5.0]$ bin/hdfs dfs
Usage: hadoop fs [generic options]
        [-appendToFile <localsrc> ... <dst>]
        [-cat [-ignoreCrc] <src> ...]
        [-checksum <src> ...]
        [-chgrp [-R] GROUP PATH...]
        [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
        [-chown [-R] [OWNER][:[GROUP]] PATH...]
        [-copyFromLocal [-f] [-p] <localsrc> ... <dst>]
        [-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
        [-count [-q] <path> ...]
        [-cp [-f] [-p | -p[topax]] <src> ... <dst>]
        [-createSnapshot <snapshotDir> [<snapshotName>]]
        [-deleteSnapshot <snapshotDir> <snapshotName>]
        [-df [-h] [<path> ...]]
        [-du [-s] [-h] <path> ...]
        [-expunge]
        [-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
        [-getfacl [-R] <path>]
        [-getfattr [-R] {-n name | -d} [-e en] <path>]
        [-getmerge [-nl] <src> <localdst>]
        [-help [cmd ...]]
        [-ls [-d] [-h] [-R] [<path> ...]]
        [-mkdir [-p] <path> ...]
        [-moveFromLocal <localsrc> ... <dst>]
        [-moveToLocal <src> <localdst>]
        [-mv <src> ... <dst>]
        [-put [-f] [-p] <localsrc> ... <dst>]
        [-renameSnapshot <snapshotDir> <oldName> <newName>]
        [-rm [-f] [-r|-R] [-skipTrash] <src> ...]
        [-rmdir [--ignore-fail-on-non-empty] <dir> ...]
        [-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
        [-setfattr {-n name [-v value] | -x name} <path>]
        [-setrep [-R] [-w] <rep> <path> ...]
        [-stat [format] <path> ...]
        [-tail [-f] <file>]
        [-test -[defsz] <path>]
        [-text [-ignoreCrc] <src> ...]
        [-touchz <path> ...]
        [-usage [cmd ...]]

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|jobtracker:port>    specify a job tracker
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
[beifeng@beifeng-hadoop-01 hadoop-2.5.0]$ bin/hdfs
Usage: hdfs [--config confdir] COMMAND
       where COMMAND is one of:
  dfs                  run a filesystem command on the file systems supported in Hadoop.
  namenode -format     format the DFS filesystem
  secondarynamenode    run the DFS secondary namenode
  namenode             run the DFS namenode
  journalnode          run the DFS journalnode
  zkfc                 run the ZK Failover Controller daemon
  datanode             run a DFS datanode
  dfsadmin             run a DFS admin client
  haadmin              run a DFS HA admin client
  fsck                 run a DFS filesystem checking utility
  balancer             run a cluster balancing utility
  jmxget               get JMX exported values from NameNode or DataNode.
  oiv                  apply the offline fsimage viewer to an fsimage
  oiv_legacy           apply the offline fsimage viewer to an legacy fsimage
  oev                  apply the offline edits viewer to an edits file
  fetchdt              fetch a delegation token from the NameNode
  getconf              get config values from configuration
  groups               get the groups which users belong to
  snapshotDiff         diff two snapshots of a directory or diff the
                       current directory contents with a snapshot
  lsSnapshottableDir   list all snapshottable dirs owned by the current user
                                                Use -help to see options
  portmap              run a portmap service
  nfs3                 run an NFS version 3 gateway
  cacheadmin           configure the HDFS cache

Most commands print help when invoked w/o parameters.
[beifeng@beifeng-hadoop-01 hadoop-2.5.0]$ bin/hdfs dfsadmin
Usage: java DFSAdmin
Note: Administrative commands can only be run as the HDFS superuser.
           [-report]
           [-safemode enter | leave | get | wait]
           [-allowSnapshot <snapshotDir>]
           [-saveNamespace]
           [-rollEdits]
           [-restoreFailedStorage true|false|check]
           [-refreshNodes]
           [-finalizeUpgrade]
           [-rollingUpgrade [<query|prepare|finalize>]]
           [-metasave filename]
           [-refreshServiceAcl]
           [-refreshUserToGroupsMappings]
           [-refreshSuperUserGroupsConfiguration]
           [-refreshCallQueue]
           [-refresh]
           [-printTopology]
           [-refreshNamenodes datanodehost:port]
           [-deleteBlockPool datanode-host:port blockpoolId [force]]
           [-setQuota <quota> <dirname>...<dirname>]
           [-clrQuota <dirname>...<dirname>]
           [-setSpaceQuota <quota> <dirname>...<dirname>]
           [-clrSpaceQuota <dirname>...<dirname>]
           [-setBalancerBandwidth <bandwidth in bytes per second>]
           [-fetchImage <local directory>]
           [-shutdownDatanode <datanode_host:ipc_port> [upgrade]]
           [-getDatanodeInfo <datanode_host:ipc_port>]
           [-help [cmd]]

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|jobtracker:port>    specify a job tracker
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

[beifeng@beifeng-hadoop-01 hadoop-2.5.0]$
[beifeng@beifeng-hadoop-01 hadoop-2.5.0]$ bin/hdfs dfsadmin
Usage: java DFSAdmin
Note: Administrative commands can only be run as the HDFS superuser.
           [-report]
           [-safemode enter | leave | get | wait]
           [-allowSnapshot <snapshotDir>]
           [-disallowSnapshot <snapshotDir>]
           [-saveNamespace]
           [-rollEdits]
           [-restoreFailedStorage true|false|check]
           [-refreshNodes]
           [-finalizeUpgrade]
           [-rollingUpgrade [<query|prepare|finalize>]]
           [-metasave filename]
           [-refreshServiceAcl]
           [-refreshUserToGroupsMappings]
           [-refreshSuperUserGroupsConfiguration]
           [-refreshCallQueue]
           [-refresh]
           [-printTopology]
           [-refreshNamenodes datanodehost:port]
           [-deleteBlockPool datanode-host:port blockpoolId [force]]
           [-setQuota <quota> <dirname>...<dirname>]
           [-clrQuota <dirname>...<dirname>]
           [-setSpaceQuota <quota> <dirname>...<dirname>]
           [-clrSpaceQuota <dirname>...<dirname>]
           [-setBalancerBandwidth <bandwidth in bytes per second>]
           [-fetchImage <local directory>]
           [-shutdownDatanode <datanode_host:ipc_port> [upgrade]]
           [-getDatanodeInfo <datanode_host:ipc_port>]
           [-help [cmd]]

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|jobtracker:port>    specify a job tracker
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

十一、安全模式

[beifeng@beifeng-hadoop-01 hadoop-2.5.0]$ bin/hdfs dfsadmin -safemode
Usage: java DFSAdmin [-safemode enter | leave | get | wait]

安全模型下,是不能对文件进行操作的。

bin/hdfs dfsadmin -safemode get

bin/hdfs dfsadmin -safemode enter

bin/hdfs dfsadmin -safemode leave

十二、安装eclipse和maven

十三、Yarn

资源管理系统: 资源分配和资源隔离

十四、HDFS API

Maven仓库常用地址   

http://hadoop.apache.org/docs/r2.5.2/api/index.html

1. 获取hdfs文件系统

 Configuration conf = new Configuration();

FileSystem fileSystem = FileSystem.get(conf);

syso(fileSystem);

十五、以YARN为核心的生态系统

1、hostonworks  BATCH(MapReduce)

2、运行在YARN上的服务:长服务、短服务

3、apache silder

4、solar

原文地址:https://www.cnblogs.com/yeahwell/p/5636891.html