HDFS集群YARN集群高可用配置随笔

集群HDFS/YARN高可用配置(zookeeper):

[hadoop@master01 hadoop]$ vi core-site.xml
配置:
----
<configuration>

<property>
<name>fs.defaultFS</name>
<value>hdfs://ns1</value>
</property>

<property>
<name>ha.zookeeper.quorum</name>
<value>slaver01:2181,slaver02:2181,slaver03:2181</value>
</property>

<property>
<name>hadoop.tmp.dir</name>
<value>/software/hadoop-2.7.3/work</value>
</property>

</configuration>

[hadoop@master01 hadoop]$ vi hdfs-site.xml
配置含有(qjournal集群):
------
<configuration>

<property>
<name>dfs.replication</name>
<value>3</value>
</property>

<property>
<name>dfs.nameservices</name>
<value>ns1</value>
</property>

<property>
<name>dfs.ha.namenodes.ns1</name>
<value>nn1,nn2</value>
</property>

<property>
<name>dfs.namenode.rpc-address.ns1.nn1</name>
<value>master01:9000</value>
</property>

<property>
<name>dfs.namenode.http-address.ns1.nn1</name>
<value>master01:50070</value>
</property>

<property>
<name>dfs.namenode.rpc-address.ns1.nn2</name>
<value>master02:9000</value>
</property>

<property>
<name>dfs.namenode.http-address.ns1.nn2</name>
<value>master02:50070</value>
</property>

<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://slaver01:8485;slaver02:8485;slaver03:8485/QJID</value>
</property>

<!--配置Qjournal集群节点在本地存放数据的位置-->

<property>
<name>dfs.journalnode.edits.dir</name>
<value>/software/hadoop-2.7.3/QJMateData</value>
</property>

<!--开启NN节点进程断掉后自动切换-->

<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>

<!--配置故障转移代理类-->

<property>
<name>dfs.client.failover.proxy.provider.ns1</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

<!--确认断掉后执行ensure.sh脚本-->
<property>
<name>dfs.ha.fencing.methods</name>
<value>
sshfence
shell(/software/hadoop-2.7.3/ensure.sh)
</value>
</property>

<!--公私-->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/hadoop/.ssh/id_rsa</value>
</property>

<!--判断超时-->
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>

</configuration>

<!--配置YARN集群高可用-->
[hadoop@master01 hadoop]$ vi yarn-site.xml
配置:
<!--HA Config-->
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>

<!--集群ID值可以自己取-->
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>RMHA</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>

<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>master01</value>
</property>

<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>master02</value>
</property>
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>slave01:2181,slave02:2181,slave03:2181</value>
</property>


<!--将以上配置复制到所有节点上-->
[hadoop@master01 hadoop]$ scp -r core-site.xml hdfs-site.xml yarn-site.xml master02:/software/hadoop-
2.7.3/etc/hadoop/


在slaver节点上启动zookeeper集群:
--------
[hadoop@slaver01 hadoop-2.7.3]$ cd /software/zookeeper-3.4.10/bin/ && ./zkServer.sh start && cd - && jps


[在slaver节点上启动Qjournal集群:
----------
1、在所有slave节点上配置ZK集群
[hadoop@slaver01 hadoop-2.7.3]$ hadoop-daemon.sh start journalnode

2、格式化HDFS:
[hadoop@master01 software]$ hdfs namenode -format

3、拷贝work到master02对应节点下:
[hadoop@master01 software]$ scp -r hadoop-2.7.3/work/ master02:/software/hadoop-2.7.3/

4、格式化ZKFC客户端:
[hadoop@master01 software]$ hdfs zkfc -formatZK

5、启动HDFS集群:
[hadoop@master01 software]$ start-dfs.sh

6、启动YARN集群(在master02上手动启动RM进程):
[hadoop@master01 hadoop]$ start-yarn.sh
[hadoop@master02 ~]$ yarn-daemon.sh start resourcemanager

7、查看master那个正在服务(使用web终端查看端口:50070):
[hadoop@master01 software]$ hdfs haadmin -getServiceState nn1
[hadoop@master01 hadoop]$ yarn rmadmin -getServiceState rm1

小结:
HDFS集群的高可用:
---------------
客户端连接zookeeper管理下的处于active状态的master节点上的Datenode进程进行数据交互,如果master节点断掉
,处于standby状态的master节点自动接管并告知客户端,数据交互正常进行。
YARN集群的高可用:
---------------
客户端提交一个Job连接处于active状态下的master节点上的Resourcemanager进程进行数据交互,如果此节点断掉,
处于standby状态的master节点只能接管之后的Job处理,对于当前的Job以失败结束。

<!--因各种原因不想使用高可用时的配置-->
<!--开启NN节点进程断掉后故障自动转移改为false-->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>false</value>
</property>
若出现开启的master节点处于standby状态执行,强行active状态,弊端:会使自动故障转移转为手动,之后的操作都只能用手
动:
hdfs haadmin -transitionToActive --forceactive nn1

<!--裂脑状态-->
-----master节点状态同时处于active或者standby状态!

原文地址:https://www.cnblogs.com/pandazhao/p/8074993.html