Apache Hadoop YARN高可用部署实战案例

                  Apache Hadoop YARN高可用部署实战案例

                                        作者:尹正杰

版权声明:原创作品,谢绝转载!否则将追究法律责任。

   本篇博客概述了YARN ResourceManager的高可用性,并详细介绍了如何配置和使用此功能。

一.YARN HA原理剖析

 

  ResourceManager(RM)负责跟踪群集中的资源,并调度应用程序(例如MapReduce作业)。在Hadoop 2.4之前,ResourceManager是YARN群集中的单点故障。高可用性功能以“活动/备用ResourceManager”对的形式添加了冗余,以消除此单点故障。

  手动转换和故障转移
    如果未启用自动故障转移,则管理员必须手动将其中一个RM转换为Active。要从一个RM到另一个RM进行故障转移,他们应该先将Active-RM转换为Standby,然后将Standby-RM转换为Active。所有这些都可以使用“ yarn rmadmin ” 命令完成。

  自动故障转移
    RM可以选择嵌入基于Zookeeper的ActiveStandbyElector,以确定哪个RM应该是Active。当Active发生故障或无响应时,另一个RM被自动选为Active,然后接管。
    请注意,无需像HDFS那样运行单独的ZKFC守护程序,因为嵌入在RM中的ActiveStandbyElector充当故障检测器和领导选举人,而不是单独的ZKFC守护进程。
    
  博主推荐阅读:
    http://hadoop.apache.org/docs/r2.10.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html

二.配置YARN HA集群实战案例

1>.实验环境

  本实验是在HDFS HA基础之上部署的,详情可参考我之前的笔记"Apache Hadoop HDFS高可用部署实战案例"。

  博主推荐阅读:
    https://www.cnblogs.com/yinzhengjie2020/p/12508145.html

2>.角色分配

  各节点Hadoop角色分配如下,当然你也可以将下面的角色自定义进行合并到一台主机,虽然博主不建议你这样干,但是如果你的物理机实在内存不足建议至少保留2台虚拟机,将下面的所有的角色分配到2个节点,但是对你的体验可能较差。
    hadoop101.yinzhengjie.org.cn:
      部署NameNode,zookeeper,DFSZKFailoverController角色

    hadoop102.yinzhengjie.org.cn:
      部署DataNode,NodeManage,zookeeper,JournalNode角色

    hadoop103.yinzhengjie.org.cn:
      部署DataNode,NodeManager,zookeeper,JournalNode角色

    hadoop104.yinzhengjie.org.cn:
      部署DataNode,NodeManager,JournalNode角色

    hadoop105.yinzhengjie.org.cn:  
      部署ResourceManager角色
    
    hadoop106.yinzhengjie.org.cn:      
           部署NameNode,DFSZKFailoverController角色

3>.配置yarn-site.xml

[root@hadoop101.yinzhengjie.org.cn ~]# vim /yinzhengjie/softwares/ha/etc/hadoop/yarn-site.xml 
[root@hadoop101.yinzhengjie.org.cn ~]# 
[root@hadoop101.yinzhengjie.org.cn ~]# cat /yinzhengjie/softwares/ha/etc/hadoop/yarn-site.xml 
<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<configuration>

<!-- Site specific YARN configuration properties -->

    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
        <description>Reducer获取数据的方式</description>
    </property>

    <property>
        <name>yarn.resourcemanager.ha.enabled</name>
        <value>true</value>
        <description>启用resourcemanager的HA功能</description>
    </property>


    <property>
        <name>yarn.resourcemanager.cluster-id</name>
        <value>yinzhengjie-yarn-ha</value>
        <description>标识集群,以确保RM不会接替另一个群集的活动状态</description>
    </property>


    <property>
        <name>yarn.resourcemanager.ha.rm-ids</name>
        <value>rm101,rm105</value>
        <description>ResourceManager的逻辑ID列表</description>
    </property>


    <property>
        <name>yarn.resourcemanager.hostname.rm101</name>
        <value>hadoop101.yinzhengjie.org.cn</value>
        <description>指定rm101逻辑别名对应的真实服务器地址,即您可以理解添加映射关系.</description>
    </property>

    <property>
        <name>yarn.resourcemanager.hostname.rm105</name>
        <value>hadoop105.yinzhengjie.org.cn</value>
        <description>指定rm105逻辑别名对应的真实服务器地址,即您可以理解添加映射关系.</description>
    </property>

    <property>
        <name>yarn.resourcemanager.zk-address</name>
        <value>hadoop101.yinzhengjie.org.cn:2181,hadoop102.yinzhengjie.org.cn:2181,hadoop103.yinzhengjie.org.cn:2181</value>
        <description>指定zookeeper集群的地址</description>
    </property>

    <property>
        <name>yarn.resourcemanager.recovery.enabled</name>
        <value>true</value>
        <description>启用自动故障转移(即自动恢复功能)</description>
    </property>

    <property>
        <name>yarn.resourcemanager.store.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
        <description>指定resourcemanager的状态信息存储在zookeeper集群</description>
    </property>

    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
        <description>启用或禁用日志聚合的配置,默认为false,即禁用,将该值设置为true,表示开启日志聚集功能使能</description>
    </property>

    <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>604800</value>
        <description>删除聚合日志前要保留多长时间(默认单位是秒),默认值是"-1"表示禁用,请注意,将此值设置得太小,您将向Namenode发送垃圾邮件.</description>
    </property>


    <property>
        <name>yarn.log-aggregation.retain-check-interval-seconds</name>
        <value>3600</value>
        <description>单位为秒,检查聚合日志保留之间的时间.如果设置为0或负值,那么该值将被计算为聚合日志保留时间的十分之一;请注意,将此值设置得太小,您将向名称节点发送垃圾邮件.</description>
    </property>

</configuration>
[root@hadoop101.yinzhengjie.org.cn ~]# 
[root@hadoop101.yinzhengjie.org.cn ~]# vim /yinzhengjie/softwares/ha/etc/hadoop/yarn-site.xml

4>.同步更新其他节点的配置信息

[root@hadoop101.yinzhengjie.org.cn ~]# rsync-hadoop.sh /yinzhengjie/softwares/ha/etc/hadoop/yarn-site.xml 
******* [hadoop102.yinzhengjie.org.cn] node starts synchronizing [/yinzhengjie/softwares/ha/etc/hadoop/yarn-site.xml] *******
命令执行成功
******* [hadoop103.yinzhengjie.org.cn] node starts synchronizing [/yinzhengjie/softwares/ha/etc/hadoop/yarn-site.xml] *******
命令执行成功
******* [hadoop104.yinzhengjie.org.cn] node starts synchronizing [/yinzhengjie/softwares/ha/etc/hadoop/yarn-site.xml] *******
命令执行成功
******* [hadoop105.yinzhengjie.org.cn] node starts synchronizing [/yinzhengjie/softwares/ha/etc/hadoop/yarn-site.xml] *******
命令执行成功
******* [hadoop106.yinzhengjie.org.cn] node starts synchronizing [/yinzhengjie/softwares/ha/etc/hadoop/yarn-site.xml] *******
命令执行成功
[root@hadoop101.yinzhengjie.org.cn ~]# 
[root@hadoop101.yinzhengjie.org.cn ~]# 
[root@hadoop101.yinzhengjie.org.cn ~]# rsync-hadoop.sh /yinzhengjie/softwares/ha/etc/hadoop/yarn-site.xml

5>.在hadoop101.yinzhengjie.org.cn节点启动YARN集群

[root@hadoop101.yinzhengjie.org.cn ~]# ansible all -m shell -a 'jps'
hadoop104.yinzhengjie.org.cn | SUCCESS | rc=0 >>
7794 JournalNode
9846 Jps
7693 DataNode

hadoop103.yinzhengjie.org.cn | SUCCESS | rc=0 >>
23717 Jps
21751 DataNode
20650 QuorumPeerMain
21852 JournalNode

hadoop102.yinzhengjie.org.cn | SUCCESS | rc=0 >>
20800 QuorumPeerMain
24369 Jps
22178 JournalNode
22451 ZooKeeperMain
22077 DataNode

hadoop106.yinzhengjie.org.cn | SUCCESS | rc=0 >>
12800 Jps
6679 NameNode
6381 DFSZKFailoverController

hadoop101.yinzhengjie.org.cn | SUCCESS | rc=0 >>
17362 DFSZKFailoverController
5463 NameNode
17736 Jps

[root@hadoop101.yinzhengjie.org.cn ~]# 
[root@hadoop101.yinzhengjie.org.cn ~]# ansible all -m shell -a 'jps'
[root@hadoop101.yinzhengjie.org.cn ~]# start-yarn.sh 
starting yarn daemons
starting resourcemanager, logging to /yinzhengjie/softwares/ha/logs/yarn-root-resourcemanager-hadoop101.yinzhengjie.org.cn.out
hadoop103.yinzhengjie.org.cn: starting nodemanager, logging to /yinzhengjie/softwares/ha/logs/yarn-root-nodemanager-hadoop103.yinzhengjie.org.cn.out
hadoop104.yinzhengjie.org.cn: starting nodemanager, logging to /yinzhengjie/softwares/ha/logs/yarn-root-nodemanager-hadoop104.yinzhengjie.org.cn.out
hadoop102.yinzhengjie.org.cn: starting nodemanager, logging to /yinzhengjie/softwares/ha/logs/yarn-root-nodemanager-hadoop102.yinzhengjie.org.cn.out
[root@hadoop101.yinzhengjie.org.cn ~]# 
[root@hadoop101.yinzhengjie.org.cn ~]# 
[root@hadoop101.yinzhengjie.org.cn ~]# start-yarn.sh
[root@hadoop101.yinzhengjie.org.cn ~]# ansible all -m shell -a 'jps'
hadoop106.yinzhengjie.org.cn | SUCCESS | rc=0 >>
6679 NameNode
12855 Jps
6381 DFSZKFailoverController

hadoop101.yinzhengjie.org.cn | SUCCESS | rc=0 >>
17362 DFSZKFailoverController
18164 Jps
5463 NameNode
17801 ResourceManager

hadoop104.yinzhengjie.org.cn | SUCCESS | rc=0 >>
7794 JournalNode
10042 Jps
7693 DataNode
9885 NodeManager

hadoop103.yinzhengjie.org.cn | SUCCESS | rc=0 >>
21751 DataNode
23913 Jps
20650 QuorumPeerMain
21852 JournalNode
23756 NodeManager

hadoop102.yinzhengjie.org.cn | SUCCESS | rc=0 >>
20800 QuorumPeerMain
22178 JournalNode
22451 ZooKeeperMain
24565 Jps
24408 NodeManager
22077 DataNode

[root@hadoop101.yinzhengjie.org.cn ~]# 
[root@hadoop101.yinzhengjie.org.cn ~]# ansible all -m shell -a 'jps'

6>.在hadoop105.yinzhengjie.org.cn单独启动resourcemanager

[root@hadoop105.yinzhengjie.org.cn ~]# ss -ntl
State       Recv-Q Send-Q                                     Local Address:Port                                                    Peer Address:Port              
LISTEN      0      128                                                    *:22                                                                 *:*                  
LISTEN      0      128                                                   :::22                                                                :::*                  
[root@hadoop105.yinzhengjie.org.cn ~]# 
[root@hadoop105.yinzhengjie.org.cn ~]# yarn-daemon.sh start resourcemanager
starting resourcemanager, logging to /yinzhengjie/softwares/ha/logs/yarn-root-resourcemanager-hadoop105.yinzhengjie.org.cn.out
[root@hadoop105.yinzhengjie.org.cn ~]# 
[root@hadoop105.yinzhengjie.org.cn ~]# ss -ntl
State       Recv-Q Send-Q                                     Local Address:Port                                                    Peer Address:Port              
LISTEN      0      128                                                    *:22                                                                 *:*                  
LISTEN      0      128                                        172.200.4.105:8088                                                               *:*                  
LISTEN      0      128                                                   :::22                                                                :::*                  
[root@hadoop105.yinzhengjie.org.cn ~]# 
[root@hadoop105.yinzhengjie.org.cn ~]# 
[root@hadoop105.yinzhengjie.org.cn ~]# yarn-daemon.sh start resourcemanager

7>.查看YARN HA服务状态

[root@hadoop105.yinzhengjie.org.cn ~]# yarn rmadmin -getServiceState rm101
active
[root@hadoop105.yinzhengjie.org.cn ~]# 
[root@hadoop105.yinzhengjie.org.cn ~]# 
[root@hadoop105.yinzhengjie.org.cn ~]# yarn rmadmin -getServiceState rm105
standby
[root@hadoop105.yinzhengjie.org.cn ~]# 

8>.将Active状态的节点RM进程kill

  YARN HA和HDFS HA的之间的区别:
    (1)YARN HA是对ResourceManager组件的高可用,而HDFS HA是对NameNode的高可用;
    (2)HDFS HA需要实现隔离(fence)机制,而YARN HA则不需要;
    (3)当HDFS 2.x的NameNode任意一台断电都将导致整个HDFS集群不可用,而YARN的ResourceManager任意一台服务器断电都不会导致整个YARN集群不可用。

原文地址:https://www.cnblogs.com/yinzhengjie2020/p/12853495.html