hadoop集群运维碰到的问题汇总

1.zookeeper报错

2017-12-13 16:47:55,968 [myid:] - INFO  [main-SendThread(localhost:2181):ClientCnxn$SendThread@975] - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2017-12-13 16:47:55,968 [myid:] - WARN  [main-SendThread(localhost:2181):ClientCnxn$SendThread@1102] - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)

原因:zookeeper节点挂了,启动即可

2.kafka消费报错:Job aborted due to stage failure:kafka.common.OffsetOutOfRangeException

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): kafka.common.OffsetOutOfRangeException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

kafka message过期时间log.retention.hours=168

解决:问题原因是,cosumer-group消费的offset已早于kafka存储的最早的message。参考blog里面有更详尽的解释

获取topic mysqlslowlog的offset的最小值

./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list=node:9092 --topic topic_name --time -2

获取topic:mysqlslowlog的offset的最大值

./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list=node:9092 --topic topic_name--time -1

在zk上更新topic partition的offset

#查partition  0最小值

get /rootdir/consumers/[cousumer_group]/offsets/mysqlslowlog/0

#更新partition  0最小值

set /rootdir/consumers/[cousumer_group]/offsets/mysqlslowlog/0 3546232

或者可以使用如下命令批量更新为最小值

./kafka-run-class.sh kafka.tools.UpdateOffsetsInZK earliest 

参考:

http://blog.csdn.net/xueba207/article/details/51135423
http://blog.csdn.net/xueba207/article/details/51174818

3.重启hbase regionserver节点报错:

Server ...,1514436003346 has been rejected; Reported time is too far out of sync with master.  Time difference of 136758ms > max allowed of 30000ms

一般是因为hmaster 节点和 regionserver节点时间不一致导致。同步时间,重启节点即可。

4.摘除hdfs  datanode节点,datanode节点一直处于Decommission In Progress状态

通过WEB UI查看:

#低于副本数要求的blocks
Under replicated blocks :2979
#没有副本的blocks
Blocks with no live replicas: 0
#低于副本数要求的blocks,且正在创建中
Under Replicated Blocks In files under construction:1

或者通过../bin/hadoop dfsadmin -report命令查看datanode的状态。

副本数为:2,当Under replicated blocks是越来越低,等于0时,应该就会完全摘除。

另外,因为同一个rack的datanode节点一般会有一个副本,因此,可以通过修改副本数的方式,快速下线datanode

#查看集群状态

./bin/hadoop fsck / -blocks -locations -files

#修改副本数(当Blocks with no live replicas为0时可以操作)

 ./bin/hadoop fs -setrep -R 1 /

#关闭datanode节点,

./sbin/hadoop-daemon.sh stop datanode

#从slaves列表和rack列表中删掉对应节点

#freshnode或者依次重启namenode

./bin/hdfs dfsadmin -refreshNodes
./bin/yarn rmadmin -refreshNodes

5.摘除hdfs的datanode节点

Failed to add xxxxxxxx:50010: You cannot have a rack and a non-rack node at the same level of the network topology.

 解决:

通过 ./bin/hdfs dfsadmin -printTopology查看rack list

刷新

./bin/hdfs dfsadmin -refreshNodes
./bin/yarn rmadmin -refreshNodes

不管用,
(1)页面依然显示状态为dead的datanode,
(2)依然报You cannot have a rack and a non-rack node at the same level of the network topology.

依次重启namenode,生效

./sbin/hadoop-daemon.sh stop namenode
./sbin/hadoop-daemon.sh start namenode

通过

./bin/hdfs dfsadmin -printTopology

查看rack信息,应该被摘掉的节点也不再显示

原文地址:https://www.cnblogs.com/wyett/p/8146044.html