hadoop集群运维碰到的问题汇总

1.zookeeper报错

2017-12-13 16:47:55,968 [myid:] - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread@975] - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)

2017-12-13 16:47:55,968 [myid:] - WARN [main-SendThread(localhost:2181):ClientCnxn$SendThread@1102] - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect

java.net.ConnectException: Connection refused

    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

    at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)

    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)

原因：zookeeper节点挂了，启动即可

2.kafka消费报错：Job aborted due to stage failure:kafka.common.OffsetOutOfRangeException

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): kafka.common.OffsetOutOfRangeException

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

kafka message过期时间log.retention.hours=168

解决：问题原因是，cosumer-group消费的offset已早于kafka存储的最早的message。参考blog里面有更详尽的解释

获取topic mysqlslowlog的offset的最小值

./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list=node:9092 --topic topic_name --time -2

获取topic:mysqlslowlog的offset的最大值

./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list=node:9092 --topic topic_name--time -1

在zk上更新topic partition的offset

#查partition 0最小值

get /rootdir/consumers/[cousumer_group]/offsets/mysqlslowlog/0

#更新partition 0最小值

set /rootdir/consumers/[cousumer_group]/offsets/mysqlslowlog/0 3546232

或者可以使用如下命令批量更新为最小值

./kafka-run-class.sh kafka.tools.UpdateOffsetsInZK earliest

参考：

http://blog.csdn.net/xueba207/article/details/51135423
http://blog.csdn.net/xueba207/article/details/51174818

3.重启hbase regionserver节点报错：

Server ...,1514436003346 has been rejected; Reported time is too far out of sync with master. Time difference of 136758ms > max allowed of 30000ms

一般是因为hmaster 节点和 regionserver节点时间不一致导致。同步时间，重启节点即可。

4.摘除hdfs datanode节点，datanode节点一直处于Decommission In Progress状态

通过WEB UI查看：

#低于副本数要求的blocks
Under replicated blocks ：2979
#没有副本的blocks
Blocks with no live replicas： 0
#低于副本数要求的blocks，且正在创建中
Under Replicated Blocks In files under construction：1

或者通过../bin/hadoop dfsadmin -report命令查看datanode的状态。

副本数为：2，当Under replicated blocks是越来越低，等于0时，应该就会完全摘除。

另外，因为同一个rack的datanode节点一般会有一个副本，因此，可以通过修改副本数的方式，快速下线datanode

#查看集群状态

./bin/hadoop fsck / -blocks -locations -files

#修改副本数（当Blocks with no live replicas为0时可以操作）

./bin/hadoop fs -setrep -R 1 /

#关闭datanode节点，

./sbin/hadoop-daemon.sh stop datanode

#从slaves列表和rack列表中删掉对应节点

#freshnode或者依次重启namenode

./bin/hdfs dfsadmin -refreshNodes
./bin/yarn rmadmin -refreshNodes

5.摘除hdfs的datanode节点

Failed to add xxxxxxxx:50010: You cannot have a rack and a non-rack node at the same level of the network topology.

解决：

通过 ./bin/hdfs dfsadmin -printTopology查看rack list

刷新

./bin/hdfs dfsadmin -refreshNodes
./bin/yarn rmadmin -refreshNodes

不管用，
(1)页面依然显示状态为dead的datanode，
(2)依然报You cannot have a rack and a non-rack node at the same level of the network topology.

依次重启namenode，生效

./sbin/hadoop-daemon.sh stop namenode
./sbin/hadoop-daemon.sh start namenode

通过

./bin/hdfs dfsadmin -printTopology

查看rack信息，应该被摘掉的节点也不再显示