hdfs 数据坏块导致datanode不能正常上报数据块

生产集群上,有一台datanode节点磁盘数量飙升,其中五块盘容量已经使用达到100%了,其他磁盘也基本达到90%以上。运行balancer不生效,数据还是疯长,看balancer日志,貌似balancer没有进行大量的块移动。

问题现象:

2020-05-25 22:13:47,459 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x1f57d7768e5e5fbf,  containing 11 storage report(s), of which we sent 2. The reports had 2790431 total blocks and used 2 RPC(s). This took 511 msec to generate and 511 msecs for RPC and NN processing. Got back no commands
2020
-05-25 22:13:47,459 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: RemoteException in offerService org.apache.hadoop.ipc.RemoteException(java.io.IOException): java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.runBlockOp(BlockManager.java:3935) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1423) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:179) at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:28423) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:845) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:788) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2455) Caused by: java.lang.NullPointerException at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1481) at org.apache.hadoop.ipc.Client.call(Client.java:1427) at org.apache.hadoop.ipc.Client.call(Client.java:1337) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy15.blockReport(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:203) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:371) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:629) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:771) at java.lang.Thread.run(Thread.java:748) 2020-05-25 22:13:48,829 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /xxx:41340, dest: /xxx:50010, bytes: 41295660, op: HDFS_WRITE, cliID: libhdfs3_client_random_1943875928_count_329741_pid_31778_tid_139650573580032, offset: 0, srvID: 66666ee2-f0b1-472f-ae97-adb2418b61b7, blockid: BP-106388200-xxx-1508315348381:blk_1104911380_31178890, duration: 8417209028 2020-05-25 22:13:48,829 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService for Block pool BP-106388200-xxx-1508315348381 (Datanode Uuid 66666ee2-f0b1-472f-ae97-adb2418b61b7) service to xxx1/xxx:9000 java.util.ConcurrentModificationException: modification=2962864 != iterModification = 2962863 at org.apache.hadoop.util.LightWeightGSet$SetIterator.ensureNext(LightWeightGSet.java:305) at org.apache.hadoop.util.LightWeightGSet$SetIterator.hasNext(LightWeightGSet.java:322) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReports(FsDatasetImpl.java:1813) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:335) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:629) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:771) at java.lang.Thread.run(Thread.java:748) 2020-05-25 22:13:48,829 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-106388200-xxx-1508315348381:blk_1104911380_31178890, type=LAST_IN_PIPELINE terminating

从上述的现象可以看到已经发生了块风暴,datanode定期的向namenode发送块信息,但是由于块风暴的问题,这个节点的块不知道什么时候才能完全上报(这个节点大概存储不到50T的数据),通过修改dfs.blockreport.split.threshold参数,貌似也不起作用。问题就卡住了。

继续看namenode日志:

        at java.lang.Thread.run(Thread.java:748)
2020-05-27 17:19:45,424 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: libhdfs3_client_random_1124367707_count_1513119_pid_195169_tid_140150889772800, pending creates: 1] has expired hard limit
2020-05-27 17:19:45,424 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  Holder: libhdfs3_client_random_1124367707_count_1513119_pid_195169_tid_140150889772800, pending creates: 1], src=/user/xxx/9.log.ok.ok.ok.ok.gz

namenode的日志文件大量报这个块的租约超时,几乎日志文件里面全是这几条信息,通过fsck检查这个块,检测失败,返回一个空指针,这真的是日了哈士奇,为了防止数据丢失,打算把该数据块拷贝到本地,同样无情的返回一个空指针。

分析:

个人感觉是由于namenode在进行对这个数据坏块分配节点的时候,占用了大量的处理线程,导致datanode在上报数据块的时候,不能正常上报,导致块风暴。namenode不能获取到节点的块信息,导致很多的本该被删除的数据库没能及时删除,同时namenode还依然往这个节点分配数据副本,导致节点磁盘数据量疯长。

解决:

最终删掉了该数据块同时看这个datanode节点日志,在执行大量的delete操作,同时配合balancer,最终集群达到均衡。

原文地址:https://www.cnblogs.com/yjt1993/p/12975316.html