报错:(未解决)NoReplicaOnlineException: No replica in ISR for partition __consumer_offsets-8 is alive. Live brokers are: [Set(50, 51, 52)], ISR brokers are: [68]

报错背景:

 CDH集成kafka插件之后,启动kafka时就报出此错误。

报错现象:

2019-05-17 08:18:06,428 ERROR state.change.logger: [Controller id=50 epoch=4447617] Initiated state change for partition __consumer_offsets-8 from OfflinePartition to OnlinePartition failed
kafka.common.NoReplicaOnlineException: No replica in ISR for partition __consumer_offsets-8 is alive. Live brokers are: [Set(50, 51, 52)], ISR brokers are: [68]
        at kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:65)
        at kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:303)
        at kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:163)
        at kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:84)
        at kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:81)
        at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
        at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
        at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
        at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
        at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
        at scala.collection.mutable.HashMap.foreach(HashMap.scala:130)
        at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
        at kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:81)
        at kafka.controller.PartitionStateMachine.startup(PartitionStateMachine.scala:58)
        at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:298)
        at kafka.controller.KafkaController.elect(KafkaController.scala:1681)
        at kafka.controller.KafkaController$Reelect$.process(KafkaController.scala:1610)
        at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:53)
        at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:53)
        at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:53)
        at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
        at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:52)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)

报错原因:

主要信息:No replica in ISR for partition __consumer_offsets-8 is alive

信息翻译:ISR中没有用于分区__consumer_offsets-8的副本存活

根据网上的资料,可以初步分析原因是leader的选举出现了问题。

四种 leader 选举实现类及对应触发条件如下所示:

实现触发条件
OfflinePartitionLeaderSelector leader 掉线时触发
ReassignedPartitionLeaderSelector 分区的副本重新分配数据同步完成后触发的
PreferredReplicaPartitionLeaderSelector 最优 leader 选举,手动触发或自动 leader 均衡调度时触发
ControlledShutdownLeaderSelector broker 发送 ShutDown 请求主动关闭服务时触发

 

 

 

 

OfflinePartitionLeaderSelector Partition leader 选举的逻辑是:

  1. 如果 isr 中至少有一个副本是存活的,那么从该 Partition 存活的 isr 中选举第一个副本作为新的 leader,存活的 isr 作为新的 isr;
  2. 否则,如果脏选举(unclear elect)是禁止的,那么就抛出 NoReplicaOnlineException 异常;
  3. 否则,即允许脏选举的情况下,从存活的、所分配的副本(不在 isr 中的副本)中选出一个副本作为新的 leader 和新的 isr 集合;
  4. 否则,即是 Partition 分配的副本没有存活的,抛出 NoReplicaOnlineException 异常;

根据以上信息可知,kafka的副本有挂掉的,但是具体什么原因我无法定位。

报错解决:

 如果是CDH报错,我的做法是将kafka的所以topic都给删除

1.使用命令删除topic:
kafka-topics.sh --delete --zookeeper localhost:2181 --topic AlarmHis
只是这样事实上并没有真正删Topic
2.进入/tmp/kafka-logs目录,删除文件名为test的文件夹
3.进入zookeeper的安装目录,再进入bin目录下,
使用命令启动zookeeper客户端 zookeeper-client
再使用命令 ls /brokers/topics 查看所建的topic,
使用命令 rmr /brokers/topics/test

删除完成之后关闭所有服务,重启计算机,启动集群。

此时CDH没有了报错,但是后来发现云主机中kafka的log文件里依然报错产生,暂时未能解决。

参考:https://www.colabug.com/3174494.html

原文地址:https://www.cnblogs.com/chuijingjing/p/10880761.html