Cassanda节点重启后无法加入集群并报错“received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397”

目前环境有一套6节点2数据中心的cassandra集群,版本为2.1.9。

今天将集群中一台机器10.168.12.3重启后发现该节点无法加入集群,现象分析。

在重启后的节点查看集群状态,发现集群状态一切正常。

$ nodetool status
Datacenter: DC-SGM-DR
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
UN  10.168.50.205  822.91 MB  256     ?       bea84e24-76c8-4070-9c41-d0051d8aba63  RAC-1B
UN  10.168.50.212  825.43 MB  256     ?       97e92d11-028a-44f6-b6ea-be3992985506  RAC-1B
UN  10.168.50.213  14.37 GB   256     ?       de47960c-54ab-4ed3-99e7-e3abcb66c014  RAC-1B
Datacenter: DC-SGM-SH
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
UN  10.168.11.11   10.17 GB   256     ?       9d016b9f-5655-4899-8652-607bdc24eda3  RAC-1A
UN  10.168.12.3    831.42 MB  256     ?       57c4d98b-c52c-48bf-b8ee-7d8f22bcc08f  RAC-1A
UN  10.168.11.6    828.2 MB   256     ?       9cf69121-4dbc-419c-b3a8-e166d83b4177  RAC-1A

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

我们登录集群其他节点查看集群状态

$ nodetool status
Datacenter: DC-SGM-DR
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
UN  10.168.50.205  828.16 MB  256     ?       bea84e24-76c8-4070-9c41-d0051d8aba63  RAC-1B
UN  10.168.50.212  825.43 MB  256     ?       97e92d11-028a-44f6-b6ea-be3992985506  RAC-1B
UN  10.168.50.213  14.37 GB   256     ?       de47960c-54ab-4ed3-99e7-e3abcb66c014  RAC-1B
Datacenter: DC-SGM-SH
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
UN  10.168.11.11   834.48 MB  256     ?       9d016b9f-5655-4899-8652-607bdc24eda3  RAC-1A
DN  10.168.12.3    831.31 MB  256     ?       57c4d98b-c52c-48bf-b8ee-7d8f22bcc08f  RAC-1A
UN  10.168.11.6    828.17 MB  256     ?       9cf69121-4dbc-419c-b3a8-e166d83b4177  RAC-1A

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

我们发现集群其他节点显示被重启的节点为“DN”状态,并在各节点的cassandra的system.log文件报错

..................................................
WARN  [GossipStage:1] 2020-01-02 10:07:45,831 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:07:47,680 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:07:49,682 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:07:50,690 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:07:50,833 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:07:51,681 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:07:51,833 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:07:52,833 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:07:54,684 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:07:55,683 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:07:55,834 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:07:57,683 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:07:58,684 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:08:00,684 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:08:01,688 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:08:05,686 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:08:06,686 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:08:08,838 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:08:09,839 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:08:11,688 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:08:11,839 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:08:12,840 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:08:13,688 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:08:17,841 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:08:20,690 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:08:21,691 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:08:21,843 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:08:22,691 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
WARN  [GossipStage:1] 2020-01-02 10:08:22,843 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397
..................................................

我们登录被重启的cassandra节点查看gossipinfo

$ nodetool gossipinfo
.................................
/10.168.12.3
  generation:1527840276
  heartbeat:22488596
  HOST_ID:57c4d98b-c52c-48bf-b8ee-7d8f22bcc08f
  SCHEMA:54b29ca7-5a9c-345b-be73-437504faf71b
  SEVERITY:0.0
  NET_VERSION:8
  RACK:RAC-1A
  DC:DC-SGM-SH
  RELEASE_VERSION:2.1.9
  STATUS:NORMAL,-101651619030947983
  RPC_ADDRESS:10.168.12.3
  LOAD:8.72963151E8
.................................

可以看到其他节点记录重启节点的generation的epoch为1527840276,我们转换成可读时间为2018年6月1日FridayAM8点04分,该时间为我们启动cassandra的时间,登录重启节点,查看local表的

cqlsh `hostname` -u cassandra
cassandra@cqlsh> use system;
cassandra@cqlsh:system> select key , gossip_generation from local ;

 key   | gossip_generation
-------+-------------------
 local |        1577928397

(1 rows)

将1577928397转换为2020年1月2日ThursdayAM1点26分,可以看到两个时间点之间间隔一年半时间,也就是说上次cassandra启动的时间还是2018年6月1日FridayAM8点04分,其实这次重启触发了一个cassandra的bug

https://issues.apache.org/jira/browse/CASSANDRA-10969

https://support.datastax.com/hc/en-us/articles/115001096783-Nodes-showing-DN-in-nodetool-status-with-invalid-gossip-generation-warning-in-logs

可以查看大牛写的blog

https://mash213.wordpress.com/2019/07/05/scylla-received-an-invalid-gossip-generation-for-peer-how-to-resolve/

我们依次将集群节点重启。

原文地址:https://www.cnblogs.com/ilifeilong/p/12133529.html