Redis生产环境节点宕机问题报错及恢复排错

Redis故障发现

主观下线

当cluster-node-timeout时间内某节点无法与另一个节点顺利完成ping消息通信时，则将该节点标记为主观下线状态。

客观下线

当某个节点判断另一个节点主观下线后，该节点的下线报告会通过Gossip消息传播。当接收节点发现消息体中含有主观下线的节点，其会尝试对该节点进行客观下线，依据下线报告是否在有效期内(如果在cluster-node-timeout*2时间内无法收集到一半以上槽节点的下线报告，那么之前的下线报告会过期)，且数量大于槽节点总数的一半。若是，则将该节点更新为客观下线，并向集群广播下线节点的fail消息。

Redis故障恢复

故障节点变为客观下线后，如果下线节点是持有槽的主节点，则需要在它的从节点中选出一个替换它，从而保证集群的高可用，过程如下：

资格检查

每个从节点都要检查最后与主节点断线时间，判断是否有资格替换故障的主节点。如果从节点与主节点断线时间超过(cluster-node-timeout*cluster-slave-validity-factor)，则当前从节点不具备故障转移资格。

准备选举时间

从节点符合故障转移资格后，更新触发故障选举时间，只有到达该时间才能执行后续流程。采用延迟触发机制，主要是对多个从节点使用不同的延迟选举时间来支持优先级。复制偏移量越大说明从节点延迟越低，那么它应该具有更高的优先级。

发起选举

当从节点到达故障选举时间后，会触发选举流程：
1. 更新配置周期
  
  配置周期是一个只增不减的整数，每个主节点自身维护一个配置周期，标示当前主节点的版本，所有主节点的配置周期都不相等，从节点会复制主节点的配置周期。整个集群又维护一个全局的配置周期，用于记录集群内所有主节点配置周期的最大版本。每次集群发生重大事件，如新加入主节点或由从节点转换而来，从节点竞争选举，都会递增集群全局配置周期并赋值给相关主节点，用于记录这一关键事件。
2. 广播选举消息
  
  在集群内广播选举消息，并记录已发送过消息的状态，保证该从节点在一个配置周期内只能发起一次选举。
选举投票

只有持有槽的主节点才会处理故障选举消息，每个持有槽的节点在一个配置周期内都有唯一的一张选票，当接到第一个请求投票的从节点消息，回复消息作为投票，之后相同配置周期内其它从节点的选举消息将忽略。投票过程其实是一个领导者选举的过程。

每个配置周期代表了一次选举周期，如果在开始投票后的cluster-node-timeout*2时间内从节点没有获取足够数量的投票，则本次选举作废。从节点对配置周期自增并发起下一轮投票，直到选举成功为止。

替换主节点

当前从节点取消复制变为主节点，撤销故障主节点负责的槽，把这些槽委派给自己，并向集群广播告知所有节点当前从节点变为主节点。

Redis故障转移时间(failover-time)

主观下线识别时间等于 cluster-node-timeout。
主观下线状态消息传播时间 小于等于 cluster-node-timeout/2(消息通信机制会优先选取下线状态节点通信)。
从节点转移时间 小于等于 1000毫秒(偏移量最大的从节点最多延迟1秒发起选举，通常一次就会成功)。

所以，failover-time(毫秒) 小于等于 cluster-node-timeout + cluster-node-timeout/2 + 1000

Redis生产环境节点宕机问题报错及恢复

问题背景和分析

当前文档记录一次Redis生产环境的问题，当时是发现生产环境的业务，用户账号无法进行登录和注册，但是其他的一些业务操作是可以实现。

这里首先查看一些Java日志，发现如下报错：

Java日志报错

{"jv_time":"2020-10-01 08:02:24.222","jv_level":"ERROR","jv_thread":"http-nio-8888-exec-33","jv_class":"org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/member].[dispatcherServlet]","jv_method":"log","jv_message":"Servlet.service() for servlet [dispatcherServlet] in context with path [/member] threw exception","jv_throwable":"org.redisson.client.RedisTimeoutException: Redis server response timeout (10000 ms) occured for command: (HSET) with params: [redisson_spring_session:0e4db6b7-bb7a-4050-949f-d6b182e8ccd0, PooledUnsafeDirectByteBuf(ridx: 0, widx: 26, cap: 256), PooledUnsafeDirectByteBuf(ridx: 0, widx: 32, cap: 256)] channel: [id: 0xbfc0a52c, L:/172.20.20.9:22454 - R:/172.20.20.9:7002] at org.redisson.command.CommandAsyncService$11.run(CommandAsyncService.java:682) ~[redisson-3.5.7.jar!/:?] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:680) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:755) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:483) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

{"jv_time":"2020-10-01 08:08:12.935","jv_level":"ERROR","jv_thread":"redisson-netty-1-4","jv_class":"org.redisson.connection.pool.SlaveConnectionPool","jv_method":"checkForReconnect","jv_message":"host testsrv1982/172.20.20.9:7002 disconnected due to failedAttempts=3 limit reached","jv_throwable":"org.redisson.client.RedisTimeoutException: Command execution timeout for testsrv1982/172.20.20.9:7002 at org.redisson.client.RedisConnection$2.run(RedisConnection.java:212) ~[redisson-3.5.7.jar!/:?] at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:139) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

{"jv_time":"2020-10-01 08:08:50.563","jv_level":"ERROR","jv_thread":"redisson-netty-1-14","jv_class":"org.redisson.connection.pool.SlaveConnectionPool","jv_method":"checkForReconnect","jv_message":"host testsrv1982/172.20.20.9:7002 disconnected due to failedAttempts=3 limit reached","jv_throwable":"org.redisson.client.RedisTimeoutException: Command execution timeout for testsrv1982/172.20.20.9:7002 at org.redisson.client.RedisConnection$2.run(RedisConnection.java:212) ~[redisson-3.5.7.jar!/:?] at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:139) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

{"jv_time":"2020-10-01 08:08:59.953","jv_level":"ERROR","jv_thread":"redisson-netty-1-1","jv_class":"org.redisson.connection.pool.SlaveConnectionPool","jv_method":"checkForReconnect","jv_message":"host /172.20.20.9:7003 disconnected due to failedAttempts=3 limit reached","jv_throwable":"org.redisson.client.RedisTimeoutException: Command execution timeout for /172.20.20.9:7003 at org.redisson.client.RedisConnection$2.run(RedisConnection.java:212) ~[redisson-3.5.7.jar!/:?] at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:139) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

{"jv_time":"2020-10-01 08:28:54.570","jv_level":"ERROR","jv_thread":"http-nio-8888-exec-608","jv_class":"org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/member].[dispatcherServlet]","jv_method":"log","jv_message":"Servlet.service() for servlet [dispatcherServlet] in context with path [/member] threw exception","jv_throwable":"org.redisson.client.RedisTimeoutException: Unable to send command: (HGET) with params: [maintained-flag, PooledUnsafeDirectByteBuf(ridx: 0, widx: 5, cap: 256)] after 3 retry attempts at org.redisson.command.CommandAsyncService$8.run(CommandAsyncService.java:554) ~[redisson-3.5.7.jar!/:?] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:680) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:755) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:483) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

{"jv_time":"2020-10-01 08:28:54.570","jv_level":"ERROR","jv_thread":"http-nio-8888-exec-124","jv_class":"org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/member].[dispatcherServlet]","jv_method":"log","jv_message":"Servlet.service() for servlet [dispatcherServlet] in context with path [/member] threw exception","jv_throwable":"org.redisson.client.RedisTimeoutException: Unable to send command: (HGET) with params: [maintained-flag, PooledUnsafeDirectByteBuf(ridx: 0, widx: 5, cap: 256)] after 3 retry attempts at org.redisson.command.CommandAsyncService$8.run(CommandAsyncService.java:554) ~[redisson-3.5.7.jar!/:?] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:680) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:755) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:483) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

{"jv_time":"2020-10-01 08:28:54.570","jv_level":"ERROR","jv_thread":"http-nio-8888-exec-379","jv_class":"org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/member].[dispatcherServlet]","jv_method":"log","jv_message":"Servlet.service() for servlet [dispatcherServlet] in context with path [/member] threw exception","jv_throwable":"org.redisson.client.RedisTimeoutException: Unable to send command: (HGET) with params: [maintained-flag, PooledUnsafeDirectByteBuf(ridx: 0, widx: 5, cap: 256)] after 3 retry attempts at org.redisson.command.CommandAsyncService$8.run(CommandAsyncService.java:554) ~[redisson-3.5.7.jar!/:?] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:680) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:755) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:483) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

通过查看Java错误日志分析，主要有两条错误内容：

org.redisson.client.RedisTimeoutException: Redis server response timeout (10000 ms) occured for command: (HSET) with params: [redisson_spring_session:0e4db6b7-bb7a-4050-949f-d6b182e8ccd0, PooledUnsafeDirectByteBuf(ridx: 0, widx: 26, cap: 256), PooledUnsafeDirectByteBuf(ridx: 0, widx: 32, cap: 256)] channel: [id: 0xbfc0a52c, L:/172.20.20.9:22454 - R:/172.20.20.9:7002]

第一条日志是因为redis 服务器地址链接超时，导致HSET写入数据失败。

org.redisson.client.RedisTimeoutException: Unable to send command: (HGET) with params: [maintained-flag, PooledUnsafeDirectByteBuf(ridx: 0, widx: 5, cap: 256)] after 3 retry attempts

第二条日志是也是因为链接超时，导致HGET查询失败，并且重试了3次还是查询失败问题。

总结：通过上述错误的Java日志分析出问题出现在Redis集群中，然后又找了Redis的日志进行查看。发现集群日志中有 redis cluster fail 问题，节点出现异常，集群启动自动恢复的机制，导致有些redis 客户端链接无法正常写入和查询redis 数据，所以出现了用户注册和登录异常，我们来分析一些redis 集群自动恢复的日志。

Redis 错误日志

生产环境redis 集群出现fail 问题，自动恢复时，导致业务出现繁忙问题，分析可能导致集群节点问题的原因：

业务接口被刷，导致大量查询命令堆积，形成redis慢查询，导致出现网络超时

通过redis 慢查询命令 SHOWLOG GET 来分析哪个查询命令导致集群崩溃了

http://redis.cn/commands/slowlog
cluster-node-timeout 时间内网络超时，redis节点自动重新下线进行选举
rdb 备份同步大量数据到Redis 服务器中，服务器性能下降，出现网络超时

下面我们查询下Redis 日志，同时看下redis 集群恢复过程：

redis 集群节点

redissrv1:7001> cluster nodes
191e4f15c0de39efe1ab9248bd413c9fdb2dddec 172.20.20.10:7005@17005 master - 0 1601964809000 11 connected 10923-16383
00fced34f12176c1a229211be9a3d579a51091df 172.20.20.9:7003@17003 slave 9a28b0084c1486ceb0bca3c45f2554d79041e57d 0 1601964810090 9 connected
d1fbc46714f89985ce45445b07355869085b2e7e 172.20.20.9:7002@17002 slave 191e4f15c0de39efe1ab9248bd413c9fdb2dddec 0 1601964809090 11 connected
9a28b0084c1486ceb0bca3c45f2554d79041e57d 172.20.20.10:7004@17004 master - 0 1601964807087 9 connected 5461-10922
599f153057aae0b5ef2f1d895a8e4ac7d0474cec 172.20.20.10:7006@17006 master - 0 1601964809000 10 connected 0-5460
219e9a5cb2898bf656ff4bf18eeadc1467ae8dd8 172.20.20.9:7001@17001 myself,slave 599f153057aae0b5ef2f1d895a8e4ac7d0474cec 0 1601964808000 1 connected

当前这份日志只分析了出现错误的节点，但是当redis 集群一个节点出现问题之后，其他集群节点也会打印出相关的集群错误日志，这里不做过多赘述。

4233:C 01 Oct 07:58:09.005 * RDB: 8 MB of memory used by copy-on-write
8971:M 01 Oct 07:58:09.072 * Background saving terminated with success
8971:M 01 Oct 07:59:50.030 * 1000 changes in 100 seconds. Saving...
8971:M 01 Oct 07:59:50.182 * Background saving started by pid 4914
4914:C 01 Oct 07:59:53.778 * DB saved on disk
4914:C 01 Oct 07:59:53.796 * RDB: 7 MB of memory used by copy-on-write
8971:M 01 Oct 07:59:53.909 * Background saving terminated with success
8971:M 01 Oct 08:00:54.791 # Cluster state changed: fail 
# 在一次rdb 数据保存之后，redis 集群状态突然变成fail，这时说明集群已经处于不稳定状态
8971:M 01 Oct 08:00:59.888 # Cluster state changed: ok
8971:M 01 Oct 08:01:33.895 * Marking node d1fbc46714f89985ce45445b07355869085b2e7e as failing (quorum reached).
# d1fbc46714f89985ce45445b07355869085b2e7e 节点被标记为主观下线
8971:M 01 Oct 08:01:38.483 * 1000 changes in 100 seconds. Saving...
8971:M 01 Oct 08:02:23.076 * Background saving started by pid 5614
8971:M 01 Oct 08:02:23.852 # Cluster state changed: fail
8971:M 01 Oct 08:02:24.282 * Marking node 00fced34f12176c1a229211be9a3d579a51091df as failing (quorum reached).
# 00fced34f12176c1a229211be9a3d579a51091df 节点被标记为主观下线
8971:M 01 Oct 08:02:24.370 * Clear FAIL state for node d1fbc46714f89985ce45445b07355869085b2e7e: is reachable again and nobody is serving its slots after some time.
# 集群内其它节点接收到d1fbc46714f89985ce45445b07355869085b2e7e发来的ping消息，清空客观下线状态。
8971:M 01 Oct 08:02:25.158 # Failover auth denied to 191e4f15c0de39efe1ab9248bd413c9fdb2dddec: its master is up
# 191e4f15c0de39efe1ab9248bd413c9fdb2dddec 被提升成为主节点
8971:M 01 Oct 08:02:25.301 # Failover auth granted to 9a28b0084c1486ceb0bca3c45f2554d79041e57d for epoch 9
# 当前节点为 9a28b0084c1486ceb0bca3c45f2554d79041e57d 进行投票
8971:M 01 Oct 08:02:27.454 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: master without slots is reachable again.
# 集群内其它节点接收到00fced34f12176c1a229211be9a3d579a51091df 发来的ping消息，清空客观下线状态。
5614:C 01 Oct 08:02:28.802 * DB saved on disk
5614:C 01 Oct 08:02:28.822 * RDB: 8 MB of memory used by copy-on-write
8971:M 01 Oct 08:02:28.914 * Background saving terminated with success
8971:M 01 Oct 08:02:29.903 # Cluster state changed: ok
# 经过从节点下线和提升新的主节点步骤，集群状态由fail 变成 ok 状态。
8971:M 01 Oct 08:03:50.110 * 5000 changes in 80 seconds. Saving...
8971:M 01 Oct 08:07:29.150 * Background saving started by pid 5708
8971:M 01 Oct 08:07:29.663 # Connection with slave client id #4 lost.
8971:M 01 Oct 08:07:29.752 * Slave 172.20.20.10:7006 asks for synchronization
8971:M 01 Oct 08:07:29.782 * Partial resynchronization request from 172.20.20.10:7006 accepted. Sending 40417 bytes of backlog starting from offset 372035256846.
# 该节点与7006 节点进行增量同步
5708:C 01 Oct 08:07:34.672 * DB saved on disk
5708:C 01 Oct 08:07:34.692 * RDB: 12 MB of memory used by copy-on-write
8971:M 01 Oct 08:07:34.815 * Background saving terminated with success
8971:M 01 Oct 08:08:55.159 * 5000 changes in 80 seconds. Saving...
8971:M 01 Oct 08:08:55.468 * Background saving started by pid 6485
6485:C 01 Oct 08:09:00.333 * DB saved on disk
6485:C 01 Oct 08:09:00.349 * RDB: 15 MB of memory used by copy-on-write
8971:M 01 Oct 08:09:00.439 * Background saving terminated with success
8971:M 01 Oct 08:09:37.073 * Marking node 00fced34f12176c1a229211be9a3d579a51091df as failing (quorum reached).
# 00fced34f12176c1a229211be9a3d579a51091df 节点被标记为主观下线
8971:M 01 Oct 08:10:25.863 * 5000 changes in 80 seconds. Saving...
8971:M 01 Oct 08:12:02.845 * Background saving started by pid 6751
8971:M 01 Oct 08:12:04.741 # Connection with slave client id #8319 lost.
8971:M 01 Oct 08:12:04.777 # Configuration change detected. Reconfiguring myself as a replica of 599f153057aae0b5ef2f1d895a8e4ac7d0474cec
# 219e9a5cb2898bf656ff4bf18eeadc1467ae8dd8 升级成为 599f153057aae0b5ef2f1d895a8e4ac7d0474cec 的从节点
8971:S 01 Oct 08:12:04.787 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
8971:S 01 Oct 08:12:05.007 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.
8971:S 01 Oct 08:12:05.895 * Connecting to MASTER 172.20.20.10:7006
# 该节点与master 节点进行连接
8971:S 01 Oct 08:12:06.350 * MASTER <-> SLAVE sync started
# 该节点与master 节点进行同步备份
8971:S 01 Oct 08:12:06.543 * Non blocking connect for SYNC fired the event.
8971:S 01 Oct 08:12:06.543 * Master replied to PING, replication can continue...
8971:S 01 Oct 08:12:06.544 * Trying a partial resynchronization (request 57b5cf8f66406836b49f26855b77a6f61e7653fc:372038922930).
8971:S 01 Oct 08:12:06.559 * Full resync from master: 1a6c129f8d1de27cb06714b27c46a420eae014e2:372038873893
8971:S 01 Oct 08:12:06.559 * Discarding previously cached master state.
8971:S 01 Oct 08:12:10.182 * MASTER <-> SLAVE sync: receiving 165760616 bytes from master
6751:C 01 Oct 08:12:10.661 * DB saved on disk
6751:C 01 Oct 08:12:14.532 * RDB: 11 MB of memory used by copy-on-write
8971:S 01 Oct 08:12:29.561 * Background saving terminated with success
8971:S 01 Oct 08:12:37.365 * FAIL message received from 599f153057aae0b5ef2f1d895a8e4ac7d0474cec about d1fbc46714f89985ce45445b07355869085b2e7e
8971:S 01 Oct 08:12:37.764 * FAIL message received from 599f153057aae0b5ef2f1d895a8e4ac7d0474cec about 00fced34f12176c1a229211be9a3d579a51091df
8971:S 01 Oct 08:12:49.951 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.
8971:S 01 Oct 08:12:50.674 * Clear FAIL state for node d1fbc46714f89985ce45445b07355869085b2e7e: slave is reachable again.
8971:S 01 Oct 08:14:08.346 * FAIL message received from 599f153057aae0b5ef2f1d895a8e4ac7d0474cec about 00fced34f12176c1a229211be9a3d579a51091df
8971:S 01 Oct 08:14:13.072 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.
8971:S 01 Oct 08:14:40.230 * FAIL message received from 599f153057aae0b5ef2f1d895a8e4ac7d0474cec about 00fced34f12176c1a229211be9a3d579a51091df
8971:S 01 Oct 08:14:42.439 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.
8971:S 01 Oct 08:16:35.122 * FAIL message received from 191e4f15c0de39efe1ab9248bd413c9fdb2dddec about 00fced34f12176c1a229211be9a3d579a51091df
8971:S 01 Oct 08:16:38.671 * FAIL message received from 191e4f15c0de39efe1ab9248bd413c9fdb2dddec about d1fbc46714f89985ce45445b07355869085b2e7e
8971:S 01 Oct 08:16:42.318 * 50 changes in 250 seconds. Saving...
8971:S 01 Oct 08:19:59.146 * Background saving started by pid 7514
8971:S 01 Oct 08:20:00.886 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.
8971:S 01 Oct 08:20:01.319 * Clear FAIL state for node d1fbc46714f89985ce45445b07355869085b2e7e: slave is reachable again.
7514:C 01 Oct 08:20:04.454 * DB saved on disk
7514:C 01 Oct 08:20:04.514 * RDB: 11 MB of memory used by copy-on-write
8971:S 01 Oct 08:20:05.327 * Background saving terminated with success
8971:S 01 Oct 08:22:20.042 * FAIL message received from 9a28b0084c1486ceb0bca3c45f2554d79041e57d about 00fced34f12176c1a229211be9a3d579a51091df
8971:S 01 Oct 08:24:08.344 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.
8971:S 01 Oct 08:25:27.485 * FAIL message received from 9a28b0084c1486ceb0bca3c45f2554d79041e57d about d1fbc46714f89985ce45445b07355869085b2e7e
8971:S 01 Oct 08:25:45.951 * FAIL message received from 599f153057aae0b5ef2f1d895a8e4ac7d0474cec about 00fced34f12176c1a229211be9a3d579a51091df
8971:S 01 Oct 08:25:54.469 * Clear FAIL state for node d1fbc46714f89985ce45445b07355869085b2e7e: slave is reachable again.
8971:S 01 Oct 08:26:18.242 # Cluster state changed: fail
8971:S 01 Oct 08:26:20.579 * FAIL message received from 9a28b0084c1486ceb0bca3c45f2554d79041e57d about d1fbc46714f89985ce45445b07355869085b2e7e
8971:S 01 Oct 08:26:28.343 # Cluster state changed: ok
8971:S 01 Oct 08:27:11.280 # Cluster state changed: fail
8971:S 01 Oct 08:27:18.984 # Cluster state changed: ok
8971:S 01 Oct 08:27:25.077 * Clear FAIL state for node d1fbc46714f89985ce45445b07355869085b2e7e: slave is reachable again.
8971:S 01 Oct 08:27:25.191 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.
8971:S 01 Oct 08:27:28.897 * MASTER <-> SLAVE sync: Flushing old data
8971:S 01 Oct 08:27:34.787 * MASTER <-> SLAVE sync: Loading DB in memory
# 上述节点日志中，全部是在进行节点的上下线和同步过程，到这步开始进行加载内存
8971:S 01 Oct 08:28:28.832 * FAIL message received from 599f153057aae0b5ef2f1d895a8e4ac7d0474cec about d1fbc46714f89985ce45445b07355869085b2e7e
8971:S 01 Oct 08:28:28.859 * FAIL message received from 9a28b0084c1486ceb0bca3c45f2554d79041e57d about 00fced34f12176c1a229211be9a3d579a51091df
8971:S 01 Oct 08:28:37.881 * MASTER <-> SLAVE sync: Finished with success
# 这里表示主节点和slave 备份数据同步成功。
8971:S 01 Oct 08:28:37.898 * Background append only file rewriting started by pid 10524
8971:S 01 Oct 08:28:38.733 * Clear FAIL state for node d1fbc46714f89985ce45445b07355869085b2e7e: slave is reachable again.
8971:S 01 Oct 08:28:39.330 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.
8971:S 01 Oct 08:28:43.029 * AOF rewrite child asks to stop sending diffs.
10524:C 01 Oct 08:28:43.040 * Parent agreed to stop sending diffs. Finalizing AOF...
10524:C 01 Oct 08:28:43.040 * Concatenating 6.43 MB of AOF diff received from parent.
10524:C 01 Oct 08:28:43.087 * SYNC append only file rewrite performed
10524:C 01 Oct 08:28:43.104 * AOF rewrite: 112 MB of memory used by copy-on-write
8971:S 01 Oct 08:28:43.332 * Background AOF rewrite terminated with success
8971:S 01 Oct 08:28:43.333 * Residual parent diff successfully flushed to the rewritten AOF (0.02 MB)
# 这里是针对aof 备份文件的重新对比校验数据和写入的步骤。
8971:S 01 Oct 08:28:43.333 * Background AOF rewrite finished successfully
8971:S 01 Oct 08:28:43.434 * 10 changes in 300 seconds. Saving...
8971:S 01 Oct 08:28:43.451 * Background saving started by pid 11643
11643:C 01 Oct 08:28:48.797 * DB saved on disk
11643:C 01 Oct 08:28:48.837 * RDB: 14 MB of memory used by copy-on-write
8971:S 01 Oct 08:28:48.916 * Background saving terminated with success
8971:S 01 Oct 08:30:09.007 * 5000 changes in 80 seconds. Saving...
8971:S 01 Oct 08:30:09.024 * Background saving started by pid 12605
12605:C 01 Oct 08:30:16.128 * DB saved on disk

总结：

Redis集群自身实现了高可用，当集群内少量节点出现故障时通过自动故障转移保证集群可以正常对外提供服务。

日志中Redis集群自动恢复机制，经过节点下线和提升新的主节点投票，上线等步骤，完成新的集群槽点同步，数据校验，全部是由redis 本身自动操作的。

错误分析汇总

问题总结：

经过Java日志和Redis日志的分析和汇总，结合生产业务的注册和登录进行判断，是因为Redis集群节点在进行同步和提升新的节点成为主节点时，更换了业务槽点，以及在同步数据时，redis 集群占用系统资源过多，造成用户工程访问redis集群节点超时，同时查询连接槽点失败，导致数据查询错误。

问题解决：

关于Redis集群节点恢复之后，生产业务依然超时，不能正常工作，考虑是连接旧的槽点分配，并且用户工程一直向旧的槽点发送请求，导致redis一直接受空的查询和写入，造成系统资源过高问题。

这里重新启动用户工程，使工程连接新的Redis节点状态，让数据写入正确的槽点位置，解决该问题。