redis sentinel auto-failover 一对主从实例测试全过程

环境：Red Hat Enterprise Linux Server release 6.5 (Santiago)

主实例端口：38001

从实例端口：38002、38003

sentinel实例端口：39001、39002、39003

一、启动实例

一开始sentinel的配置文件：

### 10.10.100.76 38001
sentinel monitor master1 10.10.100.76 38001 2
sentinel down-after-milliseconds master1 30000
sentinel failover-timeout master1 60000
sentinel parallel-syncs master1 1

启动redis 主从实例

[boss@localhost src]$ ./redis-server ../conf/redis38001.conf >> ../conf/logs/redis38001.log &
[1] 8820
[boss@localhost src]$ ./redis-server ../conf/redis38002.conf >> ../conf/logs/redis38002.log &
[2] 8825
[boss@localhost src]$ ./redis-server ../conf/redis38003.conf >> ../conf/logs/redis38003.log &
[3] 8829
[boss@localhost src]$

启动sentinel 实例

[boss@localhost src]$ ./redis-sentinel ../conf/sentinel39001.conf >> ../conf/logs/sentinel39001.log &
[1] 8869
[boss@localhost src]$ ./redis-sentinel ../conf/sentinel39002.conf >> ../conf/logs/sentinel39002.log &
[2] 8872
[boss@localhost src]$ ./redis-sentinel ../conf/sentinel39003.conf >> ../conf/logs/sentinel39003.log &
[3] 8875
[boss@localhost src]$

查看启动的redis实例

[boss@localhost src]$ ps -ef | grep redis
boss      8820  8795  0 17:42 pts/2    00:00:00 ./redis-server *:38001                
boss      8825  8795  0 17:43 pts/2    00:00:00 ./redis-server *:38002                
boss      8829  8795  0 17:43 pts/2    00:00:00 ./redis-server *:38003                
boss      8869  8842  0 17:46 pts/3    00:00:00 ./redis-sentinel *:39001 [sentinel]        
boss      8872  8842  0 17:46 pts/3    00:00:00 ./redis-sentinel *:39002 [sentinel]        
boss      8875  8842  0 17:46 pts/3    00:00:00 ./redis-sentinel *:39003 [sentinel]        
boss      8879  8683  0 17:46 pts/1    00:00:00 grep redis
[boss@localhost src]$

sentinel启动后的配置文件：

### 10.10.100.76 38001
sentinel monitor master1 10.10.100.76 38001 2
sentinel failover-timeout master1 60000
sentinel config-epoch master1 0
sentinel leader-epoch master1 0
# Generated by CONFIG REWRITE
maxclients 4064
sentinel known-slave master1 10.10.100.76 38003
sentinel known-slave master1 10.10.100.76 38002
sentinel known-sentinel master1 10.10.100.76 39002 78596f9d15311475e841904788784851c961e145
sentinel known-sentinel master1 10.10.100.76 39001 20363efd6f67c1e51364205884e8d6fcdb1cc96d
sentinel current-epoch 0

sentinel启动日志(三个都差不多相同)：

启动，生成sentinel的runid，并监视master，并得到master的slaves 和一起监视这个master的sentinels

8869:X 26 Jul 17:46:10.854 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
8869:X 26 Jul 17:46:10.854 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted.
8869:X 26 Jul 17:46:10.854 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 3.0.7 (00000000/0) 64 bit
  .-`` .-```.  ```/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in sentinel mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 39001
 |    `-._   `._    /     _.-'    |     PID: 8869
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               
 
8869:X 26 Jul 17:46:10.857 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
8869:X 26 Jul 17:46:10.857 # Sentinel runid is 20363efd6f67c1e51364205884e8d6fcdb1cc96d
8869:X 26 Jul 17:46:10.857 # +monitor master master1 10.10.100.76 38001 quorum 2
8869:X 26 Jul 17:46:11.858 * +slave slave 10.10.100.76:38002 10.10.100.76 38002 @ master1 10.10.100.76 38001
8869:X 26 Jul 17:46:11.868 * +slave slave 10.10.100.76:38003 10.10.100.76 38003 @ master1 10.10.100.76 38001
8869:X 26 Jul 17:46:19.608 * +sentinel sentinel 10.10.100.76:39002 10.10.100.76 39002 @ master1 10.10.100.76 38001
8869:X 26 Jul 17:46:24.791 * +sentinel sentinel 10.10.100.76:39003 10.10.100.76 39003 @ master1 10.10.100.76 38001

redis master实例日志（port:38001）

启动，并和slave全同步

......(前面的就不复制了)
8820:M 26 Jul 17:42:55.889 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
8820:M 26 Jul 17:42:55.889 # Server started, Redis version 3.0.7
8820:M 26 Jul 17:42:55.890 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
8820:M 26 Jul 17:42:55.890 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
8820:M 26 Jul 17:42:55.890 * DB loaded from disk: 0.000 seconds
8820:M 26 Jul 17:42:55.890 * The server is now ready to accept connections on port 38001
8820:M 26 Jul 17:43:08.360 * Slave 10.10.100.76:38002 asks for synchronization
8820:M 26 Jul 17:43:08.360 * Full resync requested by slave 10.10.100.76:38002
8820:M 26 Jul 17:43:08.360 * Starting BGSAVE for SYNC with target: disk
8820:M 26 Jul 17:43:08.361 * Background saving started by pid 8828
8828:C 26 Jul 17:43:08.371 * DB saved on disk
8828:C 26 Jul 17:43:08.372 * RDB: 4 MB of memory used by copy-on-write
8820:M 26 Jul 17:43:08.418 * Background saving terminated with success
8820:M 26 Jul 17:43:08.418 * Synchronization with slave 10.10.100.76:38002 succeeded
8820:M 26 Jul 17:43:13.867 * Slave 10.10.100.76:38003 asks for synchronization
8820:M 26 Jul 17:43:13.867 * Full resync requested by slave 10.10.100.76:38003
8820:M 26 Jul 17:43:13.867 * Starting BGSAVE for SYNC with target: disk
8820:M 26 Jul 17:43:13.868 * Background saving started by pid 8832
8832:C 26 Jul 17:43:13.878 * DB saved on disk
8832:C 26 Jul 17:43:13.878 * RDB: 4 MB of memory used by copy-on-write
8820:M 26 Jul 17:43:13.930 * Background saving terminated with success
8820:M 26 Jul 17:43:13.930 * Synchronization with slave 10.10.100.76:38003 succeeded

redis slave实例（port:38002，38003和38002相同，就不贴了）：

启动，并和和master一次全同步

......(前面的就不复制了)
8825:S 26 Jul 17:43:08.359 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
8825:S 26 Jul 17:43:08.359 # Server started, Redis version 3.0.7
8825:S 26 Jul 17:43:08.359 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
8825:S 26 Jul 17:43:08.359 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
8825:S 26 Jul 17:43:08.359 * DB loaded from disk: 0.000 seconds
8825:S 26 Jul 17:43:08.359 * The server is now ready to accept connections on port 38002
8825:S 26 Jul 17:43:08.359 * Connecting to MASTER 10.10.100.76:38001
8825:S 26 Jul 17:43:08.360 * MASTER <-> SLAVE sync started
8825:S 26 Jul 17:43:08.360 * Non blocking connect for SYNC fired the event.
8825:S 26 Jul 17:43:08.360 * Master replied to PING, replication can continue...
8825:S 26 Jul 17:43:08.360 * Partial resynchronization not possible (no cached master)
8825:S 26 Jul 17:43:08.361 * Full resync from master: 49d5d828d5c8f87a3d5ee910e6b92a271398f368:1
8825:S 26 Jul 17:43:08.418 * MASTER <-> SLAVE sync: receiving 40 bytes from master
8825:S 26 Jul 17:43:08.418 * MASTER <-> SLAVE sync: Flushing old data
8825:S 26 Jul 17:43:08.418 * MASTER <-> SLAVE sync: Loading DB in memory
8825:S 26 Jul 17:43:08.419 * MASTER <-> SLAVE sync: Finished with success

二、模拟master故障、自动故障转移

down掉master(port:38001):（也可以使用kill）

[boss@localhost src]$ ./redis-cli -p 38001 -a ai2016boss shutdown
[boss@localhost src]$

查看sentinel的日志（port:39001）

......(前面的就不复制了)
8869:X 26 Jul 17:46:10.857 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
8869:X 26 Jul 17:46:10.857 # Sentinel runid is 20363efd6f67c1e51364205884e8d6fcdb1cc96d
8869:X 26 Jul 17:46:10.857 # +monitor master master1 10.10.100.76 38001 quorum 2
8869:X 26 Jul 17:46:11.858 * +slave slave 10.10.100.76:38002 10.10.100.76 38002 @ master1 10.10.100.76 38001
8869:X 26 Jul 17:46:11.868 * +slave slave 10.10.100.76:38003 10.10.100.76 38003 @ master1 10.10.100.76 38001
8869:X 26 Jul 17:46:19.608 * +sentinel sentinel 10.10.100.76:39002 10.10.100.76 39002 @ master1 10.10.100.76 38001
8869:X 26 Jul 17:46:24.791 * +sentinel sentinel 10.10.100.76:39003 10.10.100.76 39003 @ master1 10.10.100.76 38001
8869:X 26 Jul 18:00:47.667 # +sdown master master1 10.10.100.76 38001
8869:X 26 Jul 18:00:47.759 # +new-epoch 1
8869:X 26 Jul 18:00:47.761 # +vote-for-leader 5c24343d83dd1e0da6e1e511dc5dd690ee804065 1
8869:X 26 Jul 18:00:48.110 # +config-update-from sentinel 10.10.100.76:39003 10.10.100.76 39003 @ master1 10.10.100.76 38001
8869:X 26 Jul 18:00:48.110 # +switch-master master1 10.10.100.76 38001 10.10.100.76 38003
8869:X 26 Jul 18:00:48.110 * +slave slave 10.10.100.76:38002 10.10.100.76 38002 @ master1 10.10.100.76 38003
8869:X 26 Jul 18:00:48.110 * +slave slave 10.10.100.76:38001 10.10.100.76 38001 @ master1 10.10.100.76 38003
8869:X 26 Jul 18:01:18.145 # +sdown slave 10.10.100.76:38001 10.10.100.76 38001 @ master1 10.10.100.76 38003

查看sentinel的日志（port:39002）

......(前面的就不复制了)
8872:X 26 Jul 17:46:17.568 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
8872:X 26 Jul 17:46:17.568 # Sentinel runid is 78596f9d15311475e841904788784851c961e145
8872:X 26 Jul 17:46:17.568 # +monitor master master1 10.10.100.76 38001 quorum 2
8872:X 26 Jul 17:46:17.568 * +slave slave 10.10.100.76:38002 10.10.100.76 38002 @ master1 10.10.100.76 38001
8872:X 26 Jul 17:46:17.570 * +slave slave 10.10.100.76:38003 10.10.100.76 38003 @ master1 10.10.100.76 38001
8872:X 26 Jul 17:46:17.957 * +sentinel sentinel 10.10.100.76:39001 10.10.100.76 39001 @ master1 10.10.100.76 38001
8872:X 26 Jul 17:46:24.791 * +sentinel sentinel 10.10.100.76:39003 10.10.100.76 39003 @ master1 10.10.100.76 38001
8872:X 26 Jul 18:00:47.715 # +sdown master master1 10.10.100.76 38001
8872:X 26 Jul 18:00:47.760 # +new-epoch 1
8872:X 26 Jul 18:00:47.761 # +vote-for-leader 5c24343d83dd1e0da6e1e511dc5dd690ee804065 1
8872:X 26 Jul 18:00:47.773 # +odown master master1 10.10.100.76 38001 #quorum 3/2
8872:X 26 Jul 18:00:47.774 # Next failover delay: I will not start a failover before Tue Jul 26 18:02:47 2016
8872:X 26 Jul 18:00:48.111 # +config-update-from sentinel 10.10.100.76:39003 10.10.100.76 39003 @ master1 10.10.100.76 38001
8872:X 26 Jul 18:00:48.111 # +switch-master master1 10.10.100.76 38001 10.10.100.76 38003
8872:X 26 Jul 18:00:48.111 * +slave slave 10.10.100.76:38002 10.10.100.76 38002 @ master1 10.10.100.76 38003
8872:X 26 Jul 18:00:48.111 * +slave slave 10.10.100.76:38001 10.10.100.76 38001 @ master1 10.10.100.76 38003
8872:X 26 Jul 18:01:18.160 # +sdown slave 10.10.100.76:38001 10.10.100.76 38001 @ master1 10.10.100.76 38003

查看sentinel的日志（port:39003）

......(前面的就不复制了)
8875:X 26 Jul 17:46:22.721 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
8875:X 26 Jul 17:46:22.721 # Sentinel runid is 5c24343d83dd1e0da6e1e511dc5dd690ee804065
8875:X 26 Jul 17:46:22.721 # +monitor master master1 10.10.100.76 38001 quorum 2
8875:X 26 Jul 17:46:23.722 * +slave slave 10.10.100.76:38002 10.10.100.76 38002 @ master1 10.10.100.76 38001
8875:X 26 Jul 17:46:23.731 * +slave slave 10.10.100.76:38003 10.10.100.76 38003 @ master1 10.10.100.76 38001
8875:X 26 Jul 17:46:24.110 * +sentinel sentinel 10.10.100.76:39001 10.10.100.76 39001 @ master1 10.10.100.76 38001
8875:X 26 Jul 17:46:25.754 * +sentinel sentinel 10.10.100.76:39002 10.10.100.76 39002 @ master1 10.10.100.76 38001
8875:X 26 Jul 18:00:47.684 # +sdown master master1 10.10.100.76 38001
8875:X 26 Jul 18:00:47.755 # +odown master master1 10.10.100.76 38001 #quorum 2/2
8875:X 26 Jul 18:00:47.755 # +new-epoch 1
8875:X 26 Jul 18:00:47.755 # +try-failover master master1 10.10.100.76 38001
8875:X 26 Jul 18:00:47.758 # +vote-for-leader 5c24343d83dd1e0da6e1e511dc5dd690ee804065 1
8875:X 26 Jul 18:00:47.761 # 10.10.100.76:39001 voted for 5c24343d83dd1e0da6e1e511dc5dd690ee804065 1
8875:X 26 Jul 18:00:47.761 # 10.10.100.76:39002 voted for 5c24343d83dd1e0da6e1e511dc5dd690ee804065 1
8875:X 26 Jul 18:00:47.859 # +elected-leader master master1 10.10.100.76 38001
8875:X 26 Jul 18:00:47.859 # +failover-state-select-slave master master1 10.10.100.76 38001
8875:X 26 Jul 18:00:47.911 # +selected-slave slave 10.10.100.76:38003 10.10.100.76 38003 @ master1 10.10.100.76 38001
8875:X 26 Jul 18:00:47.911 * +failover-state-send-slaveof-noone slave 10.10.100.76:38003 10.10.100.76 38003 @ master1 10.10.100.76 38001
8875:X 26 Jul 18:00:47.995 * +failover-state-wait-promotion slave 10.10.100.76:38003 10.10.100.76 38003 @ master1 10.10.100.76 38001
8875:X 26 Jul 18:00:48.053 # +promoted-slave slave 10.10.100.76:38003 10.10.100.76 38003 @ master1 10.10.100.76 38001
8875:X 26 Jul 18:00:48.053 # +failover-state-reconf-slaves master master1 10.10.100.76 38001
8875:X 26 Jul 18:00:48.108 * +slave-reconf-sent slave 10.10.100.76:38002 10.10.100.76 38002 @ master1 10.10.100.76 38001
8875:X 26 Jul 18:00:48.881 # -odown master master1 10.10.100.76 38001
8875:X 26 Jul 18:00:49.143 * +slave-reconf-inprog slave 10.10.100.76:38002 10.10.100.76 38002 @ master1 10.10.100.76 38001
8875:X 26 Jul 18:00:49.143 * +slave-reconf-done slave 10.10.100.76:38002 10.10.100.76 38002 @ master1 10.10.100.76 38001
8875:X 26 Jul 18:00:49.208 # +failover-end master master1 10.10.100.76 38001
8875:X 26 Jul 18:00:49.208 # +switch-master master1 10.10.100.76 38001 10.10.100.76 38003
8875:X 26 Jul 18:00:49.209 * +slave slave 10.10.100.76:38002 10.10.100.76 38002 @ master1 10.10.100.76 38003
8875:X 26 Jul 18:00:49.209 * +slave slave 10.10.100.76:38001 10.10.100.76 38001 @ master1 10.10.100.76 38003
8875:X 26 Jul 18:01:19.294 # +sdown slave 10.10.100.76:38001 10.10.100.76 38001 @ master1 10.10.100.76 38003

查看sentinel配置文件：

### 10.10.100.76 38001
sentinel monitor master1 10.10.100.76 38003 2
sentinel failover-timeout master1 60000
sentinel config-epoch master1 1
sentinel leader-epoch master1 1
# Generated by CONFIG REWRITE
maxclients 4064
sentinel known-slave master1 10.10.100.76 38002
sentinel known-slave master1 10.10.100.76 38001
sentinel known-sentinel master1 10.10.100.76 39003 5c24343d83dd1e0da6e1e511dc5dd690ee804065
sentinel known-sentinel master1 10.10.100.76 39002 78596f9d15311475e841904788784851c961e145
sentinel current-epoch 1

redis 原master实例日志（port:38001）

......(前面的就不复制了)
8820:M 26 Jul 17:58:14.080 * 1 changes in 900 seconds. Saving...
8820:M 26 Jul 17:58:14.081 * Background saving started by pid 8971
8971:C 26 Jul 17:58:14.083 * DB saved on disk
8971:C 26 Jul 17:58:14.084 * RDB: 4 MB of memory used by copy-on-write
8820:M 26 Jul 17:58:14.182 * Background saving terminated with success
8820:M 26 Jul 18:00:17.627 # User requested shutdown...
8820:M 26 Jul 18:00:17.627 * Saving the final RDB snapshot before exiting.
8820:M 26 Jul 18:00:17.629 * DB saved on disk
8820:M 26 Jul 18:00:17.629 # Redis is now ready to exit, bye bye...

redis 原slave 实例日志（port:38002）

8825:S 26 Jul 17:58:09.087 * 1 changes in 900 seconds. Saving...
8825:S 26 Jul 17:58:09.088 * Background saving started by pid 8968
8968:C 26 Jul 17:58:09.093 * DB saved on disk
8968:C 26 Jul 17:58:09.093 * RDB: 4 MB of memory used by copy-on-write
8825:S 26 Jul 17:58:09.188 * Background saving terminated with success
8825:S 26 Jul 18:00:17.629 # Connection with master lost.
8825:S 26 Jul 18:00:17.629 * Caching the disconnected master state.
8825:S 26 Jul 18:00:18.331 * Connecting to MASTER 10.10.100.76:38001
8825:S 26 Jul 18:00:18.331 * MASTER <-> SLAVE sync started
8825:S 26 Jul 18:00:18.331 # Error condition on socket for SYNC: Connection refused
8825:S 26 Jul 18:00:19.332 * Connecting to MASTER 10.10.100.76:38001
8825:S 26 Jul 18:00:19.333 * MASTER <-> SLAVE sync started
8825:S 26 Jul 18:00:19.333 # Error condition on socket for SYNC: Connection refused
8825:S 26 Jul 18:00:20.334 * Connecting to MASTER 10.10.100.76:38001
8825:S 26 Jul 18:00:20.334 * MASTER <-> SLAVE sync started
8825:S 26 Jul 18:00:20.334 # Error condition on socket for SYNC: Connection refused
........(这里有很多请求连接master的日志)
8825:S 26 Jul 18:00:48.108 * Discarding previously cached master state.
8825:S 26 Jul 18:00:48.108 * SLAVE OF 10.10.100.76:38003 enabled (user request from 'id=8 addr=10.10.100.76:59327 fd=11 name=sentinel-5c24343d-cmd age=865 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=139 qbuf-free=32629 obl=36 oll=0 omem=0 events=rw cmd=exec')
8825:S 26 Jul 18:00:48.110 # CONFIG REWRITE executed with success.
8825:S 26 Jul 18:00:48.392 * Connecting to MASTER 10.10.100.76:38003
8825:S 26 Jul 18:00:48.392 * MASTER <-> SLAVE sync started
8825:S 26 Jul 18:00:48.392 * Non blocking connect for SYNC fired the event.
8825:S 26 Jul 18:00:48.392 * Master replied to PING, replication can continue...
8825:S 26 Jul 18:00:48.392 * Partial resynchronization not possible (no cached master)
8825:S 26 Jul 18:00:48.394 * Full resync from master: 0ca88bef97ff1f9dddb3985fb31db97b77f70ad0:1
8825:S 26 Jul 18:00:48.492 * MASTER <-> SLAVE sync: receiving 51 bytes from master
8825:S 26 Jul 18:00:48.492 * MASTER <-> SLAVE sync: Flushing old data
8825:S 26 Jul 18:00:48.492 * MASTER <-> SLAVE sync: Loading DB in memory
8825:S 26 Jul 18:00:48.492 * MASTER <-> SLAVE sync: Finished with success
8825:S 26 Jul 18:13:10.079 * 1 changes in 900 seconds. Saving...
8825:S 26 Jul 18:13:10.080 * Background saving started by pid 9035
9035:C 26 Jul 18:13:10.083 * DB saved on disk
9035:C 26 Jul 18:13:10.084 * RDB: 4 MB of memory used by copy-on-write
8825:S 26 Jul 18:13:10.180 * Background saving terminated with success

redis 原slave实例日志（port:38003）

8829:S 26 Jul 17:58:14.001 * 1 changes in 900 seconds. Saving...
8829:S 26 Jul 17:58:14.002 * Background saving started by pid 8970
8970:C 26 Jul 17:58:14.005 * DB saved on disk
8970:C 26 Jul 17:58:14.006 * RDB: 4 MB of memory used by copy-on-write
8829:S 26 Jul 17:58:14.103 * Background saving terminated with success
8829:S 26 Jul 18:00:17.629 # Connection with master lost.
8829:S 26 Jul 18:00:17.629 * Caching the disconnected master state.
8829:S 26 Jul 18:00:17.838 * Connecting to MASTER 10.10.100.76:38001
8829:S 26 Jul 18:00:17.838 * MASTER <-> SLAVE sync started
8829:S 26 Jul 18:00:17.838 # Error condition on socket for SYNC: Connection refused
8829:S 26 Jul 18:00:18.840 * Connecting to MASTER 10.10.100.76:38001
8829:S 26 Jul 18:00:18.840 * MASTER <-> SLAVE sync started
8829:S 26 Jul 18:00:18.840 # Error condition on socket for SYNC: Connection refused
........(这里有很多请求连接master的日志)
8829:M 26 Jul 18:00:47.995 * Discarding previously cached master state.
8829:M 26 Jul 18:00:47.995 * MASTER MODE enabled (user request from 'id=8 addr=10.10.100.76:59874 fd=11 name=sentinel-5c24343d-cmd age=864 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=rw cmd=exec')
8829:M 26 Jul 18:00:47.997 # CONFIG REWRITE executed with success.
8829:M 26 Jul 18:00:48.392 * Slave 10.10.100.76:38002 asks for synchronization
8829:M 26 Jul 18:00:48.392 * Full resync requested by slave 10.10.100.76:38002
8829:M 26 Jul 18:00:48.392 * Starting BGSAVE for SYNC with target: disk
8829:M 26 Jul 18:00:48.393 * Background saving started by pid 8984
8984:C 26 Jul 18:00:48.404 * DB saved on disk
8984:C 26 Jul 18:00:48.405 * RDB: 4 MB of memory used by copy-on-write
8829:M 26 Jul 18:00:48.492 * Background saving terminated with success
8829:M 26 Jul 18:00:48.492 * Synchronization with slave 10.10.100.76:38002 succeeded

redis 主从实例的配置文件变化（由于配置文件太多就不贴了）：

port：38001 没变化

port：38002 修改了master的地址，添加了一句话

slaveof 10.10.100.76 38003
......
# Generated by CONFIG REWRITE
maxclients 4064

port：38003 删除了 slaveof这一句，添加了一句话

# Generated by CONFIG REWRITE
maxclients 4064

结论：

1、原master在down掉后，经过一段时间sentinel的确认后，自动故障转移，原slave38003提升为master

2、经过日志观察，最早发现38001 sdown掉的是 39001，但是failover 的是39003

3、failover后，提升38003为master ，并将38001、38002 为slave ，由于38001为down状态，最后添加了一句

+sdown slave 10.10.100.76:38001 10.10.100.76 38001 @ master1 10.10.100.76 38003

三、启动down掉的实例38001

启动38001

[boss@localhost src]$ ./redis-server ../conf/redis38001.conf >> ../conf/logs/redis38001.log &
[4] 9086
[1]   Done                    ./redis-server ../conf/redis38001.conf >> ../conf/logs/redis38001.log
[boss@localhost src]$

redis 38001实例的配置文件变化，添加下列语句

# Generated by CONFIG REWRITE
slaveof 10.10.100.76 38003
maxclients 4064

sentinel日志变化（三个都一样）：

删掉原来 sdown 的实例

8869:X 26 Jul 18:38:37.414 # -sdown slave 10.10.100.76:38001 10.10.100.76 38001 @ master1 10.10.100.76 38003

redis 实例（port:38001，现在为slave）

......(前面的就不复制了)
9086:M 26 Jul 18:38:37.256 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
9086:M 26 Jul 18:38:37.256 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted.
9086:M 26 Jul 18:38:37.256 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 3.0.7 (00000000/0) 64 bit
  .-`` .-```.  ```/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 38001
 |    `-._   `._    /     _.-'    |     PID: 9086
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               
 
9086:M 26 Jul 18:38:37.257 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
9086:M 26 Jul 18:38:37.257 # Server started, Redis version 3.0.7
9086:M 26 Jul 18:38:37.258 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
9086:M 26 Jul 18:38:37.258 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
9086:M 26 Jul 18:38:37.258 * DB loaded from disk: 0.000 seconds
9086:M 26 Jul 18:38:37.258 * The server is now ready to accept connections on port 38001
9086:S 26 Jul 18:38:47.334 * SLAVE OF 10.10.100.76:38003 enabled (user request from 'id=2 addr=10.10.100.76:54160 fd=6 name=sentinel-5c24343d-cmd age=10 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=rw cmd=exec')
9086:S 26 Jul 18:38:47.335 # CONFIG REWRITE executed with success.
9086:S 26 Jul 18:38:48.309 * Connecting to MASTER 10.10.100.76:38003
9086:S 26 Jul 18:38:48.309 * MASTER <-> SLAVE sync started
9086:S 26 Jul 18:38:48.309 * Non blocking connect for SYNC fired the event.
9086:S 26 Jul 18:38:48.309 * Master replied to PING, replication can continue...
9086:S 26 Jul 18:38:48.310 * Partial resynchronization not possible (no cached master)
9086:S 26 Jul 18:38:48.311 * Full resync from master: 0ca88bef97ff1f9dddb3985fb31db97b77f70ad0:464126
9086:S 26 Jul 18:38:48.398 * MASTER <-> SLAVE sync: receiving 51 bytes from master
9086:S 26 Jul 18:38:48.398 * MASTER <-> SLAVE sync: Flushing old data
9086:S 26 Jul 18:38:48.398 * MASTER <-> SLAVE sync: Loading DB in memory
9086:S 26 Jul 18:38:48.398 * MASTER <-> SLAVE sync: Finished with success

redis 实例（port：38002，现在为slave）

无特殊变化

redis 实例（port：38003，现在为master）

......(前面的就不复制了)
8829:M 26 Jul 18:38:48.310 * Slave 10.10.100.76:38001 asks for synchronization
8829:M 26 Jul 18:38:48.310 * Full resync requested by slave 10.10.100.76:38001
8829:M 26 Jul 18:38:48.310 * Starting BGSAVE for SYNC with target: disk
8829:M 26 Jul 18:38:48.311 * Background saving started by pid 9090
9090:C 26 Jul 18:38:48.314 * DB saved on disk
9090:C 26 Jul 18:38:48.314 * RDB: 4 MB of memory used by copy-on-write
8829:M 26 Jul 18:38:48.398 * Background saving terminated with success
8829:M 26 Jul 18:38:48.398 * Synchronization with slave 10.10.100.76:38001 succeeded

总结：

1、启动主从实例后，从实例会向主实例请求一次全同步

2、启动sentinel后，会生成一个唯一的sentinel runid，并监视master，根据master通过订阅发布，得到master的slaves,以及一切监视这个master的sentinels

3、在master被down后，会先被单个sentinel认定为sdown，经过一段时间（根据配置）后，被多个认定为odown，并选举得到去做failover的sentinel

（注意，最先发现sdown的并不一定是做failover的）

4、选举一个slave为master。

下面就sentinel 39003 的配置文件做下说明：

8875:X 26 Jul 17:46:22.721 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
--生成sentinel的runid
8875:X 26 Jul 17:46:22.721 # Sentinel runid is 5c24343d83dd1e0da6e1e511dc5dd690ee804065
--加入监视主机 主机名称为master1，host、port、quorum 都来自配置文件
8875:X 26 Jul 17:46:22.721 # +monitor master master1 10.10.100.76 38001 quorum 2
--根据主机，通过发布订阅功能，识别并关联 主机的 slaves
8875:X 26 Jul 17:46:23.722 * +slave slave 10.10.100.76:38002 10.10.100.76 38002 @ master1 10.10.100.76 38001
8875:X 26 Jul 17:46:23.731 * +slave slave 10.10.100.76:38003 10.10.100.76 38003 @ master1 10.10.100.76 38001

--根据主机，通过发布订阅功能，识别并关联 一起监视这个主机的 sentinels
8875:X 26 Jul 17:46:24.110 * +sentinel sentinel 10.10.100.76:39001 10.10.100.76 39001 @ master1 10.10.100.76 38001
8875:X 26 Jul 17:46:25.754 * +sentinel sentinel 10.10.100.76:39002 10.10.100.76 39002 @ master1 10.10.100.76 38001
--加入主观下线状态，该sentinel sdown master
8875:X 26 Jul 18:00:47.684 # +sdown master master1 10.10.100.76 38001
--加入客观下线状态，多个sentinels odown master
8875:X 26 Jul 18:00:47.755 # +odown master master1 10.10.100.76 38001 #quorum 2/2
--生成新的纪元号
8875:X 26 Jul 18:00:47.755 # +new-epoch 1
--尝试 failover 这个master
8875:X 26 Jul 18:00:47.755 # +try-failover master master1 10.10.100.76 38001
--发起投票选取leader，并自己给 某个slave实例投票
8875:X 26 Jul 18:00:47.758 # +vote-for-leader 5c24343d83dd1e0da6e1e511dc5dd690ee804065 1
--39001投票
8875:X 26 Jul 18:00:47.761 # 10.10.100.76:39001 voted for 5c24343d83dd1e0da6e1e511dc5dd690ee804065 1
--39002投票
8875:X 26 Jul 18:00:47.761 # 10.10.100.76:39002 voted for 5c24343d83dd1e0da6e1e511dc5dd690ee804065 1
--赢得指定纪元的选举，可以进行故障迁移操作了
8875:X 26 Jul 18:00:47.859 # +elected-leader master master1 10.10.100.76 38001
--故障转移操作现在处于 select-slave 状态 —— Sentinel 正在寻找可以升级为主服务器的从服务器
8875:X 26 Jul 18:00:47.859 # +failover-state-select-slave master master1 10.10.100.76 38001
--Sentinel 顺利找到适合进行升级的从服务器
8875:X 26 Jul 18:00:47.911 # +selected-slave slave 10.10.100.76:38003 10.10.100.76 38003 @ master1 10.10.100.76 38001
--将指定的从服务器升级为主服务器，并去掉slaveof
8875:X 26 Jul 18:00:47.911 * +failover-state-send-slaveof-noone slave 10.10.100.76:38003 10.10.100.76 38003 @ master1 10.10.100.76 38001
--故障转移在等待升级
8875:X 26 Jul 18:00:47.995 * +failover-state-wait-promotion slave 10.10.100.76:38003 10.10.100.76 38003 @ master1 10.10.100.76 38001
--升级slave
8875:X 26 Jul 18:00:48.053 # +promoted-slave slave 10.10.100.76:38003 10.10.100.76 38003 @ master1 10.10.100.76 38001
--故障转移状态切换到了 reconf-slaves 状态 （重新配置配置文件）
8875:X 26 Jul 18:00:48.053 # +failover-state-reconf-slaves master master1 10.10.100.76 38001
--向实例发送了 SLAVEOF 命令，为实例设置新的主服务器
8875:X 26 Jul 18:00:48.108 * +slave-reconf-sent slave 10.10.100.76:38002 10.10.100.76 38002 @ master1 10.10.100.76 38001
--去掉客观下线状态
8875:X 26 Jul 18:00:48.881 # -odown master master1 10.10.100.76 38001
--重新配置进行中，相应的同步过程仍未完成
8875:X 26 Jul 18:00:49.143 * +slave-reconf-inprog slave 10.10.100.76:38002 10.10.100.76 38002 @ master1 10.10.100.76 38001
--从服务器已经成功完成对新主服务器的同步
8875:X 26 Jul 18:00:49.143 * +slave-reconf-done slave 10.10.100.76:38002 10.10.100.76 38002 @ master1 10.10.100.76 38001
--故障转移操作顺利完成。所有从服务器都开始复制新的主服务器
8875:X 26 Jul 18:00:49.208 # +failover-end master master1 10.10.100.76 38001
--转换master 从38001到38003
8875:X 26 Jul 18:00:49.208 # +switch-master master1 10.10.100.76 38001 10.10.100.76 38003
--加入slave到新的master
8875:X 26 Jul 18:00:49.209 * +slave slave 10.10.100.76:38002 10.10.100.76 38002 @ master1 10.10.100.76 38003
--加入slave到新的master
8875:X 26 Jul 18:00:49.209 * +slave slave 10.10.100.76:38001 10.10.100.76 38001 @ master1 10.10.100.76 38003
--加入主观下线状态
8875:X 26 Jul 18:01:19.294 # +sdown slave 10.10.100.76:38001 10.10.100.76 38001 @ master1 10.10.100.76 38003

附件

下面是自动故障转移后的配置文件和日志文件：

备注：

此次测试没有启用密码验证，如果需要

就要在redis 和 sentinel 中都添加密码

masterauth <master-password>

requirepass foobared

sentinel auth-pass <master-password>

redis sentinel auto-failover 一对主从实例测试全过程

文章源自微信公众号【刍荛采葑菲】，转载请注明。