MySQL高可用方案 MHA之四 keepalived 半同步复制

主从架构（开启5.7的增强半同步模式）
master:
10.150.20.90 　　 ed3jrdba90
slave:
10.150.20.97 　　 ed3jrdba97
10.150.20.132 　　 ed3jrdba132
manager:
10.150.20.95 　　 ed3jrdba95
vip:10.150.20.200

MySQL:5.7

os:CentOS 7.3
网卡名：ens3

这里，ed3jrdba90、ed3jrdba97部署keepalived,vip绑定在ed3jrdba90，并且ed3jrdba90节点为MHA master主库，ed3jrdba97为candidate_master。

1:配置文件candidate_masterkeepalived.conf如下：
# cat /etc/keepalived/keepalived.conf

! Configuration File for keepalived

global_defs {
notification_email {
eric@gmail.cn
}
notification_email_from eric@gmail.cn
smtp_server 127.0.0.1
smtp_connect_timeout 30
router_id mysql_mha
}
vrrp_instance VI_1 {
state master
interface ens3
nopreempt
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
10.150.20.200
}
}

virtual_server 10.150.20.200 33061 {
delay_loop 6
lb_algo wrr
lb_kind DR
nat_mask 255.255.255.0
persistence_timeout 50
protocol TCP
real_server 10.150.20.90 33061 {
weight 1
notify_down /etc/keepalived/shutdown_keepalived.sh
TCP_CHECK {
connect_timeout 10
nb_get_retry 3
connect_port 33061
}
}
}

# cat /etc/keepalived/shutdown_keepalived.sh

systemctl stop keepalived
检查vip绑定在ed3jrdba90的ens3:1
# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 54:52:00:01:61:2b brd ff:ff:ff:ff:ff:ff
inet 10.150.20.90/24 brd 10.150.20.255 scope global ens3
valid_lft forever preferred_lft forever
inet 10.150.20.200/32 scope global ens3
valid_lft forever preferred_lft forever

ed3jrdba95节点：
MHA manager配置文件
# vi /etc/mysql_mha/app1.cnf
[server default]
manager_workdir=/data/mysql_mha/app1
manager_log=/data/mysql_mha/app1-manager.log

master_binlog_dir=/data/mysql_33061/logs

user=mha_monitor
password=mha_monitor
ping_interval=5
remote_workdir=/data/mysql_mha/app1

master_ip_failover_script= /usr/local/bin/master_ip_failover
#master_ip_online_change_script= /usr/local/bin/master_ip_online_change

secondary_check_script=/usr/local/bin/masterha_secondary_check -s 10.150.20.97 -s 10.150.20.90

repl_user=replicator
repl_password=replicator

report_script=/usr/local/bin/send_report
shutdown_script=""
ssh_user=root

[server1]
hostname=10.150.20.90
port=33061

[server2]
hostname=10.150.20.97
port=33061
candidate_master=1
check_repl_delay=0

[server3]
hostname=10.150.20.132
port=33061

master_ip_failover脚本:

# cat /usr/local/bin/master_ip_failover

#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';
use Getopt::Long;
my (
$command,$ssh_user,$orig_master_host,$orig_master_ip,
$orig_master_port, $new_master_host, $new_master_ip,$new_master_port
);


my $vip = '10.150.20.200';
my $ssh_start_vip = "systemctl start keepalived";
my $ssh_stop_vip = "systemctl stop keepalived";

GetOptions(
'command=s' => $command,
'ssh_user=s' => $ssh_user,
'orig_master_host=s' => $orig_master_host,
'orig_master_ip=s'=>$orig_master_ip,
'orig_master_port=i'=>$orig_master_port,
'new_master_host=s'=>$new_master_host,
'new_master_ip=s'=>$new_master_ip,
'new_master_port=i'=> $new_master_port,
);
exit &main();
sub main {
print "

IN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===

";
if ( $command eq "stop" || $command eq "stopssh" ) {
my $exit_code = 1;
eval {
print "Disabling the VIP on old master: $orig_master_host 
";
&stop_vip();
$exit_code = 0;
};
if ($@) {
warn "Got Error: $@
";
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "start" ) {


my $exit_code = 10;
eval {
print "Enabling the VIP - $vip on the new master - $new_master_host 
";
&start_vip();
$exit_code = 0;
};
if ($@) {
warn $@;
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "status" ) {
print "Checking the Status of the script.. OK 
";
#`ssh $ssh_user@cluster1 " $ssh_start_vip "`;
exit 0;
}
else {
&usage();
exit 1;
}
}


# A simple system call that enable the VIP on the new master
sub start_vip() {
`ssh $ssh_user@$new_master_host " $ssh_start_vip "`;
}
# A simple system call that disable the VIP on the old_master
sub stop_vip() {
return 0 unless ($ssh_user);
`ssh $ssh_user@$orig_master_host " $ssh_stop_vip "`;
}
sub usage {
print
"Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port
";
}

master_ip_failover

检测mha复制环境：

# masterha_check_repl --conf=/etc/mysql_mha/app1.cnf

Thu Dec 13 15:53:37 2018 - [info] Reading default configuration from /etc/masterha_default.cnf..
Thu Dec 13 15:53:37 2018 - [info] Reading application default configuration from /etc/mysql_mha/app1.cnf..
Thu Dec 13 15:53:37 2018 - [info] Reading server configuration from /etc/mysql_mha/app1.cnf..
Thu Dec 13 15:53:37 2018 - [info] MHA::MasterMonitor version 0.58.
Thu Dec 13 15:53:39 2018 - [info] GTID failover mode = 0
Thu Dec 13 15:53:39 2018 - [info] Dead Servers:
Thu Dec 13 15:53:39 2018 - [info] Alive Servers:
Thu Dec 13 15:53:39 2018 - [info] 10.150.20.90(10.150.20.90:33061)
Thu Dec 13 15:53:39 2018 - [info] 10.150.20.97(10.150.20.97:33061)
Thu Dec 13 15:53:39 2018 - [info] 10.150.20.132(10.150.20.132:33061)
Thu Dec 13 15:53:39 2018 - [info] Alive Slaves:
Thu Dec 13 15:53:39 2018 - [info] 10.150.20.97(10.150.20.97:33061) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Thu Dec 13 15:53:39 2018 - [info] Replicating from 10.150.20.90(10.150.20.90:33061)
Thu Dec 13 15:53:39 2018 - [info] 10.150.20.132(10.150.20.132:33061) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Thu Dec 13 15:53:39 2018 - [info] Replicating from 10.150.20.90(10.150.20.90:33061)
Thu Dec 13 15:53:39 2018 - [info] Current Alive Master: 10.150.20.90(10.150.20.90:33061)
Thu Dec 13 15:53:39 2018 - [info] Checking slave configurations..
Thu Dec 13 15:53:39 2018 - [info] read_only=1 is not set on slave 10.150.20.97(10.150.20.97:33061).
Thu Dec 13 15:53:39 2018 - [warning] relay_log_purge=0 is not set on slave 10.150.20.97(10.150.20.97:33061).
Thu Dec 13 15:53:39 2018 - [info] read_only=1 is not set on slave 10.150.20.132(10.150.20.132:33061).
Thu Dec 13 15:53:39 2018 - [info] Checking replication filtering settings..
Thu Dec 13 15:53:39 2018 - [info] binlog_do_db= , binlog_ignore_db= 
Thu Dec 13 15:53:39 2018 - [info] Replication filtering check ok.
Thu Dec 13 15:53:39 2018 - [info] GTID (with auto-pos) is not supported
Thu Dec 13 15:53:39 2018 - [info] Starting SSH connection tests..
Thu Dec 13 15:53:41 2018 - [info] All SSH connection tests passed successfully.
Thu Dec 13 15:53:41 2018 - [info] Checking MHA Node version..
Thu Dec 13 15:53:42 2018 - [info] Version check ok.
Thu Dec 13 15:53:42 2018 - [info] Checking SSH publickey authentication settings on the current master..
Thu Dec 13 15:53:42 2018 - [info] HealthCheck: SSH to 10.150.20.90 is reachable.
Thu Dec 13 15:53:42 2018 - [info] Master MHA Node version is 0.58.
Thu Dec 13 15:53:42 2018 - [info] Checking recovery script configurations on 10.150.20.90(10.150.20.90:33061)..
Thu Dec 13 15:53:42 2018 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql_33061/logs --output_file=/data/mysql_mha/app1/save_binary_logs_test --manager_version=0.58 --start_file=mysql-bin.000009 
Thu Dec 13 15:53:42 2018 - [info] Connecting to root@10.150.20.90(10.150.20.90:22).. 
Creating /data/mysql_mha/app1 if not exists.. ok.
Checking output directory is accessible or not..
ok.
Binlog found at /data/mysql_33061/logs, up to mysql-bin.000009
Thu Dec 13 15:53:42 2018 - [info] Binlog setting check done.
Thu Dec 13 15:53:42 2018 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..
Thu Dec 13 15:53:42 2018 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='mha_monitor' --slave_host=10.150.20.97 --slave_ip=10.150.20.97 --slave_port=33061 --workdir=/data/mysql_mha/app1 --target_version=5.7.21-log --manager_version=0.58 --relay_log_info=/data/mysql_33061/logs/relay-log.info --relay_dir=/data/mysql_33061/data/ --slave_pass=xxx
Thu Dec 13 15:53:42 2018 - [info] Connecting to root@10.150.20.97(10.150.20.97:22).. 
Checking slave recovery environment settings..
Opening /data/mysql_33061/logs/relay-log.info ... ok.
Relay log found at /data/mysql_33061/logs, up to relaylog.000002
Temporary relay log file is /data/mysql_33061/logs/relaylog.000002
Checking if super_read_only is defined and turned on.. not present or turned off, ignoring.
Testing mysql connection and privileges..
mysql: [Warning] Using a password on the command line interface can be insecure.
done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Thu Dec 13 15:53:43 2018 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='mha_monitor' --slave_host=10.150.20.132 --slave_ip=10.150.20.132 --slave_port=33061 --workdir=/data/mysql_mha/app1 --target_version=5.7.21-log --manager_version=0.58 --relay_log_info=/data/mysql_33061/logs/relay-log.info --relay_dir=/data/mysql_33061/data/ --slave_pass=xxx
Thu Dec 13 15:53:43 2018 - [info] Connecting to root@10.150.20.132(10.150.20.132:22).. 
Checking slave recovery environment settings..
Opening /data/mysql_33061/logs/relay-log.info ... ok.
Relay log found at /data/mysql_33061/data, up to cgdb-relay-bin.000002
Temporary relay log file is /data/mysql_33061/data/cgdb-relay-bin.000002
Checking if super_read_only is defined and turned on.. not present or turned off, ignoring.
Testing mysql connection and privileges..
mysql: [Warning] Using a password on the command line interface can be insecure.
done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Thu Dec 13 15:53:43 2018 - [info] Slaves settings check done.
Thu Dec 13 15:53:43 2018 - [info] 
10.150.20.90(10.150.20.90:33061) (current master)
+--10.150.20.97(10.150.20.97:33061)
+--10.150.20.132(10.150.20.132:33061)

Thu Dec 13 15:53:43 2018 - [info] Checking replication health on 10.150.20.97..
Thu Dec 13 15:53:43 2018 - [info] ok.
Thu Dec 13 15:53:43 2018 - [info] Checking replication health on 10.150.20.132..
Thu Dec 13 15:53:43 2018 - [info] ok.
Thu Dec 13 15:53:43 2018 - [info] Checking master_ip_failover_script status:
Thu Dec 13 15:53:43 2018 - [info] /usr/local/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=10.150.20.90 --orig_master_ip=10.150.20.90 --orig_master_port=33061


IN SCRIPT TEST====systemctl stop keepalived==systemctl start keepalived===

Checking the Status of the script.. OK 
Thu Dec 13 15:53:43 2018 - [info] OK.
Thu Dec 13 15:53:43 2018 - [warning] shutdown_script is not defined.
Thu Dec 13 15:53:43 2018 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

输出

检测MHA manager是否开启：

# masterha_check_status --conf=/etc/mysql_mha/app1.cnf
app1 (pid:19724) is running(0:PING_OK), master:10.150.20.90

开启日志监控
# tail -f /data/mysql_mha/app1-manager.log

故障切换测试：

节点ed3jrdba90：关闭主库MySQL:
# ps -ef |grep mysql
root 18908 1 0 12月12 ? 00:00:00 /bin/sh /usr/local/mysql/bin/mysqld_safe --datadir=/data/mysql_33061/data --pid-file=/data/mysql_33061/run/mysql.pid
mysql 20401 18908 0 12月12 ? 00:03:02 /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/data/mysql_33061/data --plugin-dir=/usr/local/mysql/lib/plugin --user=mysql --log-error=/data/mysql_33061/logs/mysqlerror.log --open-files-limit=65535 --pid-file=/data/mysql_33061/run/mysql.pid --socket=/data/mysql_33061/run/mysql.sock --port=33061
root 31104 29976 0 15:57 pts/0 00:00:00 grep --color=auto mysql
# kill -9 20401 18908

节点ed3jrdba97：主库MySQL宕掉之后，vip飘到节点ed3jrdba97

# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 54:52:00:49:48:92 brd ff:ff:ff:ff:ff:ff
inet 10.150.20.97/24 brd 10.150.20.255 scope global ens3
valid_lft forever preferred_lft forever
inet 10.150.20.200/32 scope global ens3
valid_lft forever preferred_lft forever

节点ed3jrdba95：在MHA manager节点的日志显示了MHA切换过程

Thu Dec 13 16:37:54 2018 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away)
Thu Dec 13 16:37:54 2018 - [info] Executing secondary network check script: /usr/local/bin/masterha_secondary_check -s 10.150.20.97 -s 10.150.20.90 --user=root --master_host=10.150.20.90 --master_ip=10.150.20.90 --master_port=33061 --master_user=mha_monitor --master_password=mha_monitor --ping_type=SELECT
Thu Dec 13 16:37:54 2018 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql_33061/logs --output_file=/data/mysql_mha/app1/save_binary_logs_test --manager_version=0.58 --binlog_prefix=mysql-bin
Thu Dec 13 16:37:55 2018 - [info] HealthCheck: SSH to 10.150.20.90 is reachable.
Monitoring server 10.150.20.97 is reachable, Master is not reachable from 10.150.20.97. OK.
Monitoring server 10.150.20.90 is reachable, Master is not reachable from 10.150.20.90. OK.
Thu Dec 13 16:37:55 2018 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Thu Dec 13 16:37:59 2018 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '10.150.20.90' (111))
Thu Dec 13 16:37:59 2018 - [warning] Connection failed 2 time(s)..
Thu Dec 13 16:38:04 2018 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '10.150.20.90' (111))
Thu Dec 13 16:38:04 2018 - [warning] Connection failed 3 time(s)..
Thu Dec 13 16:38:09 2018 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '10.150.20.90' (111))
Thu Dec 13 16:38:09 2018 - [warning] Connection failed 4 time(s)..
Thu Dec 13 16:38:09 2018 - [warning] Master is not reachable from health checker!
Thu Dec 13 16:38:09 2018 - [warning] Master 10.150.20.90(10.150.20.90:33061) is not reachable!
Thu Dec 13 16:38:09 2018 - [warning] SSH is reachable.
Thu Dec 13 16:38:09 2018 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/mysql_mha/app1.cnf again, and trying to connect to all servers to check server status..
Thu Dec 13 16:38:09 2018 - [info] Reading default configuration from /etc/masterha_default.cnf..
Thu Dec 13 16:38:09 2018 - [info] Reading application default configuration from /etc/mysql_mha/app1.cnf..
Thu Dec 13 16:38:09 2018 - [info] Reading server configuration from /etc/mysql_mha/app1.cnf..
Thu Dec 13 16:38:10 2018 - [info] GTID failover mode = 0
Thu Dec 13 16:38:10 2018 - [info] Dead Servers:
Thu Dec 13 16:38:10 2018 - [info] 10.150.20.90(10.150.20.90:33061)
Thu Dec 13 16:38:10 2018 - [info] Alive Servers:
Thu Dec 13 16:38:10 2018 - [info] 10.150.20.97(10.150.20.97:33061)
Thu Dec 13 16:38:10 2018 - [info] 10.150.20.132(10.150.20.132:33061)
Thu Dec 13 16:38:10 2018 - [info] Alive Slaves:
Thu Dec 13 16:38:10 2018 - [info] 10.150.20.97(10.150.20.97:33061) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Thu Dec 13 16:38:10 2018 - [info] Replicating from 10.150.20.90(10.150.20.90:33061)
Thu Dec 13 16:38:10 2018 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Dec 13 16:38:10 2018 - [info] 10.150.20.132(10.150.20.132:33061) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Thu Dec 13 16:38:10 2018 - [info] Replicating from 10.150.20.90(10.150.20.90:33061)
Thu Dec 13 16:38:10 2018 - [info] Checking slave configurations..
Thu Dec 13 16:38:10 2018 - [info] read_only=1 is not set on slave 10.150.20.97(10.150.20.97:33061).
Thu Dec 13 16:38:10 2018 - [warning] relay_log_purge=0 is not set on slave 10.150.20.97(10.150.20.97:33061).
Thu Dec 13 16:38:10 2018 - [info] read_only=1 is not set on slave 10.150.20.132(10.150.20.132:33061).
Thu Dec 13 16:38:10 2018 - [info] Checking replication filtering settings..
Thu Dec 13 16:38:10 2018 - [info] Replication filtering check ok.
Thu Dec 13 16:38:10 2018 - [info] Master is down!
Thu Dec 13 16:38:10 2018 - [info] Terminating monitoring script.
Thu Dec 13 16:38:10 2018 - [info] Got exit code 20 (Master dead).
Thu Dec 13 16:38:10 2018 - [info] MHA::MasterFailover version 0.58.
Thu Dec 13 16:38:10 2018 - [info] Starting master failover.
Thu Dec 13 16:38:10 2018 - [info] 
Thu Dec 13 16:38:10 2018 - [info] * Phase 1: Configuration Check Phase..
Thu Dec 13 16:38:10 2018 - [info] 
Thu Dec 13 16:38:12 2018 - [info] GTID failover mode = 0
Thu Dec 13 16:38:12 2018 - [info] Dead Servers:
Thu Dec 13 16:38:12 2018 - [info] 10.150.20.90(10.150.20.90:33061)
Thu Dec 13 16:38:12 2018 - [info] Checking master reachability via MySQL(double check)...
Thu Dec 13 16:38:12 2018 - [info] ok.
Thu Dec 13 16:38:12 2018 - [info] Alive Servers:
Thu Dec 13 16:38:12 2018 - [info] 10.150.20.97(10.150.20.97:33061)
Thu Dec 13 16:38:12 2018 - [info] 10.150.20.132(10.150.20.132:33061)
Thu Dec 13 16:38:12 2018 - [info] Alive Slaves:
Thu Dec 13 16:38:12 2018 - [info] 10.150.20.97(10.150.20.97:33061) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Thu Dec 13 16:38:12 2018 - [info] Replicating from 10.150.20.90(10.150.20.90:33061)
Thu Dec 13 16:38:12 2018 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Dec 13 16:38:12 2018 - [info] 10.150.20.132(10.150.20.132:33061) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Thu Dec 13 16:38:12 2018 - [info] Replicating from 10.150.20.90(10.150.20.90:33061)
Thu Dec 13 16:38:12 2018 - [info] Starting Non-GTID based failover.
Thu Dec 13 16:38:12 2018 - [info] 
Thu Dec 13 16:38:12 2018 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Dec 13 16:38:12 2018 - [info] 
Thu Dec 13 16:38:12 2018 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Dec 13 16:38:12 2018 - [info] 
Thu Dec 13 16:38:12 2018 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Dec 13 16:38:12 2018 - [info] Executing master IP deactivation script:
Thu Dec 13 16:38:12 2018 - [info] /usr/local/bin/master_ip_failover --orig_master_host=10.150.20.90 --orig_master_ip=10.150.20.90 --orig_master_port=33061 --command=stopssh --ssh_user=root


IN SCRIPT TEST====systemctl stop keepalived==systemctl start keepalived===

Disabling the VIP on old master: 10.150.20.90 
Thu Dec 13 16:38:12 2018 - [info] done.
Thu Dec 13 16:38:12 2018 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Dec 13 16:38:12 2018 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Dec 13 16:38:12 2018 - [info] 
Thu Dec 13 16:38:12 2018 - [info] * Phase 3: Master Recovery Phase..
Thu Dec 13 16:38:12 2018 - [info] 
Thu Dec 13 16:38:12 2018 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Dec 13 16:38:12 2018 - [info] 
Thu Dec 13 16:38:12 2018 - [info] The latest binary log file/position on all slaves is mysql-bin.000011:154
Thu Dec 13 16:38:12 2018 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Dec 13 16:38:12 2018 - [info] 10.150.20.97(10.150.20.97:33061) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Thu Dec 13 16:38:12 2018 - [info] Replicating from 10.150.20.90(10.150.20.90:33061)
Thu Dec 13 16:38:12 2018 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Dec 13 16:38:12 2018 - [info] 10.150.20.132(10.150.20.132:33061) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Thu Dec 13 16:38:12 2018 - [info] Replicating from 10.150.20.90(10.150.20.90:33061)
Thu Dec 13 16:38:12 2018 - [info] The oldest binary log file/position on all slaves is mysql-bin.000011:154
Thu Dec 13 16:38:12 2018 - [info] Oldest slaves:
Thu Dec 13 16:38:12 2018 - [info] 10.150.20.97(10.150.20.97:33061) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Thu Dec 13 16:38:12 2018 - [info] Replicating from 10.150.20.90(10.150.20.90:33061)
Thu Dec 13 16:38:12 2018 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Dec 13 16:38:12 2018 - [info] 10.150.20.132(10.150.20.132:33061) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Thu Dec 13 16:38:12 2018 - [info] Replicating from 10.150.20.90(10.150.20.90:33061)
Thu Dec 13 16:38:12 2018 - [info] 
Thu Dec 13 16:38:12 2018 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Thu Dec 13 16:38:12 2018 - [info] 
Thu Dec 13 16:38:12 2018 - [info] Fetching dead master's binary logs..
Thu Dec 13 16:38:12 2018 - [info] Executing command on the dead master 10.150.20.90(10.150.20.90:33061): save_binary_logs --command=save --start_file=mysql-bin.000011 --start_pos=154 --binlog_dir=/data/mysql_33061/logs --output_file=/data/mysql_mha/app1/saved_master_binlog_from_10.150.20.90_33061_20181213163810.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.58
Creating /data/mysql_mha/app1 if not exists.. ok.
Concat binary/relay logs from mysql-bin.000011 pos 154 to mysql-bin.000011 EOF into /data/mysql_mha/app1/saved_master_binlog_from_10.150.20.90_33061_20181213163810.binlog ..
Binlog Checksum enabled
Dumping binlog format description event, from position 0 to 154.. ok.
Dumping effective binlog data from /data/mysql_33061/logs/mysql-bin.000011 position 154 to tail(177).. ok.
Binlog Checksum enabled
Concat succeeded.
Thu Dec 13 16:38:12 2018 - [info] scp from root@10.150.20.90:/data/mysql_mha/app1/saved_master_binlog_from_10.150.20.90_33061_20181213163810.binlog to local:/data/mysql_mha/app1/saved_master_binlog_from_10.150.20.90_33061_20181213163810.binlog succeeded.
Thu Dec 13 16:38:13 2018 - [info] HealthCheck: SSH to 10.150.20.97 is reachable.
Thu Dec 13 16:38:13 2018 - [info] HealthCheck: SSH to 10.150.20.132 is reachable.
Thu Dec 13 16:38:13 2018 - [info] 
Thu Dec 13 16:38:13 2018 - [info] * Phase 3.3: Determining New Master Phase..
Thu Dec 13 16:38:13 2018 - [info] 
Thu Dec 13 16:38:13 2018 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Thu Dec 13 16:38:13 2018 - [info] All slaves received relay logs to the same position. No need to resync each other.
Thu Dec 13 16:38:13 2018 - [info] Searching new master from slaves..
Thu Dec 13 16:38:13 2018 - [info] Candidate masters from the configuration file:
Thu Dec 13 16:38:13 2018 - [info] 10.150.20.97(10.150.20.97:33061) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Thu Dec 13 16:38:13 2018 - [info] Replicating from 10.150.20.90(10.150.20.90:33061)
Thu Dec 13 16:38:13 2018 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Dec 13 16:38:13 2018 - [info] Non-candidate masters:
Thu Dec 13 16:38:13 2018 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Thu Dec 13 16:38:13 2018 - [info] New master is 10.150.20.97(10.150.20.97:33061)
Thu Dec 13 16:38:13 2018 - [info] Starting master failover..
Thu Dec 13 16:38:13 2018 - [info] 
From:
10.150.20.90(10.150.20.90:33061) (current master)
+--10.150.20.97(10.150.20.97:33061)
+--10.150.20.132(10.150.20.132:33061)

To:
10.150.20.97(10.150.20.97:33061) (new master)
+--10.150.20.132(10.150.20.132:33061)
Thu Dec 13 16:38:13 2018 - [info] 
Thu Dec 13 16:38:13 2018 - [info] * Phase 3.4: New Master Diff Log Generation Phase..
Thu Dec 13 16:38:13 2018 - [info] 
Thu Dec 13 16:38:13 2018 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Thu Dec 13 16:38:13 2018 - [info] Sending binlog..
Thu Dec 13 16:38:14 2018 - [info] scp from local:/data/mysql_mha/app1/saved_master_binlog_from_10.150.20.90_33061_20181213163810.binlog to root@10.150.20.97:/data/mysql_mha/app1/saved_master_binlog_from_10.150.20.90_33061_20181213163810.binlog succeeded.
Thu Dec 13 16:38:14 2018 - [info] 
Thu Dec 13 16:38:14 2018 - [info] * Phase 3.5: Master Log Apply Phase..
Thu Dec 13 16:38:14 2018 - [info] 
Thu Dec 13 16:38:14 2018 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Thu Dec 13 16:38:14 2018 - [info] Starting recovery on 10.150.20.97(10.150.20.97:33061)..
Thu Dec 13 16:38:14 2018 - [info] Generating diffs succeeded.
Thu Dec 13 16:38:14 2018 - [info] Waiting until all relay logs are applied.
Thu Dec 13 16:38:14 2018 - [info] done.
Thu Dec 13 16:38:14 2018 - [info] Getting slave status..
Thu Dec 13 16:38:14 2018 - [info] This slave(10.150.20.97)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000011:154). No need to recover from Exec_Master_Log_Pos.
Thu Dec 13 16:38:14 2018 - [info] Connecting to the target slave host 10.150.20.97, running recover script..
Thu Dec 13 16:38:14 2018 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='mha_monitor' --slave_host=10.150.20.97 --slave_ip=10.150.20.97 --slave_port=33061 --apply_files=/data/mysql_mha/app1/saved_master_binlog_from_10.150.20.90_33061_20181213163810.binlog --workdir=/data/mysql_mha/app1 --target_version=5.7.21-log --timestamp=20181213163810 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.58 --slave_pass=xxx
Thu Dec 13 16:38:14 2018 - [info] 
MySQL client version is 5.7.21. Using --binary-mode.
Applying differential binary/relay log files /data/mysql_mha/app1/saved_master_binlog_from_10.150.20.90_33061_20181213163810.binlog on 10.150.20.97:33061. This may take long time...
Applying log files succeeded.
Thu Dec 13 16:38:14 2018 - [info] All relay logs were successfully applied.
Thu Dec 13 16:38:14 2018 - [info] Getting new master's binlog name and position..
Thu Dec 13 16:38:14 2018 - [info] mysql-bin.000018:154
Thu Dec 13 16:38:14 2018 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='10.150.20.97', MASTER_PORT=33061, MASTER_LOG_FILE='mysql-bin.000018', MASTER_LOG_POS=154, MASTER_USER='replicator', MASTER_PASSWORD='xxx';
Thu Dec 13 16:38:14 2018 - [info] Executing master IP activate script:
Thu Dec 13 16:38:14 2018 - [info] /usr/local/bin/master_ip_failover --command=start --ssh_user=root --orig_master_host=10.150.20.90 --orig_master_ip=10.150.20.90 --orig_master_port=33061 --new_master_host=10.150.20.97 --new_master_ip=10.150.20.97 --new_master_port=33061 --new_master_user='mha_monitor' --new_master_password=xxx
Unknown option: new_master_user
Unknown option: new_master_password


IN SCRIPT TEST====systemctl stop keepalived==systemctl start keepalived===

Enabling the VIP - 10.150.20.200 on the new master - 10.150.20.97 
Thu Dec 13 16:38:14 2018 - [info] OK.
Thu Dec 13 16:38:14 2018 - [info] ** Finished master recovery successfully.
Thu Dec 13 16:38:14 2018 - [info] * Phase 3: Master Recovery Phase completed.
Thu Dec 13 16:38:14 2018 - [info] 
Thu Dec 13 16:38:14 2018 - [info] * Phase 4: Slaves Recovery Phase..
Thu Dec 13 16:38:14 2018 - [info] 
Thu Dec 13 16:38:14 2018 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Thu Dec 13 16:38:14 2018 - [info] 
Thu Dec 13 16:38:14 2018 - [info] -- Slave diff file generation on host 10.150.20.132(10.150.20.132:33061) started, pid: 21368. Check tmp log /data/mysql_mha/app1/10.150.20.132_33061_20181213163810.log if it takes time..
Thu Dec 13 16:38:15 2018 - [info] 
Thu Dec 13 16:38:15 2018 - [info] Log messages from 10.150.20.132 ...
Thu Dec 13 16:38:15 2018 - [info] 
Thu Dec 13 16:38:14 2018 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Thu Dec 13 16:38:15 2018 - [info] End of log messages from 10.150.20.132.
Thu Dec 13 16:38:15 2018 - [info] -- 10.150.20.132(10.150.20.132:33061) has the latest relay log events.
Thu Dec 13 16:38:15 2018 - [info] Generating relay diff files from the latest slave succeeded.
Thu Dec 13 16:38:15 2018 - [info] 
Thu Dec 13 16:38:15 2018 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Thu Dec 13 16:38:15 2018 - [info] 
Thu Dec 13 16:38:15 2018 - [info] -- Slave recovery on host 10.150.20.132(10.150.20.132:33061) started, pid: 21370. Check tmp log /data/mysql_mha/app1/10.150.20.132_33061_20181213163810.log if it takes time..
Thu Dec 13 16:38:17 2018 - [info] 
Thu Dec 13 16:38:17 2018 - [info] Log messages from 10.150.20.132 ...
Thu Dec 13 16:38:17 2018 - [info] 
Thu Dec 13 16:38:15 2018 - [info] Sending binlog..
Thu Dec 13 16:38:16 2018 - [info] scp from local:/data/mysql_mha/app1/saved_master_binlog_from_10.150.20.90_33061_20181213163810.binlog to root@10.150.20.132:/data/mysql_mha/app1/saved_master_binlog_from_10.150.20.90_33061_20181213163810.binlog succeeded.
Thu Dec 13 16:38:16 2018 - [info] Starting recovery on 10.150.20.132(10.150.20.132:33061)..
Thu Dec 13 16:38:16 2018 - [info] Generating diffs succeeded.
Thu Dec 13 16:38:16 2018 - [info] Waiting until all relay logs are applied.
Thu Dec 13 16:38:16 2018 - [info] done.
Thu Dec 13 16:38:16 2018 - [info] Getting slave status..
Thu Dec 13 16:38:16 2018 - [info] This slave(10.150.20.132)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000011:154). No need to recover from Exec_Master_Log_Pos.
Thu Dec 13 16:38:16 2018 - [info] Connecting to the target slave host 10.150.20.132, running recover script..
Thu Dec 13 16:38:16 2018 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='mha_monitor' --slave_host=10.150.20.132 --slave_ip=10.150.20.132 --slave_port=33061 --apply_files=/data/mysql_mha/app1/saved_master_binlog_from_10.150.20.90_33061_20181213163810.binlog --workdir=/data/mysql_mha/app1 --target_version=5.7.21-log --timestamp=20181213163810 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.58 --slave_pass=xxx
Thu Dec 13 16:38:16 2018 - [info] 
MySQL client version is 5.7.21. Using --binary-mode.
Applying differential binary/relay log files /data/mysql_mha/app1/saved_master_binlog_from_10.150.20.90_33061_20181213163810.binlog on 10.150.20.132:33061. This may take long time...
Applying log files succeeded.
Thu Dec 13 16:38:16 2018 - [info] All relay logs were successfully applied.
Thu Dec 13 16:38:16 2018 - [info] Resetting slave 10.150.20.132(10.150.20.132:33061) and starting replication from the new master 10.150.20.97(10.150.20.97:33061)..
Thu Dec 13 16:38:16 2018 - [info] Executed CHANGE MASTER.
Thu Dec 13 16:38:16 2018 - [info] Slave started.
Thu Dec 13 16:38:17 2018 - [info] End of log messages from 10.150.20.132.
Thu Dec 13 16:38:17 2018 - [info] -- Slave recovery on host 10.150.20.132(10.150.20.132:33061) succeeded.
Thu Dec 13 16:38:17 2018 - [info] All new slave servers recovered successfully.
Thu Dec 13 16:38:17 2018 - [info] 
Thu Dec 13 16:38:17 2018 - [info] * Phase 5: New master cleanup phase..
Thu Dec 13 16:38:17 2018 - [info] 
Thu Dec 13 16:38:17 2018 - [info] Resetting slave info on the new master..
Thu Dec 13 16:38:17 2018 - [info] 10.150.20.97: Resetting slave info succeeded.
Thu Dec 13 16:38:17 2018 - [info] Master failover to 10.150.20.97(10.150.20.97:33061) completed successfully.
Thu Dec 13 16:38:17 2018 - [info] Deleted server1 entry from /etc/mysql_mha/app1.cnf .
Thu Dec 13 16:38:17 2018 - [info]

----- Failover Report -----

app1: MySQL Master failover 10.150.20.90(10.150.20.90:33061) to 10.150.20.97(10.150.20.97:33061) succeeded

Master 10.150.20.90(10.150.20.90:33061) is down!

Check MHA Manager logs at ed3jrdba95:/data/mysql_mha/app1-manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 10.150.20.90(10.150.20.90:33061)
The latest slave 10.150.20.97(10.150.20.97:33061) has all relay logs for recovery.
Selected 10.150.20.97(10.150.20.97:33061) as a new master.
10.150.20.97(10.150.20.97:33061): OK: Applying all logs succeeded.
10.150.20.97(10.150.20.97:33061): OK: Activated master IP address.
10.150.20.132(10.150.20.132:33061): This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
10.150.20.132(10.150.20.132:33061): OK: Applying all logs succeeded. Slave started, replicating from 10.150.20.97(10.150.20.97:33061)
10.150.20.97(10.150.20.97:33061): Resetting slave info succeeded.
Master failover to 10.150.20.97(10.150.20.97:33061) completed successfully.

failover过程

MHA manager节点的masterha_manager已经停掉
# masterha_check_status --conf=/etc/mysql_mha/app1.cnf
app1 is stopped(2:NOT_RUNNING).
app1.cnf文件被修改为：
# cat /etc/mysql_mha/app1.cnf
[server default]
manager_log=/data/mysql_mha/app1-manager.log
manager_workdir=/data/mysql_mha/app1
master_binlog_dir=/data/mysql_33061/logs
master_ip_online_change_script=/usr/local/bin/master_ip_online_change
password=mha_monitor
ping_interval=5
remote_workdir=/data/mysql_mha/app1
repl_password=replicator
repl_user=replicator
shutdown_script=""
ssh_user=root
user=mha_monitor

[server2]
hostname=10.150.20.97
port=33061

[server3]
hostname=10.150.20.132
port=33061

此时的复制关系为：
新主：10.150.20.97
从库：10.150.20.132

# masterha_check_repl --conf=/etc/mysql_mha/app1.cnf

Thu Dec 13 16:03:35 2018 - [info] Reading default configuration from /etc/masterha_default.cnf..
Thu Dec 13 16:03:35 2018 - [info] Reading application default configuration from /etc/mysql_mha/app1.cnf..
Thu Dec 13 16:03:35 2018 - [info] Reading server configuration from /etc/mysql_mha/app1.cnf..
Thu Dec 13 16:03:35 2018 - [info] MHA::MasterMonitor version 0.58.
Thu Dec 13 16:03:36 2018 - [info] GTID failover mode = 0
Thu Dec 13 16:03:36 2018 - [info] Dead Servers:
Thu Dec 13 16:03:36 2018 - [info] Alive Servers:
Thu Dec 13 16:03:36 2018 - [info] 10.150.20.97(10.150.20.97:33061)
Thu Dec 13 16:03:36 2018 - [info] 10.150.20.132(10.150.20.132:33061)
Thu Dec 13 16:03:36 2018 - [info] Alive Slaves:
Thu Dec 13 16:03:36 2018 - [info] 10.150.20.132(10.150.20.132:33061) Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Thu Dec 13 16:03:36 2018 - [info] Replicating from 10.150.20.97(10.150.20.97:33061)
Thu Dec 13 16:03:36 2018 - [info] Current Alive Master: 10.150.20.97(10.150.20.97:33061)
Thu Dec 13 16:03:36 2018 - [info] Checking slave configurations..
Thu Dec 13 16:03:36 2018 - [info] read_only=1 is not set on slave 10.150.20.132(10.150.20.132:33061).
Thu Dec 13 16:03:36 2018 - [info] Checking replication filtering settings..
Thu Dec 13 16:03:36 2018 - [info] binlog_do_db= , binlog_ignore_db= 
Thu Dec 13 16:03:36 2018 - [info] Replication filtering check ok.
Thu Dec 13 16:03:36 2018 - [info] GTID (with auto-pos) is not supported
Thu Dec 13 16:03:36 2018 - [info] Starting SSH connection tests..
Thu Dec 13 16:03:37 2018 - [info] All SSH connection tests passed successfully.
Thu Dec 13 16:03:37 2018 - [info] Checking MHA Node version..
Thu Dec 13 16:03:38 2018 - [info] Version check ok.
Thu Dec 13 16:03:38 2018 - [info] Checking SSH publickey authentication settings on the current master..
Thu Dec 13 16:03:38 2018 - [info] HealthCheck: SSH to 10.150.20.97 is reachable.
Thu Dec 13 16:03:38 2018 - [info] Master MHA Node version is 0.58.
Thu Dec 13 16:03:38 2018 - [info] Checking recovery script configurations on 10.150.20.97(10.150.20.97:33061)..
Thu Dec 13 16:03:38 2018 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql_33061/logs --output_file=/data/mysql_mha/app1/save_binary_logs_test --manager_version=0.58 --start_file=mysql-bin.000015 
Thu Dec 13 16:03:38 2018 - [info] Connecting to root@10.150.20.97(10.150.20.97:22).. 
Creating /data/mysql_mha/app1 if not exists.. ok.
Checking output directory is accessible or not..
ok.
Binlog found at /data/mysql_33061/logs, up to mysql-bin.000015
Thu Dec 13 16:03:38 2018 - [info] Binlog setting check done.
Thu Dec 13 16:03:38 2018 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..
Thu Dec 13 16:03:38 2018 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='mha_monitor' --slave_host=10.150.20.132 --slave_ip=10.150.20.132 --slave_port=33061 --workdir=/data/mysql_mha/app1 --target_version=5.7.21-log --manager_version=0.58 --relay_log_info=/data/mysql_33061/logs/relay-log.info --relay_dir=/data/mysql_33061/data/ --slave_pass=xxx
Thu Dec 13 16:03:38 2018 - [info] Connecting to root@10.150.20.132(10.150.20.132:22).. 
Checking slave recovery environment settings..
Opening /data/mysql_33061/logs/relay-log.info ... ok.
Relay log found at /data/mysql_33061/data, up to cgdb-relay-bin.000002
Temporary relay log file is /data/mysql_33061/data/cgdb-relay-bin.000002
Checking if super_read_only is defined and turned on.. not present or turned off, ignoring.
Testing mysql connection and privileges..
mysql: [Warning] Using a password on the command line interface can be insecure.
done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Thu Dec 13 16:03:39 2018 - [info] Slaves settings check done.
Thu Dec 13 16:03:39 2018 - [info] 
10.150.20.97(10.150.20.97:33061) (current master)
+--10.150.20.132(10.150.20.132:33061)

Thu Dec 13 16:03:39 2018 - [info] Checking replication health on 10.150.20.132..
Thu Dec 13 16:03:39 2018 - [info] ok.
Thu Dec 13 16:03:39 2018 - [info] Checking master_ip_failover_script status:
Thu Dec 13 16:03:39 2018 - [info] /usr/local/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=10.150.20.97 --orig_master_ip=10.150.20.97 --orig_master_port=33061


IN SCRIPT TEST====systemctl stop keepalived==systemctl start keepalived===

Checking the Status of the script.. OK 
Thu Dec 13 16:03:39 2018 - [info] OK.
Thu Dec 13 16:03:39 2018 - [warning] shutdown_script is not defined.
Thu Dec 13 16:03:39 2018 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

主从发生变化

实战：https://yq.aliyun.com/articles/58920