GTID环境中手动修复主从故障一例(Error 1146)

 
Preface
 
    In my last test of pt-heartbeat,both of master and slave were out of disk.And the mysql client was hang.In order to resolve the issue,I've tryed to fix the replicaiton environment without using mysqldump to reconfigure the slave.Let's see the details.
 
Procedure
 
I dropped test tables in database "sysbench" to release the disk space on master.
 1 [root@zlm2 08:56:51 /data/mysql/mysql3306/logs]
 2 #sysbench oltp_read_write.lua --mysql-host=192.168.1.101 --mysql-port=3306 --mysql-user=zlm --mysql-password=zlmzlm --mysql-db=sysbench --tables=1 --table-size=10000000 --mysql-storage-engine=innodb cleanup
 3 sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2)
 4 
 5 Dropping table 'sbtest1'...
 6 
 7 (zlm@192.168.1.101 3306)[(none)]>use sysbench;
 8 Reading table information for completion of table and column names
 9 You can turn off this feature to get a quicker startup with -A
10 
11 Database changed
12 (zlm@192.168.1.101 3306)[sysbench]>show tables;
13 +--------------------+
14 | Tables_in_sysbench |
15 +--------------------+
16 | hb                 |
17 | sbtest2            |
18 | sbtest3            |
19 | sbtest4            |
20 | sbtest5            |
21 +--------------------+
22 5 rows in set (0.00 sec)
23 
24 //Only sbtest1 was deleted.It's not enough.
25 
26 [root@zlm2 08:59:04 ~/sysbench-1.0/src/lua]
27 #sysbench oltp_read_write.lua --mysql-host=192.168.1.101 --mysql-port=3306 --mysql-user=zlm --mysql-password=zlmzlm --mysql-db=sysbench --tables=5 --table-size=10000000 --mysql-storage-engine=innodb cleanup
28 sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2)
29 
30 Dropping table 'sbtest1'...
31 Dropping table 'sbtest2'...
32 Dropping table 'sbtest3'...
33 Dropping table 'sbtest4'...
34 Dropping table 'sbtest5'...
35 
36 [root@zlm2 08:59:09 ~/sysbench-1.0/src/lua]
37 #df -h
38 Filesystem               Size  Used Avail Use% Mounted on
39 /dev/mapper/centos-root  8.4G  6.9G  1.5G  83% / //I'd got 27% free space.
40 devtmpfs                 488M     0  488M   0% /dev
41 tmpfs                    497M     0  497M   0% /dev/shm
42 tmpfs                    497M  6.6M  491M   2% /run
43 tmpfs                    497M     0  497M   0% /sys/fs/cgroup
44 /dev/sda1                497M  118M  379M  24% /boot
45 none                      87G   80G  6.6G  93% /vagrant
46 
47 (zlm@192.168.1.101 3306)[(none)]>drop database sysbench;
48 Query OK, 1 row affected (0.04 sec)
49 
50 //Further more,I dropped the "sysbench".
The slave hung still and disk space was full.
 1 [root@zlm3 08:55:17 ~]
 2 #df -h
 3 Filesystem               Size  Used Avail Use% Mounted on
 4 /dev/mapper/centos-root  8.4G  8.4G   20K 100% /
 5 devtmpfs                 488M     0  488M   0% /dev
 6 tmpfs                    497M     0  497M   0% /dev/shm
 7 tmpfs                    497M  6.5M  491M   2% /run
 8 tmpfs                    497M     0  497M   0% /sys/fs/cgroup
 9 /dev/sda1                497M  118M  379M  24% /boot
10 none                      87G   80G  6.6G  93% /vagrant
11 
12 (zlm@192.168.1.102 3306)[(none)]>show slave statusG
13 ^C^C -- query aborted
14 
15 
16 ^Z
17 [6]+  Stopped                 mysql
18 
19 [root@zlm3 08:55:58 ~]
20 #pkill mysqld
21 
22 [root@zlm3 08:56:04 ~]
23 #./mysqld.sh 
24 
25 [root@zlm3 08:56:07 ~]
26 #mysql
27 ERROR 2003 (HY000): Can't connect to MySQL server on '192.168.1.102' (111)
28 
29 [root@zlm3 09:00:35 ~]
30 #cd /data/mysql/mysql3306/data
31 
32 [root@zlm3 09:00:46 /data/mysql/mysql3306/data]
33 #cat error.log |tail -n 30
34 2018-07-19T08:57:02.581937+01:00 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 17039436409
35 2018-07-19T08:57:02.581958+01:00 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 17039436418
36 2018-07-19T08:57:02.581963+01:00 0 [Note] InnoDB: Database was not shutdown normally!
37 2018-07-19T08:57:02.581965+01:00 0 [Note] InnoDB: Starting crash recovery.
38 2018-07-19T08:57:02.696292+01:00 0 [Note] InnoDB: Transaction 31700139 was in the XA prepared state.
39 2018-07-19T08:57:02.700688+01:00 0 [Note] InnoDB: Transaction 31700139 was in the XA prepared state.
40 2018-07-19T08:57:02.700814+01:00 0 [Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 0 row operations to undo
41 2018-07-19T08:57:02.700821+01:00 0 [Note] InnoDB: Trx id counter is 31700480
42 2018-07-19T08:57:02.701719+01:00 0 [Note] InnoDB: Last MySQL binlog file position 0 99139927, file name mysql-bin.000016
43 2018-07-19T08:57:02.805965+01:00 0 [Note] InnoDB: Ignoring tablespace `zlm`.`sbtest2` because the DISCARD flag is set .
44 2018-07-19T08:57:02.806462+01:00 0 [Note] InnoDB: Creating shared tablespace for temporary tables
45 2018-07-19T08:57:02.807316+01:00 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
46 2018-07-19T08:57:02.807568+01:00 0 [Note] InnoDB: Starting in background the rollback of uncommitted transactions
47 2018-07-19T08:57:02.807594+01:00 0 [Note] InnoDB: Rollback of non-prepared transactions completed
48 2018-07-19T08:57:02.871396+01:00 0 [Warning] InnoDB: Retry attempts for writing partial data failed.
49 2018-07-19T08:57:02.871423+01:00 0 [ERROR] InnoDB: Write to file ./ibtmp1failed at offset 11534336, 1048576 bytes should have been written, only 0 were written. Operating system error number 28. Check that your OS and file system support files of this size. Check also that the disk is not full or a disk quota exceeded.
50 2018-07-19T08:57:02.871441+01:00 0 [ERROR] InnoDB: Error number 28 means 'No space left on device'
51 2018-07-19T08:57:02.871446+01:00 0 [Note] InnoDB: Some operating system error numbers are described at http://dev.mysql.com/doc/refman/5.7/en/operating-system-error-codes.html
52 2018-07-19T08:57:02.871451+01:00 0 [ERROR] InnoDB: Could not set the file size of './ibtmp1'. Probably out of disk space
53 2018-07-19T08:57:02.871456+01:00 0 [ERROR] InnoDB: Unable to create the shared innodb_temporary
54 2018-07-19T08:57:02.871459+01:00 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
55 2018-07-19T08:57:03.273011+01:00 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
56 2018-07-19T08:57:03.273029+01:00 0 [ERROR] Plugin 'InnoDB' init function returned error.
57 2018-07-19T08:57:03.273033+01:00 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
58 2018-07-19T08:57:03.273037+01:00 0 [ERROR] Failed to initialize builtin plugins.
59 2018-07-19T08:57:03.273040+01:00 0 [ERROR] Aborting
60 
61 2018-07-19T08:57:03.273046+01:00 0 [Note] Binlog end
62 2018-07-19T08:57:03.273389+01:00 0 [Note] mysqld: Shutdown complete
63 
64 //The mysqld process could not run again because of no free disk space.
I decided to drop all the binlogs on slave to release the disk space.
 1 [root@zlm3 09:01:34 /data/mysql/mysql3306]
 2 #cd logs
 3 
 4 [root@zlm3 09:01:35 /data/mysql/mysql3306/logs]
 5 #ls -l
 6 total 2691436
 7 -rw-r----- 1 mysql mysql   4333745 Jul 13 11:07 mysql-bin.000001
 8 -rw-r----- 1 mysql mysql       653 Jul 16 09:28 mysql-bin.000002
 9 -rw-r----- 1 mysql mysql 268435663 Jul 16 10:03 mysql-bin.000003
10 -rw-r----- 1 mysql mysql 268435798 Jul 16 10:22 mysql-bin.000004
11 -rw-r----- 1 mysql mysql 268435798 Jul 16 10:32 mysql-bin.000005
12 -rw-r----- 1 mysql mysql 268435798 Jul 16 10:48 mysql-bin.000006
13 -rw-r----- 1 mysql mysql 268435798 Jul 16 11:29 mysql-bin.000007
14 -rw-r----- 1 mysql mysql  35214498 Jul 16 11:59 mysql-bin.000008
15 -rw-r----- 1 mysql mysql       249 Jul 16 11:59 mysql-bin.000009
16 -rw-r----- 1 mysql mysql 199908536 Jul 18 11:27 mysql-bin.000010
17 -rw-r----- 1 mysql mysql       249 Jul 18 11:27 mysql-bin.000011
18 -rw-r----- 1 mysql mysql 268474318 Jul 19 04:20 mysql-bin.000012
19 -rw-r----- 1 mysql mysql 268817026 Jul 19 04:24 mysql-bin.000013
20 -rw-r----- 1 mysql mysql 268802986 Jul 19 04:28 mysql-bin.000014
21 -rw-r----- 1 mysql mysql 268826152 Jul 19 04:32 mysql-bin.000015
22 -rw-r----- 1 mysql mysql  99422208 Jul 19 08:56 mysql-bin.000016
23 -rw-r----- 1 mysql mysql       704 Jul 19 04:32 mysql-bin.index
24 
25 [root@zlm3 09:01:36 /data/mysql/mysql3306/logs]
26 #rm -f *
27 
28 [root@zlm3 09:01:38 /data/mysql/mysql3306/logs]
29 #ls -l
30 total 0
31 
32 [root@zlm3 09:01:48 ~]
33 #df -h
34 Filesystem               Size  Used Avail Use% Mounted on
35 /dev/mapper/centos-root  8.4G  4.5G  4.0G  53% / //The free disk space had been reduced to 47%.
36 devtmpfs                 488M     0  488M   0% /dev
37 tmpfs                    497M     0  497M   0% /dev/shm
38 tmpfs                    497M  6.5M  491M   2% /run
39 tmpfs                    497M     0  497M   0% /sys/fs/cgroup
40 /dev/sda1                497M  118M  379M  24% /boot
41 none                      87G   80G  6.6G  93% /vagrant
Ran the mysqld again and dropped the database "sysbench" on slave.
 1 [root@zlm3 09:01:42 /data/mysql/mysql3306/logs]
 2 #sh /root/mysqld.sh
 3 
 4 [root@zlm3 09:01:51 /data/mysql/mysql3306/logs]
 5 #ps aux|grep mysqld
 6 mysql     4346  7.0 17.8 1109984 181576 pts/0  Sl   09:01   0:00 mysqld --defaults-file=/data/mysql/mysql3306/my.cnf
 7 root      4380  0.0  0.0 112640   960 pts/0    R+   09:01   0:00 grep --color=auto mysqld
 8 
 9 [root@zlm3 09:01:55 /data/mysql/mysql3306/logs]
10 #mysql
11 Welcome to the MySQL monitor.  Commands end with ; or g.
12 Your MySQL connection id is 2
13 Server version: 5.7.21-log MySQL Community Server (GPL)
14 
15 Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
16 
17 Oracle is a registered trademark of Oracle Corporation and/or its
18 affiliates. Other names may be trademarks of their respective
19 owners.
20 
21 Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.
22 
23 (zlm@192.168.1.102 3306)[(none)]>drop database sysbench;
24 Query OK, 6 rows affected (0.11 sec)

Started the replication threads of slave.

 1 (zlm@192.168.1.102 3306)[(none)]>start slave;
 2 Query OK, 0 rows affected (0.00 sec)
 3 
 4 (zlm@192.168.1.102 3306)[(none)]>show slave statusG
 5 *************************** 1. row ***************************
 6                Slave_IO_State: Waiting for master to send event
 7                   Master_Host: 192.168.1.101
 8                   Master_User: repl
 9                   Master_Port: 3306
10                 Connect_Retry: 60
11               Master_Log_File: mysql-bin.000019
12           Read_Master_Log_Pos: 155765801
13                Relay_Log_File: relay-bin.000068
14                 Relay_Log_Pos: 398
15         Relay_Master_Log_File: mysql-bin.000019
16              Slave_IO_Running: Yes
17             Slave_SQL_Running: No
18               Replicate_Do_DB: 
19           Replicate_Ignore_DB: 
20            Replicate_Do_Table: 
21        Replicate_Ignore_Table: 
22       Replicate_Wild_Do_Table: 
23   Replicate_Wild_Ignore_Table: 
24                    Last_Errno: 1146
25                    Last_Error: Error executing row event: 'Table 'sysbench.sbtest1' doesn't exist'
26                  Skip_Counter: 0
27           Exec_Master_Log_Pos: 99148246
28               Relay_Log_Space: 56618146
29               Until_Condition: None
30                Until_Log_File: 
31                 Until_Log_Pos: 0
32            Master_SSL_Allowed: No
33            Master_SSL_CA_File: 
34            Master_SSL_CA_Path: 
35               Master_SSL_Cert: 
36             Master_SSL_Cipher: 
37                Master_SSL_Key: 
38         Seconds_Behind_Master: NULL
39 Master_SSL_Verify_Server_Cert: No
40                 Last_IO_Errno: 0
41                 Last_IO_Error: 
42                Last_SQL_Errno: 1146
43                Last_SQL_Error: Error executing row event: 'Table 'sysbench.sbtest1' doesn't exist' //Since the database had been droppted.This error was notable.
44   Replicate_Ignore_Server_Ids: 
45              Master_Server_Id: 1013306
46                   Master_UUID: 1b7181ee-6eaf-11e8-998e-080027de0e0e
47              Master_Info_File: mysql.slave_master_info
48                     SQL_Delay: 0
49           SQL_Remaining_Delay: NULL
50       Slave_SQL_Running_State: 
51            Master_Retry_Count: 86400
52                   Master_Bind: 
53       Last_IO_Error_Timestamp: 
54      Last_SQL_Error_Timestamp: 180719 09:02:07
55                Master_SSL_Crl: 
56            Master_SSL_Crlpath: 
57            Retrieved_Gtid_Set: 1b7181ee-6eaf-11e8-998e-080027de0e0e:3714549-3730021 //It was stuck on transaction 3714549(which contained error).
58             Executed_Gtid_Set: 1b7181ee-6eaf-11e8-998e-080027de0e0e:1-3714548,
59 5c77c31b-4add-11e8-81e2-080027de0e0e:1-168
60                 Auto_Position: 1
61          Replicate_Rewrite_DB: 
62                  Channel_Name: 
63            Master_TLS_Version: 
64 1 row in set (0.00 sec)
65 
66 [root@zlm3 09:10:42 ~]
67 #perror 1146
68 MySQL error code 1146 (ER_NO_SUCH_TABLE): Table '%-.192s.%-.192s' doesn't exist
69 
70 //Error 1146 indicated the absence of table "sbtest1" in "sysbench" database.
71 //Obviously,the slave was replaying the operations relevant to this table on master.The table even the database had been dropped.
72 //How could I do next step?Do I have to generate a new mysqldump file and reconfigure the slave again?
73 //There's One thing I'm rather sure that there were no other transactions generated in the whole course except the operations on "sysbench" database.
74 //Since I'd drop "sysbentch" database on both master and slave.Maybe I can fix the issue easily.

Checked the Executed_Gtid_Set on master.

1 (zlm@192.168.1.101 3306)[(none)]>show master status;
2 +------------------+-----------+--------------+------------------+------------------------------------------------+
3 | File             | Position  | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                              |
4 +------------------+-----------+--------------+------------------+------------------------------------------------+
5 | mysql-bin.000019 | 155765801 |              |                  | 1b7181ee-6eaf-11e8-998e-080027de0e0e:1-3730021 |
6 +------------------+-----------+--------------+------------------+------------------------------------------------+
7 1 row in set (0.00 sec)
8 
9 //The executed gtid was upto "3730021".

Tryed to fix the replica of master.

  1 (zlm@192.168.1.102 3306)[(none)]>show slave statusG
  2 *************************** 1. row ***************************
  3                Slave_IO_State: Waiting for master to send event
  4                   Master_Host: 192.168.1.101
  5                   Master_User: repl
  6                   Master_Port: 3306
  7                 Connect_Retry: 60
  8               Master_Log_File: mysql-bin.000019
  9           Read_Master_Log_Pos: 155765801
 10                Relay_Log_File: relay-bin.000068
 11                 Relay_Log_Pos: 398
 12         Relay_Master_Log_File: mysql-bin.000019
 13              Slave_IO_Running: Yes
 14             Slave_SQL_Running: No
 15               Replicate_Do_DB: 
 16           Replicate_Ignore_DB: 
 17            Replicate_Do_Table: 
 18        Replicate_Ignore_Table: 
 19       Replicate_Wild_Do_Table: 
 20   Replicate_Wild_Ignore_Table: 
 21                    Last_Errno: 1146
 22                    Last_Error: Error executing row event: 'Table 'sysbench.sbtest1' doesn't exist'
 23                  Skip_Counter: 0
 24           Exec_Master_Log_Pos: 99148246
 25               Relay_Log_Space: 56618627
 26               Until_Condition: None
 27                Until_Log_File: 
 28                 Until_Log_Pos: 0
 29            Master_SSL_Allowed: No
 30            Master_SSL_CA_File: 
 31            Master_SSL_CA_Path: 
 32               Master_SSL_Cert: 
 33             Master_SSL_Cipher: 
 34                Master_SSL_Key: 
 35         Seconds_Behind_Master: NULL
 36 Master_SSL_Verify_Server_Cert: No
 37                 Last_IO_Errno: 0
 38                 Last_IO_Error: 
 39                Last_SQL_Errno: 1146
 40                Last_SQL_Error: Error executing row event: 'Table 'sysbench.sbtest1' doesn't exist'
 41   Replicate_Ignore_Server_Ids: 
 42              Master_Server_Id: 1013306
 43                   Master_UUID: 1b7181ee-6eaf-11e8-998e-080027de0e0e
 44              Master_Info_File: mysql.slave_master_info
 45                     SQL_Delay: 0
 46           SQL_Remaining_Delay: NULL
 47       Slave_SQL_Running_State: 
 48            Master_Retry_Count: 86400
 49                   Master_Bind: 
 50       Last_IO_Error_Timestamp: 
 51      Last_SQL_Error_Timestamp: 180719 09:02:57
 52                Master_SSL_Crl: 
 53            Master_SSL_Crlpath: 
 54            Retrieved_Gtid_Set: 1b7181ee-6eaf-11e8-998e-080027de0e0e:3714549-3730021
 55             Executed_Gtid_Set: 
 56                 Auto_Position: 1
 57          Replicate_Rewrite_DB: 
 58                  Channel_Name: 
 59            Master_TLS_Version: 
 60 1 row in set (0.00 sec)
 61 
 62 (zlm@192.168.1.102 3306)[(none)]>reset master;
 63 Query OK, 0 rows affected (0.02 sec)
 64 
 65 (zlm@192.168.1.102 3306)[(none)]>set @@global.gtid_purged='1b7181ee-6eaf-11e8-998e-080027de0e0e:1-3730021';
 66 Query OK, 0 rows affected (0.00 sec)
 67 
 68 //On account of surely knowing there were no other transactions at all.I set the "gtid_purged" variable to the value of "gtid_executed" on master.
 69 //It means I guised that all the transactions generated on master had been replayed on slave already.The slave could retrieve new GTID at the moment.
 70 
 71 (zlm@192.168.1.102 3306)[(none)]>start slave sql_thread;
 72 Query OK, 0 rows affected (0.02 sec)
 73 
 74 (zlm@192.168.1.102 3306)[(none)]>show slave statusG
 75 *************************** 1. row ***************************
 76                Slave_IO_State: Waiting for master to send event
 77                   Master_Host: 192.168.1.101
 78                   Master_User: repl
 79                   Master_Port: 3306
 80                 Connect_Retry: 60
 81               Master_Log_File: mysql-bin.000019
 82           Read_Master_Log_Pos: 155765801
 83                Relay_Log_File: relay-bin.000069
 84                 Relay_Log_Pos: 438
 85         Relay_Master_Log_File: mysql-bin.000019
 86              Slave_IO_Running: Yes
 87             Slave_SQL_Running: Yes //The sql_thread became "Yes".
 88               Replicate_Do_DB: 
 89           Replicate_Ignore_DB: 
 90            Replicate_Do_Table: 
 91        Replicate_Ignore_Table: 
 92       Replicate_Wild_Do_Table: 
 93   Replicate_Wild_Ignore_Table: 
 94                    Last_Errno: 0
 95                    Last_Error: 
 96                  Skip_Counter: 0
 97           Exec_Master_Log_Pos: 155765801
 98               Relay_Log_Space: 56618434
 99               Until_Condition: None
100                Until_Log_File: 
101                 Until_Log_Pos: 0
102            Master_SSL_Allowed: No
103            Master_SSL_CA_File: 
104            Master_SSL_CA_Path: 
105               Master_SSL_Cert: 
106             Master_SSL_Cipher: 
107                Master_SSL_Key: 
108         Seconds_Behind_Master: 0
109 Master_SSL_Verify_Server_Cert: No
110                 Last_IO_Errno: 0
111                 Last_IO_Error: 
112                Last_SQL_Errno: 0
113                Last_SQL_Error: 
114   Replicate_Ignore_Server_Ids: 
115              Master_Server_Id: 1013306
116                   Master_UUID: 1b7181ee-6eaf-11e8-998e-080027de0e0e
117              Master_Info_File: mysql.slave_master_info
118                     SQL_Delay: 0
119           SQL_Remaining_Delay: NULL
120       Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
121            Master_Retry_Count: 86400
122                   Master_Bind: 
123       Last_IO_Error_Timestamp: 
124      Last_SQL_Error_Timestamp: 
125                Master_SSL_Crl: 
126            Master_SSL_Crlpath: 
127            Retrieved_Gtid_Set: 1b7181ee-6eaf-11e8-998e-080027de0e0e:3714549-3730021
128             Executed_Gtid_Set: 1b7181ee-6eaf-11e8-998e-080027de0e0e:1-3730021 //The slave had skipped those GTID(which contained error 1146) of master and waited for newer GTID.The replica had been fixed up.
129                 Auto_Position: 1
130          Replicate_Rewrite_DB: 
131                  Channel_Name: 
132            Master_TLS_Version: 
133 1 row in set (0.00 sec)

 Summary

  • The variable "gtid_purged" cannot be set if "gtid_executed" is not empty.
  • Caution,"reset master" can only be used on slave.Keep in mind that don't do it on master anytime.
  • This case can be followed only in test environment 'cause you cannot guarantee whether all the transactions are really replayed on slave.
原文地址:https://www.cnblogs.com/aaron8219/p/9326557.html