一次线上MySQL主从延迟排查

今天早上来上班,发现zabbix一直告警主从延迟,mysql slave Seconds_Behind_Master (mysql.slave_status[Seconds_Behind_Master]): 69
登录MySQL从库查看 slave状态,Seconds_Behind_Master: 10562,确实存在延迟
mysql> show slave statusG
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
......
Master_Log_File: mysql-bin.002582
Read_Master_Log_Pos: 60231375
Relay_Log_File: relay-log.000998
Relay_Log_Pos: 283
Relay_Master_Log_File: mysql-bin.002580
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 1070638020
Relay_Log_Space: 29799422300
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_Key:
Seconds_Behind_Master: 10562
Master_SSL_Verify_Server_Cert: No

1 row in set (0.00 sec)


于是登录MySQL主库排查可能出现的原因
1、可能的原因如下
(1)主从服务器处于不同的网络之中,由于网络延迟导致;
(2)主从服务器的硬件配置不同,从服务器的硬件配置(包括内存,CPU,网卡等)远低于主服务器;
(3)主库上有大量的写入操作,导致从库无法实时重放主库上的binlog;
(4)主库上存在着大事务操作或者慢SQL,导致从库在应用主库binlog的过程过慢,形成延迟;
(5)数据库实例的参数配置问题导致,如:从库开启了binlog,或者配置了每次事务都去做刷盘操作;

经过排查,确定不是1、2、5的原因,查看慢日志文件,也没发现主库慢日志,所以判断主库上可能有大量的写入操作,或者主库上存在着大事务操作

2、查看主库的binlog文件,看到在1号有一个28G的mysql-bin文件
# ls -rtlsh
1.1G -rw-rw---- 1 mysql mysql 1.1G Mar 31 19:06 mysql-bin.002579
28G -rw-rw---- 1 mysql mysql 28G Apr 1 08:02 mysql-bin.002580
1.1G -rw-rw---- 1 mysql mysql 1.1G Apr 1 10:22 mysql-bin.002581

3、结合上述条件,用binlog_rollback分析MySQL binlog,判断是主库都执行了什么大的dml语句

binlog_rollback简介
binlog_rollback实现了基于row格式binlog的回滚闪回功能,让误删除或者误更新数据,可以不停机不使用备份而快速回滚误操作。
binlog_rollback也可以解释binlog(支持非row格式binlog)生成易读的SQL,让查找问题如什么时个某个表的某个字段的值被更新成了1,或者找出某个时间内某个表的所有删除操作等问题变得简单。
binlog_rollback可以按配置输出各个表的update/insert/delete统计报表, 也会输出大事务与长事务的分析, 应用是否干了坏事一目了然, 也会输出所有DDL。
binlog_rollback通过解释mysql/mariadb binlog/relaylog实现以下三大功能:
1)flashback/闪回/回滚, DML回滚到任意时间或者位置。
生成的文件名为rollback.xxx.sql或者db.tb.rollback.xxx.sql
生成的SQL形式如下
```sql
begin
DELETE FROM `danny`.`emp` WHERE `id`=1
# datetime=2017-10-23_00:14:28 database=danny table=emp binlog=mysql-bin.000012 startpos=417 stoppos=575
commit
```
2)前滚,把binlog/relaylog的DML解释成易读的SQL语句。
*支持非row格式的binlog, 默认不解释非row格式的DML, 需要指定参数-stsql
生成的文件名为forward.xxx.sql或者db.tb.forward.xxx.sql
生成的SQL形式如下
```sql
begin
# datetime=2017-10-23_00:14:28 database=danny table=emp binlog=mysql-bin.000012 startpos=417 stoppos=575
INSERT INTO `danny`.`emp` (`id`,`name`,`sr`,`icon`,`points`,`sa`,`sex`) VALUES (1,'张三1','华南理工大学&SCUT',X'89504e47',1.1,1.1,1)
commit
```
3)统计分析, 输出各个表的DML统计, 输出大事务与长事务, 输出所有的DDL
DML统计结果文件:binlog_stats.txt
大事务与长事务结果文件:big_long_trx.txt
DDL结果文件:ddl_info.txt

具体可以产考 https://github.com/GoDannyLai/binlog_rollback

# /root/dba/mysql_dml/bin/binlog_rollback -m file -w stats -M mysql -i 20 -b 100 -l 10 -o /data/dml_binlog/log /data/mysql/var/mysql-bin.002580

# cat binlog_stats.txt |awk '{print $8}' |sort -nr|head -10
417936881
9718
7183
7183
7165
7162
7152
7150
7137
7119


mysql-bin.002580 2020-03-31_18:33:00 2020-03-31_18:33:00 1070638106 3966168092 0 0 417936881 content testing

可以看到在 2020-03-31_18:33:00的时候content.testing删除了 417936881条数据,是由于delete造成数据库主从延迟的主要原因。

最后和相关的开发建讨论群追踪问题,发现是一个研发直接delete  form  testing;删了4亿多条数据的一个大事务,从而造成主从延迟。

解决办法

(一)

由于这组实例不是核心业务,qps和tps都不是很高,零时把读写都切到master上,等主从延迟追平了,再把读切换到从库上。

(二)

也可以在主库上备份一个全备,再从库上恢复2个新的slave,恢复完了重新做主从。

 
为了下次方便,写了一个脚本。
 
建表语句
CREATE TABLE `t_dml_count` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `insert_totals` int(11) DEFAULT NULL,
  `update_totals` int(11) DEFAULT NULL,
  `delete_totals` int(11) DEFAULT NULL,
  `db_name` varchar(30) DEFAULT NULL,
  `table_name` varchar(50) DEFAULT NULL,
  `binlog_name` varchar(30) DEFAULT NULL,
  `starttime` varchar(50) DEFAULT NULL,
  `stoptime` varchar(50) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8;
 
 
同步主库binlog脚本
cat binlog_server.sh
#!/bin/sh
case "$1" in
    start)
        /usr/local/mysql/bin/mysqlbinlog -h$2 -root -pxxx -R --raw --stop-never $3  --stop-never-slave-server-id=20003306  -r /data/dml_binlog/binlog/  &  
        ;;
    stop)
        ps -ef|grep  mysqlbinlog|grep -v grep|awk '{print $2}'|xargs kill -9
        ;;
        *)
        echo "Usage: ls {start|stop}"
        exit 1
        ;;
esac
 
dml启动和停止脚本
cat mysql_dml_count.sh
#!/bin/sh
case "$1" in
    start)
        /root/dba/mysql_dml/mysql_dml_count.py   &>/dev/null &
        ;;
    stop)
        ps -ef|grep  mysql_dml_count.py|grep -v grep|awk '{print $2}'|xargs kill -9
        /bin/rm -rf /data/dml_binlog/log/*
        ;;
        *)
        echo "Usage: ls {start|stop}"
        exit 1
        ;;
esac
 
分析和统计dml脚本
cat mysql_dml_count.py
 
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import sys, os, MySQLdb
 
dirpath = "/data/dml_binlog/binlog/"
 
while True:
    file_name = os.listdir(dirpath)
    for file in file_name:
        print(file)
        os.system(
            "/root/dba/mysql_dml/bin/binlog_rollback -m file -w stats -M mysql -i 20 -b 100 -l 10 -o /data/dml_binlog/log   /data/dml_binlog/binlog/%s" % file)
        os.remove("/data/dml_binlog/binlog/%s" % file)
 
        log_info = open("/data/dml_binlog/log/binlog_stats.txt", mode="r")
        for info in log_info.readlines():
            info_array = info.split()
            arg1 = info_array[0]  # binlog_name
            arg2 = info_array[1]  # starttime
            arg3 = info_array[2]  # stoptime
            arg4 = info_array[5]  # insert_totals
            arg5 = info_array[6]  # update_totals
            arg6 = info_array[7]  # delete_totals
            arg7 = info_array[8]  # db_name
            arg8 = info_array[9]  # table_name
 
            # 打开数据库连接
            conn = MySQLdb.connect(host='10.20.30.21', port=3306, user='root', passwd='xxx', db='test')
            # 使用cursor()方法获取操作游标
            cursor = conn.cursor()
            # SQL 插入语句
            sql = "INSERT INTO  t_dml_count(`insert_totals`,`update_totals`,`delete_totals`,`db_name`,`table_name`,`binlog_name`,`starttime`,`stoptime`) values('%s','%s','%s','%s','%s','%s','%s','%s')" % (
                arg4, arg5, arg6, arg7, arg8, arg1, arg2, arg3)
            try:
                # 执行sql语句
                cursor.execute(sql)
                # 提交到数据库执行
                conn.commit()
            except Exception as error:
                print(error)
                # 发生错误时回滚
                conn.rollback()
            # 关闭数据库连接
            conn.close()
 
        log_info.close()
 

查看哪些MySQL数据库有哪些大的dml操作
mysql> select delete_totals,db_name,table_name,starttime,stoptime from test.t_dml_count order by delete_totals desc limit 10;
+---------------+--------------------+------------------+---------------------+---------------------+
| delete_totals | db_name | table_name | starttime | stoptime |
+---------------+--------------------+------------------+---------------------+---------------------+
| 417936881 | content | testing | 2020-03-31_18:33:00 | 2020-03-31_18:33:00 |
| 9718 | content | data_count | 2020-04-01_00:15:01 | 2020-04-01_00:15:01 |
| 4831 | Log | soLog | 2020-04-01_05:10:40 | 2020-04-01_05:10:59 |
| 4813 | Log | soLog | 2020-04-01_05:21:20 | 2020-04-01_05:21:39 |
| 4811 | Log | soLog | 2020-04-01_05:34:00 | 2020-04-01_05:34:19 |
| 4806 | Log | soLog | 2020-04-01_05:11:40 | 2020-04-01_05:11:59 |
| 4802 | Log | soLog | 2020-04-01_05:33:40 | 2020-04-01_05:33:59 |
| 4800 | Log | soLog | 2020-04-01_05:34:40 | 2020-04-01_05:34:59 |
| 4779 | Log | soLog | 2020-04-01_05:35:20 | 2020-04-01_05:35:39 |
| 4776 | Log | soLog | 2020-04-01_05:19:40 | 2020-04-01_05:19:59 |
+---------------+--------------------+------------------+---------------------+---------------------+
10 rows in set (0.01 sec)

mysql> select insert_totals,db_name,table_name,starttime,stoptime from test.t_dml_count order by insert_totals desc limit 10; 

+---------------+--------------------+-----------------------------+---------------------+---------------------+
| insert_totals | db_name | table_name | starttime | stoptime |
+---------------+--------------------+-----------------------------+---------------------+---------------------+
| 9722 | content | data_count | 2020-04-01_00:15:01 | 2020-04-01_00:15:01 |
| 4000 | Log | LogBackUp | 2020-04-01_05:05:40 | 2020-04-01_05:05:59 |
| 3500 | Log | LogBackUp | 2020-04-01_05:03:01 | 2020-04-01_05:03:19 |
| 3500 | Log | LogBackUp | 2020-04-01_05:02:40 | 2020-04-01_05:02:58 |
| 3500 | Log | LogBackUp | 2020-04-01_05:03:40 | 2020-04-01_05:03:57 |
| 3500 | Log | LogBackUp | 2020-04-01_05:01:40 | 2020-04-01_05:01:58 |
| 3500 | Log | LogBackUp | 2020-04-01_05:00:40 | 2020-04-01_05:00:58 |
| 3500 | Log | LogBackUp | 2020-04-01_05:04:20 | 2020-04-01_05:04:37 |
| 3500 | Log | LogBackUp | 2020-04-01_05:01:01 | 2020-04-01_05:01:19 |
| 3500 | Log | LogBackUp | 2020-04-01_05:02:01 | 2020-04-01_05:02:19 |
+---------------+--------------------+-----------------------------+---------------------+---------------------+
10 rows in set (0.01 sec)

mysql> select update_totals ,db_name,table_name,starttime,stoptime from test.t_dml_count order by update_totals desc limit 10;
+---------------+--------------------+--------------------+---------------------+---------------------+
| update_totals | db_name | table_name | starttime | stoptime |
+---------------+--------------------+--------------------+---------------------+---------------------+
| 2629 | dataanalytics | Code | 2020-04-01_04:10:56 | 2020-04-01_04:11:15 |
| 2493 | dataanalytics | Code | 2020-04-01_04:10:16 | 2020-04-01_04:10:35 |
| 2430 | dataanalytics | Code | 2020-04-01_04:11:16 | 2020-04-01_04:11:35 |
| 2377 | dataanalytics | Code | 2020-04-01_04:10:36 | 2020-04-01_04:10:55 |
| 1778 | dataanalytics | Code | 2020-04-01_04:10:01 | 2020-04-01_04:10:15 |
| 959 | dataanalytics | Code | 2020-04-01_04:11:36 | 2020-04-01_04:11:43 |
| 503 | Log | PointLog | 2020-04-01_08:03:09 | 2020-04-01_08:03:28 |
| 431 | content | push | 2020-04-01_15:15:13 | 2020-04-01_15:15:13 |
| 351 | Log | PointLog | 2020-04-01_11:43:09 | 2020-04-01_11:43:28 |
| 349 | Log | PointLog | 2020-04-01_11:51:09 | 2020-04-01_11:51:28 |
+---------------+--------------------+--------------------+---------------------+---------------------+
10 rows in set (0.02 sec)

原文地址:https://www.cnblogs.com/manger/p/12613515.html