【MySQL】你以为设置了并行复制就降低延迟了?这个你绝对想不到!

在MySQL官方版本中,为了保证其的高可用性,一般情况我们会采用主从复制的方式来解决。当然,方法很多。而我们今天所要处理的是采用GTID方式并且开了多线程复制后,仍然延迟的情况,糟糕的是,延迟还在不断扩大!

  • 环境概要
  • 序号 清单 说明
    1 系统 Redhat 6.x(4c,32g)
    2 数据库 MySQL-5.7.25
    3 同步方式 基于GTID主从同步

  • 环境排查
  • 1)已经配置的重要参数:

    
    #relay for slave
    slave_parallel_type             = LOGICAL_CLOCK
    slave_parallel_workers          = 6
    master_info_repository          = TABLE
    relay_log_info_repository       = TABLE 
    relay_log_recovery              = on
    sync_relay_log                  = 10000
    

    注:此时没有设置slave_preserve_commit_order参数。

    2)从库延迟状态查询

    
    mysql> show slave statusG
    *************************** 1. row ***************************
                   Slave_IO_State: Queueing master event to the relay log
                      Master_Host: xxx.xxx.xxx.xxx
                      Master_User: repl
                      Master_Port: 3306
                    Connect_Retry: 60
                  Master_Log_File: mysql-bin.008978
              Read_Master_Log_Pos: 696914605
                   Relay_Log_File: DB41-relay-bin.001259
                    Relay_Log_Pos: 207377582
            Relay_Master_Log_File: mysql-bin.008970
                 Slave_IO_Running: Yes
                Slave_SQL_Running: Yes
                  Replicate_Do_DB: neteagle3
              Replicate_Ignore_DB: mysql
               Replicate_Do_Table: 
           Replicate_Ignore_Table: 
          Replicate_Wild_Do_Table: 
      Replicate_Wild_Ignore_Table: 
                       Last_Errno: 0
                       Last_Error: 
                     Skip_Counter: 0
              Exec_Master_Log_Pos: 1068770059
                  Relay_Log_Space: 8425484286
                  Until_Condition: None
                   Until_Log_File: 
                    Until_Log_Pos: 0
               Master_SSL_Allowed: No
               Master_SSL_CA_File: 
               Master_SSL_CA_Path: 
                  Master_SSL_Cert: 
                Master_SSL_Cipher: 
                   Master_SSL_Key: 
            Seconds_Behind_Master: 187358
    Master_SSL_Verify_Server_Cert: No
                    Last_IO_Errno: 0
                    Last_IO_Error: 
                   Last_SQL_Errno: 0
                   Last_SQL_Error: 
      Replicate_Ignore_Server_Ids: 
                 Master_Server_Id: 42
                      Master_UUID: eab7fcac-3cda-11e6-ada8-fa163e648db2
                 Master_Info_File: mysql.slave_master_info
                        SQL_Delay: 0
              SQL_Remaining_Delay: NULL
          Slave_SQL_Running_State: Waiting for dependent transaction to commit
               Master_Retry_Count: 86400
                      Master_Bind: 
          Last_IO_Error_Timestamp: 
         Last_SQL_Error_Timestamp: 
                   Master_SSL_Crl: 
               Master_SSL_Crlpath: 
               Retrieved_Gtid_Set: eab7fcac-3cda-11e6-ada8-fa163e648db2:58031191-59927276
                Executed_Gtid_Set: eab7fcac-3cda-11e6-ada8-fa163e648db2:1-58080239:58080241
                    Auto_Position: 1
             Replicate_Rewrite_DB: 
                     Channel_Name: 
               Master_TLS_Version: 
    1 row in set (0.00 sec)
    
    简单介绍一下几个指标信息:

    1. Master_Log_File
    2. Read_Master_Log_Pos
    3. Seconds_Behind_Master
    4. Relay_Log_File
    5. Relay_Log_Pos
    6. Relay_Master_Log_File
    7. Exec_Master_Log_Pos

    Master_Log_File, Read_Master_Log_Pos
    这两个参数是成对的,表示的是从库IO thread传输主库的binlog日志号及具体位置。

    Relay_Log_File, Relay_Log_Pos
    这两个参数也是成对的,表示的是从库sql thread应用中继日志(relay log)号及具体位置。

    Relay_Master_Log_File, Relay_Log_Pos
    这两个参数也是成对的,表示的是上一项中的中继日志对应的主库binlog日志及具体位置(有点绕)。

    Seconds_Behind_Master
    此参数可简单理解为主从延迟时间,单位为秒。

    从上面这段MySQL从状态信息中,我们可以看到,Seconds_Behind_Master: 187358这是从库sql应用延迟主库的时间为187358秒,转换成天,大概两天多。这说明,我们从库复制的数据是两天前的。

    3)验证并行复制

    
    mysql> show full processlist;
     ---- ------------- -------------- ----------- ------------------ -------- --------------------------------------------------------------- ----------------------- 
    | Id | User        | Host         | db        | Command          | Time   | State                                                         | Info                  |
     ---- ------------- -------------- ----------- ------------------ -------- --------------------------------------------------------------- ----------------------- 
    |  1 | system user |              | NULL      | Connect          |  18204 | Waiting for master to send event                              | NULL                  |
    |  2 | system user |              | NULL      | Connect          |      0 | Waiting for dependent transaction to commit                   | NULL                  |
    |  3 | system user |              | NULL      | Connect          | 154914 | System lock                                                   | NULL                  |
    |  4 | system user |              | NULL      | Connect          | 154914 | Waiting for an event from Coordinator                         | NULL                  |
    |  5 | system user |              | NULL      | Connect          | 154918 | Waiting for an event from Coordinator                         | NULL                  |
    |  6 | system user |              | NULL      | Connect          | 155525 | Waiting for an event from Coordinator                         | NULL                  |
    |  7 | system user |              | NULL      | Connect          | 180427 | Waiting for an event from Coordinator                         | NULL                  |
    |  8 | system user |              | NULL      | Connect          |  18204 | Waiting for an event from Coordinator                         | NULL                  |
    | 10 | root        | localhost    | neteagle3 | Query            |      0 | starting                                                      | show full processlist |
    | 11 | repl        | DBSlave:9683 | NULL      | Binlog Dump GTID |  18156 | Master has sent all binlog to slave; waiting for more updates | NULL                  |
    | 13 | root        | localhost    | neteagle3 | Sleep            |   4962 |                                                               | NULL                  |
     ---- ------------- -------------- ----------- ------------------ -------- --------------------------------------------------------------- ----------------------- 
    
    
    mysql> select * from  performance_schema.replication_applier_status_by_worker ;
     -------------- ----------- ----------- --------------- ----------------------------------------------- ------------------- -------------------- ---------------------- 
    | CHANNEL_NAME | WORKER_ID | THREAD_ID | SERVICE_STATE | LAST_SEEN_TRANSACTION                         | LAST_ERROR_NUMBER | LAST_ERROR_MESSAGE | LAST_ERROR_TIMESTAMP |
     -------------- ----------- ----------- --------------- ----------------------------------------------- ------------------- -------------------- ---------------------- 
    |              |         1 |        51 | ON            | eab7fcac-3cda-11e6-ada8-fa163e648db2:80240805 |                 0 |                    | 0000-00-00 00:00:00  |
    |              |         2 |        52 | ON            | eab7fcac-3cda-11e6-ada8-fa163e648db2:80240210 |                 0 |                    | 0000-00-00 00:00:00  |
    |              |         3 |        53 | ON            | eab7fcac-3cda-11e6-ada8-fa163e648db2:80235089 |                 0 |                    | 0000-00-00 00:00:00  |
    |              |         4 |        54 | ON            | eab7fcac-3cda-11e6-ada8-fa163e648db2:80191268 |                 0 |                    | 0000-00-00 00:00:00  |
    |              |         5 |        55 | ON            | eab7fcac-3cda-11e6-ada8-fa163e648db2:75296683 |                 0 |                    | 0000-00-00 00:00:00  |
    |              |         6 |        56 | ON            |                                               |                 0 |                    | 0000-00-00 00:00:00  |
     -------------- ----------- ----------- --------------- ----------------------------------------------- ------------------- -------------------- ---------------------- 
    6 rows in set (0.00 sec)
    通过本条查询,可以看到开启了6个并行进行复制。
    

  • 细找瓶颈?
  • 通过以上来看,所有一切似乎都正常,并行复制开了,并且查看到CPU,IO,内存均没有达到瓶颈地步。主库写binlog日志大概为2MB/s,这样的日质量并不是非常高。

    从库也同样查看了是否存在锁的情况,也没有发现。

    因此反复的在查看slave状态,看能否发现一些细节,结果还真看到了一些异常现象。Relay_Log_Pos这个参数在频繁的刷slave状态时,发现时常会卡着不动(此时已确认没有看到锁)。或许问题真正的原因正在这里!

  • 顺藤摸瓜!
  • 分析binlog或者relay log日志,看有啥线索:
    
    [mysql@xxx data]$ mysqlbinlog --no-defaults -v -v --base64-output=DECODE-ROWS  relay-bin.001384 --start-position=420090430|more
    
    /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
    /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
    DELIMITER /*!*/;
    # at 420090430
    #190923  9:24:28 server id 42  end_log_pos 420090282 CRC32 0xd9097eaf   GTID    last_committed=57148    sequence_number=57149   rbr_only=yes
    /*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
    SET @@SESSION.GTID_NEXT= 'eab7fcac-3cda-11e6-ada8-fa163e648db2:69415610'/*!*/;
    # at 420090495
    #190923  9:24:28 server id 42  end_log_pos 420090364 CRC32 0x82b57dfd   Query   thread_id=95    exec_time=0     error_code=0
    SET TIMESTAMP=1569201868/*!*/;
    SET @@session.pseudo_thread_id=95/*!*/;
    SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
    SET @@session.sql_mode=1075838976/*!*/;
    SET @@session.auto_increment_increment=2, @@session.auto_increment_offset=1/*!*/;
    /*!C gbk *//*!*/;
    SET @@session.character_set_client=28,@@session.collation_connection=28,@@session.collation_server=8/*!*/;
    SET @@session.lc_time_names=0/*!*/;
    SET @@session.collation_database=DEFAULT/*!*/;
    BEGIN
    /*!*/;
    # at 420090577
    #190923  9:24:28 server id 42  end_log_pos 420090585 CRC32 0x752e27cf   Table_map: `net`.`f_event` mapped to number 108
    # at 420090798
    #190923  9:24:28 server id 42  end_log_pos 420090812 CRC32 0x72b8e10d   Table_map: `net`.`f_eventstorage` mapped to number 245
    # at 420091025
    #190923  9:24:28 server id 42  end_log_pos 420091039 CRC32 0x1797f9d8   Table_map: `net`.`f_eventstorage` mapped to number 245
    # at 420091252
    #190923  9:24:28 server id 42  end_log_pos 420091106 CRC32 0x8af14ad2   Table_map: `net`.`f_eventdetail` mapped to number 243
    # at 420091319
    #190923  9:24:28 server id 42  end_log_pos 420091177 CRC32 0xf1ce87c8   Table_map: `net`.`f_eventoperation` mapped to number 244
    # at 420091390
    #190923  9:24:28 server id 42  end_log_pos 420091244 CRC32 0x586c0b9d   Table_map: `net`.`f_eventaudit` mapped to number 242
    # at 420091457
    #190923  9:24:28 server id 42  end_log_pos 420093382 CRC32 0x505e5408   Update_rows: table id 108
    # at 420093595
    #190923  9:24:28 server id 42  end_log_pos 420098858 CRC32 0x0f404509   Update_rows: table id 245
    # at 420099071
    #190923  9:24:28 server id 42  end_log_pos 420098910 CRC32 0xb8d9ed15   Write_rows: table id 243
    # at 420099123
    #190923  9:24:28 server id 42  end_log_pos 420098966 CRC32 0x3c489a7f   Write_rows: table id 244 flags: STMT_END_F
    

    我们查看了中继日志relay-bin.001384卡住的位置号420090430,为设置GTID_NEXT,信息没什么用。

    继续看在卡住时刻,数据库open的是什么表?

    
    mysql>   show open tables where In_use=1;
     ----------- --------------------- -------- ------------- 
    | Database  | Table               | In_use | Name_locked |
     ----------- --------------------- -------- ------------- 
    | net       | f_currentxxx        |      1 |           0 |
     ----------- --------------------- -------- ------------- 
    

    这张表有什么特殊的么?查看其表结构

    
    mysql>         show create table net.f_currentxxxG
    *************************** 1. row ***************************
           Table: f_currentxxx
    Create Table: CREATE TABLE `f_currentxxx` (
      `serial` int(20) NOT NULL COMMENT 'xxx',
      `audittime` bigint(20) NOT NULL COMMENT 'xxx',
      `type` int(11) DEFAULT NULL COMMENT 'xxx',
      `severity` int(11) DEFAULT NULL COMMENT 'xxx,
      KEY `audittime` (`audittime`)
    ) ENGINE=MEMORY DEFAULT CHARSET=gbk COMMENT='xxx'
    1 row in set (0.00 sec)
    

    有没有看到什么与众不同?

    没错,就是表的存储引擎ENGINE=MEMORY。MEMORY的表进行主从复制,首先来说如果从库不做查询,一点意义没有,另外对Memory表做复制,性能是非常堪忧的。如果必须进行同步,考虑将表的存储引擎改为InnoDB

    
    mysql> select table_name from information_schema.tables where TABLE_SCHEMA='net' and ENGINE='memory';
    
     ---------------------- 
    | table_name           |
     ---------------------- 
    | f_currentxxx         |
     ---------------------- 
    1 row in set, 6 warnings (0.01 sec)
    出于严谨,我们将要复制的数据库进行彻底排查,确实只有这一张表是Memory存储引擎。
    

  • 措施
  • 停止复制进程,将选项中添加```replicate-ignore-table=net.f_currentxxx```,重新启动复制进程,观察slave状态。
    
    mysql> show slave statusG
    *************************** 1. row ***************************
                   Slave_IO_State: Waiting for master to send event
                      Master_Host: xxx.xxx.xxx.xxx
                      Master_User: repl
                      Master_Port: 3306
                    Connect_Retry: 60
                  Master_Log_File: mysql-bin.009194
              Read_Master_Log_Pos: 939698255
                   Relay_Log_File: relay-bin.001964
                    Relay_Log_Pos: 444060572
            Relay_Master_Log_File: mysql-bin.009027
                 Slave_IO_Running: Yes
                Slave_SQL_Running: Yes
                  Replicate_Do_DB: net
              Replicate_Ignore_DB: mysql
               Replicate_Do_Table: 
           Replicate_Ignore_Table: net.f_currentxxx
          Replicate_Wild_Do_Table: 
      Replicate_Wild_Ignore_Table: 
                       Last_Errno: 0
                       Last_Error: 
                     Skip_Counter: 0
              Exec_Master_Log_Pos: 444060359
                  Relay_Log_Space: 180287882098
                  Until_Condition: None
                   Until_Log_File: 
                    Until_Log_Pos: 0
               Master_SSL_Allowed: No
               Master_SSL_CA_File: 
               Master_SSL_CA_Path: 
                  Master_SSL_Cert: 
                Master_SSL_Cipher: 
                   Master_SSL_Key: 
            Seconds_Behind_Master: 179221
    Master_SSL_Verify_Server_Cert: No
                    Last_IO_Errno: 0
                    Last_IO_Error: 
                   Last_SQL_Errno: 0
                   Last_SQL_Error: 
      Replicate_Ignore_Server_Ids: 
                 Master_Server_Id: 42
                      Master_UUID: eab7fcac-3cda-11e6-ada8-fa163e648db2
                 Master_Info_File: mysql.slave_master_info
                        SQL_Delay: 0
              SQL_Remaining_Delay: NULL
          Slave_SQL_Running_State: Waiting for dependent transaction to commit
               Master_Retry_Count: 86400
                      Master_Bind: 
          Last_IO_Error_Timestamp: 
         Last_SQL_Error_Timestamp: 
                   Master_SSL_Crl: 
               Master_SSL_Crlpath: 
               Retrieved_Gtid_Set: eab7fcac-3cda-11e6-ada8-fa163e648db2:69497322-107886661
                Executed_Gtid_Set: 1264a536-da12-11e9-81ea-005056856ba5:1,
    eab7fcac-3cda-11e6-ada8-fa163e648db2:1-71980857
                    Auto_Position: 1
             Replicate_Rewrite_DB: 
                     Channel_Name: 
               Master_TLS_Version: 
    1 row in set (0.00 sec)
    

    我们可以看到net.f_currentxxx表已经被忽略复制。持续观察一段时间后,Seconds_Behind_Master在逐渐缩小。
    中继日志应用速度大约5分钟一个(每个中继日志为1GB大小),而主库binlog日志大约为10分钟一个(每个binlog日志为1GB大小)。

  • 总结
  • 在梳理了整个处理过程后,其实难度不高,主要是要细心,细心去排查每一个想到的点。 在非轻量级的数据库中问题发生的概率也会随着量级的增加而增多。而这恰恰是能够磨练个人的成长。
    同时,知识的储备也要充足,这是进阶高手的必要前提!

    原文地址:https://www.cnblogs.com/linuxsql/p/11586845.html