一次线上问题定位步骤及解决复盘demo

前段时间由于公司rabbitmq集群网络出现故障导致Rabbitmq响应延迟,引发楼主团队的一个核心服务出现连锁问题,现用这个demo复盘一下当时情景,及演示如何快速定位问题,以及总结经验教训思考如何避免类似情况发生
项目地址: https://gitee.com/kenwar/debug_online_demo

场景再现

"十万火急群"有大量反馈系统xx/xx/xxx功能不可用,同时告警系统发出大量告警:"xxx服务获取数据库连接失败异常"
登录日志平台查看该服务:error级别日志中存在大量异常 [com.alibaba.druid.pool.GetConnectionTimeoutException: wait millis 1500, active 30, maxActive 30]
登录Grafana平台查看对应MySQL实例,CPU,内存,带宽,连接数等指标水平非常低,基本没有负载,排除长查询/数据库并发高/数据库死锁等原因
Grafana平台查看对应tomcat实例,发现CPU占比高,线程数高,初步怀疑某些耗时操作导致连接未释放(目标定位到耗时代码块,耗时原因)
登录K8s管理平台,exec进入对应容器中
jps找到容器中运行的java项目进程号18(其实我们所有docker中的java项目都是指定了进程号18)
top -Hp 18 查看该进程中线程CPU前十,所有线程cpu占比较均匀,排除某些线程执行死循环/大量cpu运算等原因
jstack -l 18 打印该进程中所有线程堆栈(可用jstack -l 18 > fileName的方式将结果写入指定文件中,便于后续分析 )

vi打开文件,搜索关键词"wait"发现大量请求waiting on condition [0x00007f6a32167000],大量请求在等待某个condition.其实是在获取连接时等待notEmpty的condition(并无卵用)
线程堆栈信息
avatar

"http-nio-8093-exec-7" #20 daemon prio=5 os_prio=0 tid=0x00007f6a58b8c000 nid=0x15a8 waiting on condition [0x00007f6a32167000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000ed03f468> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at com.alibaba.druid.pool.DruidDataSource.pollLast(DruidDataSource.java:1487)
        at com.alibaba.druid.pool.DruidDataSource.getConnectionInternal(DruidDataSource.java:1086)
        at com.alibaba.druid.pool.DruidDataSource.getConnectionDirect(DruidDataSource.java:953)
        at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:933)
        at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:923)
        at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:100)
        at org.springframework.jdbc.datasource.DataSourceTransactionManager.doBegin(DataSourceTransactionManager.java:264)

DruidDataSource.java:1487

    
    // druid连接池获取连接时会通过该condition进行限时等待,限时时间为连接池中配置的maxWAit值
    estimate = notEmpty.awaitNanos(estimate); // signal by

另有几十个线程状态为BLOCKED waiting for monitor entry [0x00007f7243af6000],都在wait同一把锁,通过堆栈信息是RabbitTemplate中使用CachingConnectionFactory进行createConnection()时在同步代码块中创建连接阻塞了,就是他了,定位到具体业务代码
阻塞的线程堆栈信息
avatar

"http-nio-8093-exec-6" #19 daemon prio=5 os_prio=0 tid=0x00007f7274b0d000 nid=0x2930 waiting for monitor entry [0x00007f7243af6000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at com.ken.debugonline.sender.MsgSender.sender(MsgSender.java:29)
        - waiting to lock <0x00000000ed1202f8> (a java.lang.Object)
        at com.ken.debugonline.sender.MsgSender.sendMsg(MsgSender.java:20)
        at com.ken.debugonline.service.impl.TestServiceImpl.sendMsg(TestServiceImpl.java:23)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
        at org.springframework.transaction.interceptor.TransactionInterceptor$$Lambda$508/65382615.proceedWithInvocation(Unknown Source)

模拟阻塞的代码

private void sender(Integer msg){
    // 该锁模拟rabbitmq获取connection是的逻辑 CachingConnectionFactory.java-> createConnection
    synchronized (connectionMonitor){
        try {
            // 模拟mq建立连接时的网络阻塞
            Thread.sleep(20000);
            log.info("send msg:{} to mq",msg);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

CachingConnectionFactory.java->createConnection()实际rabbitmq代码片段

    public final Connection createConnection() throws AmqpException {
            Assert.state(!this.stopped, "The ApplicationContext is closed and the ConnectionFactory can no longer create connections.");
            Object var1 = this.connectionMonitor;
            synchronized(this.connectionMonitor) {
            ...

模拟业务代码TestServiceImpl.java

    // 声明事务
    @Transactional(rollbackFor = Exception.class)
    public void sendMsg() {
        int id = testMapper.insert();
        事务中进行发送
        testMapper.select(1L);
        // 模拟发送消息
        msgSender.sendMsg(id);

    }

分析:阅读该位置代码发现是在事务中调用了发送消息到mq的方法,该方法本会将消息offer到一个队列中,队列容量2w,再由其他线程异步消费发送至mq,讲道理是不会阻塞的,但是有一段神奇的逻辑是,当队列满后,将有当前线程进行同步发送

// 示例代码com.ken.debugonline.sender.MsgSender
/**
 * 将消息放入队列中,使用线程异步发送,若队列已满则同步发送
 * @param msg
 */
public void sendMsg(Integer msg){
    boolean offer = needSend.offer(msg);
    // 若队列已满则同步发送
    if (!offer){
        this.sender(msg);
    }
}

private void sender(Integer msg){
    // 该锁模拟rabbitmq获取connection是的逻辑 CachingConnectionFactory.java-> createConnection
    synchronized (connectionMonitor){
        try {
            // 模拟mq建立连接时的网络阻塞
            Thread.sleep(20000);
            log.info("send msg:{} to mq",msg);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

分析出原因,第一mq网络出现问题导致发送阻塞(沟通运维得知确实出了故障),第二在事务中调用了同步发送逻辑导致出现长事务,连接得不到释放,导致其他所有依赖该连接池的接口都获取不到连接,第三:mq未设置超时时间,导致阻塞在mq的时间过长
解决方案,设置mq超时时间,使发送逻辑快速失败,队列满时不再转同步发送,直接丢弃
重新打包发版
发版失败,由于项目启动时强依赖Rabbitmq,Rabbitmq连接超时就启动不了
~~怎么办,不搞了等运维把Rabbitmq恢复吧(反正不是我背锅)~~....咳咳,新项目启动不了没关系,可以进行热替换
下载arthas:curl -O https://arthas.aliyun.com/arthas-boot.jar
启动arthas: java -jar arthas-boot.jar 根据进程号选择要注入的java项目
将新的/MsgSender.class上传进容器指定路径(运维帮忙)
redefine /MsgSender.class
等待30s(原mq超时时间),观察到获取连接失败的报错不再有,服务恢复正常(该替换只是替换内存中的class,重启后失效,不过我们服务不存在重启,只会新开容器部署)
mq恢复后再次发布修改好的服务
坐等丢锅大会(复盘)

从该次事件中发现的问题

所有有可能出现延迟的网络中间件的使用都要设置好超时时间,考虑好如何在不可用时保证核心服务可用
开发同学不要在事务中进行耗时操作,长事务将导致连接迟迟无法释放,会快速耗完连接池(还会导致其他晚于该事务的回滚段不会被清理,占用大量存储空间:详情可了解MVCC)
DBA增加对长事务的监听[查找超过60s的事务的代码: select * from information_schema.innodb_trx where TIME_TO_SEC(timediff(now(),trx_started))>60 ]
中间件出现故障时,运维团队应及时跟各部门同步,还需尽量保证高可用
消息发送的sdk不应进行异步转同步操作:如果是可靠消息,那么肯定先落表再发送,如果是非可靠消息,那么丢了也无所谓,所以理论上无需转同步