记一次Centos7主机自动重启原因查询

1 背景描述
最近上线了一台物理机,IT那边安装的操作系统的版本信息如下:
CentOS Linux release 7.3.1611 (Core)

内核版本
3.10.0-514.el7.x86_64 

该系统是跑docker的,docker版本为
Docker version 19.03.6
在运行的故障中,出现异常宕机重启的情况。

2 故障分析
2.1 分析思路
(1)先看操作系统日志/var/log/message,看看能不能看出蛛丝马迹
(2)怀疑硬件兼容性问题,找硬件厂商确定固件、兼容性问题
(3)猜测操作系统有BUG。看看Linux的kdump有没有启动,如果,看看有没有崩溃时候的内核转储文件

2.2 具体分析实践
(1)查看操作系统日志/var/log/message

从日志中可以看出,系统在2020.4.1 18:19:01 宕机了,随即在18:23:19重启了。但除此之外,并没有其它更多可帮助分析的信息了。

(2)分析硬件兼容性问题
同步发送idrac上收集到的硬件信息,发给硬件供应商查询。
(3)使用kdump分析
a. 查看是否安装和启动了kdump

# systemctl status kdump.service 
● kdump.service - Crash recovery kernel arming
Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled)
Active: active (exited) since Thu 2020-04-02 09:01:47 CST; 4h 0min ago
Main PID: 284294 (code=exited, status=0/SUCCESS)
Tasks: 0
Memory: 0B
CGroup: /system.slice/kdump.service

注:安装kdump相关工具见章节3
b. 使用crash命令分析
按照章节3安装好工具之后,使用以下命令分析vmcore(我的是之前默认就已经开了kdump的)

# crash /var/crash/127.0.0.1-2020-04-01-18:19:32/vmcore /usr/lib/debug/lib/modules/3.10.0-514.el7.x86_64/vmlinux
crash 7.2.3-10.el7
Copyright (C) 2002-2017 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.

GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

KERNEL: /usr/lib/debug/lib/modules/3.10.0-514.el7.x86_64/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2020-04-01-18:19:32/vmcore [PARTIAL DUMP]
CPUS: 72
DATE: Wed Apr 1 18:19:27 2020
UPTIME: 19 days, 08:32:38
LOAD AVERAGE: 0.29, 0.32, 0.29
TASKS: 4177
NODENAME: 
RELEASE: 3.10.0-514.el7.x86_64
VERSION: #1 SMP Tue Nov 22 16:42:41 UTC 2016
MACHINE: x86_64 (2600 Mhz)
MEMORY: 127.5 GB
PANIC: "kernel BUG at fs/xfs/xfs_aops.c:1062!"
PID: 92639
COMMAND: "kworker/u898:3"
TASK: ffff8810f827bec0 [THREAD_INFO: ffff880106fa4000]
CPU: 1
STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 92639 TASK: ffff8810f827bec0 CPU: 1 COMMAND: "kworker/u898:3"
#0 [ffff880106fa75f0] machine_kexec at ffffffff81059cdb
#1 [ffff880106fa7650] __crash_kexec at ffffffff81105182
#2 [ffff880106fa7720] crash_kexec at ffffffff81105270
#3 [ffff880106fa7738] oops_end at ffffffff8168ee88
#4 [ffff880106fa7760] die at ffffffff8102e93b
#5 [ffff880106fa7790] do_trap at ffffffff8168e540
#6 [ffff880106fa77e0] do_invalid_op at ffffffff8102b144
#7 [ffff880106fa7890] invalid_op at ffffffff81697e5e
[exception RIP: xfs_vm_writepage+1419]
RIP: ffffffffa052b2fb RSP: ffff880106fa7948 RFLAGS: 00010246
RAX: 006fffff00040009 RBX: ffff8813abed8fc8 RCX: 000000000000000c
RDX: 0000000000000008 RSI: ffff880106fa7c40 RDI: ffffea006be56c00
RBP: ffff880106fa79f0 R8: ffffffffffffffd8 R9: 000000000001a100
R10: ffff88207ffd7000 R11: 0000000000000000 R12: ffff8813abed8fc8
R13: ffff880106fa7c40 R14: ffff8813abed8e78 R15: ffffea006be56c00
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#8 [ffff880106fa7990] find_get_pages_tag at ffffffff81180981
#9 [ffff880106fa79f8] __writepage at ffffffff8118b3b3
#10 [ffff880106fa7a10] write_cache_pages at ffffffff8118bed1
#11 [ffff880106fa7b28] generic_writepages at ffffffff8118c19d
#12 [ffff880106fa7b88] xfs_vm_writepages at ffffffffa052a063 [xfs]
#13 [ffff880106fa7bb8] do_writepages at ffffffff8118d24e
#14 [ffff880106fa7bc8] __writeback_single_inode at ffffffff81228730
#15 [ffff880106fa7c08] writeback_sb_inodes at ffffffff8122941e
#16 [ffff880106fa7cb0] __writeback_inodes_wb at ffffffff8122967f
#17 [ffff880106fa7cf8] wb_writeback at ffffffff81229ec3
#18 [ffff880106fa7d70] bdi_writeback_workfn at ffffffff8122bd05
#19 [ffff880106fa7e20] process_one_work at ffffffff810a7f3b
#20 [ffff880106fa7e68] worker_thread at ffffffff810a8d76
#21 [ffff880106fa7ec8] kthread at ffffffff810b052f
#22 [ffff880106fa7f50] ret_from_fork at ffffffff81696518
crash>

c. 可以看到exception RIP: xfs_vm_writepage+1419,用谷歌查询一下

感觉这个与我的现象很像
https://access.redhat.com/solutions/2779111
看起来一样,先安排停机时间,按照文档的说法,将内核版本进行升级,后续再观察下是否还会出现宕机。


3 kdump相关工具安装
3.1 安装kexec-tools

yum search kexec-tools
yum install crash

3.2 配置kdump服务

vim /etc/kdump.conf
# 修改core文件的目录
path /var/crash systemctl start kdump systemctl enable kdump.service

参考:https://www.linuxtechi.com/how-to-enable-kdump-on-rhel-7-and-centos-7/
3.3 安装kernel-debuginfo工具
(1)下载安装包
在http://debuginfo.centos.org/7/x86_64/上搜索与内核版本一致的rpm包

kernel-debuginfo-3.10.0-514.el7.x86_64.rpm 
kernel-debuginfo-common-x86_64-3.10.0-514.el7.x86_64.rpm

(2)安装

rpm -ivh kernel-debuginfo-common-x86_64-3.10.0-514.el7.x86_64.rpm
rpm -ivh kernel-debuginfo-3.10.0-514.el7.x86_64.rpm
原文地址:https://www.cnblogs.com/doctormo/p/12619485.html