web服务器挂死问题

web服务器卡死,登陆到后台查看问题; ps aux执行的时候发现卡死,

重新ssh 登陆 strace ps 发现如下结果:

使用gdb 调试也是卡死!

使用top -b 查看所有的进程,发现 之前的ps 的进程为D状态, 同时web服务器 部分线程进程为D状态;

dmesg 查看结果发现:

[20761.085669] INFO: task apache2:7135 blocked for more than 120 seconds.
[20761.085675]       Tainted: G        W  O    #4
[20761.085677] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[20761.085679] apache2         D ffffffc000086ef8     0  7135   4035 0x00000000
[20761.085683] Call trace:
[20761.085736] [<ffffffc000086ef8>] __switch_to+0xa0/0xb8
[20761.085767] [<ffffffc0009ff10c>] __schedule+0x24c/0x704
[20761.085769] [<ffffffc0009ff5fc>] schedule+0x38/0x90
[20761.085781] [<ffffffc000a0260c>] rwsem_down_write_failed+0x1d8/0x310
[20761.085783] [<ffffffc000a00928>] down_write+0x5c/0x74
[20761.085795] [<ffffffc0002188d4>] SyS_mprotect+0xb0/0x204
[20761.085797] [<ffffffc000085c74>] el0_svc_naked+0x24/0x28
[20761.085799] INFO: task apache2:7138 blocked for more than 120 seconds.
[20761.085800]       Tainted: G        W  O    YUN #4
[20761.085801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[20761.085802] apache2         D ffffffc000086ef8     0  7138   4035 0x00000000
[20761.085805] Call trace:
[20761.085807] [<ffffffc000086ef8>] __switch_to+0xa0/0xb8
[20761.085809] [<ffffffc0009ff10c>] __schedule+0x24c/0x704
[20761.085811] [<ffffffc0009ff5fc>] schedule+0x38/0x90
[20761.085813] [<ffffffc000a0260c>] rwsem_down_write_failed+0x1d8/0x310
[20761.085815] [<ffffffc000a00928>] down_write+0x5c/0x74
[20761.085818] [<ffffffc000249344>] split_huge_page_to_list+0x64/0x7e4
[20761.085819] [<ffffffc00024a570>] __split_huge_page_pmd+0x120/0x354
[20761.085821] [<ffffffc00020dc88>] unmap_single_vma+0x178/0x644
[20761.085823] [<ffffffc00020ec48>] zap_page_range+0xa8/0x114
[20761.085825] [<ffffffc000221150>] SyS_madvise+0x2f4/0x520
[20761.085827] [<ffffffc000085c74>] el0_svc_naked+0x24/0x28
[20761.085828] INFO: task apache2:7158 blocked for more than 120 seconds.
[20761.085830]       Tainted: G        W  O    server.YUN #4
[20761.085831] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[20761.085832] apache2         D ffffffc000086ef8     0  7158   4035 0x00000008
[20761.085834] Call trace:
[20761.085836] [<ffffffc000086ef8>] __switch_to+0xa0/0xb8
[20761.085838] [<ffffffc0009ff10c>] __schedule+0x24c/0x704
[20761.085840] [<ffffffc0009ff5fc>] schedule+0x38/0x90
[20761.085842] [<ffffffc000a0260c>] rwsem_down_write_failed+0x1d8/0x310
[20761.085843] [<ffffffc000a00928>] down_write+0x5c/0x74
[20761.085845] [<ffffffc000249344>] split_huge_page_to_list+0x64/0x7e4
[20761.085847] [<ffffffc00024a570>] __split_huge_page_pmd+0x120/0x354
[20761.085849] [<ffffffc00020dc88>] unmap_single_vma+0x178/0x644
[20761.085850] [<ffffffc00020ec48>] zap_page_range+0xa8/0x114
[20761.085852] [<ffffffc000221150>] SyS_madvise+0x2f4/0x520
[20761.085854] [<ffffffc000085c74>] el0_svc_naked+0x24/0x28
[20761.085856] INFO: task ps:17403 blocked for more than 120 seconds.
[20761.085857]       Tainted: G        W  O     #4

查看内核代码只接原因为:fs/proc/base.c 文件中的proc_pid_cmdline_read 函数执行如下代码发生获取信号量失败而导致休眠

down_read(&mm->mmap_sem);
    arg_start = mm->arg_start;
    arg_end = mm->arg_end;
    env_start = mm->env_start;
    env_end = mm->env_end;
    up_read(&mm->mmap_sem);
void __sched down_read(struct rw_semaphore *sem)
{
    might_sleep();
    rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);

    LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
}
/*
 * lock for reading
 */
static inline void __down_read(struct rw_semaphore *sem)
{
    if (unlikely(atomic_long_inc_return_acquire((atomic_long_t *)&sem->count) <= 0))
        rwsem_down_read_failed(sem);
}

  那是什么进程获取此sem没有释放呢?

目前怎样查看?------>首先需要获取内核的堆栈 

同时目前google 结果发现:内核有相关patch对此进行修改;见内核patch

http代理服务器(3-4-7层代理)-网络事件库公共组件、内核kernel驱动 摄像头驱动 tcpip网络协议栈、netfilter、bridge 好像看过!!!! 但行好事 莫问前程 --身高体重180的胖子
原文地址:https://www.cnblogs.com/codestack/p/15155899.html