有同事分析一个crash,我参与了分析,记录如下,供遇到相同crash的兄弟参考:
crash> bt PID: 24632 TASK: ffff881723ce8080 CPU: 14 COMMAND: "30-00-SS" #0 [ffff881723cef6d0] machine_kexec at ffffffff8103244b #1 [ffff881723cef730] crash_kexec at ffffffff810bb0e2 #2 [ffff881723cef800] oops_end at ffffffff814fe120 #3 [ffff881723cef830] no_context at ffffffff810425db #4 [ffff881723cef880] __bad_area_nosemaphore at ffffffff81042865 #5 [ffff881723cef8d0] bad_area at ffffffff8104298e #6 [ffff881723cef900] __do_page_fault at ffffffff810430c0 #7 [ffff881723cefa20] do_page_fault at ffffffff8150014e #8 [ffff881723cefa50] page_fault at ffffffff814fd485 [exception RIP: udp_send_skb+744] RIP: ffffffff81497358 RSP: ffff881723cefb08 RFLAGS: 00010202 RAX: 000000008cfa0d02 RBX: ffff881439a092c0 RCX: 0000000000000000 RDX: 000000000000005c RSI: 0000000000000014 RDI: ffff881061d4b800 RBP: ffff881723cefb48 R8: 000000004a00300a R9: 0000000039a092c0 R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000005c R13: ffff8811f8e77100 R14: 0000000000000000 R15: ffff881061d4b824 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff881723cefb50] udp_sendmsg at ffffffff8149781d #10 [ffff881723cefc50] inet_sendmsg at ffffffff8149fc5a #11 [ffff881723cefc90] sock_sendmsg at ffffffff814261ca #12 [ffff881723cefe40] sys_sendto at ffffffff81426ae9 #13 [ffff881723ceff80] system_call_fastpath at ffffffff8100b0d2 RIP: 0000003e1580ee83 RSP: 00007ffb0c9339b8 RFLAGS: 00000202 RAX: 000000000000002c RBX: ffffffff8100b0d2 RCX: 0000000000000002 RDX: 0000000000000054 RSI: 00007ffb0c934380 RDI: 0000000000000b0d RBP: 00007ffaf4175c80 R8: 00007ffb0c933fb0 R9: 000000000000001c R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffb0c934258 R13: 00007ffb0c93432c R14: 000000000000e2ab R15: 00007ffaf4d3a878 ORIG_RAX: 000000000000002c CS: 0033 SS: 002b crash>
BUG: unable to handle kernel paging request at 000000008cfa0d76 IP: [<ffffffff81497358>] udp_send_skb+0x2e8/0x3d0 PGD be47ab067 PUD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/online CPU 14 Modules linked in: ***,省略 Pid: 24632, comm: 30-00-SS Tainted: G W ---------------- 2.6.32-220.el6.x86_64 #1 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M. RIP: 0010:[<ffffffff81497358>] [<ffffffff81497358>] udp_send_skb+0x2e8/0x3d0 RSP: 0018:ffff881723cefb08 EFLAGS: 00010202 RAX: 000000008cfa0d02 RBX: ffff881439a092c0 RCX: 0000000000000000 RDX: 000000000000005c RSI: 0000000000000014 RDI: ffff881061d4b800 RBP: ffff881723cefb48 R08: 000000004a00300a R09: 0000000039a092c0 R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000005c R13: ffff8811f8e77100 R14: 0000000000000000 R15: ffff881061d4b824 FS: 00007ffb0c935700(0000) GS:ffff880c5a780000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000008cfa0d76 CR3: 0000000c10733000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process 30-00-SS (pid: 24632, threadinfo ffff881723cee000, task ffff881723ce8080) Stack: 0000000000000000 5e2d320a2654dacc ffff880c94c6acc0 ffff8811f8e77100 <0> 0000000000000000 000000004a00300a ffff881723cefe58 ffffffff81471f40 <0> ffff881723cefc48 ffffffff8149781d ffff881723cefbc0 ffffffff00000040 Call Trace: [<ffffffff81471f40>] ? ip_generic_getfrag+0x0/0xb0 [<ffffffff8149781d>] udp_sendmsg+0x2ed/0x8f0 [<ffffffff8149fc5a>] inet_sendmsg+0x4a/0xb0 [<ffffffff814261ca>] sock_sendmsg+0x11a/0x150 [<ffffffff810925e0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff810a4b2e>] ? futex_wake+0x10e/0x120 [<ffffffff8100bdee>] ? reschedule_interrupt+0xe/0x20 [<ffffffff8117f90a>] ? fget_light+0x7a/0x90 [<ffffffff81426ae9>] sys_sendto+0x139/0x190 [<ffffffff810d6a12>] ? audit_syscall_entry+0xc2/0x2b0 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b Code: 30 01 00 00 4e 8d 3c 3f 48 8b 44 07 18 44 89 fe 2b b3 d8 00 00 00 41 29 f4 48 85 c0 44 89 e2 0f 84 9e 00 00 00 66 0f 1f 44 00 00 <03> 48 74 40 0f 92 c7 2b 50 68 48 8b 00 40 0f b6 ff 8d 0c 0f 48 RIP [<ffffffff81497358>] udp_send_skb+0x2e8/0x3d0 RSP <ffff881723cefb08> CR2: 000000008cfa0d76
从堆栈看,用户通过sendto系统调用进入内核,在内核态调用 udp_send_skb 时出现异常,执行的代码为 :udp_send_skb+0x2e8/0x3d0
/usr/src/debug/kernel-2.6.32-220.el6/linux-2.6.32-220.el6.x86_64/include/net/checksum.h: 64 0xffffffff81497358 <udp_send_skb+744>: add 0x74(%rax),%ecx----------------------------------------异常
根据代码行,确定 checksum.h:64所在的函数是:csum_add,根据调用链
udp4_hwcsum ->csum_add会调用csum_add,udp_csum 也会调用csum_add,所以需要具体分析一下哪个调用csum_add导致的异常。
skb的参数是 udp_sendmsg 中根据 ip_make_skb 调用返回的,
0xffffffff814977eb <udp_sendmsg+699>: callq 0xffffffff81474820 <ip_make_skb> /usr/src/debug/kernel-2.6.32-220.el6/linux-2.6.32-220.el6.x86_64/net/ipv4/udp.c: 764 0xffffffff814977f0 <udp_sendmsg+704>: test %rax,%rax /usr/src/debug/kernel-2.6.32-220.el6/linux-2.6.32-220.el6.x86_64/net/ipv4/udp.c: 763 0xffffffff814977f3 <udp_sendmsg+707>: mov %eax,%r9d /usr/src/debug/kernel-2.6.32-220.el6/linux-2.6.32-220.el6.x86_64/net/ipv4/udp.c: 764 0xffffffff814977f6 <udp_sendmsg+710>: je 0xffffffff81497a6d <udp_sendmsg+1341> 0xffffffff814977fc <udp_sendmsg+716>: cmp $0xfffffffffffff000,%rax 0xffffffff81497802 <udp_sendmsg+722>: ja 0xffffffff81497a6d <udp_sendmsg+1341> /usr/src/debug/kernel-2.6.32-220.el6/linux-2.6.32-220.el6.x86_64/net/ipv4/udp.c: 765 0xffffffff81497808 <udp_sendmsg+728>: movzwl -0xb2(%rbp),%edx 0xffffffff8149780f <udp_sendmsg+735>: mov -0xb0(%rbp),%esi 0xffffffff81497815 <udp_sendmsg+741>: mov %rax,%rdi 0xffffffff81497818 <udp_sendmsg+744>: callq 0xffffffff81497070 <udp_send_skb>
rax就是skb,没有使用栈来保存,在调用udp_send_skb之前,又赋值给了rdi,所以rdi就是skb,进入udp_send_skb的堆栈之后:
crash> dis -l udp_send_skb /usr/src/debug/kernel-2.6.32-220.el6/linux-2.6.32-220.el6.x86_64/net/ipv4/udp.c: 519 0xffffffff81497070 <udp_send_skb>: push %rbp 0xffffffff81497071 <udp_send_skb+1>: mov %rsp,%rbp 0xffffffff81497074 <udp_send_skb+4>: push %r15 0xffffffff81497076 <udp_send_skb+6>: push %r14 0xffffffff81497078 <udp_send_skb+8>: push %r13 0xffffffff8149707a <udp_send_skb+10>: push %r12 0xffffffff8149707c <udp_send_skb+12>: push %rbx 0xffffffff8149707d <udp_send_skb+13>: sub $0x18,%rsp 0xffffffff81497081 <udp_send_skb+17>: nopl 0x0(%rax,%rax,1) 0xffffffff81497086 <udp_send_skb+22>: mov %esi,-0x34(%rbp) /usr/src/debug/kernel-2.6.32-220.el6/linux-2.6.32-220.el6.x86_64/include/linux/skbuff.h: 1227 0xffffffff81497089 <udp_send_skb+25>: mov 0xbc(%rdi),%r12d-------------------------------------此时rdi就是skb,r12d 就是skb->transport_header /usr/src/debug/kernel-2.6.32-220.el6/linux-2.6.32-220.el6.x86_64/net/ipv4/udp.c: 519 0xffffffff81497090 <udp_send_skb+32>: mov %rdi,%rbx
我们可以看到,rdi赋值给了rbx,且rbx在最后遇到异常指令之前,都没有被修改过,所以rbx中就是skb指针。也就是:ffff881439a092c0 ,有了skb,我们就好分析udp_send_skb的执行流了。
crash> struct -x sk_buff.sk ffff881439a092c0 sk = 0xffff8811f8e77100 crash> struct udp_sock.pcflag 0xffff8811f8e77100 pcflag = 0 '