内核bug的追踪： THP huge page

2.6.32.279的内核， KSM THP开启。

LOG:

<4>------------[ cut here ]------------

<2>kernel BUG at mm/huge_memory.c:1194!

<4>invalid opcode: 0000 [#1] SMP 

<4>last sysfs file: /sys/devices/pci0000:00/0000:00:11.0/0000:06:00.0/host7/scsi_host/host7/proc_name

<4>CPU 1 

<4>Modules linked in: nf_conntrack bonding 8021q garp stp llc ipv6 openvswitch(U) libcrc32c iptable_filter ip_tables sg igb dca microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ext4 mbcache jbd2 sd_mod crc_t10dif isci libsas scsi_transport_sas ahci megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

<4>

<4>Pid: 3084, comm: redis-server Tainted: G           --------------- H  2.6.32-279.19.5.el6.ucloud.x86_64 #1 Huawei Technologies Co., Ltd. Tecal RH2288H V2-12L/BC11SRSG1

<4>RIP: 0010:[<ffffffff81180f22>]  [<ffffffff81180f22>] split_huge_page+0x802/0x850

<4>RSP: 0018:ffff880c5edddbd8  EFLAGS: 00010086

<4>RAX: 00000000ffffffff RBX: ffffea0029333970 RCX: ffffc9000c06b000

<4>RDX: 0000000000000002 RSI: ffff880c61fe9c00 RDI: 0000000000000004

<4>RBP: ffff880c5edddca8 R08: 0000000000000064 R09: 0000000000000001

<4>R10: 0000000000000106 R11: ffff880c5941f300 R12: ffffea0029334000

<4>R13: 0000000000000000 R14: ffffea002932d000 R15: 00000007fcd17be1

<4>FS:  00007fcd25858720(0000) GS:ffff880053820000(0000) knlGS:0000000000000000

<4>CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b

<4>CR2: 00007f7086a8a740 CR3: 0000000c57daf000 CR4: 00000000001406e0

<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

<4>Process redis-server (pid: 3084, threadinfo ffff880c5eddc000, task ffff880c5941e940)

<4>Stack:

<4> 0000000000000002 0000000000000000 0000000000000000 ffff880c00000001

<4><d> ffff880c5edddc78 ffff88184155d5f0 0000000000000000 ffffffff81180f9b

<4><d> ffff88184155d630 ffffffff810a9f0d ffffea0052938e78 0000000180010f40

<4>Call Trace:

<4> [<ffffffff81180f9b>] ? __split_huge_page_pmd+0x2b/0xc0

<4> [<ffffffff810a9f0d>] ? trace_hardirqs_on_caller+0x14d/0x190

<4> [<ffffffff81180ff0>] __split_huge_page_pmd+0x80/0xc0

<4> [<ffffffff81147ff4>] unmap_vmas+0xa34/0xc00

<4> [<ffffffff81148b9d>] zap_page_range+0x7d/0xe0

<4> [<ffffffff81144b4d>] sys_madvise+0x54d/0x770

<4> [<ffffffff815073bb>] ? thread_return+0x4e/0x7d3

<4> [<ffffffff810a9f0d>] ? trace_hardirqs_on_caller+0x14d/0x190

<4> [<ffffffff81509ee2>] ? trace_hardirqs_on_thunk+0x3a/0x3f

<4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

<4>Code: 83 c1 01 e8 69 5e 38 00 0f 0b eb fe 0f 0b eb fe 0f 0b 0f 1f 80 00 00 00 00 eb f7 0f 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 <0f> 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 0f 0b eb fe 0f 

<1>RIP  [<ffffffff81180f22>] split_huge_page+0x802/0x850

<4> RSP <ffff880c5edddbd8>

结合代码：找到的call path:

<上面的back trace中， >

	unmap_vmas
		--unmap_page_range
			--zap_pud_range
				--zap_pmd_range
					--split_huge_page_pmd
						--__split_huge_page_pmd
							--split_huge_page
								--__split_huge_page
									--__split_huge_page_refcount

Clipboard Image.png

line: 1194行如上：

【汇编code】： objdump vmlinux之后，因为很多函数是static inline的，所以system map里面找到jump地址，要到调用函数里面去找。

Clipboard Image.png

RAX: 00000000ffffffff

RBX: ffffea0029333970

R12: ffffea0029334000

触发原因： eax的值是-1;

eax放的是： add $0x1, %eax <-------> atomic_read(&(page)->_mapcount) + 1;

即： _mapcount的值是-2，相加之后出现了-1值。

现象是 mapcount出现了理论上不应该也不能出现的值；

mapcount:

+ * Count of ptes mapped in

+ * mms, to show when page is

+ * mapped & limit reverse map

+ * searches.

start from "-1"；

经过分析，理论上此处的mapcount值是 -1

BUG_ON(page_mapcount(page_tail) < 0); 判断值是0 ，就不会出现BUG。

而改写的值不是脏值，看起来像是被减一后出现的。

Clipboard Image.png

都是在 put_compound_page中。。。

经过搜索<google>

得到下面的一个记录：

+====== start quote =======
+  mapcount 0 page_mapcount 1
+  kernel BUG at mm/huge_memory.c:1384!
+
+At some point prior to the panic, a "bad pmd ..." message similar to the
+following is logged on the console:
+
+  mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
+
+The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
+the page's PMD table entry.
+
+    143 void pmd_clear_bad(pmd_t *pmd)
+    144 {
+->  145         pmd_ERROR(*pmd);
+    146         pmd_clear(pmd);
+    147 }
+
+After the PMD table entry has been cleared, there is an inconsistency
+between the actual number of PMD table entries that are mapping the page
+and the page's map count (_mapcount field in struct page). When the page
+is subsequently reclaimed, __split_huge_page() detects this inconsistency.
+
+   1381         if (mapcount != page_mapcount(page))
+   1382                 printk(KERN_ERR "mapcount %d page_mapcount %d
",
+   1383                        mapcount, page_mapcount(page));
+-> 1384         BUG_ON(mapcount != page_mapcount(page));
+
+The root cause of the problem is a race of two threads in a multithreaded
+process. Thread B incurs a page fault on a virtual address that has never
+been accessed (PMD entry is zero) while Thread A is executing an madvise()
+system call on a virtual address within the same 2 MB (huge page) range.
+
+           virtual address space
+          .---------------------.
+          |                     |
+          |                     |
+        .-|---------------------|
+        | |                     |
+        | |                     |<-- B(fault)
+        | |                     |
+  2 MB  | |/////////////////////|-.
+  huge <  |/////////////////////|  > A(range)
+  page  | |/////////////////////|-'
+        | |                     |
+        | |                     |
+        '-|---------------------|
+          |                     |
+          |                     |
+          '---------------------'
+
+- Thread A is executing an madvise(..., MADV_DONTNEED) system call
+  on the virtual address range "A(range)" shown in the picture.
+
+sys_madvise
+  // Acquire the semaphore in shared mode.
+  down_read(&current->mm->mmap_sem)
+  ...
+  madvise_vma
+    switch (behavior)
+    case MADV_DONTNEED:
+         madvise_dontneed
+           zap_page_range
+             unmap_vmas
+               unmap_page_range
+                 zap_pud_range
+                   zap_pmd_range
+                     //
+                     // Assume that this huge page has never been accessed.
+                     // I.e. content of the PMD entry is zero (not mapped).
+                     //
+                     if (pmd_trans_huge(*pmd)) {
+                         // We don't get here due to the above assumption.
+                     }
+                     //
+                     // Assume that Thread B incurred a page fault and
+         .---------> // sneaks in here as shown below.
+         |           //
+         |           if (pmd_none_or_clear_bad(pmd))
+         |               {
+         |                 if (unlikely(pmd_bad(*pmd)))
+         |                     pmd_clear_bad
+         |                     {
+         |                       pmd_ERROR
+         |                         // Log "bad pmd ..." message here.
+         |                       pmd_clear
+         |                         // Clear the page's PMD entry.
+         |                         // Thread B incremented the map count
+         |                         // in page_add_new_anon_rmap(), but
+         |                         // now the page is no longer mapped
+         |                         // by a PMD entry (-> inconsistency).
+         |                     }
+         |               }
+         |
+         v
+- Thread B is handling a page fault on virtual address "B(fault)" shown
+  in the picture.
+
+...
+do_page_fault
+  __do_page_fault
+    // Acquire the semaphore in shared mode.
+    down_read_trylock(&mm->mmap_sem)
+    ...
+    handle_mm_fault
+      if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
+          // We get here due to the above assumption (PMD entry is zero).
+          do_huge_pmd_anonymous_page
+            alloc_hugepage_vma
+              // Allocate a new transparent huge page here.
+            ...
+            __do_huge_pmd_anonymous_page
+              ...
+              spin_lock(&mm->page_table_lock)
+              ...
+              page_add_new_anon_rmap
+                // Here we increment the page's map count (starts at -1).
+                atomic_set(&page->_mapcount, 0)
+              set_pmd_at
+                // Here we set the page's PMD entry which will be cleared
+                // when Thread A calls pmd_clear_bad().
+              ...
+              spin_unlock(&mm->page_table_lock)
+
+The mmap_sem does not prevent the race because both threads are acquiring
+it in shared mode (down_read).  Thread B holds the page_table_lock while
+the page's map count and PMD table entry are updated.  However, Thread A
+does not synchronize on that lock.
+
+====== end quote =======

通过上面的分析，和目前的情况。。。。。。

有理由判断：multi thread的情况出现（函数调用过程中，嵌套了多个循环，multi thread的执行是有各种条件的。

The mmap_sem does not prevent the race because both threads are acquiring
+it in shared mode (down_read).  Thread B holds the page_table_lock while
+the page's map count and PMD table entry are updated.  However, Thread A
+does not synchronize on that lock.

需要进一步判断，具体哪里触发的。。。。。

已经怎么构造测试的环境。。。。