Oops or bad page in page_alloc.c

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi

I encountered an oops in isolate_pcp_pages() and a bad page in
get_page_from_freelist().

linux: 3.12.37-rt51  (CONFIG_PREEMPT_RT_BASE not enabled)
arch: PowperPC (e500)

The appmon.sh below is a shell script who periodically check whether
other applications is still existing, if not, print some info into a
uniq log file under the directory /tmp and restart that application
again. Normally, other applications are existing and there's no need
to be restart. But because bug, there's one application won't be
restart successfully (There's no such an application. Failed to start
it won't impact the system except printing some info into the log file
periodically.).
It's hard to reproduce it. It's reported in real world after running
more than 217 days (about 5233 ~ 5238 hours).
I tried to reproduce it in small app but failed.

>From the oops below, it's really strange. The page to be deleted from
the pcp free list has been deleted in the past. From the 'Bad page'
issue, it seems that we could get a page who is still in use?
To me, the issue seems related to some race condition (maybe between
the parent and it's child processes). But no clue yet.
Any suggestions will be appreciated!

[18857088.953420] Unable to handle kernel paging request for data at
address 0x00100104
[18857089.046143] Faulting instruction address: 0xc0075624
[18857089.108654] Oops: Kernel access of bad area, sig: 11 [#1]
[18857089.176366] SMP NR_CPUS=8 CoreNet Generic
[18857089.227419] Modules linked in: napt(O)
[18857089.275357] CPU: 1 PID: 10357 Comm: appmon.sh Tainted: G
  O 3.12.37-rt51 #1
[18857089.371202] task: caba75b0 ti: cab2c000 task.ti: cab2c000
[18857089.438917] NIP: c0075624 LR: c0078f24 CTR: 00000007
[18857089.501427] REGS: cab2dbc0 TRAP: 0300   Tainted: G           O
(3.12.37-rt51)
[18857089.591014] MSR: 00021002 <CE,ME>  CR: 44448888  XER: 20000000
[18857089.663967] DEAR: 00100104, ESR: 00800000
[18857089.715017]
[18857089.715017] GPR00: 00100100 cab2dc70 caba75b0 00000006 c0728054
cab2dc88 c0728070 00000002
[18857089.715017] GPR08: c0728064 c0641814 00000002 00200200 00100100
100f9890 100f1d2c 100f0000
[18857089.715017] GPR16: 100f0000 100f0000 100bd61c c04b8d80 00029002
00000000 00200200 00100100
[18857089.715017] GPR24: cab8b00c 00000007 c04b8d80 00289000 00029002
00000000 cab2dc88 00200200
[18857090.073578] NIP [c0075624] isolate_pcp_pages+0x84/0xc4
[18857090.138173] LR [c0078f24] free_hot_cold_page+0x124/0x174
[18857090.204849] Call Trace:
[18857090.237156] [cab2dc70] [00080008] 0x80008 (unreliable)
[18857090.301762] [cab2dc80] [c0078e34] free_hot_cold_page+0x34/0x174
[18857090.375736] [cab2dcc0] [c0079300] free_hot_cold_page_list+0x44/0x54
[18857090.453876] [cab2dce0] [c007c588] release_pages+0x74/0x1c8
[18857090.522645] [cab2dd30] [c008d500] tlb_flush_mmu+0x60/0x70
[18857090.590370] [cab2dd50] [c008d528] tlb_finish_mmu+0x18/0x44
[18857090.659137] [cab2dd60] [c0093cb8] exit_mmap+0xb8/0x11c
[18857090.723741] [cab2ddd0] [c0019514] mmput+0x3c/0xf4
[18857090.783133] [cab2ddf0] [c00a8878] flush_old_exec+0x514/0x58c
[18857090.853986] [cab2de20] [c00d2208] load_elf_binary+0x1f0/0xfa4
[18857090.925875] [cab2dea0] [c00a8308] search_binary_handler+0x16c/0x1c8
[18857091.004015] [cab2ded0] [c00a8fcc] do_execve+0x2f0/0x4f8
[18857091.069655] [cab2df20] [c00a93d4] SyS_execve+0x40/0x58
[18857091.134257] [cab2df40] [c000cb38] ret_from_syscall+0x0/0x3c
[18857091.204067] --- Exception: c01 at 0xfdb75b4
[18857091.204067]     LR = 0x10032c24
[18857091.297826] Instruction dump:
[18857091.336385] 8128000c 7cc43214 7f864800 41feffd4 2f8a0003
40fe0008 7c6a1b78 7c6903a6
[18857091.432277] 81280010 3863ffff 81690004 81890000 <916c0004>
918b0000 90090000 93e90004
[18857091.530255] ---[ end trace ea47a50e65f9635c ]---
[18857091.588595]
[18857091.609453] Unable to handle kernel paging request for data at
address 0x00100104
[18857091.702170] Faulting instruction address: 0xc0075624
[18857091.764680] Oops: Kernel access of bad area, sig: 11 [#2]
[18857091.832394] SMP NR_CPUS=8 CoreNet Generic
[18857091.883446] Modules linked in: napt(O)
[18857091.931383] CPU: 1 PID: 10357 Comm: appmon.sh Tainted: G      D
  O 3.12.37-rt51 #1
[18857092.027222] task: caba75b0 ti: cab2c000 task.ti: cab2c000
[18857092.094938] NIP: c0075624 LR: c0078f24 CTR: 00000007
[18857092.157448] REGS: cab2d940 TRAP: 0300   Tainted: G      D    O
(3.12.37-rt51)
[18857092.247036] MSR: 00021002 <CE,ME>  CR: 24442288  XER: 20000000
[18857092.319989] DEAR: 00100104, ESR: 00800000
[18857092.371039]
[18857092.371039] GPR00: 00100100 cab2d9f0 caba75b0 00000006 c0728054
cab2da08 c0728070 00000002
[18857092.371039] GPR08: c0728064 c0641814 00000002 00200200 00100100
100f9890 100f1d2c 100f0000
[18857092.371039] GPR16: 100f0000 100f0000 100bd61c c04b8d80 00029002
00000000 c0000000 cabb57fc
[18857092.371039] GPR24: c0000000 00000007 c04b8d80 00289000 00021002
00000000 cab2da08 00200200
[18857092.729594] NIP [c0075624] isolate_pcp_pages+0x84/0xc4
[18857092.794187] LR [c0078f24] free_hot_cold_page+0x124/0x174
[18857092.860857] Call Trace:
[18857092.893165] [cab2da00] [c0078e34] free_hot_cold_page+0x34/0x174
[18857092.967139] [cab2da40] [c008d790] free_pgd_range+0x148/0x15c
[18857093.037987] [cab2da70] [c008d81c] free_pgtables+0x78/0xa4
[18857093.105710] [cab2daa0] [c0093ca4] exit_mmap+0xa4/0x11c
[18857093.170308] [cab2db10] [c0019514] mmput+0x3c/0xf4
[18857093.229700] [cab2db30] [c001cbb4] do_exit+0x2d0/0x790
[18857093.293261] [cab2db80] [c0008fbc] die+0x23c/0x244
[18857093.352654] [cab2dbb0] [c000d060] handle_page_fault+0x7c/0x80
[18857093.424547] --- Exception: 300 at isolate_pcp_pages+0x84/0xc4
[18857093.424547]     LR = free_hot_cold_page+0x124/0x174
[18857093.557893] [cab2dc70] [00080008] 0x80008 (unreliable)
[18857093.622500] [cab2dc80] [c0078e34] free_hot_cold_page+0x34/0x174
[18857093.696474] [cab2dcc0] [c0079300] free_hot_cold_page_list+0x44/0x54
[18857093.774613] [cab2dce0] [c007c588] release_pages+0x74/0x1c8
[18857093.843378] [cab2dd30] [c008d500] tlb_flush_mmu+0x60/0x70
[18857093.911102] [cab2dd50] [c008d528] tlb_finish_mmu+0x18/0x44
[18857093.979866] [cab2dd60] [c0093cb8] exit_mmap+0xb8/0x11c
[18857094.044464] [cab2ddd0] [c0019514] mmput+0x3c/0xf4
[18857094.103855] [cab2ddf0] [c00a8878] flush_old_exec+0x514/0x58c
[18857094.174705] [cab2de20] [c00d2208] load_elf_binary+0x1f0/0xfa4
[18857094.246594] [cab2dea0] [c00a8308] search_binary_handler+0x16c/0x1c8
[18857094.324732] [cab2ded0] [c00a8fcc] do_execve+0x2f0/0x4f8
[18857094.390373] [cab2df20] [c00a93d4] SyS_execve+0x40/0x58
[18857094.454973] [cab2df40] [c000cb38] ret_from_syscall+0x0/0x3c
[18857094.524779] --- Exception: c01 at 0xfdb75b4
[18857094.524779]     LR = 0x10032c24
[18857094.618538] Instruction dump:
[18857094.657091] 8128000c 7cc43214 7f864800 41feffd4 2f8a0003
40fe0008 7c6a1b78 7c6903a6
[18857094.752982] 81280010 3863ffff 81690004 81890000 <916c0004>
918b0000 90090000 93e90004
[18857094.850954] ---[ end trace ea47a50e65f9635d ]---
[18857094.909294]
[18857094.930140] Fixing recursive fault but reboot is needed!

static void isolate_pcp_pages(int to_free, struct per_cpu_pages *src,
struct list_head *dst)
{
int migratetype = 0, batch_free = 0;

while (to_free) {
struct page *page;
struct list_head *list;

/*
* Remove pages from lists in a round-robin fashion. A
* batch_free count is maintained that is incremented when an
* empty list is encountered.  This is so more pages are freed
* off fuller lists instead of spinning excessively around empty
* lists
*/
do {
batch_free++;
if (++migratetype == MIGRATE_PCPTYPES)
migratetype = 0;
list = &src->lists[migratetype];
} while (list_empty(list));

/* This is the only non-empty list. Free them all. */
if (batch_free == MIGRATE_PCPTYPES)
batch_free = to_free;

do {
page = list_last_entry(list, struct page, lru);
list_del(&page->lru);
list_add(&page->lru, dst);
} while (--to_free && --batch_free && !list_empty(list));
}
}

(gdb) disas isolate_pcp_pages
Dump of assembler code for function isolate_pcp_pages:
   0xc00755a0 <+0>: stwu    r1,-16(r1)
   0xc00755a4 <+4>: lis     r0,16
   0xc00755a8 <+8>: li      r10,0
   0xc00755ac <+12>: li      r7,0
   0xc00755b0 <+16>: ori     r0,r0,256
   0xc00755b4 <+20>: stw     r31,12(r1)
   0xc00755b8 <+24>: lis     r31,32
   0xc00755bc <+28>: ori     r31,r31,512
   0xc00755c0 <+32>: cmpwi   cr7,r3,0
   0xc00755c4 <+36>: bne+    cr7,0xc00755d4 <isolate_pcp_pages+52>
   0xc00755c8 <+40>: lwz     r31,12(r1)
   0xc00755cc <+44>: addi    r1,r1,16
   0xc00755d0 <+48>: blr
   0xc00755d4 <+52>: cmpwi   cr7,r7,2
   0xc00755d8 <+56>: addi    r10,r10,1
   0xc00755dc <+60>: addi    r7,r7,1
   0xc00755e0 <+64>: bne+    cr7,0xc00755e8 <isolate_pcp_pages+72>
   0xc00755e4 <+68>: li      r7,0
   0xc00755e8 <+72>: rlwinm  r8,r7,3,0,28
   0xc00755ec <+76>: addi    r6,r8,12
   0xc00755f0 <+80>: add     r8,r4,r8
   0xc00755f4 <+84>: lwz     r9,12(r8)
   0xc00755f8 <+88>: add     r6,r4,r6
   0xc00755fc <+92>: cmpw    cr7,r6,r9
   0xc0075600 <+96>: beq+    cr7,0xc00755d4 <isolate_pcp_pages+52>
   0xc0075604 <+100>: cmpwi   cr7,r10,3
   0xc0075608 <+104>: bne+    cr7,0xc0075610 <isolate_pcp_pages+112>
   0xc007560c <+108>: mr      r10,r3
   0xc0075610 <+112>: mtctr   r3
   0xc0075614 <+116>: lwz     r9,16(r8)
   0xc0075618 <+120>: addi    r3,r3,-1
   0xc007561c <+124>: lwz     r11,4(r9)
   0xc0075620 <+128>: lwz     r12,0(r9)
   0xc0075624 <+132>: stw     r11,4(r12)
   0xc0075628 <+136>: stw     r12,0(r11)
   0xc007562c <+140>: stw     r0,0(r9)
   0xc0075630 <+144>: stw     r31,4(r9)
   0xc0075634 <+148>: lwz     r11,0(r5)
   0xc0075638 <+152>: stw     r9,4(r11)
   0xc007563c <+156>: stw     r11,0(r9)
   0xc0075640 <+160>: stw     r5,4(r9)
   0xc0075644 <+164>: stw     r9,0(r5)
   0xc0075648 <+168>: bdz     0xc00755c0 <isolate_pcp_pages+32>
   0xc007564c <+172>: addic.  r10,r10,-1
   0xc0075650 <+176>: beq-    0xc00755c0 <isolate_pcp_pages+32>
   0xc0075654 <+180>: lwz     r9,12(r8)
   0xc0075658 <+184>: cmpw    cr7,r6,r9
   0xc007565c <+188>: bne+    cr7,0xc0075614 <isolate_pcp_pages+116>
   0xc0075660 <+192>: b       0xc00755c0 <isolate_pcp_pages+32>
End of assembler dump.

Below is another occurence:
[18855563.899808] BUG: Bad page state in process appmon.sh  pfn:08349
[18855563.973857] page:c063e920 count:1 mapcount:1 mapping:ca8ab541 index:0xfc73
[18855564.059306] page flags: 0x80068(uptodate|lru|active|swapbacked)
[18855564.133354] Modules linked in: napt(O)
[18855564.181334] CPU: 1 PID: 259 Comm: appmon.sh Tainted: G
O 3.12.37-rt51 #1
[18855564.275116] Call Trace:
[18855564.307444] [ca39bce0] [c0005cd0] show_stack+0x54/0x13c (unreliable)
[18855564.386697] [ca39bd20] [c0365d90] dump_stack+0x74/0x94
[18855564.451332] [ca39bd30] [c007779c] bad_page+0xec/0xf0
[18855564.513884] [ca39bd40] [c0077d00] get_page_from_freelist+0x438/0x4f8
[18855564.593103] [ca39bde0] [c0078800] __alloc_pages_nodemask+0xf4/0x6a4
[18855564.671281] [ca39bea0] [c008fc10] handle_mm_fault+0x9cc/0xc1c
[18855564.743205] [ca39bf10] [c000f6a0] do_page_fault+0x304/0x468
[18855564.813141] [ca39bf40] [c000cff0] handle_page_fault+0xc/0x80
[18855564.884050] --- Exception: 301 at 0xfd812c8
[18855564.884050]     LR = 0xfea31f4
[18855564.976803] Disabling lock debugging due to kernel taint


B.R.
Yimin



[Index of Archives]     [RT Stable]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]

  Powered by Linux