PROBLEM: Kernel oops -- IP: [<ffffffff800cddfa>] kfree+0x5a/0x200

Fanhenglong <fanhenglong@xxxxxxxxxx> · Tue, 9 Apr 2013 09:04:49 +0000

Hi,
Full description of the problem:
Kernel version: 2.6.32.36
Oops information:
[9638271.695663] BUG: unable to handle kernel paging request at 0000000000a3ad90

[9638271.695685] IP: [<ffffffff800cddfa>] kfree+0x5a/0x200

[9638271.695701] PGD f94ff067 PUD fd652067 PMD 0 

[9638271.695707] Oops: 0000 [#1] SMP

[9638271.695712] last sysfs file: /sys/devices/xen-backend/vbd-415-51776/statistics/wr_sect
Trap number:14, message:Oops

Error num: 0

Sigal Num:11_SIGSEGV

Event ID:DIE_OOPS

RIP: e030:[<ffffffff800cddfa>]

<ffffffff800cddfa>{kfree+0x5a}

RSP: e02b:ffff88001ce65da8  EFLAGS: 00010006

RAX: 0000000000a3ad90 RBX: 0000000000000000 RCX: 00000000000002eb

RDX: 00000000001761f0 RSI: 00000000000002eb RDI: ffff88002ec3e3e0

RBP: fffffffffffffffe R08: 0000000000000000 R09: ffff88002ec3e3e0

R10: ffffffffffffffff R11: ffffffff801b0e50 R12: 0000000000008001

R13: 0000000000000024 R14: 00000000ffffff9c R15: ffff88001ce65e48

FS:  00007fbe05e71700(0000) GS:ffff880002008000(0000) knlGS:0000000000000000

CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 0000000000a3ad90 CR3: 00000000f9009000 CR4: 0000000000002620

DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

<kernel_trace>

       <ffffffff80009b05>{dump_trace+0x65}

       <ffffffff8037d897>{notifier_call_chain+0x37}

       <ffffffff8005a1ed>{notify_die+0x2d}

       <ffffffff8037bd0b>{__die+0x8b}

       <ffffffff8001bed1>{no_context+0xd1}

       <ffffffff8001c1f5>{__bad_area_nosemaphore+0x175}

       <ffffffff8037b298>{page_fault+0x28}

       <ffffffff800cddfa>{kfree+0x5a}

       <ffffffff800da03d>{put_filp+0x1d}

       <ffffffff800e7133>{do_filp_open+0x723}

       <ffffffff800d62b7>{do_sys_open+0x97}

       <ffffffff80007378>{system_call_fastpath+0x16}

       [<00007fbe059c8040>]

</kernel_trace>

Following is my own preliminary analysis:
crash> dis kfree

0xffffffff800cdda0 <kfree>:     push   %r15

0xffffffff800cdda2 <kfree+2>:   push   %r14

0xffffffff800cdda4 <kfree+4>:   push   %r13

0xffffffff800cdda6 <kfree+6>:   push   %r12

0xffffffff800cdda8 <kfree+8>:   push   %rbp

0xffffffff800cdda9 <kfree+9>:   push   %rbx

0xffffffff800cddaa <kfree+10>:  sub    $0x18,%rsp

0xffffffff800cddae <kfree+14>:  cmp    $0x10,%rdi

0xffffffff800cddb2 <kfree+18>:  mov    %rdi,0x8(%rsp)

0xffffffff800cddb7 <kfree+23>:  jbe    0xffffffff800cde7c <kfree+220>

0xffffffff800cddbd <kfree+29>:  mov    %gs:0x67c1,%al

0xffffffff800cddc5 <kfree+37>:  movb   $0x1,%gs:0x67c1

0xffffffff800cddce <kfree+46>:  mov    %al,0x17(%rsp)

0xffffffff800cddd2 <kfree+50>:  mov    0x8(%rsp),%rdi

0xffffffff800cddd7 <kfree+55>:  mov    0x758872(%rip),%rbx        # 0xffffffff80826650

0xffffffff800cddde <kfree+62>:  callq  0xffffffff800228e0 <__phys_addr>

0xffffffff800cdde3 <kfree+67>:  shr    $0xc,%rax

0xffffffff800cdde7 <kfree+71>:  lea    0x0(,%rax,8),%rdx

0xffffffff800cddef <kfree+79>:  shl    $0x6,%rax

0xffffffff800cddf3 <kfree+83>:  sub    %rdx,%rax

0xffffffff800cddf6 <kfree+86>:  lea    (%rbx,%rax,1),%rax

0xffffffff800cddfa <kfree+90>:  mov    (%rax),%rdx

0xffffffff800cddfd <kfree+93>:  test   $0x20000,%edx

0xffffffff800cde03 <kfree+99>:  je     0xffffffff800cde1b <kfree+123>

0xffffffff800cde05 <kfree+101>: mov    0x10(%rax),%rax

0xffffffff800cde09 <kfree+105>: mov    (%rax),%rdx

0xffffffff800cde0c <kfree+108>: test   $0x20000,%edx

0xffffffff800cde12 <kfree+114>: je     0xffffffff800cde1b <kfree+123>

......
Normally %rbx should be the value of mem_map which is a fixed value in my system, the address of the mem_map is 0xffffffff80826650, and the value of mem_map is 0xffff880004802000.
But here, %rbx was changed to 0x0000000000000000, in my opinion, the possible reason is below:
1. mem_map was changed with an unknown reason, led to %rbx is wrong.

2. mem_map is right, but %rip is wrong, led to %rbx is wrong.

3. mem_map is right, and %rip is also right, but %rbx was changed after later.
I changed the mem_map value to 0x0000000000000000, kernel is panic immediately, but it can’t produce the vmcore, this problem has the vmcore(sad to say, vmcore was gone because of carelessness).
So we can exclude the reason one, the rest of the reason is two and three, but i don’t know how they can happen.
I don't do anything before the system panic, and i can’t reproduce this problem.