Re: debugging kernel hang (can't type anything)

Sukanto Ghosh <sukanto.cse.iitb@xxxxxxxxx> · Thu, 19 Feb 2009 23:49:05 +0530

On Thu, Feb 19, 2009 at 8:14 AM, Peter Teoh <htmldeveloper@xxxxxxxxx> wrote:
> On Thu, Feb 19, 2009 at 3:19 AM, Sukanto Ghosh <sukanto.cse.iitb@xxxxxxxxx>
> wrote:
>>
>> Hi,
>>
>> I am running into a peculiar problem here.
>>
>> I am getting stack-traces from two sources, i) spin_lock BUG_ON "cpu
>> recursion", and ii) remote gdb running on my host and connected to the
>> guest kernel (the one I am debugging).
>>
>> spin_lock BUG_ON stack-trace says:
>> BUG: spinlock recursion on CPU#0, qsbench/2169
>>  lock: c6783618, .magic: dead4ead, .owner: qsbench/2169, .owner_cpu: 0
>> Pid: 2169, comm: qsbench Not tainted 2.6.27.4-sg-kgdb #2
>>  [<c0644bb7>] ? printk+0xf/0x18
>>  [<c05096ae>] spin_bug+0x7c/0x87
>>  [<c0509766>] _raw_spin_lock+0x35/0xfa
>>  [<c0647509>] _spin_lock+0x3d/0x4b
>>  [<c048189e>] grab_swap_token+0x20c/0x246   <===
>>  [<c0477873>] handle_mm_fault+0x35b/0x70a
>>  [<c0649352>] ? do_page_fault+0x2a6/0x722
>>  [<c0649425>] do_page_fault+0x379/0x722
>>  [<c0445d82>] ? __lock_acquire+0x7bd/0x814
>>  [<c0445d82>] ? __lock_acquire+0x7bd/0x814
>>  [<c043f9c4>] ? getnstimeofday+0x3c/0xd6
>> ....
>>
>>
>> And gdb shows:
>> #0  0xc0505f1b in delay_tsc (loops=1) at arch/x86/lib/delay.c:85
>> #1  0xc0505f77 in __udelay (usecs=2394097143) at arch/x86/lib/delay.c:118
>> #2  0xc6866d98 in ?? ()
>> #3  0xc05097ba in _raw_spin_lock (lock=0xc6783618) at
>> lib/spinlock_debug.c:116
>> #4  0xc0647509 in _spin_lock_bh (lock=0xc6783628) at kernel/spinlock.c:113
>> #5  0xc048189e in dmam_pool_match (dev=<value optimized out>, res=0x2d5,
>>    match_data=0x0) at mm/dmapool.c:457        <===
>> #6  0x00000001 in ?? ()
>>
>>
>> If you notice the lines I have shown with arrows, the address
>> 0xc048189e  is shown in different function in each case.
>> What is more surprising, System.map says
>> c0481692 T grab_swap_token
>> c04818d8 t dmam_pool_match
>> (consistent with the spin_lock BUG_ON trace)
>>
>>
>> But on going through the output of 'objdump -s -l --source' I found
>> that 'dmam_pool_match' starts at address 0xc0481898.
>> (consistent with the gdb back-trace output)
>
> I must say I am quite surprised by this feature of --source.....quite useful
> ...but now sure how to use it not.   But it really takes too
> LONG......normally I do "objdump -rd" much faster.   But to find the
> address, both approach are the same.
>
> And the address indeed have some problem....u sure u did not used an old
> copy of these files ?(as u said ..... these are from two different physical
> sources??? compiled codes may be different on different machines)
> Normally, System.map's address should match to "objdump -rd"'s address.   In
> your case it is c04818d8 vs c0481898, weird.   Mine both are identical.

Actually the addresses produced by 'objdump -rd' is matching with that
in System.map
But I was using 'objdump -s -l --source' which is producing different
addresses. (It prints code fragments along with the disassembled code,
but I don't understand why it produces different addresses).

I am compiling on the test virtual machine (which is running my test
kernel) and sending the vmlinux generated there to the host (for use
with gdb).

Another finding ... when I set a breakpoint at the 'grab_swap_token'
function, the address shown by gdb is different from that of
System.map

>>
>> I am confused about the source of the 'spinlock cpu recursion'. I have
>> checked the code I have written (grab_swap_token modifications etc in
>> mm/thrash.c) , it doesn't appears to be the source of the problem.  I
>> haven't touched the function dmam_pool_match() and it doesn't appears
>> in mm/thrash.c
>>
>>
>>
>>
>> >>> But now I am getting a "BUG: spinlock recursion on CPU#0" error.  What
>> >>> does this error mean ?
>> >>
>> >> yes, it actually means that the previous lock is still not released
>> >> yet, and
>> >> now u are calling a function to acquire the spinlock again......and
>> >> somehow,
>> >> before actually acquiring the spinlock, a check was made:
>> >>
>> >> lib/spinlock_debug.c:
>> >>
>> >> static inline void
>> >> debug_spin_lock_before(spinlock_t *lock)
>> >> {
>> >>         SPIN_BUG_ON(lock->magic != SPINLOCK_MAGIC, lock, "bad magic");
>> >>         SPIN_BUG_ON(lock->owner == current, lock, "recursion");
>> >>         SPIN_BUG_ON(lock->owner_cpu == raw_smp_processor_id(),
>> >>                                                         lock, "cpu
>> >> recursion");
>> >> }
>> >>
>> >> so now the above "recursion" was detected, as the spinlock is still
>> >> residing
>> >> on the same CPU while being reacquired.   so instead of going into a
>> >> tight
>> >> spin, it print out the stack trace before hand.
>> >>
>>
>> Does it mean that it's illegal for two processes (on the same cpu)
>> spinning on a lock that is held by someone else (again on the same
>> cpu) ?
>
> yes,....but very confusing.   normally, when u are talking about spinlock on
> one CPU,  the process will not be switched out of the CPU until it has
> finish and release the spinlock.   the rule is that spinlock is meant for
> multi-cpu synchronization, NOT multiple process on one CPU.   in fact, if
> another process is switch in, and found the spinlock locked by another
> process, the CPU will (in the past) go into a deadly spin.....and loop wait
> until the spinlock is unlock.  but this is not possible, because the CPU is
> spinning.   now I think the code is wiser - it will throw an error
> instead.   if u find this happening....i think u have found a bug....:-)
>

It was my fault again something similar to what I had done earlier.
And this one doesn't hangs... So no bug found ;-)

BTW, I am using earlyprintk to redirect the printk output to serial
port and log it but it is also printing the output to the console.
What could be the reason ?  (I want it to redirect the printk output
to the usual logs + ttyS0 and not on console).

-- 
Regards,
Sukanto Ghosh

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ