Re: debugging kernel hang (can't type anything)

Sukanto Ghosh <sukanto.cse.iitb@xxxxxxxxx> · Thu, 19 Feb 2009 00:49:07 +0530

Hi,

I am running into a peculiar problem here.

I am getting stack-traces from two sources, i) spin_lock BUG_ON "cpu
recursion", and ii) remote gdb running on my host and connected to the
guest kernel (the one I am debugging).

spin_lock BUG_ON stack-trace says:
BUG: spinlock recursion on CPU#0, qsbench/2169
 lock: c6783618, .magic: dead4ead, .owner: qsbench/2169, .owner_cpu: 0
Pid: 2169, comm: qsbench Not tainted 2.6.27.4-sg-kgdb #2
 [<c0644bb7>] ? printk+0xf/0x18
 [<c05096ae>] spin_bug+0x7c/0x87
 [<c0509766>] _raw_spin_lock+0x35/0xfa
 [<c0647509>] _spin_lock+0x3d/0x4b
 [<c048189e>] grab_swap_token+0x20c/0x246   <===
 [<c0477873>] handle_mm_fault+0x35b/0x70a
 [<c0649352>] ? do_page_fault+0x2a6/0x722
 [<c0649425>] do_page_fault+0x379/0x722
 [<c0445d82>] ? __lock_acquire+0x7bd/0x814
 [<c0445d82>] ? __lock_acquire+0x7bd/0x814
 [<c043f9c4>] ? getnstimeofday+0x3c/0xd6
....

And gdb shows:
#0  0xc0505f1b in delay_tsc (loops=1) at arch/x86/lib/delay.c:85
#1  0xc0505f77 in __udelay (usecs=2394097143) at arch/x86/lib/delay.c:118
#2  0xc6866d98 in ?? ()
#3  0xc05097ba in _raw_spin_lock (lock=0xc6783618) at lib/spinlock_debug.c:116
#4  0xc0647509 in _spin_lock_bh (lock=0xc6783628) at kernel/spinlock.c:113
#5  0xc048189e in dmam_pool_match (dev=<value optimized out>, res=0x2d5,
    match_data=0x0) at mm/dmapool.c:457        <===
#6  0x00000001 in ?? ()

If you notice the lines I have shown with arrows, the address
0xc048189e  is shown in different function in each case.
What is more surprising, System.map says
c0481692 T grab_swap_token
c04818d8 t dmam_pool_match
(consistent with the spin_lock BUG_ON trace)

But on going through the output of 'objdump -s -l --source' I found
that 'dmam_pool_match' starts at address 0xc0481898.
(consistent with the gdb back-trace output)

I am confused about the source of the 'spinlock cpu recursion'. I have
checked the code I have written (grab_swap_token modifications etc in
mm/thrash.c) , it doesn't appears to be the source of the problem.  I
haven't touched the function dmam_pool_match() and it doesn't appears
in mm/thrash.c

>>> But now I am getting a "BUG: spinlock recursion on CPU#0" error.  What
>>> does this error mean ?
>>
>> yes, it actually means that the previous lock is still not released yet, and
>> now u are calling a function to acquire the spinlock again......and somehow,
>> before actually acquiring the spinlock, a check was made:
>>
>> lib/spinlock_debug.c:
>>
>> static inline void
>> debug_spin_lock_before(spinlock_t *lock)
>> {
>>         SPIN_BUG_ON(lock->magic != SPINLOCK_MAGIC, lock, "bad magic");
>>         SPIN_BUG_ON(lock->owner == current, lock, "recursion");
>>         SPIN_BUG_ON(lock->owner_cpu == raw_smp_processor_id(),
>>                                                         lock, "cpu
>> recursion");
>> }
>>
>> so now the above "recursion" was detected, as the spinlock is still residing
>> on the same CPU while being reacquired.   so instead of going into a tight
>> spin, it print out the stack trace before hand.
>>

Does it mean that it's illegal for two processes (on the same cpu)
spinning on a lock that is held by someone else (again on the same
cpu) ?

-- 
Regards,
Sukanto Ghosh

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ