Re: debugging kernel hang (can't type anything)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Thu, Feb 19, 2009 at 3:19 AM, Sukanto Ghosh <sukanto.cse.iitb@xxxxxxxxx> wrote:
Hi,

I am running into a peculiar problem here.

I am getting stack-traces from two sources, i) spin_lock BUG_ON "cpu
recursion", and ii) remote gdb running on my host and connected to the
guest kernel (the one I am debugging).

spin_lock BUG_ON stack-trace says:
BUG: spinlock recursion on CPU#0, qsbench/2169
 lock: c6783618, .magic: dead4ead, .owner: qsbench/2169, .owner_cpu: 0
Pid: 2169, comm: qsbench Not tainted 2.6.27.4-sg-kgdb #2
 [<c0644bb7>] ? printk+0xf/0x18
 [<c05096ae>] spin_bug+0x7c/0x87
 [<c0509766>] _raw_spin_lock+0x35/0xfa
 [<c0647509>] _spin_lock+0x3d/0x4b
 [<c048189e>] grab_swap_token+0x20c/0x246   <===
 [<c0477873>] handle_mm_fault+0x35b/0x70a
 [<c0649352>] ? do_page_fault+0x2a6/0x722
 [<c0649425>] do_page_fault+0x379/0x722
 [<c0445d82>] ? __lock_acquire+0x7bd/0x814
 [<c0445d82>] ? __lock_acquire+0x7bd/0x814
 [<c043f9c4>] ? getnstimeofday+0x3c/0xd6
....


And gdb shows:
#0  0xc0505f1b in delay_tsc (loops=1) at arch/x86/lib/delay.c:85
#1  0xc0505f77 in __udelay (usecs=2394097143) at arch/x86/lib/delay.c:118
#2  0xc6866d98 in ?? ()
#3  0xc05097ba in _raw_spin_lock (lock=0xc6783618) at lib/spinlock_debug.c:116
#4  0xc0647509 in _spin_lock_bh (lock=0xc6783628) at kernel/spinlock.c:113
#5  0xc048189e in dmam_pool_match (dev=<value optimized out>, res=0x2d5,
   match_data=0x0) at mm/dmapool.c:457        <===
#6  0x00000001 in ?? ()


If you notice the lines I have shown with arrows, the address
0xc048189e  is shown in different function in each case.
What is more surprising, System.map says
c0481692 T grab_swap_token
c04818d8 t dmam_pool_match
(consistent with the spin_lock BUG_ON trace)


But on going through the output of 'objdump -s -l --source' I found
that 'dmam_pool_match' starts at address 0xc0481898.
(consistent with the gdb back-trace output)

I must say I am quite surprised by this feature of --source.....quite useful ...but now sure how to use it not.   But it really takes too LONG......normally I do "objdump -rd" much faster.   But to find the address, both approach are the same.

And the address indeed have some problem....u sure u did not used an old copy of these files ?(as u said ..... these are from two different physical sources??? compiled codes may be different on different machines)   Normally, System.map's address should match to "objdump -rd"'s address.   In your case it is c04818d8 vs c0481898, weird.   Mine both are identical.




I am confused about the source of the 'spinlock cpu recursion'. I have
checked the code I have written (grab_swap_token modifications etc in
mm/thrash.c) , it doesn't appears to be the source of the problem.  I
haven't touched the function dmam_pool_match() and it doesn't appears
in mm/thrash.c




>>> But now I am getting a "BUG: spinlock recursion on CPU#0" error.  What
>>> does this error mean ?
>>
>> yes, it actually means that the previous lock is still not released yet, and
>> now u are calling a function to acquire the spinlock again......and somehow,
>> before actually acquiring the spinlock, a check was made:
>>
>> lib/spinlock_debug.c:
>>
>> static inline void
>> debug_spin_lock_before(spinlock_t *lock)
>> {
>>         SPIN_BUG_ON(lock->magic != SPINLOCK_MAGIC, lock, "bad magic");
>>         SPIN_BUG_ON(lock->owner == current, lock, "recursion");
>>         SPIN_BUG_ON(lock->owner_cpu == raw_smp_processor_id(),
>>                                                         lock, "cpu
>> recursion");
>> }
>>
>> so now the above "recursion" was detected, as the spinlock is still residing
>> on the same CPU while being reacquired.   so instead of going into a tight
>> spin, it print out the stack trace before hand.
>>

Does it mean that it's illegal for two processes (on the same cpu)
spinning on a lock that is held by someone else (again on the same
cpu) ?

yes,....but very confusing.   normally, when u are talking about spinlock on one CPU,  the process will not be switched out of the CPU until it has finish and release the spinlock.   the rule is that spinlock is meant for multi-cpu synchronization, NOT multiple process on one CPU.   in fact, if another process is switch in, and found the spinlock locked by another process, the CPU will (in the past) go into a deadly spin.....and loop wait until the spinlock is unlock.  but this is not possible, because the CPU is spinning.   now I think the code is wiser - it will throw an error instead.   if u find this happening....i think u have found a bug....:-)

 


--
Regards,
Sukanto Ghosh



--
Regards,
Peter Teoh

[Index of Archives]     [Newbies FAQ]     [Linux Kernel Mentors]     [Linux Kernel Development]     [IETF Annouce]     [Git]     [Networking]     [Security]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux SCSI]     [Linux ACPI]
  Powered by Linux