Re: softlockup on 3.6.8-rt19

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Wed, 5 Dec 2012 16:49:28 +0100 (CET)

On Wed, 5 Dec 2012, Sven-Thorsten Dietrich wrote:
> 
> This is the softlockup I am seeing on one of our HP blades.
> 
> I haven't fully ruled out bad hardware, trying to reproduce on another machine.
> 
> Sven
> 
> 
> [  128.371195] BUG: soft lockup - CPU#9 stuck for 22s! [git:6333]
> [  132.387637] BUG: soft lockup - CPU#10 stuck for 23s! [agetty:674]
> [  144.398987] BUG: soft lockup - CPU#11 stuck for 22s! [flush-8:0:336]
> [  156.353376] BUG: soft lockup - CPU#9 stuck for 22s! [git:6333]
> [  160.369814] BUG: soft lockup - CPU#10 stuck for 22s! [agetty:674]
> [  192.330459] BUG: soft lockup - CPU#9 stuck for 23s! [git:6333]
> [  192.349444] BUG: soft lockup - CPU#10 stuck for 23s! [agetty:674]
> [  192.368428] BUG: soft lockup - CPU#11 stuck for 23s! [flush-8:0:336]
> [  195.632116] BUG: spinlock lockup suspected on CPU#9, git/6333
> [  195.632122] general protection fault: 0000 [#1] PREEMPT SMP 

So we fault in spin_dump. Which is not surprising when we decode the
faulting instruction:

	 44 8b 83 e4 02 00 00 	mov    0x2e4(%rbx),%r8d

> [  195.632138] RIP: 0010:[<ffffffff816438c1>]  [<ffffffff816438c1>] spin_dump+0x56/0x91
> [  195.632138] RSP: 0000:ffff880be0077818  EFLAGS: 00010206
> [  195.632139] RAX: 0000000000000031 RBX: 1067a77cb2247fcc RCX: 0000000000000871

RBX contains a random number. Ditto in the next dump on CPU10

> [  200.084385] BUG: spinlock lockup suspected on CPU#10, agetty/674
> [  200.084388] general protection fault: 0000 [#2] PREEMPT SMP 

> [  200.084403] RIP: 0010:[<ffffffff816438c1>]  [<ffffffff816438c1>] spin_dump+0x56/0x91
> [  200.084403] RSP: 0018:ffff8805e03877a8  EFLAGS: 00010286
> [  200.084404] RAX: 0000000000000034 RBX: cdc5c4fabb8bf87b RCX: 00000000000008d5

0000000000000000 <spin_dump>:
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	41 54                	push   %r12
   6:	49 89 fc             	mov    %rdi,%r12
   9:	53                   	push   %rbx
   a:	48 8b 5f 10          	mov    0x10(%rdi),%rbx

RBX is initialized with lock->owner (0ffset 0x10 of lock)

   e:	48 c7 c7 00 00 00 00 	mov    $0x0,%rdi
  15:	48 8d 43 ff          	lea    -0x1(%rbx),%rax
  19:	48 83 f8 fe          	cmp    $0xfffffffffffffffe,%rax
  1d:	b8 00 00 00 00       	mov    $0x0,%eax
  22:	48 0f 43 d8          	cmovae %rax,%rbx
  26:	65 48 8b 04 25 00 00 	mov    %gs:0x0,%rax
  2d:	00 00 
  2f:	44 8b 80 e4 02 00 00 	mov    0x2e4(%rax),%r8d
  36:	48 8d 88 90 04 00 00 	lea    0x490(%rax),%rcx
  3d:	31 c0                	xor    %eax,%eax
  3f:	65 8b 14 25 00 00 00 	mov    %gs:0x0,%edx
  46:	00 
  47:	e8 00 00 00 00       	callq  4c <spin_dump+0x4c>
  4c:	48 85 db             	test   %rbx,%rbx
  4f:	45 8b 4c 24 08       	mov    0x8(%r12),%r9d

Here we read lock->owner_cpu into R9. Random numbers as well:

     R09: 000000004642dad1

     R09: 0000000017f07438

  54:	74 10                	je     66 <spin_dump+0x66>
  56:	44 8b 83 e4 02 00 00 	mov    0x2e4(%rbx),%r8d

And of course here we crash. Let's look at the call chain

> [  200.084416]  [<ffffffff81343189>] do_raw_spin_lock+0xf9/0x140
> [  200.084417]  [<ffffffff81649f44>] _raw_spin_lock+0x44/0x50
> [  200.084418]  [<ffffffff81648d63>] ? rt_spin_lock_slowlock+0x43/0x380
> [  200.084420]  [<ffffffff81648d63>] ? rt_spin_lock_slowlock+0x43/0x380
> [  200.084421]  [<ffffffff81648d63>] rt_spin_lock_slowlock+0x43/0x380
> [  200.084422]  [<ffffffff81649817>] rt_spin_lock+0x27/0x60
> [  200.084424]  [<ffffffff8113f4bd>] __lru_cache_add+0x5d/0x1f0

That's the per cpu local lock swap_lock protecting the pagevec
operations. So something is corrupting the per cpu locks really badly.

The lock addresses look reasonably:

CPU9:	 R12: ffff880bc1867c00
CPU10:	 R12: ffff880bc1887c00
CPU11:	 R12: ffff880bc18a7c00

That's a spacing of 20000H per cpu.

I really have no idea what scribbles over those locks. Can you check
what is next to those locks in the per_cpu area ?

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html