Re: new (to us) kernel panic nfsv4 linux 3.0.12

"Myklebust, Trond" <Trond.Myklebust@xxxxxxxxxx> · Wed, 7 Mar 2012 21:11:27 +0000

On Wed, 2012-03-07 at 15:53 -0500, Chuck Lever wrote:
> On Mar 7, 2012, at 3:49 PM, Myklebust, Trond wrote:
> 
> > On Wed, 2012-03-07 at 14:41 -0500, Paul Anderson wrote:
> >> The following kernel panic occurred on at least 4 compute nodes nearly
> >> simultaneously.  It was during unattended operation, so no clue as to
> >> what the server was doing.
> >> 
> >> The client node was under very heavy CPU load (12 core plus HT with
> >> 50-100 jobs running).  No swapping, unknown I/O but probably low,
> >> except for the set of slurm jobs that stopped in D state probably due
> >> to the kernel panic.
> >> 
> >> uname -> Linux c09 3.0.12 #1 SMP Wed Nov 30 19:42:40 EST 2011 x86_64 GNU/Linux
> >> 
> >> Please let me know what additional information I can provide - thanks!
> >> 
> >> Paul Anderson
> >> University of Michigan
> >> 
> >> [1411404.724301] nfs4_reclaim_open_state: Lock reclaim failed!
> >> [1412738.175791] nfs4_reclaim_open_state: Lock reclaim failed!
> >> [1412738.175805] general protection fault: 0000 [#1] SMP
> >> [1412738.176036] CPU 3
> >> [1412738.176112] Modules linked in: binfmt_misc ipmi_msghandler
> >> ipt_ULOG x_tables autofs4 mptctl mptbase dlm configfs dm_crypt nfsd
> >> nfs lockd xfs auth_rpcgss n
> >> [1412738.177205]
> >> [1412738.177297] Pid: 10473, comm: 192.168.1.16-ma Not tainted 3.0.12
> >> #1 Dell     C6100       /0D61XP
> >> [1412738.177683] RIP: 0010:[<ffffffffa02a8e00>]  [<ffffffffa02a8e00>]
> >> nfs4_do_reclaim+0x1c0/0x560 [nfs]
> >> [1412738.178074] RSP: 0018:ffff88100e651e00  EFLAGS: 00010287
> >> [1412738.178296] RAX: 0000000000000042 RBX: ffff88080dff5380 RCX:
> >> 000000000003ffff
> >> [1412738.178606] RDX: ffff88080dff53a0 RSI: 0000000000000082 RDI:
> >> 0000000000000246
> >> [1412738.178917] RBP: ffff88100e651e80 R08: 0000000000000000 R09:
> >> 0000000000000000
> >> [1412738.179227] R10: 0000000000000006 R11: 0000000000000000 R12:
> >> ffffffffa02b9c00
> >> [1412738.179537] R13: dead000000100100 R14: ffff88100e762a58 R15:
> >> ffff88100e762a00
> >> [1412738.179848] FS:  0000000000000000(0000) GS:ffff88083fc60000(0000)
> >> knlGS:0000000000000000
> >> [1412738.180192] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >> [1412738.180428] CR2: 0000000001c89068 CR3: 000000100534f000 CR4:
> >> 00000000000006e0
> >> [1412738.180739] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >> 0000000000000000
> >> [1412738.181049] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> >> 0000000000000400
> >> [1412738.181360] Process 192.168.1.16-ma (pid: 10473, threadinfo
> >> ffff88100e650000, task ffff8809a7ca8000)
> >> [1412738.181739] Stack:
> >> [1412738.181847]  ffff88080dff53a0 ffff88080dff53c0 ffff8808055cf4b0
> >> ffff8808055cf400
> >> [1412738.182192]  ffff88100e762a50 ffff88054ab0b2b0 ffff8808055cf4f8
> >> ffff88100e762a48
> >> [1412738.182538]  ffffffffa02b9ec8 ffff880ac2296008 ffff88100e651e80
> >> ffff8808055cf4f0
> >> [1412738.182882] Call Trace:
> >> [1412738.183015]  [<ffffffffa02a9424>] nfs4_run_state_manager+0x284/0x420 [nfs]
> >> [1412738.183298]  [<ffffffffa02a91a0>] ? nfs4_do_reclaim+0x560/0x560 [nfs]
> >> [1412738.183562]  [<ffffffff81080a96>] kthread+0x96/0xa0
> >> [1412738.183771]  [<ffffffff815ac124>] kernel_thread_helper+0x4/0x10
> >> [1412738.184927]  [<ffffffff81080a00>] ? kthread_worker_fn+0x190/0x190
> >> [1412738.185177]  [<ffffffff815ac120>] ? gs_change+0x13/0x13
> >> [1412738.185395] Code: 48 74 50 4d 8b 6d 00 4d 85 ed 75 df e8 2a a5 ee
> >> e0 48 8b 7d a8 e8 41 cf dd e0 4c 8b 6b 20 48 8d 53 20 49 39 d5 74 18
> >> 0f 1f 40 00
> >> [1412738.186187]  f6 45 18 01 0f 84 6a 03 00 00 4d 8b 6d 00 49 39 d5 75 ec 48
> >> [1412738.186646] RIP  [<ffffffffa02a8e00>] nfs4_do_reclaim+0x1c0/0x560 [nfs]
> >> [1412738.186926]  RSP <ffff88100e651e00>
> >> [1412738.187353] ---[ end trace 4dbb732d1756f6b1 ]---
> > 
> > 3.0 kernels are no longer supported as part of the stable kernel series,
> 
> I thought I just saw Greg KH post an e-mail calling for everyone to move to 3.0.

Oops.. You are right. I see that the bug I suspect is being hit above
was subject to a patch that didn't go through stable.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git&a=commitdiff&h=4b44b40e04a758e2242ff4a3f7c15982801ec8bc

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@xxxxxxxxxx
www.netapp.com

��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥