[Linux-cluster] DLM or SM bug after 50 hours

Daniel McNeil <daniel@xxxxxxxx> · Thu, 16 Dec 2004 17:18:40 -0800

My tests ran for 50 hours!  This is a new record and is running
with my up_write() before queue_ast() patch.

It hit an error during a 2 node test (GFS on cl030 and cl031;
cl032 was a member of the cluster, but no GFS file system mounted).

On cl030 console:

SM: 00000001 sm_stop: SG still joined
SM: 01000410 sm_stop: SG still joined

/proc/cluster/status shows cl030 is not in cluster

On cl031 console:

CMAN: node cl030a is not responding - removing from the cluster
dlm: stripefs: recover event 6388
CMAN: node cl030a is not responding - removing from the cluster
dlm: stripefs: recover event 6388
name "       5          54bdb0" flags 2 nodeid 0 ref 1
G 00240122 gr 3 rq -1 flg 0 sts 2 node 0 remid 0 lq 0,0
[60,000 lines of this]
------------[ cut here ]------------
kernel BUG at /Views/redhat-cluster/cluster/dlm-kernel/src/reccomms.c:128!
invalid operand: 0000 [#1]
PREEMPT SMP
Modules linked in: lock_dlm dlm gfs lock_harness cman qla2200 qla2xxx dm_mod
CPU:    1
EIP:    0060:[<f8b3e243>]    Not tainted VLI
EFLAGS: 00010286   (2.6.9)
EIP is at rcom_send_message+0x193/0x250 [dlm]
eax: 00000001   ebx: c27813cc   ecx: c0456c0c   edx: 00000286
esi: da046eb4   edi: c27812d8   ebp: da046e90   esp: da046e6c
ds: 007b   es: 007b   ss: 0068
Process dlm_recoverd (pid: 28108, threadinfo=da046000 task=f6d656f0)
Stack: f8b44904 ffffff97 f8b46c60 f8b448ed 0af345bb ffffff97 c27812d8 da046000
       da046eb4 da046ee0 f8b3eff1 c27812d8 00000001 00000001 da046eb4 00000001
       c181f040 00000001 00150014 00000000 01000410 00000008 01000001 c7062300
Call Trace:
 [<c010626f>] show_stack+0x7f/0xa0
 [<c010641e>] show_registers+0x15e/0x1d0
 [<c010663e>] die+0xfe/0x190
 [<c0106bd7>] do_invalid_op+0x107/0x110
 [<c0105e59>] error_code+0x2d/0x38
 [<f8b3eff1>] dlm_wait_status_low+0x71/0xa0 [dlm]
 [<f8b38e19>] nodes_reconfig_wait+0x29/0x80 [dlm]
 [<f8b39051>] ls_nodes_reconfig+0x161/0x350 [dlm]
 [<f8b4077b>] ls_reconfig+0x6b/0x250 [dlm]
 [<f8b41685>] do_ls_recovery+0x195/0x4a0 [dlm]
 [<f8b41a88>] dlm_recoverd+0xf8/0x100 [dlm]
 [<c0134cca>] kthread+0xba/0xc0
 [<c0103325>] kernel_thread_helper+0x5/0x10
Code: 44 24 04 80 00 00 00 e8 dc 1c 5e c7 8b 45 f0 c7 04 24 f8 48 b4 f8 89 44 24 04 e8 c9 1c 5e c7 c7 04 24 04 49 b4 f8 e8 bd 1c 5e c7 <0f> 0b 80 00 60 6c b4 f8 c7 04 24 a0 6c b4 f8 e8 59 14 5e c7 89
 <1>Unable to handle kernel paging request at virtual address 6b6b6b7b
 printing eip:
c011967a
*pde = 00000000
Oops: 0000 [#2]
PREEMPT SMP
Modules linked in: lock_dlm dlm gfs lock_harness cman qla2200 qla2xxx dm_mod
CPU:    0
EIP:    0060:[<c011967a>]    Not tainted VLI
EFLAGS: 00010086   (2.6.9)
EIP is at task_rq_lock+0x2a/0x70
eax: 6b6b6b6b   ebx: c052e000   ecx: c2781350   edx: f6d656f0
esi: c0533020   edi: c052e000   ebp: eb6b8e9c   esp: eb6b8e90
ds: 007b   es: 007b   ss: 0068
Process cman_comms (pid: 3628, threadinfo=eb6b8000 task=eb9e0910)
Stack: c2781350 c27812d8 00000002 eb6b8ee4 c0119d92 f6d656f0 eb6b8ed4 0af34b37
       c0456ac8 00100100 00200200 0af34b37 00000001 dead4ead 00000000 c0129790
       eb9e0910 00000286 c2781350 c27812d8 00000002 eb6b8ef8 c011a02e f6d656f0
Call Trace:
 [<c010626f>] show_stack+0x7f/0xa0
 [<c010641e>] show_registers+0x15e/0x1d0
 [<c010663e>] die+0xfe/0x190
 [<c0118683>] do_page_fault+0x293/0x7c1
 [<c0105e59>] error_code+0x2d/0x38
 [<c0119d92>] try_to_wake_up+0x22/0x2a0
 [<c011a02e>] wake_up_process+0x1e/0x30
 [<f8b41c28>] dlm_recoverd_stop+0x48/0x6b [dlm]
 [<f8b350c8>] release_lockspace+0x38/0x2f0 [dlm]
 [<f8b3541c>] dlm_emergency_shutdown+0x4c/0x70 [dlm]
 [<f8a8057a>] notify_kernel_listeners+0x5a/0x90 [cman]
 [<f8a8440e>] node_shutdown+0x5e/0x3c0 [cman]
 [<f8a8047a>] cluster_kthread+0x2aa/0x350 [cman]
 [<c0103325>] kernel_thread_helper+0x5/0x10
Code: 00 55 89 e5 83 ec 0c 89 1c 24 89 74 24 04 89 7c 24 08 8b 45 0c 9c 8f 00 fa be 20 30 53 c0 bb 00 e0 52 c0 8b 55 08 89 df 8b 42 04 <8b> 40 10 8b 0c 86 01 cf 89 f8 e8 e7 2c 2c 00 8b 55 08 8b 42 04

cl032 console shows:
SM: process_reply invalid id=7783 nodeid=2
CMAN: quorum lost, blocking activity

The test was umounting the gfs file system on cl030 when this
occurred.  the gfs file system is still mounted on cl031
according to /proc/mounts.

The stack trace on cl030 for umount shows:

umount        D 00000008     0 10862  10856                     (NOTLB)
e383de00 00000082 e383ddf0 00000008 00000002 e0b661e7 00000008 0000007d
       f71b37f8 00000001 e383dde8 c011b77b e383dde0 c0119881 eb59d8b0 e0ba257b
       c1716f60 00000001 00053db9 0fb9cfc1 0000a65f d678b790 d678b8f8 c1716f60
Call Trace:
 [<c03dbac4>] wait_for_completion+0xa4/0xe0
 [<f8a92aee>] kcl_leave_service+0xfe/0x180 [cman]
 [<f8b35366>] release_lockspace+0x2d6/0x2f0 [dlm]
 [<f8b5215c>] release_gdlm+0x1c/0x30 [lock_dlm]
 [<f8b52464>] lm_dlm_unmount+0x24/0x50 [lock_dlm]
 [<f8964496>] lm_unmount+0x46/0xac [lock_harness]
 [<f8b0eb2f>] gfs_put_super+0x30f/0x3c0 [gfs]
 [<c0167f07>] generic_shutdown_super+0x1b7/0x1d0
 [<c0168c0d>] kill_block_super+0x1d/0x40
 [<c0167c10>] deactivate_super+0xa0/0xd0
 [<c017f6ac>] sys_umount+0x3c/0xa0
 [<c017f729>] sys_oldumount+0x19/0x20
 [<c010537d>] sysenter_past_esp+0x52/0x71

So my guess is that the umount on cl030 caused the assert on
cl031 and both nodes got kicked out of the cluster.

All the data is available here:
http://developer.osdl.org/daniel/GFS/panic.16dec2004/

I included /proc/cluster/dlm_debug and sm_debug (not sure
what the data from those is).

Thoughts?

Daniel