My tests ran for 50 hours! This is a new record and is running with my up_write() before queue_ast() patch. It hit an error during a 2 node test (GFS on cl030 and cl031; cl032 was a member of the cluster, but no GFS file system mounted). On cl030 console: SM: 00000001 sm_stop: SG still joined SM: 01000410 sm_stop: SG still joined /proc/cluster/status shows cl030 is not in cluster On cl031 console: CMAN: node cl030a is not responding - removing from the cluster dlm: stripefs: recover event 6388 CMAN: node cl030a is not responding - removing from the cluster dlm: stripefs: recover event 6388 name " 5 54bdb0" flags 2 nodeid 0 ref 1 G 00240122 gr 3 rq -1 flg 0 sts 2 node 0 remid 0 lq 0,0 [60,000 lines of this] ------------[ cut here ]------------ kernel BUG at /Views/redhat-cluster/cluster/dlm-kernel/src/reccomms.c:128! invalid operand: 0000 [#1] PREEMPT SMP Modules linked in: lock_dlm dlm gfs lock_harness cman qla2200 qla2xxx dm_mod CPU: 1 EIP: 0060:[<f8b3e243>] Not tainted VLI EFLAGS: 00010286 (2.6.9) EIP is at rcom_send_message+0x193/0x250 [dlm] eax: 00000001 ebx: c27813cc ecx: c0456c0c edx: 00000286 esi: da046eb4 edi: c27812d8 ebp: da046e90 esp: da046e6c ds: 007b es: 007b ss: 0068 Process dlm_recoverd (pid: 28108, threadinfo=da046000 task=f6d656f0) Stack: f8b44904 ffffff97 f8b46c60 f8b448ed 0af345bb ffffff97 c27812d8 da046000 da046eb4 da046ee0 f8b3eff1 c27812d8 00000001 00000001 da046eb4 00000001 c181f040 00000001 00150014 00000000 01000410 00000008 01000001 c7062300 Call Trace: [<c010626f>] show_stack+0x7f/0xa0 [<c010641e>] show_registers+0x15e/0x1d0 [<c010663e>] die+0xfe/0x190 [<c0106bd7>] do_invalid_op+0x107/0x110 [<c0105e59>] error_code+0x2d/0x38 [<f8b3eff1>] dlm_wait_status_low+0x71/0xa0 [dlm] [<f8b38e19>] nodes_reconfig_wait+0x29/0x80 [dlm] [<f8b39051>] ls_nodes_reconfig+0x161/0x350 [dlm] [<f8b4077b>] ls_reconfig+0x6b/0x250 [dlm] [<f8b41685>] do_ls_recovery+0x195/0x4a0 [dlm] [<f8b41a88>] dlm_recoverd+0xf8/0x100 [dlm] [<c0134cca>] kthread+0xba/0xc0 [<c0103325>] kernel_thread_helper+0x5/0x10 Code: 44 24 04 80 00 00 00 e8 dc 1c 5e c7 8b 45 f0 c7 04 24 f8 48 b4 f8 89 44 24 04 e8 c9 1c 5e c7 c7 04 24 04 49 b4 f8 e8 bd 1c 5e c7 <0f> 0b 80 00 60 6c b4 f8 c7 04 24 a0 6c b4 f8 e8 59 14 5e c7 89 <1>Unable to handle kernel paging request at virtual address 6b6b6b7b printing eip: c011967a *pde = 00000000 Oops: 0000 [#2] PREEMPT SMP Modules linked in: lock_dlm dlm gfs lock_harness cman qla2200 qla2xxx dm_mod CPU: 0 EIP: 0060:[<c011967a>] Not tainted VLI EFLAGS: 00010086 (2.6.9) EIP is at task_rq_lock+0x2a/0x70 eax: 6b6b6b6b ebx: c052e000 ecx: c2781350 edx: f6d656f0 esi: c0533020 edi: c052e000 ebp: eb6b8e9c esp: eb6b8e90 ds: 007b es: 007b ss: 0068 Process cman_comms (pid: 3628, threadinfo=eb6b8000 task=eb9e0910) Stack: c2781350 c27812d8 00000002 eb6b8ee4 c0119d92 f6d656f0 eb6b8ed4 0af34b37 c0456ac8 00100100 00200200 0af34b37 00000001 dead4ead 00000000 c0129790 eb9e0910 00000286 c2781350 c27812d8 00000002 eb6b8ef8 c011a02e f6d656f0 Call Trace: [<c010626f>] show_stack+0x7f/0xa0 [<c010641e>] show_registers+0x15e/0x1d0 [<c010663e>] die+0xfe/0x190 [<c0118683>] do_page_fault+0x293/0x7c1 [<c0105e59>] error_code+0x2d/0x38 [<c0119d92>] try_to_wake_up+0x22/0x2a0 [<c011a02e>] wake_up_process+0x1e/0x30 [<f8b41c28>] dlm_recoverd_stop+0x48/0x6b [dlm] [<f8b350c8>] release_lockspace+0x38/0x2f0 [dlm] [<f8b3541c>] dlm_emergency_shutdown+0x4c/0x70 [dlm] [<f8a8057a>] notify_kernel_listeners+0x5a/0x90 [cman] [<f8a8440e>] node_shutdown+0x5e/0x3c0 [cman] [<f8a8047a>] cluster_kthread+0x2aa/0x350 [cman] [<c0103325>] kernel_thread_helper+0x5/0x10 Code: 00 55 89 e5 83 ec 0c 89 1c 24 89 74 24 04 89 7c 24 08 8b 45 0c 9c 8f 00 fa be 20 30 53 c0 bb 00 e0 52 c0 8b 55 08 89 df 8b 42 04 <8b> 40 10 8b 0c 86 01 cf 89 f8 e8 e7 2c 2c 00 8b 55 08 8b 42 04 cl032 console shows: SM: process_reply invalid id=7783 nodeid=2 CMAN: quorum lost, blocking activity The test was umounting the gfs file system on cl030 when this occurred. the gfs file system is still mounted on cl031 according to /proc/mounts. The stack trace on cl030 for umount shows: umount D 00000008 0 10862 10856 (NOTLB) e383de00 00000082 e383ddf0 00000008 00000002 e0b661e7 00000008 0000007d f71b37f8 00000001 e383dde8 c011b77b e383dde0 c0119881 eb59d8b0 e0ba257b c1716f60 00000001 00053db9 0fb9cfc1 0000a65f d678b790 d678b8f8 c1716f60 Call Trace: [<c03dbac4>] wait_for_completion+0xa4/0xe0 [<f8a92aee>] kcl_leave_service+0xfe/0x180 [cman] [<f8b35366>] release_lockspace+0x2d6/0x2f0 [dlm] [<f8b5215c>] release_gdlm+0x1c/0x30 [lock_dlm] [<f8b52464>] lm_dlm_unmount+0x24/0x50 [lock_dlm] [<f8964496>] lm_unmount+0x46/0xac [lock_harness] [<f8b0eb2f>] gfs_put_super+0x30f/0x3c0 [gfs] [<c0167f07>] generic_shutdown_super+0x1b7/0x1d0 [<c0168c0d>] kill_block_super+0x1d/0x40 [<c0167c10>] deactivate_super+0xa0/0xd0 [<c017f6ac>] sys_umount+0x3c/0xa0 [<c017f729>] sys_oldumount+0x19/0x20 [<c010537d>] sysenter_past_esp+0x52/0x71 So my guess is that the umount on cl030 caused the assert on cl031 and both nodes got kicked out of the cluster. All the data is available here: http://developer.osdl.org/daniel/GFS/panic.16dec2004/ I included /proc/cluster/dlm_debug and sm_debug (not sure what the data from those is). Thoughts? Daniel