I got the following oops messages on my cluster nodes, both at different times. Once was on node A, I was running a clustat, and did a ctrl-4 to kill it, (it was taking a long while to run, seemed to be blocked by something). The second time after doing that OOPS#1 showed up. The second oops showed up on the b node, the cluster was running, and I wasn't actually doing anything outside of watching a tcpdump to watch some data flow by, went away for about 10 minutes, and when I came back node B had blocked up, and was fenced by A. The OOPS was in the messages file. These events were separated by about a week, and in between I had updated everything to RHEL4 U1, and recompiled the cluster code which was checked out from the RHEL4 branch for the new kernel. Yes, these nodes both have VMWare loaded. I can move the virtual machines off to another host, and disabled VMware, and try and replicate the problem again if you think VMWare might be causing the problem. (it may take a week or so, since this problem seems to be intermittent) Two nodes in the cluster, shared ext3 partitions, a few services (apache, postgresql, a vmware virtual machine) All nodes running Redhat Enterprise 4 on identical HP DL380 G4 Dual Xeon boxes, with hyperthreading enabled. A Memtest86 on the B node went through two successful passes, run soon after oops. Any help would be appreciated, including a step in the right direction to debug this problem. Eric Kerin OOPS#1: Node A - ctrl-4ing clustat from a root shell Unable to handle kernel NULL pointer dereference at virtual address 0000001c printing eip: c02c4f92 *pde = 34a2c001 Oops: 0000 [#1] SMP Modules linked in: nfsd exportfs lockd nls_utf8 vmnet(U) vmmon(U) parport_pc lp parport autofs4 i2c_dev i2c_core dlm(U) cman(U) sunrpc button battery ac md5 ipv6 uhci_hcd ehci_hcd hw_random tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod cciss sd_mod scsi_mod CPU: 2 EIP: 0060:[<c02c4f92>] Tainted: PF VLI EFLAGS: 00010206 (2.6.9-5.0.5.ELsmp) EIP is at _spin_lock+0x3/0x34 eax: 00000018 ebx: 00000018 ecx: f466ae00 edx: f466ae00 esi: f466ae00 edi: 00000000 ebp: 00000000 esp: f50eff70 ds: 007b es: 007b ss: 0068 Process dlm_astd (pid: 2440, threadinfo=f50ef000 task=f515c130) Stack: f466ae00 f89993be f466ae00 00000000 0011ab26 00000000 f466ae00 00000005 f466ae00 f8999493 00000000 f587e2e8 f8999423 f89984c7 00000000 e22d518c f7e24600 f89b36a8 00000000 00000000 f8998b61 f8998cb2 f50ef000 f53caeac Call Trace: [<f89993be>] add_to_astqueue+0x79/0xc7 [dlm] [<f8999493>] ast_routine+0x70/0x130 [dlm] [<f8999423>] ast_routine+0x0/0x130 [dlm] [<f89984c7>] process_asts+0x15c/0x1c2 [dlm] [<f8998b61>] dlm_astd+0x0/0x1a9 [dlm] [<f8998cb2>] dlm_astd+0x151/0x1a9 [dlm] [<c0131d3d>] kthread+0x73/0x9b [<c0131cca>] kthread+0x0/0x9b [<c01041f1>] kernel_thread_helper+0x5/0xb Code: c0 84 d2 0f 9f c0 c3 89 c2 f0 81 28 00 00 00 01 0f 94 c0 84 c0 b9 01 00 00 00 75 09 f0 81 02 00 00 00 01 30 c9 89 c8 c3 53 89 c3 <81> 7 8 04 ad 4e ad de 74 18 ff 74 24 04 68 4d 83 2d c0 e8 61 bb OOPS#2: Node B - Nothing out of the ordinary, just watching a tcpdump Unable to handle kernel NULL pointer dereference at virtual address 0000001c printing eip: c02c5ee4 *pde = 3509b001 Oops: 0000 [#1] SMP Modules linked in: dlm(U) cman(U) vmnet(U) parport_pc vmmon(U) lp parport autofs4 i2c_dev i2c_core sunrpc button battery ac md5 ipv6 uhci_hcd ehci_hcd hw_random tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod cciss sd_mod scsi_mod CPU: 2 EIP: 0060:[<c02c5ee4>] Tainted: PF VLI EFLAGS: 00010206 (2.6.9-11.ELsmp) EIP is at _spin_lock+0x3/0x34 eax: 00000018 ebx: 00000018 ecx: c2b67b80 edx: c2b67b80 esi: c2b67b80 edi: 00000000 ebp: 00000000 esp: f5083f70 ds: 007b es: 007b ss: 0068 Process dlm_astd (pid: 4733, threadinfo=f5083000 task=f7601730) Stack: c2b67b80 f8c73446 c2b67b80 00000000 008ecb26 00000000 c2b67b80 00000005 c2b67b80 f8c7351b 00000000 f5a853f0 f8c734ab f8c724c7 00000000 d95e6eac f74fa400 f8c8d7a8 00000000 00000000 f8c72b61 f8c72cb2 f5083000 f519feac Call Trace: [<f8c73446>] add_to_astqueue+0x79/0xc7 [dlm] [<f8c7351b>] ast_routine+0x70/0x130 [dlm] [<f8c734ab>] ast_routine+0x0/0x130 [dlm] [<f8c724c7>] process_asts+0x15c/0x1c2 [dlm] [<f8c72b61>] dlm_astd+0x0/0x1a9 [dlm] [<f8c72cb2>] dlm_astd+0x151/0x1a9 [dlm] [<c0132e31>] kthread+0x73/0x9b [<c0132dbe>] kthread+0x0/0x9b [<c01041f1>] kernel_thread_helper+0x5/0xb Code: c0 84 d2 0f 9f c0 c3 89 c2 f0 81 28 00 00 00 01 0f 94 c0 84 c0 b9 01 00 00 00 75 09 f0 81 02 00 00 00 01 30 c9 89 c8 c3 53 89 c3 <81>78 04 ad 4e ad de 74 18 ff 74 24 04 68 2a 97 2d c0 e8 db ba <0>Fatal exception: panic in 5 seconds -- Linux-cluster@xxxxxxxxxx http://www.redhat.com/mailman/listinfo/linux-cluster