Ok, that's sort of what I thought was going on but I wanted to get some feedback. There is another bug in bugzilla that looks like it might be related. https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=212055 Anyway, thanks Corey -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Patrick Caulfield Sent: Monday, November 13, 2006 9:14 AM To: linux clustering Subject: Re: dlm_recvd + bnx2 oops Kovacs, Corey J. wrote: > Morning all. We've been experienceing regular cluster crashes on RHEL4u4. > This system has 5 nodes and a few dozen nodes mounting shares via nfs. > Periodically, nodes will panic, get fenced and all continues on. This > system does have some of the HP Product Support Pack installed (not > the HP bnx2 driver). Below is the section from the logs. It is hand > typed but I am fairly sure it's accurrate. > > The machines are HP DL360-G5's. The nics are Broadcom NeXtreme II 5708's. > > > Anyone else seeing this? > > Corey > > =========================================================== > > Unable to handle kernel NULL pointer dereference at virtual address > 000000ac printing eip: > f8f339ae > *pde = 37038001 > Oops: 0000 [#1] > SMP > Modules linked in: ipt_multiport iptable_nat ip_conntrack ip_tables > ip_vs_rr ip_vs cpqci(U) ipmi_dev intf ipmi_si ipmi_msghandler xp(U) > mptctl mptbase sg autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) > lock_harness(U) dlm(U) cman(U) md5 ipv6 nfsd exportfs lockd nfs_acl > sunrpc joydev dm_mirror button battery ac ehci_hcd uhci_hcd bnx2 ext3 > jbd dm_mod qla6312(U) qla2400(U) qla2300(U) qla2xxx_conf(U) qla2xxx(U) > cciss sd_mod scsi_mod > CPU: 0 > EIP: 0060:[<f8f339ae>] Tainted: P VLI > EFLAGS: 00010202 (2.6.9-42.0.2.ELsmp) > EIP is at bnx2_tx_int+0x48/0x1d1 [bnx2] > eax: f70620dc ebx: 00000ad7 ecx: 00000002 edx: 00000037 > esi: 00000a37 edi: 00000000 ebp: f6a0b200 esp: c03cefa0 > ds: 007b es: 007b ss: 0068 > Process dlm_recvd (pid: 3973, threadinfo=c03ce000 task=f71652f0) > Stack: f70620dc 00000037 f5c19000 00000000 f6a0b200 f6a0afc0 c03cefd4 > f8f3431d > 00000000 f6a0afc0 c201fd80 15a3182b c0280e24 000493dc 00000001 > c0392c18 > 0000000a 00000000 c01269b8 f59d4dc4 00000046 c038b900 f59d4000 > c010819f Call trace: > [<f8f3431d>] bnx2_poll+0x4f/0x142 [bnx2] [<c0280e24>] > net_rx_action+0xae/0x160 [<c01269b8>] __do_softirq+0x4c/0xb1 > [<c010819f>] do_softirq+0x4f/0x56 That looks like a driver crash to me. The fact that it's in dlm_recvd is probably just that it's a busy process doing lots of network IO. There's no DLM code in the stacktrace at all -- patrick -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster