Seen this alot on server boards. Check your Node Memory Interleave setting in the BIOS. Regards, GG On 11/28/05, Allen Smith <lazlor@xxxxxxxxxxxxxxxxxx> wrote: > On Monday 28 November 2005 03:27 pm, Marcelino Mata wrote: > > > > Running RHEL 3.0 x86_64 U6 (2.4.21-37.Elsmp) > > > > I have searched, logged a call with HP and Redhat support and have > > turned up nothing. HP says I have memory problems, Redhat says it's a > > known non-critical error. > > > > I am not sure if I am chasing after the correct problem but all six of > > my AMD64 HP XW9300 (based off Tyan Thunder K8WE?) with anywhere between > > 4-16Gb RAM and two Opteron CPU's get the following errors : > > > > Nov 10 17:18:46 node4 kernel: CPU 0: Silent Northbridge MCE > > Nov 10 17:18:46 node4 kernel: Northbridge status 94044100:ac080a13 > > Nov 10 17:18:46 node4 kernel: Error chipkill ecc error > > Nov 10 17:18:46 node4 kernel: ECC error syndrome ac08 > > Nov 10 17:18:46 node4 kernel: bus error local node response, request > > didn't time out > > Nov 10 17:18:46 node4 kernel: generic read > > Nov 10 17:18:46 node4 kernel: memory access, level generic > > Nov 10 17:18:46 node4 kernel: link number 0 > > Nov 10 17:18:46 node4 kernel: dram scrub error > > Nov 10 17:18:46 node4 kernel: corrected ecc error > > Nov 10 17:18:46 node4 kernel: previous error lost > > Nov 10 17:18:46 node4 kernel: NB error address 000000000126dd40 > > > > > > Nov 14 19:14:16 node4 kernel: CPU 0: Silent Northbridge MCE > > Nov 14 19:14:16 node4 kernel: Northbridge status a6000001:0005001b > > Nov 14 19:14:16 node4 kernel: Error gart error > > Nov 14 19:14:16 node4 kernel: GART TLB error generic level generic > > Nov 14 19:14:16 node4 kernel: err cpu1 > > Nov 14 19:14:16 node4 kernel: processor context corrupt > > Nov 14 19:14:16 node4 kernel: error uncorrected > > Nov 14 19:14:16 node4 kernel: previous error lost > > Nov 14 19:14:16 node4 kernel: NB error address 00000000dffe0038 > > > > Five of the computers have between 1-30 references to these error > > messages in the past 3 weeks. One computer has over 30,000 instances of > > these error messages. I am getting the majority of these messages on > > computers with >4Gb RAM but I have had the messages on computers with > > only 4GB RAM. > > > > The main reason I am focusing on these messages is that the computers > > have crashed numerous times since being put online. The computer with > > 30K instances of the error message has crashed about 1-2 times per week. > > I am running the latest BIOS. > > > > I can not turn on diskdump since they have Nvidia SATA controllers (not > > support by diskdump) and netdump has not produced anything since during > > the kernel crash no data was written ( network driver went down? ). > > > > Has anyone else seen these messages or have any idea how to identify the > > problem? Could my crashes be due to Northbridge errors or am I barking > > up the wrong tree. > > > > Marcelino > > > > Reference Information below > > > > lspci information > > ----------------- > > > > 00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller > > (rev a3) > > 00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3) > > 00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2) > > 00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2) > > 00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3) > > 00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 > > Audio Controller (rev a2) > > 00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2) > > 00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller > > (rev f3) > > 00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller > > (rev f3) > > 00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2) > > 00:0a.0 Ethernet controller: nVidia Corporation CK804 Ethernet > > Controller (rev a3) > > 00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) > > 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] > > HyperTransport Technology Configuration > > 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] > > Address Map > > 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] > > DRAM Controller > > 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] > > Miscellaneous Control > > 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] > > HyperTransport Technology Configuration > > 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] > > Address Map > > 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] > > DRAM Controller > > 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] > > Miscellaneous Control > > 05:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A > > IEEE-1394a-2000 Controller (PHY/Link) > > 0a:00.0 VGA compatible controller: nVidia Corporation NV41GL [Quadro FX > > 1400] (rev a2) > > 40:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge > > (rev 12) > > 40:01.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01) > > 40:02.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge > > (rev 12) > > 40:02.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01) > > 61:06.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X > > Fusion-MPT Dual Ultra320 SCSI (rev 07) > > 61:06.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X > > Fusion-MPT Dual Ultra320 SCSI (rev 07) > > 61:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5782 > > Gigabit Ethernet (rev 03) > > 80:00.0 Memory controller: nVidia Corporation CK804 Memory Controller > > (rev a3) > > 80:01.0 Memory controller: nVidia Corporation CK804 Memory Controller > > (rev a3) > > > > lsmod > > ----- > > Module Size Used by Tainted: P > > nfs 95984 7 (autoclean) > > audit 127208 2 (autoclean) > > nfsd 86096 8 (autoclean) > > lockd 60528 1 (autoclean) [nfs nfsd] > > sunrpc 91944 1 (autoclean) [nfs nfsd lockd] > > netconsole 19208 0 (unused) > > autofs4 16912 2 (autoclean) > > tg3 69936 1 > > nvnet 71168 1 > > sg 37880 0 (autoclean) > > sr_mod 17676 0 (autoclean) > > ide-scsi 12832 0 > > ide-cd 34408 0 > > cdrom 33096 0 [sr_mod ide-cd] > > keybdev 3104 0 (unused) > > mousedev 6728 0 (unused) > > hid 21992 0 (unused) > > input 7520 0 [keybdev mousedev hid] > > ehci-hcd 21200 0 (unused) > > usb-ohci 22864 0 (unused) > > usbcore 85152 1 [hid ehci-hcd usb-ohci] > > ext3 87856 2 > > jbd 57088 2 [ext3] > > raid0 4368 1 > > sata_nv 5116 5 > > libata 49352 0 [sata_nv] > > mptscsih 43792 0 (unused) > > mptbase 50472 3 [mptscsih] > > diskdumplib 6548 0 [mptscsih mptbase] > > sd_mod 14964 10 > > scsi_mod 130124 6 [sg sr_mod ide-scsi sata_nv libata > > mptscsih sd_mod] > > > > -- > > redhat-list mailing list > > unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe > > https://www.redhat.com/mailman/listinfo/redhat-list > > > > I have seen this on 3 similar setups. We swapped out memory and that resolved it for 2 of them. On the third we had to do a complete swap (memory/mb/ps/cpu) to make them go away. > > -- > redhat-list mailing list > unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe > https://www.redhat.com/mailman/listinfo/redhat-list > -- redhat-list mailing list unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe https://www.redhat.com/mailman/listinfo/redhat-list