Re: AMD64 Northbridge errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Seen this alot on server boards. Check your Node Memory Interleave
setting in the BIOS.

Regards,
GG

On 11/28/05, Allen Smith <lazlor@xxxxxxxxxxxxxxxxxx> wrote:
> On Monday 28 November 2005 03:27 pm, Marcelino Mata wrote:
> >
> > Running RHEL 3.0 x86_64 U6 (2.4.21-37.Elsmp)
> >
> > I have searched, logged a call with HP and Redhat support and have
> > turned up nothing.  HP says I have memory problems, Redhat says it's a
> > known non-critical error.
> >
> > I am not sure if I am chasing after the correct problem but all six of
> > my AMD64 HP XW9300 (based off Tyan Thunder K8WE?) with anywhere between
> > 4-16Gb RAM and two Opteron CPU's get the following errors :
> >
> > Nov 10 17:18:46 node4 kernel: CPU 0: Silent Northbridge MCE
> > Nov 10 17:18:46 node4 kernel: Northbridge status 94044100:ac080a13
> > Nov 10 17:18:46 node4 kernel:     Error chipkill ecc error
> > Nov 10 17:18:46 node4 kernel:     ECC error syndrome ac08
> > Nov 10 17:18:46 node4 kernel:     bus error local node response, request
> > didn't time out
> > Nov 10 17:18:46 node4 kernel:     generic read
> > Nov 10 17:18:46 node4 kernel:     memory access, level generic
> > Nov 10 17:18:46 node4 kernel:     link number 0
> > Nov 10 17:18:46 node4 kernel:     dram scrub error
> > Nov 10 17:18:46 node4 kernel:     corrected ecc error
> > Nov 10 17:18:46 node4 kernel:     previous error lost
> > Nov 10 17:18:46 node4 kernel:     NB error address 000000000126dd40
> >
> >
> > Nov 14 19:14:16 node4 kernel: CPU 0: Silent Northbridge MCE
> > Nov 14 19:14:16 node4 kernel: Northbridge status a6000001:0005001b
> > Nov 14 19:14:16 node4 kernel:     Error gart error
> > Nov 14 19:14:16 node4 kernel:     GART TLB error generic level generic
> > Nov 14 19:14:16 node4 kernel:     err cpu1
> > Nov 14 19:14:16 node4 kernel:     processor context corrupt
> > Nov 14 19:14:16 node4 kernel:     error uncorrected
> > Nov 14 19:14:16 node4 kernel:     previous error lost
> > Nov 14 19:14:16 node4 kernel:     NB error address 00000000dffe0038
> >
> > Five of the computers have between 1-30 references to these error
> > messages in the past 3 weeks.  One computer has over 30,000 instances of
> > these error messages.  I am getting the majority of these messages on
> > computers with >4Gb RAM but I have had the messages on computers with
> > only 4GB RAM.
> >
> > The main reason I am focusing on these messages is that the computers
> > have crashed numerous times since being put online.  The computer with
> > 30K instances of the error message has crashed about 1-2 times per week.
> > I am running the latest BIOS.
> >
> > I can not turn on diskdump since they have Nvidia SATA controllers (not
> > support by diskdump) and netdump has not produced anything since during
> > the kernel crash no data was written ( network driver went down? ).
> >
> > Has anyone else seen these messages or have any idea how to identify the
> > problem?  Could my crashes be due to Northbridge errors or am I barking
> > up the wrong tree.
> >
> > Marcelino
> >
> > Reference Information below
> >
> > lspci information
> > -----------------
> >
> >  00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller
> > (rev a3)
> > 00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
> > 00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
> > 00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
> > 00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
> > 00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97
> > Audio Controller (rev a2)
> > 00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
> > 00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller
> > (rev f3)
> > 00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller
> > (rev f3)
> > 00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
> > 00:0a.0 Ethernet controller: nVidia Corporation CK804 Ethernet
> > Controller (rev a3)
> > 00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
> > 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
> > HyperTransport Technology Configuration
> > 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
> > Address Map
> > 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
> > DRAM Controller
> > 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
> > Miscellaneous Control
> > 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
> > HyperTransport Technology Configuration
> > 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
> > Address Map
> > 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
> > DRAM Controller
> > 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
> > Miscellaneous Control
> > 05:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A
> > IEEE-1394a-2000 Controller (PHY/Link)
> > 0a:00.0 VGA compatible controller: nVidia Corporation NV41GL [Quadro FX
> > 1400] (rev a2)
> > 40:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge
> > (rev 12)
> > 40:01.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
> > 40:02.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge
> > (rev 12)
> > 40:02.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
> > 61:06.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
> > Fusion-MPT Dual Ultra320 SCSI (rev 07)
> > 61:06.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
> > Fusion-MPT Dual Ultra320 SCSI (rev 07)
> > 61:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5782
> > Gigabit Ethernet (rev 03)
> > 80:00.0 Memory controller: nVidia Corporation CK804 Memory Controller
> > (rev a3)
> > 80:01.0 Memory controller: nVidia Corporation CK804 Memory Controller
> > (rev a3)
> >
> > lsmod
> > -----
> > Module                  Size  Used by    Tainted: P
> > nfs                    95984   7  (autoclean)
> > audit                 127208   2  (autoclean)
> > nfsd                   86096   8  (autoclean)
> > lockd                  60528   1  (autoclean) [nfs nfsd]
> > sunrpc                 91944   1  (autoclean) [nfs nfsd lockd]
> > netconsole             19208   0  (unused)
> > autofs4                16912   2  (autoclean)
> > tg3                    69936   1
> > nvnet                  71168   1
> > sg                     37880   0  (autoclean)
> > sr_mod                 17676   0  (autoclean)
> > ide-scsi               12832   0
> > ide-cd                 34408   0
> > cdrom                  33096   0  [sr_mod ide-cd]
> > keybdev                 3104   0  (unused)
> > mousedev                6728   0  (unused)
> > hid                    21992   0  (unused)
> > input                   7520   0  [keybdev mousedev hid]
> > ehci-hcd               21200   0  (unused)
> > usb-ohci               22864   0  (unused)
> > usbcore                85152   1  [hid ehci-hcd usb-ohci]
> > ext3                   87856   2
> > jbd                    57088   2  [ext3]
> > raid0                   4368   1
> > sata_nv                 5116   5
> > libata                 49352   0  [sata_nv]
> > mptscsih               43792   0  (unused)
> > mptbase                50472   3  [mptscsih]
> > diskdumplib             6548   0  [mptscsih mptbase]
> > sd_mod                 14964  10
> > scsi_mod              130124   6  [sg sr_mod ide-scsi sata_nv libata
> > mptscsih sd_mod]
> >
> > --
> > redhat-list mailing list
> > unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
> > https://www.redhat.com/mailman/listinfo/redhat-list
> >
>
> I have seen this on 3 similar setups. We swapped out memory and that resolved it for 2 of them. On the third we had to do a complete swap (memory/mb/ps/cpu) to make them go away.
>
> --
> redhat-list mailing list
> unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
> https://www.redhat.com/mailman/listinfo/redhat-list
>

-- 
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list

[Index of Archives]     [CentOS]     [Kernel Development]     [PAM]     [Fedora Users]     [Red Hat Development]     [Big List of Linux Books]     [Linux Admin]     [Gimp]     [Asterisk PBX]     [Yosemite News]     [Red Hat Crash Utility]


  Powered by Linux