New kernel causes hardware error?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



I have recently upgraded to 2.6.18-194.3.1.el5 and within several days 
the machine crashed with the following error (repeating in mcelog):

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 BANK 8 MISC 41
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
Processor context corrupt
MCA: MEMORY CONTROLLER AC_CHANNEL0_ERR
Transaction: Address/Command error
Memory address parity error
Memory corrected error count (CORE_ERR_CNT): 911
Memory transaction Tracker ID (RTId): 41
Memory DIMM ID of error: 0
Memory channel ID of error: 0
Memory ECC syndrome: 0
STATUS ea10e3c0008000b0 MCGSTATUS 0
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 BANK 8 MISC 41
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
Processor context corrupt
MCA: MEMORY CONTROLLER AC_CHANNEL0_ERR
Transaction: Address/Command error
Memory address parity error
Memory corrected error count (CORE_ERR_CNT): 7970
Memory transaction Tracker ID (RTId): 41
Memory DIMM ID of error: 0
Memory channel ID of error: 0
Memory ECC syndrome: 0
STATUS ea17c880008000b0 MCGSTATUS 0

Everytime the error occurs, the only variables that change are 
CORE_ERR_CNT and STATUS.

Since this appears to be a memory error, I have run memtest86+ many 
times. However it does not report any errors.

Reverting back other Kernels (below) and testing, this above error would 
be generated only once (after boot) and then not be reported again and 
definitely wasn't causing kernel panic and crashing the machine.
CentOS-5.4 (2.6.18-164.15.1.el5)
CentOS (2.6.18-164.9.1.el5)
CentOS (2.6.18-164.el5)

Would this error indicate a motherboard or CPU problem? How can I 
diagnose? or is there something funny with the Kernel?

Hardware:
Supermicro X8DTL-iF motherboard.
Intel Server Xeon E5502 1.86GHz Nehalem
8GB Ram Kingston DDR3-1333 w/ Parity w/ Thermal Sensor

I have read on bugzilla  note about mcelog and not supporting nehalem 
processor during error decoding. I think this is fixed in Centos 5.5, 
but maybe there is still a bug?
https://bugzilla.redhat.com/show_bug.cgi?id=473392




_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos


[Index of Archives]     [CentOS]     [CentOS Announce]     [CentOS Development]     [CentOS ARM Devel]     [CentOS Docs]     [CentOS Virtualization]     [Carrier Grade Linux]     [Linux Media]     [Asterisk]     [DCCP]     [Netdev]     [Xorg]     [Linux USB]
  Powered by Linux