Re: FC4 crashes repeatedly on Supermicro AS1020A-T dual-core Opterons, SMP

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, May 05, 2006 at 08:23:44AM -0700, cerise@xxxxxxxxxx wrote:
> > Michal Szymanski wrote:
> >
> > >All systems crash (either hang with some "machine check exception"
> > >kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU
> > >intensive with some I/O. I run 2 or 4 jobs simultaneously and they had
> > >never survived more than a few hours.
> 
> Let's try the easy stuff first -- if it's crashing with a machine check
> exception, then let's disable machine check exceptions, and see if things
> still break.
> 
> Try booting with the parameter "nomce".  Be aware that mce is a mechanism
> for the processor to inform the kernel of thermal issues or component 
> failure.  You'll only want to disable this mechanism if you aren't having
> thermal problems.  

I tried "nomce". The machine does not "halt" now with MCE kernel panic
messages onscreen but resets after 3-4 hours of work under 2 or more jobs.

As I wrote in a response to Robert's message, it seems to be a memory
issue, as there are no crashes with Kingston 1GB memory modules.
One of the machines and the memory went back to the dealer for tests.

> P.S.  I came a little late to this party -- I didn't see the original message.
> Did you include the text of the kernel crash?

Below the kernel message as OCR-ed from a screen digital photo :)
Plus the decoded message as adviced by the first message:

Fedora Core release 4 (Stentz)
kernel 2.6.16-1.2069_FC4smp on an x86_64

red10 login:
HARDWARE ERROR
        CPU 0: Machine Check Exception: 4 Bank 4: f604a00200000813
TSC 1504205a42ba ADDR 115e47828
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
Kernel panic - not syncing: Machine check

Call Trace: <#MC> 
     <ffffffff80134e6a>{panic+133} (ffffffff801129eb){mcheck_timer+0}
     <ffffffff801131fc>{do_machine_check+753} 
     <ffffffff8010be43>{machine_check+127} <EOE>

------------------

mcelog --ascii  output:

HARDWARE ERROR
CPU 0 BANK 4 TSC 1504205a42ba 
MCG status:MCIP 
MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_ADDR register valid
Processor context corrupt
MCA:BUS Generic Originated-request Read Memory-access Request-timeout Error
Model:
STATUS f604a00200000813 MCGSTATUS 4
------------------

regards, Michal.

-- 
  Michal Szymanski (msz at astrouw dot edu dot pl)
  Warsaw University Observatory, Warszawa, POLAND
-
: send the line "unsubscribe linux-smp" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel]     [Remote Processor]     [Audio]     [Linux for Hams]     [Kernel Newbies]     [Netfilter]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Fedora Users]

  Powered by Linux