On Fri, May 05, 2006 at 08:23:44AM -0700, cerise@xxxxxxxxxx wrote: > > Michal Szymanski wrote: > > > > >All systems crash (either hang with some "machine check exception" > > >kernel messages or reset) when loaded with repeating runs of 1.3gb, CPU > > >intensive with some I/O. I run 2 or 4 jobs simultaneously and they had > > >never survived more than a few hours. > > Let's try the easy stuff first -- if it's crashing with a machine check > exception, then let's disable machine check exceptions, and see if things > still break. > > Try booting with the parameter "nomce". Be aware that mce is a mechanism > for the processor to inform the kernel of thermal issues or component > failure. You'll only want to disable this mechanism if you aren't having > thermal problems. I tried "nomce". The machine does not "halt" now with MCE kernel panic messages onscreen but resets after 3-4 hours of work under 2 or more jobs. As I wrote in a response to Robert's message, it seems to be a memory issue, as there are no crashes with Kingston 1GB memory modules. One of the machines and the memory went back to the dealer for tests. > P.S. I came a little late to this party -- I didn't see the original message. > Did you include the text of the kernel crash? Below the kernel message as OCR-ed from a screen digital photo :) Plus the decoded message as adviced by the first message: Fedora Core release 4 (Stentz) kernel 2.6.16-1.2069_FC4smp on an x86_64 red10 login: HARDWARE ERROR CPU 0: Machine Check Exception: 4 Bank 4: f604a00200000813 TSC 1504205a42ba ADDR 115e47828 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Kernel panic - not syncing: Machine check Call Trace: <#MC> <ffffffff80134e6a>{panic+133} (ffffffff801129eb){mcheck_timer+0} <ffffffff801131fc>{do_machine_check+753} <ffffffff8010be43>{machine_check+127} <EOE> ------------------ mcelog --ascii output: HARDWARE ERROR CPU 0 BANK 4 TSC 1504205a42ba MCG status:MCIP MCi status: Error overflow Uncorrected error Error enabled MCi_ADDR register valid Processor context corrupt MCA:BUS Generic Originated-request Read Memory-access Request-timeout Error Model: STATUS f604a00200000813 MCGSTATUS 4 ------------------ regards, Michal. -- Michal Szymanski (msz at astrouw dot edu dot pl) Warsaw University Observatory, Warszawa, POLAND - : send the line "unsubscribe linux-smp" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html