Hi, On Sun, Jul 31, 2011 at 04:22:41AM -0400, F. P. Beekhof wrote: > I've used the hooks to call a script, the value is 100008 after > resume, and I'm booting the system by going onto 'recovery console', > running the script to set msr 0xc001001f to 100008, then completing > the normal boot procedure. Hmm, there has a to be a way to automate that. Maybe push /etc/init.d/rc.local up in the call prio so that it gets run as early as possible? > So far, it seems to have fixed the issue, in the sense that there have > been no MCEs yet. There was some call trace after a suspend/resume > (see below), but that's it. Yeah, its on resume. This warning fires because it took the system 17880 msecs to resume and the test was expecting something under 10000. It could be unstable RTC clock or something. You could disable it by turning CONFIG_PM_TEST_SUSPEND off for your kernel if there's no other issues with suspend/resume beside that warning firing. > I found that one can enable ECC on ram in the bios, which I did. As > far as I know, this is non-ECC ram, so frankly I'm at a loss about Maybe the BIOS is not properly detecting whether DRAM is ECC or not. Normally, if it is not, it should simply remove the option to enable ECC from the menu. To check what the hw says, do $ setpci -s 18.3 0x44.l as root and send me the result pls. > To provoke MCEs, I've added a firewire card, that I had pulled out > before. Removing that thing had reduced the number of MCEs, but not > eliminated them. With a regular boot sequence (no msr setting), the > radeon driver complained of something and the system froze within 5 > minutes. I then rebooted and followed your instructions, so far the > system is working perfectly fine. good. > I've also switched two eSATA on and off a few times, they are detected > fine now with no crash, and let banshee run. That has frequently > proven to be too much, but now it is fine. good. > All of this is no definite proof that all is well, but it certainly > seems more stable. I'd suggest you run your system at full swing and watch it for signs of trouble a couple of days longer just in case. > Are there any conclusions that can be drawn from this experiment ? Yeah, it means that your BIOS doesn't seem to have the fix for erratum #131: http://support.amd.com/us/Processor_TechDocs/25759.pdf, page 83. I don't know whether there is BIOS for your ancient CPU :-) and if there were, whether upgrading it won't break something else. If I were you, I'd run the automated script hooks and don't care about upgrade... provided we don't see any other hickups that is and provided we manage to automate them so that you don't have to boot into recovery console every time. Let me know how it all plays out. HTH. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach GM: Alberto Bozzo Reg: Dornach, Landkreis Muenchen HRB Nr. 43632 WEEE Registernr: 129 19551 -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html