On Mon, Oct 25, 2010 at 04:35:43PM -0700, Tony Luck wrote: > On Mon, Oct 25, 2010 at 2:51 PM, Borislav Petkov <bp@xxxxxxxxx> wrote: > > Concerning fatal errors, take a look at drivers/edac/mce_amd.(c|h)Â - > > this is not in arch/x86/ and still decodes MCEs in the kernel. And it > > works fine - it even helped in several cases where people simply read > > their serial console/dmesg and didn't have to collect it first and run > > it through some tool to understand which functional unit in the CPU is > > mchecking. > > That looks neat ... but end-users seem to have some conflicting requirements > here. Your uses seem to like it but the LLNL folks at the S.F. meeting said > that solutions that involved looking at console logs from thousands > of machines in a cluster were not acceptable. > > I doubt very much if any end-user cares which unit *within* a cpu > failed (their replaceable unit is the whole of the cpu). So much of > your driver could be replaced with: printk("CPU%d is bad\n", cpu); Yeah, nobody said this is finished. The next step is using perf infrastructure to convey those decoded errors to userspace, say, to a ras daemon or similar which can do all sorts of reporting, statistics, policy decisions, injection, paint graphs, whatever... I sent out two patchsets as an rfc already and am working on the 3rd one so we're getting there. Here's the last one: http://kerneltrap.org/mailarchive/linux-kernel/2010/8/6/4603847 Also, I'm open to all suggestions on how to make it more usable and user-friendly. Thanks. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html