On Fri, Jul 21, 2017 at 02:01:31PM -0300, Mauro Carvalho Chehab wrote: > I see the value of having a threshold in BIOS, provided that it is > well documented, and whose value can be adjusted, if needed. > > One of the things I wanted to implement in ras-daemon were an > algorithm that would be doing such threshold in software. We have that now in the kernel: drivers/ras/cec.c We did it exactly for that purpose - not upsetting users unnecessarily. > The thing with a BIOS threshold is that the user has no way to > audit the algorithm. So, when BIOS start reporting such errors, > it may be already too late: the systems may be in the verge of > losing data (or some data was already lost). Not only that: thresholds depend on the DIMM types which means, BIOS must know what DIMM types are in there which I doubt. So exposing that to configuration instead of "deciding" for people would be better. > That's critical on cluster systems with thousands of machines: > while the impact of disabling a cluster node to do some maintainance > is marginal, the impact of an uncorrected error on a single > machine may compromise weeks of expensive processing. > > That's why some users prefer to monitor every single corrected > error, and compare with the probability distribution they > know that the risk of uncorrected errors is acceptable. Yap, you need to have stuff like that configurable - BIOS can't predict all possible use cases. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html