On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote: > I especially don't want to have the case where a PCIe error is *really* > fatal and then we noodle in some handlers debating about the severity > because it got marked as recoverable intermittently and end up causing > data corruption on the storage device. Here's a real no-no for ya. All that we have is a message from the BIOS that this is a "fatal" error. When did we start trusting the BIOS to give us accurate information? PCIe fatal means that the link or the device is broken. But that seems a poor reason to take down a large server that may have dozens of devices (some of them set up specifically to handle errors ... e.g. mirrored disks on separate controllers, or NIC devices that have been "bonded" together). So, as long as the action for a "fatal" error is to mark a link down and offline the device, that seems a pretty reasonable course of action. The argument gets a lot more marginal if you simply reset the link and re-enable the device to "fix" it. That might be enough, but I don't think the OS has enough data to make the call. -Tony P.S. I deliberately put "fatal" in quotes above because to quote "The Princess Bride" -- "that word, I do not think it means what you think it means". :-) -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html