On Thu, 2018-08-16 at 16:51 +1000, Benjamin Herrenschmidt wrote: > No, this is wrong and not the intent of the error handling. > > You seem to be applying PCIe specific concepts brain-farted at Intel > that are way way away from what we care about in practice and in Linux. > > > e.g. some driver handle errors ERR_NONFATAL or FATAL in similar ways > > e.g. > > ioat_pcie_error_detected(); calls ioat_shutdown(); in case of > > ERR_NONFATAL > > otherwise ioat_shutdown() in case of ERR_FATAL. > > Since when the error handling callbacks even have the concept of FATAL > vs. non-fatal ? This doesn't appear anyhwhere in the prototype of the > struct pci_error_handlers and shouldn't. Ugh... I just saw the changes you did to Documentation/PCI/pci-error- recovery.txt and I would very much like to revert those ! Bjorn, you shouldn't let changes to the PCI error handling through without acks from us, it looks like we didn't notice (we can't possibly monitor all lists). We wrote that in the firsat place and our EEH infrastructure rely on it heavily on it. Poza, you seem to have not understood the intent of the code and are now changing the rules in ways that are broken in our opinion. This is bad. Bjorn, please revert all of those changes. There was NEVER an intent to separate fatal from non-fatal at that level. We could pass the information to the driver if we wished but the recovery sequence is NOT intended to be different. Especially we specifically do NOT want to unplug and replug the device for fatal errors at all. This is not going to work with drivers that cannot re-link with their various kernel services, such as storage devices re-connecting with mounted file systems etc... Those changes are utterly broken. The basic premise of the design that we woudl do that unplug/replug trick if and ONLY IF the driver doesn't have the appropriate callbacks. We are also now looking at replacing this with an ubind/re-bind because in practice, the unplugging is causing us all sort of problems. Sam (CC) can elaborate. Bjorn, we are the main authors of that spec (Linas wrote it under my supervision) and created those callbacks for EEH. AER picked them up only later. Those changes must be at the very least acked by us before going upstream. Ben.