Re: [PATCH 1/1] PCI/AER: prevent pcie_do_fatal_recovery from using device after it is removed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 2018-08-16 at 16:51 +1000, Benjamin Herrenschmidt wrote:
> No, this is wrong and not the intent of the error handling.
> 
> You seem to be applying PCIe specific concepts brain-farted at Intel
> that are way way away from what we care about in practice and in Linux.
> 
> > e.g.  some driver handle errors ERR_NONFATAL or FATAL in similar ways
> > e.g.
> > ioat_pcie_error_detected();  calls  ioat_shutdown(); in case of 
> > ERR_NONFATAL
> > otherwise ioat_shutdown() in case of ERR_FATAL.
> 
> Since when the error handling callbacks even have the concept of FATAL
> vs. non-fatal ? This doesn't appear anyhwhere in the prototype of the
> struct pci_error_handlers and shouldn't.

Ugh... I just saw the changes you did to Documentation/PCI/pci-error-
recovery.txt and I would very much like to revert those !

Bjorn, you shouldn't let changes to the PCI error handling through
without acks from us, it looks like we didn't notice (we can't possibly
monitor all lists).

We wrote that in the firsat place and our EEH infrastructure rely on it
heavily on it.

Poza, you seem to have not understood the intent of the code and are
now changing the rules in ways that are broken in our opinion. This is
bad.

Bjorn, please revert all of those changes.

There was NEVER an intent to separate fatal from non-fatal at that
level. We could pass the information to the driver if we wished but the
recovery sequence is NOT intended to be different.

Especially we specifically do NOT want to unplug and replug the device
for fatal errors at all. This is not going to work with drivers that
cannot re-link with their various kernel services, such as storage
devices re-connecting with mounted file systems etc...

Those changes are utterly broken.

The basic premise of the design that we woudl do that unplug/replug
trick if and ONLY IF the driver doesn't have the appropriate callbacks.

We are also now looking at replacing this with an ubind/re-bind because
in practice, the unplugging is causing us all sort of problems. Sam
(CC) can elaborate.

Bjorn, we are the main authors of that spec (Linas wrote it under my
supervision) and created those callbacks for EEH. AER picked them up
only later. Those changes must be at the very least acked by us before
going upstream.

Ben.





[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux