Re: [PATCH 1/1] PCI/AER: prevent pcie_do_fatal_recovery from using device after it is removed

poza@xxxxxxxxxxxxxx · Thu, 16 Aug 2018 13:35:43 +0530

On 2018-08-16 12:35, Benjamin Herrenschmidt wrote:
On Thu, 2018-08-16 at 16:51 +1000, Benjamin Herrenschmidt wrote:
No, this is wrong and not the intent of the error handling.

You seem to be applying PCIe specific concepts brain-farted at Intel
that are way way away from what we care about in practice and in 
Linux.

> e.g.  some driver handle errors ERR_NONFATAL or FATAL in similar ways
> e.g.
> ioat_pcie_error_detected();  calls  ioat_shutdown(); in case of
> ERR_NONFATAL
> otherwise ioat_shutdown() in case of ERR_FATAL.

Since when the error handling callbacks even have the concept of FATAL
vs. non-fatal ? This doesn't appear anyhwhere in the prototype of the
struct pci_error_handlers and shouldn't.

Ugh... I just saw the changes you did to Documentation/PCI/pci-error-
recovery.txt and I would very much like to revert those !

Bjorn, you shouldn't let changes to the PCI error handling through
without acks from us, it looks like we didn't notice (we can't possibly
monitor all lists).

We wrote that in the firsat place and our EEH infrastructure rely on it
heavily on it.

Poza, you seem to have not understood the intent of the code and are
now changing the rules in ways that are broken in our opinion. This is
bad.

Bjorn, please revert all of those changes.

There was NEVER an intent to separate fatal from non-fatal at that
level. We could pass the information to the driver if we wished but the
recovery sequence is NOT intended to be different.

Especially we specifically do NOT want to unplug and replug the device
for fatal errors at all. This is not going to work with drivers that
cannot re-link with their various kernel services, such as storage
devices re-connecting with mounted file systems etc...

Those changes are utterly broken.

The basic premise of the design that we woudl do that unplug/replug
trick if and ONLY IF the driver doesn't have the appropriate callbacks.

We are also now looking at replacing this with an ubind/re-bind because
in practice, the unplugging is causing us all sort of problems. Sam
(CC) can elaborate.

Bjorn, we are the main authors of that spec (Linas wrote it under my
supervision) and created those callbacks for EEH. AER picked them up
only later. Those changes must be at the very least acked by us before
going upstream.

Ben.

+ Sinan

This patch set was there in mailing list for nearly 17 to 18 revisions 
for 7 months.
besides the intent was to bring DPC and AER into the same well defined 
way of error handling.
The way DPC used to behave in 2016, is still the same; which involved 
removing and re-enumerating the devices.

Regards,
Oza.