Re: [PATCH 1/1] PCI/AER: prevent pcie_do_fatal_recovery from using device after it is removed

poza@xxxxxxxxxxxxxx · Mon, 20 Aug 2018 13:26:01 +0530

On 2018-08-20 11:03, Benjamin Herrenschmidt wrote:
On Mon, 2018-08-20 at 10:49 +0530, poza@xxxxxxxxxxxxxx wrote:

Reverting spec/Documentation which is fine by me.

But the good thing has happened now is; we can have very clear
definition for the framework to go forward.
e.g. how the errors have to be handled.

Because of those patches, the whole error framework is under common 
code
base and now has become independent of service e.g. AER, DPC etc..

Well, EEH isn't yet :-) But then the EEH code is a real mess buried in
arch/powerpc. Sam (CC) is trying to improve that situation and I might
step in as well to help if we think we can make things more common, it
would definitely help.

That enables us to define or extend policies in more clearly defined 
way
irrespective of what services are running.

Now it is just that we have to change in err.c and walk away with the
policies what we want to enforce.

let me know how this sounds Ben.

So for now, I've sent a revert patch for the Documentation/ bit to
Bjorn, and I have no (not yet at least) beef in what you do in
drivers/pci/* ... however, that said, I think it would be great to move
EEH toward having a bulk of the policy use common code as well.

In my opinion, I think revert patch can wait, because again we might 
need to change it according to improvements we bring.

It will be long road, in part due to the historical crappyness of our
EEH code, so my thinking is we should:

 - First agree on what we want the policy to be. I need to read a bit
more about DPC since that's new to me, it seems to be similar to what
our EEH does, with slighty less granularity (we can freeze access to
individual functions for example).

 - Rework err.c to implement that policy with the existing AER and DPC
code.

I went through eeh-pci-error-recovery.txt
If I pick up folowing

"
If the OS or device driver suspects that a PCI slot has been 
EEH-isolated, there is a firmware call it can make to determine if this 
is the case. If so, then the device driver should put itself into a 
consistent state (given that it won't be able to complete any pending 
work) and start recovery of the card.  Recovery normally would consist 
of resetting the PCI device (holding the PCI #RST line high for two 
seconds), followed by setting up the device config space (the base 
address registers (BAR's), latency timer, cache line size, interrupt 
line, and so on).  This is followed by a reinitialization of the device 
driver.  In a worst-case scenario, the power to the card can be toggled, 
at least on hot-plug-capable slots.  In principle, layers far above the 
device driver probably do not need to know that the PCI card has been 
"rebooted" in this way; ideally, there should be at most a pause in 
Ethernet/disk/USB I/O while the card is being reset.

If the card cannot be recovered after three or four resets, the 
kernel/device driver should assume the worst-case scenario, that the 
card has died completely, and report this error to the sysadmin. In 
addition, error messages are reported through RTAS and also through 
syslogd (/var/log/messages) to alert the sysadmin of PCI resets. The 
correct way to deal with failed adapters is to use the standard PCI 
hotplug tools to remove and replace the dead card.
"

some differences: I find is:
Current framework does not attempt recovery 3-4 times.

Besides If I grasp eeh-pci-error-recovery.txt correctly, (although I 
could be easily wrong),
mainly the difference is coming how we reset the device.

Linux uses SBR, while EEH seems to be using #PERST.
PERST signal implementation might be board specific, e.g. we have 
io-port-expander sitting on i2c to drive #PERST.
we rely on SBR. and SBR also should be a good way to bring hotreset of 
device.
"All Lanes in the configured Link transmit TS1 Ordered Sets with the Hot 
Reset bit 15 asserted and the configured Link and Lane numbers."
although I am not sure between #PERST and SBR is there is any subtle 
differences such as some sticky bits might not be affected by SBR.
but so far SBR has served well for us.

Also I see that eeh-pci-error-recovery.txt does not seem to distinguish 
between fatal and nonfatal errors, is the behavior same for both type of 
errors in EEH ?

I would like to mention is that there should nto be any need to reset 
the card in case of NONFATAL because by definition link is functional, 
only the particular transaction
had a problem. so not sure how EEH deals with ERR_NONFATAL.

Also I am very much interested in knowing original intention of DPC 
driver to unplug/plug devices,
all I remember in some conversation was:
hotplug capable bridge might have see devices changed, so it is safer to 
remove/unplug the devices and during which .shutdown methods of driver 
is called, in case of ERR_FATAL.
although DPC is HW recovery while AER is sw recovery both should 
fundamentally act in the same way as far as device drivers callbacks are 
concerned.
(again I really dont know real motivation behind this)

Regards,
Oza.

 - Figure out what hooks might be needed to be able to plumb EEH into
it, possibly removing a bunch of crap in arch/powerpc (yay !)

I don't think having a webex will be that practical with the timezones
involved. I'm trying to get approval to go to Plumbers in which case we
could setup a BOF but I have no guarantee at this point that I can make
it happen.

So let's try using email as much possible for now.

Cheers,
Ben.