Re: [PATCH 0/5] s390/pci: automatic error recovery

"Oliver O'Halloran" <oohall@xxxxxxxxx> · Wed, 8 Sep 2021 11:37:11 +1000

On Tue, Sep 7, 2021 at 10:21 PM Niklas Schnelle <schnelle@xxxxxxxxxxxxx> wrote:
>
> On Tue, 2021-09-07 at 10:45 +0200, Niklas Schnelle wrote:
> > On Tue, 2021-09-07 at 12:04 +1000, Oliver O'Halloran wrote:
> > > On Mon, Sep 6, 2021 at 7:49 PM Niklas Schnelle <schnelle@xxxxxxxxxxxxx> wrote:
> > > > Patch 3 I already sent separately resulting in the discussion below but without
> > > > a final conclusion.
> > > >
> > > > https://lore.kernel.org/lkml/20210720150145.640727-1-schnelle@xxxxxxxxxxxxx/
> > > >
> > > > I believe even though there were some doubts about the use of
> > > > pci_dev_is_added() by arch code the existing uses as well as the use in the
> > > > final patch of this series warrant this export.
> > >
> > > The use of pci_dev_is_added() in arch/powerpc was because in the past
> > > pci_bus_add_device() could be called before pci_device_add(). That was
> > > fixed a while ago so It should be safe to remove those calls now.
> >
> > Hmm, ok that confirms Bjorns suspicion and explains how it came to be.
> > I can certainly sent a patch for that. This would then leave only the
> > existing use in s390 which I added because of a dead lock prevention
> > and explained here:
> > https://lore.kernel.org/lkml/87d15d5eead35c9eaa667958d057cf4a81a8bf13.camel@xxxxxxxxxxxxx/
> >
> > Plus the need to use it in the recovery code of this series. I think in
> > the EEH code the need for a similar check is alleviated by the checks
> > in the beginning of
> > arch/powerpc/kernel/eeh_driver.c:eeh_handle_normal_event() especially
> > eeh_slot_presence_check() which checks presence via the hotplug slot.
> > I guess we could use our own state tracking in a similar way but felt
> > like pci_dev_is_added() is the more logical choice.

The slot check is mainly there to prevent attempts to "recover"
devices that have been surprise removed (i.e NVMe hot-unplug). The
actual recovery process operates off the eeh_pe tree which is frozen
in place when an error is detected. If a pci_dev is added or removed
it's not really a problem since those are only ever looked at when
notifying drivers which is done with the rescan_remove lock held. That
said, I wouldn't really encourage anyone to follow the EEH model since
it's pretty byzantine.

> Looking into this again, I think we actually can't easily track this
> state ourselves outside struct pci_dev. The reason for this is that
> when e.g. arch/s390/pci/pci_sysfs.c:recover_store() removes the struct
> pci_dev and scans it again the new struct pci_dev re-uses the same
> struct zpci_dev because from a platform point of view the PCI device
> was never removed but only disabled and re-enabled. Thus we can only
> distinguish the stale struct pci_dev by looking at things stored in
> struct pci_dev itself.

IMO the real problem is removing and re-adding the pci_dev. I think
it's something that's done largely because the PCI core doesn't really
provide any better mechanism for getting a device back into a
known-good state so it's abused to implement error recovery. This is
something that's always annoyed me since it conflates recovery with
hotplug. After a hot-(un)plug we might have a different device or no
device. In the recovery case we expect to start and end with the same
device. Why not apply the same logic to the pci_dev?

Something I was tinkering with before I left IBM was re-working the
way EEH handles recovering devices that don't have a driver with error
handling callbacks to something like:

1. unbind the driver
2. pci_save_state()
3. do the reset
4. pci_restore_state()
5. re-bind the driver

That would allow keeping the pci_dev around and let me delete a pile
of confusing code which handles binding the eeh_dev to the new
pci_dev. The obvious problem with that approach is the assumption the
device is functional enough to allow saving the config space, but I
don't think that's a deal breaker. We could stash a copy of the device
state before we allow drivers to attach and use that to restore the
device after the reset. The end result would be the same known-good
state that we'd get after a re-scan.

> That said, I think for the recovery case we might be able to drop the
> pci_dev_is_added() and rely on pdev->driver != NULL which we check
> anyway and that should catch any PCI device that was already removed.

Would that work if there was an error on a device without a driver
bound? If you're just trying to stop races between recovery and device
removal then pci_dev_is_added() is probably the right tool for the
job. Trying to substitute it with a proxy seems like a bad idea.