On Wed, 2021-09-08 at 11:37 +1000, Oliver O'Halloran wrote: > On Tue, Sep 7, 2021 at 10:21 PM Niklas Schnelle <schnelle@xxxxxxxxxxxxx> wrote: > > On Tue, 2021-09-07 at 10:45 +0200, Niklas Schnelle wrote: > > > On Tue, 2021-09-07 at 12:04 +1000, Oliver O'Halloran wrote: > > > > On Mon, Sep 6, 2021 at 7:49 PM Niklas Schnelle <schnelle@xxxxxxxxxxxxx> wrote: > > > > > Patch 3 I already sent separately resulting in the discussion below but without > > > > > a final conclusion. > > > > > > > > > > https://lore.kernel.org/lkml/20210720150145.640727-1-schnelle@xxxxxxxxxxxxx/ > > > > > > > > > > I believe even though there were some doubts about the use of > > > > > pci_dev_is_added() by arch code the existing uses as well as the use in the > > > > > final patch of this series warrant this export. > > > > > > > > The use of pci_dev_is_added() in arch/powerpc was because in the past > > > > pci_bus_add_device() could be called before pci_device_add(). That was > > > > fixed a while ago so It should be safe to remove those calls now. > > > > > > Hmm, ok that confirms Bjorns suspicion and explains how it came to be. > > > I can certainly sent a patch for that. This would then leave only the > > > existing use in s390 which I added because of a dead lock prevention > > > and explained here: > > > https://lore.kernel.org/lkml/87d15d5eead35c9eaa667958d057cf4a81a8bf13.camel@xxxxxxxxxxxxx/ > > > > > > Plus the need to use it in the recovery code of this series. I think in > > > the EEH code the need for a similar check is alleviated by the checks > > > in the beginning of > > > arch/powerpc/kernel/eeh_driver.c:eeh_handle_normal_event() especially > > > eeh_slot_presence_check() which checks presence via the hotplug slot. > > > I guess we could use our own state tracking in a similar way but felt > > > like pci_dev_is_added() is the more logical choice. > > The slot check is mainly there to prevent attempts to "recover" > devices that have been surprise removed (i.e NVMe hot-unplug). The > actual recovery process operates off the eeh_pe tree which is frozen > in place when an error is detected. If a pci_dev is added or removed > it's not really a problem since those are only ever looked at when > notifying drivers which is done with the rescan_remove lock held. Thanks for the explanation. > That > said, I wouldn't really encourage anyone to follow the EEH model since > it's pretty byzantine. > > > Looking into this again, I think we actually can't easily track this > > state ourselves outside struct pci_dev. The reason for this is that > > when e.g. arch/s390/pci/pci_sysfs.c:recover_store() removes the struct > > pci_dev and scans it again the new struct pci_dev re-uses the same > > struct zpci_dev because from a platform point of view the PCI device > > was never removed but only disabled and re-enabled. Thus we can only > > distinguish the stale struct pci_dev by looking at things stored in > > struct pci_dev itself. > > IMO the real problem is removing and re-adding the pci_dev. I think > it's something that's done largely because the PCI core doesn't really > provide any better mechanism for getting a device back into a > known-good state so it's abused to implement error recovery. This is > something that's always annoyed me since it conflates recovery with > hotplug. After a hot-(un)plug we might have a different device or no > device. In the recovery case we expect to start and end with the same > device. Why not apply the same logic to the pci_dev? For us there are two cases. First The existing /sys/bus/pci/devices/<dev>/recover attribute. This does the pci_dev remove and re-add that you mention and thus we end up with a ne pci_dev afterwards and I agree that is kind of a dumb way to recover which (too?) closely resembles unplug/re-plug. Secondly the automatic error recovery added in this series. Here we only attempt recovery if we have a driver bound that supports the error callbacks thus always keeping the same pci_dev. If there is no driver we give up automatic recovery and are back at the situation without this series. > > Something I was tinkering with before I left IBM was re-working the > way EEH handles recovering devices that don't have a driver with error > handling callbacks to something like: > > 1. unbind the driver > 2. pci_save_state() > 3. do the reset > 4. pci_restore_state() > 5. re-bind the driver > > That would allow keeping the pci_dev around and let me delete a pile > of confusing code which handles binding the eeh_dev to the new > pci_dev. This sounds like an interesting future approach for us too. Thankfully our binding of the zpci_dev to the new pci_dev is pretty simple by now. The main trouble with removing and re-adding a pci_dev is then that upper layers like block devices are also re-created which really only happens if we have a driver bound. > The obvious problem with that approach is the assumption the > device is functional enough to allow saving the config space, but I > don't think that's a deal breaker. We could stash a copy of the device > state before we allow drivers to attach and use that to restore the > device after the reset. The end result would be the same known-good > state that we'd get after a re-scan. > > > That said, I think for the recovery case we might be able to drop the > > pci_dev_is_added() and rely on pdev->driver != NULL which we check > > anyway and that should catch any PCI device that was already removed. > > Would that work if there was an error on a device without a driver > bound? For the automatic recovery flow introduced by this series we only recover if such a driver is bound anyway so that is already a requirement. Luckily all physical PCI devices we support on our platform have drivers with that support. > If you're just trying to stop races between recovery and device > removal then pci_dev_is_added() is probably the right tool for the > job. Trying to substitute it with a proxy seems like a bad idea. Yes I believe at least for the existing recover attribute that does not require a bound driver we still need pci_dev_is_added(). For the automatic recovery flow I think it would be okay to rely on the fact that removed devices don't have a driver bound since the recovery requires a bound driver anyway but yes an explicit pci_dev_is_added() check as in this patch does feel more clean.