On 9/24/2020 12:06 AM, Kuppuswamy, Sathyanarayanan wrote: > For problem description, please check the following details > > Current pcie_do_recovery() implementation has following two issues: > > 1. Fatal (DPC) error recovery is currently broken for non-hotplug > capable devices. Current fatal error recovery implementation relies > on PCIe hotplug (pciehp) handler for detaching and re-enumerating > the affected devices/drivers. pciehp handler listens for DLLSC state > changes and handles device/driver detachment on DLLSC_LINK_DOWN event > and re-enumeration on DLLSC_LINK_UP event. So when dealing with > non-hotplug capable devices, recovery code does not restore the state > of the affected devices correctly. Correct implementation should > restore the device state and call report_slot_reset() function after > resetting the link to restore the state of the device/driver. > So, this is a matter of moving the save/restore logic from the hotplug driver into common code so that DPC slot reset takes advantage of it? If that's direction we are going, this is fine change IMO. > You can find fatal non-hotplug related issues reported in following links: > > https://lore.kernel.org/linux-pci/20200527083130.4137-1-Zhiqiang.Hou@xxxxxxx/ > > https://lore.kernel.org/linux-pci/12115.1588207324@famine/ > https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx/ > > > 2. For non-fatal errors if report_error_detected() or > report_mmio_enabled() functions requests PCI_ERS_RESULT_NEED_RESET then > current pcie_do_recovery() implementation does not do the requested > explicit device reset, instead just calls the report_slot_reset() on all > affected devices. Notifying about the reset via report_slot_reset() > without doing the actual device reset is incorrect. > This makes sens too. There seems to be an ordering issue per your description. > To fix above issues, use PCI_ERS_RESULT_NEED_RESET as error state after > successful reset_link() operation. This will ensure ->slot_reset() be > called after reset_link() operation for fatal errors. You lost me here. Why do we want to do secondary bus reset on top of DPC reset? > Also call > pci_bus_reset() to do slot/bus reset() before triggering device specific > ->slot_reset() callback. Also, using pci_bus_reset() will restore the state > of the devices after performing the reset operation. > > Even though using pci_bus_reset() will do redundant reset operation after > ->reset_link() for fatal errors, it should should affect the functional > behavior.