Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Bjorn,

On 7/24/20 12:07 PM, sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx wrote:
From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx>

Current pcie_do_recovery() implementation has following two issues:

1. Fatal (DPC) error recovery is currently broken for non-hotplug
capable devices. Current fatal error recovery implementation relies
on PCIe hotplug (pciehp) handler for detaching and re-enumerating
the affected devices/drivers. pciehp handler listens for DLLSC state
changes and handles device/driver detachment on DLLSC_LINK_DOWN event
and re-enumeration on DLLSC_LINK_UP event. So when dealing with
non-hotplug capable devices, recovery code does not restore the state
of the affected devices correctly. Correct implementation should
restore the device state and call report_slot_reset() function after
resetting the link to restore the state of the device/driver.

You can find fatal non-hotplug related issues reported in following links:

https://lore.kernel.org/linux-pci/20200527083130.4137-1-Zhiqiang.Hou@xxxxxxx/
https://lore.kernel.org/linux-pci/12115.1588207324@famine/
https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx/

2. For non-fatal errors if report_error_detected() or
report_mmio_enabled() functions requests PCI_ERS_RESULT_NEED_RESET then
current pcie_do_recovery() implementation does not do the requested
explicit device reset, instead just calls the report_slot_reset() on all
affected devices. Notifying about the reset via report_slot_reset()
without doing the actual device reset is incorrect.

To fix above issues, use PCI_ERS_RESULT_NEED_RESET as error state after
successful reset_link() operation. This will ensure ->slot_reset() be
called after reset_link() operation for fatal errors. Also call
pci_bus_reset() to do slot/bus reset() before triggering device specific
->slot_reset() callback. Also, using pci_bus_reset() will restore the state
of the devices after performing the reset operation.

Even though using pci_bus_reset() will do redundant reset operation after
->reset_link() for fatal errors, it should should affect the functional
behavior.
Any comment on this patch?

[original patch is from jay.vosburgh@xxxxxxxxxxxxx]
[original patch link https://lore.kernel.org/linux-pci/12115.1588207324@famine/]
Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
Signed-off-by: Jay Vosburgh <jay.vosburgh@xxxxxxxxxxxxx>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx>
---

Changes since v2:
  * Changed the subject of patch to "PCI/ERR: Fix reset logic in
    pcie_do_recovery() call". v2 patch link is,
    https://lore.kernel.org/linux-pci/ce417fbf81a8a46a89535f44b9224ee9fbb55a29.1591307288.git.sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx/
  * Squashed "PCI/ERR: Add reset support for non fatal errors" patch.

  drivers/pci/pcie/err.c | 41 +++++++++++++++++++++++++++++++++++++----
  1 file changed, 37 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 14bb8f54723e..b5eb6ba65be1 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -165,8 +165,29 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
  	pci_dbg(dev, "broadcast error_detected message\n");
  	if (state == pci_channel_io_frozen) {
  		pci_walk_bus(bus, report_frozen_detected, &status);
+		/*
+		 * After resetting the link using reset_link() call, the
+		 * possible value of error status is either
+		 * PCI_ERS_RESULT_DISCONNECT (failure case) or
+		 * PCI_ERS_RESULT_NEED_RESET (success case).
+		 * So ignore the return value of report_error_detected()
+		 * call for fatal errors.
+		 *
+		 * In EDR mode, since AER and DPC Capabilities are owned by
+		 * firmware, reported_error_detected() will return error
+		 * status PCI_ERS_RESULT_NO_AER_DRIVER. Continuing
+		 * pcie_do_recovery() with error status as
+		 * PCI_ERS_RESULT_NO_AER_DRIVER will report recovery failure
+		 * irrespective of recovery status. But successful reset_link()
+		 * call usually recovers all fatal errors. So ignoring the
+		 * status result of report_error_detected() also helps EDR based
+		 * error recovery.
+		 */
  		status = reset_link(dev);
-		if (status != PCI_ERS_RESULT_RECOVERED) {
+		if (status == PCI_ERS_RESULT_RECOVERED) {
+			status = PCI_ERS_RESULT_NEED_RESET;
+		} else {
+			status = PCI_ERS_RESULT_DISCONNECT;
  			pci_warn(dev, "link reset failed\n");
  			goto failed;
  		}
@@ -182,10 +203,22 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
if (status == PCI_ERS_RESULT_NEED_RESET) {
  		/*
-		 * TODO: Should call platform-specific
-		 * functions to reset slot before calling
-		 * drivers' slot_reset callbacks?
+		 * TODO: Optimize the call to pci_reset_bus()
+		 *
+		 * There are two components to pci_reset_bus().
+		 *
+		 * 1. Do platform specific slot/bus reset.
+		 * 2. Save/Restore all devices in the bus.
+		 *
+		 * For hotplug capable devices and fatal errors,
+		 * device is already in reset state due to link
+		 * reset. So repeating platform specific slot/bus
+		 * reset via pci_reset_bus() call is redundant. So
+		 * can optimize this logic and conditionally call
+		 * pci_reset_bus().
  		 */
+		pci_reset_bus(dev);
+
  		status = PCI_ERS_RESULT_RECOVERED;
  		pci_dbg(dev, "broadcast slot_reset message\n");
  		pci_walk_bus(bus, report_slot_reset, &status);


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux