Re: [PATCH 1/4] PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset

Bjorn Helgaas <helgaas@xxxxxxxxxx> · Thu, 7 Oct 2021 18:00:20 -0500

On Sat, Jul 31, 2021 at 02:39:01PM +0200, Lukas Wunner wrote:
> Stuart Hayes reports that an error handled by DPC at a Root Port results
> in pciehp gratuitously bringing down a subordinate hotplug port:
> 
>   RP -- UP -- DP -- UP -- DP (hotplug) -- EP
> 
> pciehp brings the slot down because the Link to the Endpoint goes down.
> That is caused by a Hot Reset being propagated as a result of DPC.
> Per PCIe Base Spec 5.0, section 6.6.1 "Conventional Reset":
> 
>   For a Switch, the following must cause a hot reset to be sent on all
>   Downstream Ports: [...]
> 
>   * The Data Link Layer of the Upstream Port reporting DL_Down status.
>     In Switches that support Link speeds greater than 5.0 GT/s, the
>     Upstream Port must direct the LTSSM of each Downstream Port to the
>     Hot Reset state, but not hold the LTSSMs in that state. This permits
>     each Downstream Port to begin Link training immediately after its
>     hot reset completes. This behavior is recommended for all Switches.
> 
>   * Receiving a hot reset on the Upstream Port.
> 
> Once DPC recovers, pcie_do_recovery() walks down the hierarchy and
> invokes pcie_portdrv_slot_reset() to restore each port's config space.
> At that point, a hotplug interrupt is signaled per PCIe Base Spec r5.0,
> section 6.7.3.4 "Software Notification of Hot-Plug Events":
> 
>   If the Port is enabled for edge-triggered interrupt signaling using
>   MSI or MSI-X, an interrupt message must be sent every time the logical
>   AND of the following conditions transitions from FALSE to TRUE: [...]
> 
>   * The Hot-Plug Interrupt Enable bit in the Slot Control register is
>     set to 1b.
> 
>   * At least one hot-plug event status bit in the Slot Status register
>     and its associated enable bit in the Slot Control register are both
>     set to 1b.
> 
> Prevent pciehp from gratuitously bringing down the slot by clearing the
> error-induced Data Link Layer State Changed event before restoring
> config space.  Afterwards, check whether the link has unexpectedly
> failed to retrain and synthesize a DLLSC event if so.
> 
> Allow each pcie_port_service_driver (one of them being pciehp) to define
> a slot_reset callback and re-use the existing pm_iter() function to
> iterate over the callbacks.
> 
> Thereby, the Endpoint driver remains bound throughout error recovery and
> may restore the device to working state.
> 
> Surprise removal during error recovery is detected through a Presence
> Detect Changed event.  The hotplug port is expected to not signal that
> event as a result of a Hot Reset.
> 
> The issue isn't DPC-specific, it also occurs when an error is handled by
> AER through aer_root_reset().  So while the issue was noticed only now,
> it's been around since 2006 when AER support was first introduced.
> 
> Fixes: 6c2b374d7485 ("PCI-Express AER implemetation: AER core and aerdriver")
> Link: https://lore.kernel.org/linux-pci/08c046b0-c9f2-3489-eeef-7e7aca435bb9@xxxxxxxxx/
> Reported-by: Stuart Hayes <stuart.w.hayes@xxxxxxxxx>
> Signed-off-by: Lukas Wunner <lukas@xxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx # v2.6.19+: ba952824e6c1: PCI/portdrv: Report reset for frozen channel

Have you tried backporting this to v2.6.19?  I bet it's tough.  Not
sure we should suggest that stable pick this up unless there's a
reasonable path to do that.

> Cc: Keith Busch <kbusch@xxxxxxxxxx>
> ---
>  drivers/pci/Kconfig               |  3 +++
>  drivers/pci/hotplug/pciehp.h      |  2 ++
>  drivers/pci/hotplug/pciehp_core.c |  4 ++++
>  drivers/pci/hotplug/pciehp_hpc.c  | 28 ++++++++++++++++++++++++++++
>  drivers/pci/pcie/portdrv.h        |  2 ++
>  drivers/pci/pcie/portdrv_core.c   | 20 ++++++++++----------
>  drivers/pci/pcie/portdrv_pci.c    |  3 +++
>  7 files changed, 52 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 0c473d75e625..a295d3c48927 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -182,6 +182,9 @@ config PCI_LABEL
>  	def_bool y if (DMI || ACPI)
>  	select NLS
>  
> +config PCI_ERROR_RECOVERY
> +	def_bool PCIEAER || EEH

I'm having a hard time connecting this to the code.  

pcie_portdrv_slot_reset() is the .slot_reset() method for
pcie_portdriver.  When CONFIG_PCIEPORTBUS=y, portdrv is bound to every
PCIe port and RCEC.

The callers of pci_driver->err_handler->slot_reset() that I see are:

  eeh_report_reset		# arch/powerpc/kernel/eeh_driver.c
  report_slot_reset		# drivers/pci/pcie/err.c
  pci_error_handlers		# drivers/misc/cxl/guest.c
  cxl_pci_slot_reset		# drivers/misc/cxl/pci.c
  pcifront_common_process	# drivers/pci/xen-pcifront.c

I guess the cxl and xen cases probably do not involve PCIe ports;
they're probably only concerned with endpoints, so maybe we can
exclude those.

But this still seems like it's kind of dangling.  It's not obvious to
me why pciehp_slot_reset() should be inside that #ifdef.  Do we need
the #ifdef for a functional reason, or is it there because we know it
will never be called?  If the latter, I don't think the savings are
worth it.

>  config PCI_HYPERV
>  	tristate "Hyper-V PCI Frontend"
>  	depends on X86_64 && HYPERV && PCI_MSI && PCI_MSI_IRQ_DOMAIN && SYSFS
> diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
> index d4a930881054..7f24fe30a898 100644
> --- a/drivers/pci/hotplug/pciehp.h
> +++ b/drivers/pci/hotplug/pciehp.h
> @@ -189,6 +189,8 @@ int pciehp_get_attention_status(struct hotplug_slot *hotplug_slot, u8 *status);
>  int pciehp_set_raw_indicator_status(struct hotplug_slot *h_slot, u8 status);
>  int pciehp_get_raw_indicator_status(struct hotplug_slot *h_slot, u8 *status);
>  
> +int pciehp_slot_reset(struct pcie_device *dev);
> +
>  static inline const char *slot_name(struct controller *ctrl)
>  {
>  	return hotplug_slot_name(&ctrl->hotplug_slot);
> diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
> index ad3393930ecb..46a62237dcc8 100644
> --- a/drivers/pci/hotplug/pciehp_core.c
> +++ b/drivers/pci/hotplug/pciehp_core.c
> @@ -351,6 +351,10 @@ static struct pcie_port_service_driver hpdriver_portdrv = {
>  	.runtime_suspend = pciehp_runtime_suspend,
>  	.runtime_resume	= pciehp_runtime_resume,
>  #endif	/* PM */
> +
> +#ifdef	CONFIG_PCI_ERROR_RECOVERY
> +	.slot_reset	= pciehp_slot_reset,
> +#endif
>  };
>  
>  int __init pcie_hp_init(void)
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
> index 9d06939736c0..72ef22d0c2c9 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -862,6 +862,34 @@ void pcie_disable_interrupt(struct controller *ctrl)
>  	pcie_write_cmd(ctrl, 0, mask);
>  }
>  
> +#ifdef CONFIG_PCI_ERROR_RECOVERY
> +/**
> + * pciehp_slot_reset() - ignore link event caused by error-induced hot reset
> + * @dev: PCI Express port service device
> + *
> + * Called from pcie_portdrv_slot_reset() after AER or DPC initiated a reset
> + * further up in the hierarchy to recover from an error.  The reset was
> + * propagated down to this hotplug port.  Ignore the resulting link flap.
> + * If the link failed to retrain successfully, synthesize the ignored event.
> + * Surprise removal during reset is detected through Presence Detect Changed.
> + */
> +int pciehp_slot_reset(struct pcie_device *dev)
> +{
> +	struct controller *ctrl = get_service_data(dev);
> +
> +	if (ctrl->state != ON_STATE)
> +		return 0;
> +
> +	pcie_capability_write_word(dev->port, PCI_EXP_SLTSTA,
> +				   PCI_EXP_SLTSTA_DLLSC);
> +
> +	if (!pciehp_check_link_active(ctrl))
> +		pciehp_request(ctrl, PCI_EXP_SLTSTA_DLLSC);
> +
> +	return 0;
> +}
> +#endif
> +
>  /*
>   * pciehp has a 1:1 bus:slot relationship so we ultimately want a secondary
>   * bus reset of the bridge, but at the same time we want to ensure that it is
> diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
> index 2ff5724b8f13..92a776d211ec 100644
> --- a/drivers/pci/pcie/portdrv.h
> +++ b/drivers/pci/pcie/portdrv.h
> @@ -85,6 +85,7 @@ struct pcie_port_service_driver {
>  	int (*runtime_suspend)(struct pcie_device *dev);
>  	int (*runtime_resume)(struct pcie_device *dev);
>  
> +	int (*slot_reset)(struct pcie_device *dev);
>  	/* Device driver may resume normal operations */
>  	void (*error_resume)(struct pci_dev *dev);
>  
> @@ -110,6 +111,7 @@ void pcie_port_service_unregister(struct pcie_port_service_driver *new);
>  
>  extern struct bus_type pcie_port_bus_type;
>  int pcie_port_device_register(struct pci_dev *dev);
> +int pcie_port_device_iter(struct device *dev, void *data);
>  #ifdef CONFIG_PM
>  int pcie_port_device_suspend(struct device *dev);
>  int pcie_port_device_resume_noirq(struct device *dev);
> diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
> index e1fed6649c41..ebcec7daa245 100644
> --- a/drivers/pci/pcie/portdrv_core.c
> +++ b/drivers/pci/pcie/portdrv_core.c
> @@ -362,24 +362,24 @@ int pcie_port_device_register(struct pci_dev *dev)
>  	return status;
>  }
>  
> -#ifdef CONFIG_PM
> -typedef int (*pcie_pm_callback_t)(struct pcie_device *);
> +typedef int (*pcie_callback_t)(struct pcie_device *);
>  
> -static int pm_iter(struct device *dev, void *data)
> +int pcie_port_device_iter(struct device *dev, void *data)

I like this change, and it seems like it's basically a rename that
could be split off from rest to make the slot_reset part a little more
focused.

>  {
>  	struct pcie_port_service_driver *service_driver;
>  	size_t offset = *(size_t *)data;
> -	pcie_pm_callback_t cb;
> +	pcie_callback_t cb;
>  
>  	if ((dev->bus == &pcie_port_bus_type) && dev->driver) {
>  		service_driver = to_service_driver(dev->driver);
> -		cb = *(pcie_pm_callback_t *)((void *)service_driver + offset);
> +		cb = *(pcie_callback_t *)((void *)service_driver + offset);
>  		if (cb)
>  			return cb(to_pcie_device(dev));
>  	}
>  	return 0;
>  }
>  
> +#ifdef CONFIG_PM
>  /**
>   * pcie_port_device_suspend - suspend port services associated with a PCIe port
>   * @dev: PCI Express port to handle
> @@ -387,13 +387,13 @@ static int pm_iter(struct device *dev, void *data)
>  int pcie_port_device_suspend(struct device *dev)
>  {
>  	size_t off = offsetof(struct pcie_port_service_driver, suspend);
> -	return device_for_each_child(dev, &off, pm_iter);
> +	return device_for_each_child(dev, &off, pcie_port_device_iter);
>  }
>  
>  int pcie_port_device_resume_noirq(struct device *dev)
>  {
>  	size_t off = offsetof(struct pcie_port_service_driver, resume_noirq);
> -	return device_for_each_child(dev, &off, pm_iter);
> +	return device_for_each_child(dev, &off, pcie_port_device_iter);
>  }
>  
>  /**
> @@ -403,7 +403,7 @@ int pcie_port_device_resume_noirq(struct device *dev)
>  int pcie_port_device_resume(struct device *dev)
>  {
>  	size_t off = offsetof(struct pcie_port_service_driver, resume);
> -	return device_for_each_child(dev, &off, pm_iter);
> +	return device_for_each_child(dev, &off, pcie_port_device_iter);
>  }
>  
>  /**
> @@ -413,7 +413,7 @@ int pcie_port_device_resume(struct device *dev)
>  int pcie_port_device_runtime_suspend(struct device *dev)
>  {
>  	size_t off = offsetof(struct pcie_port_service_driver, runtime_suspend);
> -	return device_for_each_child(dev, &off, pm_iter);
> +	return device_for_each_child(dev, &off, pcie_port_device_iter);
>  }
>  
>  /**
> @@ -423,7 +423,7 @@ int pcie_port_device_runtime_suspend(struct device *dev)
>  int pcie_port_device_runtime_resume(struct device *dev)
>  {
>  	size_t off = offsetof(struct pcie_port_service_driver, runtime_resume);
> -	return device_for_each_child(dev, &off, pm_iter);
> +	return device_for_each_child(dev, &off, pcie_port_device_iter);
>  }
>  #endif /* PM */
>  
> diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
> index c7ff1eea225a..1af74c3d9d5d 100644
> --- a/drivers/pci/pcie/portdrv_pci.c
> +++ b/drivers/pci/pcie/portdrv_pci.c
> @@ -160,6 +160,9 @@ static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
>  
>  static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev)
>  {
> +	size_t off = offsetof(struct pcie_port_service_driver, slot_reset);
> +	device_for_each_child(&dev->dev, &off, pcie_port_device_iter);
> +
>  	pci_restore_state(dev);
>  	pci_save_state(dev);
>  	return PCI_ERS_RESULT_RECOVERED;
> -- 
> 2.31.1
>