Hi All, Just wanted to touch base on the status of this patch. It resolves a deadlock I am seeing between AER and Hotplug when a controller is reset. I have a machine that can consistently recreate this deadlock, so if there is any further investigation needed I'd be willing to help in any way I can. Thanks, Ian On 2020-07-21 13:17:50 , Lukas Wunner wrote: > Back in 2013, commits > > 2e35afaefe64 ("PCI: pciehp: Add reset_slot() method") > 608c388122c7 ("PCI: Add slot reset option to pci_dev_reset()") > > introduced the callback pciehp_reset_slot() to the PCIe hotplug driver > and amended __pci_dev_reset() (today __pci_reset_function_locked()) to > invoke it when resetting a hotplug port's child. The callback performs > a Secondary Bus Reset and ensures that an ensuing link or presence flap > is ignored by pciehp. > > However the commits did not perform any locking, in particular: > > * No precautions were taken to prevent concurrent execution of the new > callback with pciehp's IRQ handler or a sysfs request to bring the > slot up or down. These code paths may see flapping link or presence > bits during a slot reset. > > * pciehp is not prevented from unbinding while the new callback accesses > its struct controller. Commit 608c388122c7 did take a reference on > pciehp's module, but that's not sufficient. It only keeps pciehp's > code in memory, but doesn't prevent unbinding. > > * In pci_dev_reset_slot_function(), commit 608c388122c7 iterates over > the devices on a bus without holding pci_bus_sem. > > In 2018, commit > > 5b3f7b7d062b ("PCI: pciehp: Avoid slot access during reset") > > sought to address the first of these locking issues: It introduced a > reset_lock which serializes a slot reset with other parts of pciehp. > > But Michael Haeuptle reports that deadlocks now occur on simultaneous > hot-removal and reset of vfio devices because pciehp acquires the > reset_lock and the device_lock in a different order than > pci_try_reset_function(): > > pciehp_ist() # down_read(reset_lock) > pciehp_handle_presence_or_link_change() > pciehp_disable_slot() > __pciehp_disable_slot() > remove_board() > pciehp_unconfigure_device() > pci_stop_and_remove_bus_device() > pci_stop_bus_device() > pci_stop_dev() > device_release_driver() > device_release_driver_internal() > __device_driver_lock() # device_lock() > > SYS_munmap() > vfio_device_fops_release() > vfio_pci_release() > vfio_pci_disable() > pci_try_reset_function() # device_lock() > __pci_reset_function_locked() > pci_reset_hotplug_slot() > pciehp_reset_slot() # down_write(reset_lock) > > Ian May reports the same deadlock on simultaneous hot-removal and AER > reset: > > aer_recover_work_func() > pcie_do_recovery() > aer_root_reset() > pci_bus_error_reset() > pci_slot_reset() > pci_slot_lock() # device_lock() > pci_reset_hotplug_slot() > pciehp_reset_slot() # down_write(reset_lock) > > Fix by pushing the reset_lock out of pciehp's struct controller and into > struct pci_slot such that it can be taken by the PCI core before taking > the device lock. > > There's a catch though: Some drivers call __pci_reset_function_locked() > directly and the function expects that all necessary locks, including > the reset_lock, have been acquired by the caller. There are callers > which already hold the device_lock, so they can't acquire the reset_lock > without re-introducing the AB-BA deadlock: > > * drivers/net/ethernet/cavium/liquidio/lio_main.c: octeon_pci_flr() > * drivers/xen/xen-pciback/pci_stub.c: pcistub_device_release() > * drivers/xen/xen-pciback/pci_stub.c: pcistub_init_device() (if called > from pcistub_seize()) > > In the case of octeon_pci_flr(), the device is reset on driver unbind, > which is why the device_lock is already held. A possible solution might > be to add a flag in struct pci_dev with which drivers tell the PCI core > that the device is handed back in an unclean state and needs a reset. > The PCI core would then perform the reset on behalf of the driver after > it has unbound and before any new driver is bound. > > As for xen, this patch (which was never applied) explains that a reset > is performed on bind, unbind and on un-assigning a device from a guest: > > https://lore.kernel.org/patchwork/patch/848180/ > > The unbind code path could be solved by the same solution as for octeon > and it may also be possible to make it work on bind, though it's unclear > why a reset on bind is at all necessary. The un-assigning code path is > fixed by the present commit I think. > > For now, the three functions do not acquire the reset_lock. I'm > inserting a lockdep_assert_held_write() so that a lockdep splat is shown > as a reminder that liquidio and xen require fixing. > > Fixes: 5b3f7b7d062b ("PCI: pciehp: Avoid slot access during reset") > Link: https://lore.kernel.org/linux-pci/CS1PR8401MB0728FC6FDAB8A35C22BD90EC95F10@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx > Link: https://lore.kernel.org/linux-pci/20200615143250.438252-1-ian.may@xxxxxxxxxxxxx > Reported-and-tested-by: Michael Haeuptle <michael.haeuptle@xxxxxxx> > Reported-and-tested-by: Ian May <ian.may@xxxxxxxxxxxxx> > Signed-off-by: Lukas Wunner <lukas@xxxxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx # v4.19+ > Cc: Alex Williamson <alex.williamson@xxxxxxxxxx> > --- > drivers/pci/hotplug/pciehp.h | 5 ----- > drivers/pci/hotplug/pciehp_core.c | 4 ++-- > drivers/pci/hotplug/pciehp_hpc.c | 12 ++++++------ > drivers/pci/pci.c | 17 +++++++++++++++++ > drivers/pci/slot.c | 2 ++ > drivers/vfio/pci/vfio_pci.c | 19 +++++++++++++------ > drivers/xen/xen-pciback/passthrough.c | 14 ++++++++++++-- > drivers/xen/xen-pciback/pci_stub.c | 6 ++++++ > drivers/xen/xen-pciback/vpci.c | 14 ++++++++++++-- > include/linux/pci.h | 8 +++++++- > 10 files changed, 77 insertions(+), 24 deletions(-) > > diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h > index 4fd200d..676e579 100644 > --- a/drivers/pci/hotplug/pciehp.h > +++ b/drivers/pci/hotplug/pciehp.h > @@ -20,7 +20,6 @@ > #include <linux/pci_hotplug.h> > #include <linux/delay.h> > #include <linux/mutex.h> > -#include <linux/rwsem.h> > #include <linux/workqueue.h> > > #include "../pcie/portdrv.h" > @@ -69,9 +68,6 @@ > * @button_work: work item to turn the slot on or off after 5 seconds > * in response to an Attention Button press > * @hotplug_slot: structure registered with the PCI hotplug core > - * @reset_lock: prevents access to the Data Link Layer Link Active bit in the > - * Link Status register and to the Presence Detect State bit in the Slot > - * Status register during a slot reset which may cause them to flap > * @ist_running: flag to keep user request waiting while IRQ thread is running > * @request_result: result of last user request submitted to the IRQ thread > * @requester: wait queue to wake up on completion of user request, > @@ -102,7 +98,6 @@ struct controller { > struct delayed_work button_work; > > struct hotplug_slot hotplug_slot; /* hotplug core interface */ > - struct rw_semaphore reset_lock; > unsigned int ist_running; > int request_result; > wait_queue_head_t requester; > diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c > index bf779f2..cdb241b 100644 > --- a/drivers/pci/hotplug/pciehp_core.c > +++ b/drivers/pci/hotplug/pciehp_core.c > @@ -165,7 +165,7 @@ static void pciehp_check_presence(struct controller *ctrl) > { > int occupied; > > - down_read(&ctrl->reset_lock); > + down_read(&ctrl->hotplug_slot.pci_slot->reset_lock); > mutex_lock(&ctrl->state_lock); > > occupied = pciehp_card_present_or_link_active(ctrl); > @@ -176,7 +176,7 @@ static void pciehp_check_presence(struct controller *ctrl) > pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC); > > mutex_unlock(&ctrl->state_lock); > - up_read(&ctrl->reset_lock); > + up_read(&ctrl->hotplug_slot.pci_slot->reset_lock); > } > > static int pciehp_probe(struct pcie_device *dev) > diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c > index 53433b3..a1c9072 100644 > --- a/drivers/pci/hotplug/pciehp_hpc.c > +++ b/drivers/pci/hotplug/pciehp_hpc.c > @@ -706,13 +706,17 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) > /* > * Disable requests have higher priority than Presence Detect Changed > * or Data Link Layer State Changed events. > + * > + * A slot reset may cause flaps of the Presence Detect State bit in the > + * Slot Status register and the Data Link Layer Link Active bit in the > + * Link Status register. Prevent by holding the reset lock. > */ > - down_read(&ctrl->reset_lock); > + down_read(&ctrl->hotplug_slot.pci_slot->reset_lock); > if (events & DISABLE_SLOT) > pciehp_handle_disable_request(ctrl); > else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) > pciehp_handle_presence_or_link_change(ctrl, events); > - up_read(&ctrl->reset_lock); > + up_read(&ctrl->hotplug_slot.pci_slot->reset_lock); > > ret = IRQ_HANDLED; > out: > @@ -841,8 +845,6 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, int probe) > if (probe) > return 0; > > - down_write(&ctrl->reset_lock); > - > if (!ATTN_BUTTN(ctrl)) { > ctrl_mask |= PCI_EXP_SLTCTL_PDCE; > stat_mask |= PCI_EXP_SLTSTA_PDC; > @@ -861,7 +863,6 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, int probe) > ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__, > pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, ctrl_mask); > > - up_write(&ctrl->reset_lock); > return rc; > } > > @@ -925,7 +926,6 @@ struct controller *pcie_init(struct pcie_device *dev) > ctrl->slot_cap = slot_cap; > mutex_init(&ctrl->ctrl_lock); > mutex_init(&ctrl->state_lock); > - init_rwsem(&ctrl->reset_lock); > init_waitqueue_head(&ctrl->requester); > init_waitqueue_head(&ctrl->queue); > INIT_DELAYED_WORK(&ctrl->button_work, pciehp_queue_pushbutton_work); > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 45c51af..455da72 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -4902,6 +4902,8 @@ static int pci_reset_hotplug_slot(struct hotplug_slot *hotplug, int probe) > if (!hotplug || !try_module_get(hotplug->owner)) > return rc; > > + lockdep_assert_held_write(&hotplug->pci_slot->reset_lock); > + > if (hotplug->ops->reset_slot) > rc = hotplug->ops->reset_slot(hotplug, probe); > > @@ -5110,6 +5112,8 @@ int pci_reset_function(struct pci_dev *dev) > if (!dev->reset_fn) > return -ENOTTY; > > + if (dev->slot) > + down_write(&dev->slot->reset_lock); > pci_dev_lock(dev); > pci_dev_save_and_disable(dev); > > @@ -5117,6 +5121,8 @@ int pci_reset_function(struct pci_dev *dev) > > pci_dev_restore(dev); > pci_dev_unlock(dev); > + if (dev->slot) > + up_write(&dev->slot->reset_lock); > > return rc; > } > @@ -5169,6 +5175,9 @@ int pci_try_reset_function(struct pci_dev *dev) > if (!dev->reset_fn) > return -ENOTTY; > > + if (dev->slot && !down_write_trylock(&dev->slot->reset_lock)) > + return -EAGAIN; > + > if (!pci_dev_trylock(dev)) > return -EAGAIN; > > @@ -5176,6 +5185,8 @@ int pci_try_reset_function(struct pci_dev *dev) > rc = __pci_reset_function_locked(dev); > pci_dev_restore(dev); > pci_dev_unlock(dev); > + if (dev->slot) > + up_write(&dev->slot->reset_lock); > > return rc; > } > @@ -5274,6 +5285,7 @@ static void pci_slot_lock(struct pci_slot *slot) > { > struct pci_dev *dev; > > + down_write(&slot->reset_lock); > list_for_each_entry(dev, &slot->bus->devices, bus_list) { > if (!dev->slot || dev->slot != slot) > continue; > @@ -5295,6 +5307,7 @@ static void pci_slot_unlock(struct pci_slot *slot) > pci_bus_unlock(dev->subordinate); > pci_dev_unlock(dev); > } > + up_write(&slot->reset_lock); > } > > /* Return 1 on successful lock, 0 on contention */ > @@ -5302,6 +5315,9 @@ static int pci_slot_trylock(struct pci_slot *slot) > { > struct pci_dev *dev; > > + if (!down_write_trylock(&slot->reset_lock)) > + return 0; > + > list_for_each_entry(dev, &slot->bus->devices, bus_list) { > if (!dev->slot || dev->slot != slot) > continue; > @@ -5325,6 +5341,7 @@ static int pci_slot_trylock(struct pci_slot *slot) > pci_bus_unlock(dev->subordinate); > pci_dev_unlock(dev); > } > + up_write(&slot->reset_lock); > return 0; > } > > diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c > index cc386ef..e8e7d09 100644 > --- a/drivers/pci/slot.c > +++ b/drivers/pci/slot.c > @@ -279,6 +279,8 @@ struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr, > INIT_LIST_HEAD(&slot->list); > list_add(&slot->list, &parent->slots); > > + init_rwsem(&slot->reset_lock); > + > down_read(&pci_bus_sem); > list_for_each_entry(dev, &parent->devices, bus_list) > if (PCI_SLOT(dev->devfn) == slot_nr) > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c > index f634c81..260650e 100644 > --- a/drivers/vfio/pci/vfio_pci.c > +++ b/drivers/vfio/pci/vfio_pci.c > @@ -454,13 +454,20 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev) > * We can not use the "try" reset interface here, which will > * overwrite the previously restored configuration information. > */ > - if (vdev->reset_works && pci_cfg_access_trylock(pdev)) { > - if (device_trylock(&pdev->dev)) { > - if (!__pci_reset_function_locked(pdev)) > - vdev->needs_reset = false; > - device_unlock(&pdev->dev); > + if (vdev->reset_works) { > + if (!pdev->slot || > + down_write_trylock(&pdev->slot->reset_lock)) { > + if (pci_cfg_access_trylock(pdev)) { > + if (device_trylock(&pdev->dev)) { > + if (!__pci_reset_function_locked(pdev)) > + vdev->needs_reset = false; > + device_unlock(&pdev->dev); > + } > + pci_cfg_access_unlock(pdev); > + } > + if (pdev->slot) > + up_write(&pdev->slot->reset_lock); > } > - pci_cfg_access_unlock(pdev); > } > > pci_restore_state(pdev); > diff --git a/drivers/xen/xen-pciback/passthrough.c b/drivers/xen/xen-pciback/passthrough.c > index 66e9b81..98a9ec8 100644 > --- a/drivers/xen/xen-pciback/passthrough.c > +++ b/drivers/xen/xen-pciback/passthrough.c > @@ -89,11 +89,17 @@ static void __xen_pcibk_release_pci_dev(struct xen_pcibk_device *pdev, > mutex_unlock(&dev_data->lock); > > if (found_dev) { > - if (lock) > + if (lock) { > + if (found_dev->slot) > + down_write(&found_dev->slot->reset_lock); > device_lock(&found_dev->dev); > + } > pcistub_put_pci_dev(found_dev); > - if (lock) > + if (lock) { > device_unlock(&found_dev->dev); > + if (found_dev->slot) > + up_write(&found_dev->slot->reset_lock); > + } > } > } > > @@ -164,9 +170,13 @@ static void __xen_pcibk_release_devices(struct xen_pcibk_device *pdev) > list_for_each_entry_safe(dev_entry, t, &dev_data->dev_list, list) { > struct pci_dev *dev = dev_entry->dev; > list_del(&dev_entry->list); > + if (dev->slot) > + down_write(&dev->slot->reset_lock); > device_lock(&dev->dev); > pcistub_put_pci_dev(dev); > device_unlock(&dev->dev); > + if (dev->slot) > + up_write(&dev->slot->reset_lock); > kfree(dev_entry); > } > > diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c > index e876c3d..91779a2 100644 > --- a/drivers/xen/xen-pciback/pci_stub.c > +++ b/drivers/xen/xen-pciback/pci_stub.c > @@ -463,6 +463,9 @@ static int __init pcistub_init_devices_late(void) > > spin_unlock_irqrestore(&pcistub_devices_lock, flags); > > + if (psdev->dev->slot) > + down_write(&psdev->dev->slot->reset_lock); > + device_lock(&psdev->dev->dev); > err = pcistub_init_device(psdev->dev); > if (err) { > dev_err(&psdev->dev->dev, > @@ -470,6 +473,9 @@ static int __init pcistub_init_devices_late(void) > kfree(psdev); > psdev = NULL; > } > + device_unlock(&psdev->dev->dev); > + if (psdev->dev->slot) > + up_write(&psdev->dev->slot->reset_lock); > > spin_lock_irqsave(&pcistub_devices_lock, flags); > > diff --git a/drivers/xen/xen-pciback/vpci.c b/drivers/xen/xen-pciback/vpci.c > index 5447b5a..d157b1d 100644 > --- a/drivers/xen/xen-pciback/vpci.c > +++ b/drivers/xen/xen-pciback/vpci.c > @@ -171,11 +171,17 @@ static void __xen_pcibk_release_pci_dev(struct xen_pcibk_device *pdev, > mutex_unlock(&vpci_dev->lock); > > if (found_dev) { > - if (lock) > + if (lock) { > + if (found_dev->slot) > + down_write(&found_dev->slot->reset_lock); > device_lock(&found_dev->dev); > + } > pcistub_put_pci_dev(found_dev); > - if (lock) > + if (lock) { > device_unlock(&found_dev->dev); > + if (found_dev->slot) > + up_write(&found_dev->slot->reset_lock); > + } > } > } > > @@ -216,9 +222,13 @@ static void __xen_pcibk_release_devices(struct xen_pcibk_device *pdev) > list) { > struct pci_dev *dev = e->dev; > list_del(&e->list); > + if (dev->slot) > + down_write(&dev->slot->reset_lock); > device_lock(&dev->dev); > pcistub_put_pci_dev(dev); > device_unlock(&dev->dev); > + if (dev->slot) > + up_write(&dev->slot->reset_lock); > kfree(e); > } > } > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 2a2d00e..12869bd 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -38,6 +38,7 @@ > #include <linux/interrupt.h> > #include <linux/io.h> > #include <linux/resource_ext.h> > +#include <linux/rwsem.h> > #include <uapi/linux/pci.h> > > #include <linux/pci_ids.h> > @@ -65,11 +66,16 @@ > /* return bus from PCI devid = ((u16)bus_number) << 8) | devfn */ > #define PCI_BUS_NUM(x) (((x) >> 8) & 0xff) > > -/* pci_slot represents a physical slot */ > +/** > + * struct pci_slot - represents a physical slot > + * @reset_lock: held for writing during a slot reset; acquire for reading to > + * protect access to register bits which may flap upon a reset > + */ > struct pci_slot { > struct pci_bus *bus; /* Bus this slot is on */ > struct list_head list; /* Node in list of slots */ > struct hotplug_slot *hotplug; /* Hotplug info (move here) */ > + struct rw_semaphore reset_lock; > unsigned char number; /* PCI_SLOT(pci_dev->devfn) */ > struct kobject kobj; > }; > -- > 2.27.0 >