Back in 2013, commits 2e35afaefe64 ("PCI: pciehp: Add reset_slot() method") 608c388122c7 ("PCI: Add slot reset option to pci_dev_reset()") introduced the callback pciehp_reset_slot() to the PCIe hotplug driver and amended __pci_dev_reset() (today __pci_reset_function_locked()) to invoke it when resetting a hotplug port's child. The callback performs a Secondary Bus Reset and ensures that an ensuing link or presence flap is ignored by pciehp. However the commits did not perform any locking, in particular: * No precautions were taken to prevent concurrent execution of the new callback with pciehp's IRQ handler or a sysfs request to bring the slot up or down. These code paths may see flapping link or presence bits during a slot reset. * pciehp is not prevented from unbinding while the new callback accesses its struct controller. Commit 608c388122c7 did take a reference on pciehp's module, but that's not sufficient. It only keeps pciehp's code in memory, but doesn't prevent unbinding. * In pci_dev_reset_slot_function(), commit 608c388122c7 iterates over the devices on a bus without holding pci_bus_sem. In 2018, commit 5b3f7b7d062b ("PCI: pciehp: Avoid slot access during reset") sought to address the first of these locking issues: It introduced a reset_lock which serializes a slot reset with other parts of pciehp. But Michael Haeuptle reports that deadlocks now occur on simultaneous hot-removal and reset of vfio devices because pciehp acquires the reset_lock and the device_lock in a different order than pci_try_reset_function(): pciehp_ist() # down_read(reset_lock) pciehp_handle_presence_or_link_change() pciehp_disable_slot() __pciehp_disable_slot() remove_board() pciehp_unconfigure_device() pci_stop_and_remove_bus_device() pci_stop_bus_device() pci_stop_dev() device_release_driver() device_release_driver_internal() __device_driver_lock() # device_lock() SYS_munmap() vfio_device_fops_release() vfio_pci_release() vfio_pci_disable() pci_try_reset_function() # device_lock() __pci_reset_function_locked() pci_reset_hotplug_slot() pciehp_reset_slot() # down_write(reset_lock) Ian May reports the same deadlock on simultaneous hot-removal and AER reset: aer_recover_work_func() pcie_do_recovery() aer_root_reset() pci_bus_error_reset() pci_slot_reset() pci_slot_lock() # device_lock() pci_reset_hotplug_slot() pciehp_reset_slot() # down_write(reset_lock) Fix by pushing the reset_lock out of pciehp's struct controller and into struct pci_slot such that it can be taken by the PCI core before taking the device lock. There's a catch though: Some drivers call __pci_reset_function_locked() directly and the function expects that all necessary locks, including the reset_lock, have been acquired by the caller. There are callers which already hold the device_lock, so they can't acquire the reset_lock without re-introducing the AB-BA deadlock: * drivers/net/ethernet/cavium/liquidio/lio_main.c: octeon_pci_flr() * drivers/xen/xen-pciback/pci_stub.c: pcistub_device_release() * drivers/xen/xen-pciback/pci_stub.c: pcistub_init_device() (if called from pcistub_seize()) In the case of octeon_pci_flr(), the device is reset on driver unbind, which is why the device_lock is already held. A possible solution might be to add a flag in struct pci_dev with which drivers tell the PCI core that the device is handed back in an unclean state and needs a reset. The PCI core would then perform the reset on behalf of the driver after it has unbound and before any new driver is bound. As for xen, this patch (which was never applied) explains that a reset is performed on bind, unbind and on un-assigning a device from a guest: https://lore.kernel.org/patchwork/patch/848180/ The unbind code path could be solved by the same solution as for octeon and it may also be possible to make it work on bind, though it's unclear why a reset on bind is at all necessary. The un-assigning code path is fixed by the present commit I think. For now, the three functions do not acquire the reset_lock. I'm inserting a lockdep_assert_held_write() so that a lockdep splat is shown as a reminder that liquidio and xen require fixing. Fixes: 5b3f7b7d062b ("PCI: pciehp: Avoid slot access during reset") Link: https://lore.kernel.org/linux-pci/CS1PR8401MB0728FC6FDAB8A35C22BD90EC95F10@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Link: https://lore.kernel.org/linux-pci/20200615143250.438252-1-ian.may@xxxxxxxxxxxxx Reported-and-tested-by: Michael Haeuptle <michael.haeuptle@xxxxxxx> Reported-and-tested-by: Ian May <ian.may@xxxxxxxxxxxxx> Signed-off-by: Lukas Wunner <lukas@xxxxxxxxx> Cc: stable@xxxxxxxxxxxxxxx # v4.19+ Cc: Alex Williamson <alex.williamson@xxxxxxxxxx> --- drivers/pci/hotplug/pciehp.h | 5 ----- drivers/pci/hotplug/pciehp_core.c | 4 ++-- drivers/pci/hotplug/pciehp_hpc.c | 12 ++++++------ drivers/pci/pci.c | 17 +++++++++++++++++ drivers/pci/slot.c | 2 ++ drivers/vfio/pci/vfio_pci.c | 19 +++++++++++++------ drivers/xen/xen-pciback/passthrough.c | 14 ++++++++++++-- drivers/xen/xen-pciback/pci_stub.c | 6 ++++++ drivers/xen/xen-pciback/vpci.c | 14 ++++++++++++-- include/linux/pci.h | 8 +++++++- 10 files changed, 77 insertions(+), 24 deletions(-) diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h index 4fd200d..676e579 100644 --- a/drivers/pci/hotplug/pciehp.h +++ b/drivers/pci/hotplug/pciehp.h @@ -20,7 +20,6 @@ #include <linux/pci_hotplug.h> #include <linux/delay.h> #include <linux/mutex.h> -#include <linux/rwsem.h> #include <linux/workqueue.h> #include "../pcie/portdrv.h" @@ -69,9 +68,6 @@ * @button_work: work item to turn the slot on or off after 5 seconds * in response to an Attention Button press * @hotplug_slot: structure registered with the PCI hotplug core - * @reset_lock: prevents access to the Data Link Layer Link Active bit in the - * Link Status register and to the Presence Detect State bit in the Slot - * Status register during a slot reset which may cause them to flap * @ist_running: flag to keep user request waiting while IRQ thread is running * @request_result: result of last user request submitted to the IRQ thread * @requester: wait queue to wake up on completion of user request, @@ -102,7 +98,6 @@ struct controller { struct delayed_work button_work; struct hotplug_slot hotplug_slot; /* hotplug core interface */ - struct rw_semaphore reset_lock; unsigned int ist_running; int request_result; wait_queue_head_t requester; diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c index bf779f2..cdb241b 100644 --- a/drivers/pci/hotplug/pciehp_core.c +++ b/drivers/pci/hotplug/pciehp_core.c @@ -165,7 +165,7 @@ static void pciehp_check_presence(struct controller *ctrl) { int occupied; - down_read(&ctrl->reset_lock); + down_read(&ctrl->hotplug_slot.pci_slot->reset_lock); mutex_lock(&ctrl->state_lock); occupied = pciehp_card_present_or_link_active(ctrl); @@ -176,7 +176,7 @@ static void pciehp_check_presence(struct controller *ctrl) pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC); mutex_unlock(&ctrl->state_lock); - up_read(&ctrl->reset_lock); + up_read(&ctrl->hotplug_slot.pci_slot->reset_lock); } static int pciehp_probe(struct pcie_device *dev) diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 53433b3..a1c9072 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -706,13 +706,17 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) /* * Disable requests have higher priority than Presence Detect Changed * or Data Link Layer State Changed events. + * + * A slot reset may cause flaps of the Presence Detect State bit in the + * Slot Status register and the Data Link Layer Link Active bit in the + * Link Status register. Prevent by holding the reset lock. */ - down_read(&ctrl->reset_lock); + down_read(&ctrl->hotplug_slot.pci_slot->reset_lock); if (events & DISABLE_SLOT) pciehp_handle_disable_request(ctrl); else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) pciehp_handle_presence_or_link_change(ctrl, events); - up_read(&ctrl->reset_lock); + up_read(&ctrl->hotplug_slot.pci_slot->reset_lock); ret = IRQ_HANDLED; out: @@ -841,8 +845,6 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, int probe) if (probe) return 0; - down_write(&ctrl->reset_lock); - if (!ATTN_BUTTN(ctrl)) { ctrl_mask |= PCI_EXP_SLTCTL_PDCE; stat_mask |= PCI_EXP_SLTSTA_PDC; @@ -861,7 +863,6 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, int probe) ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__, pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, ctrl_mask); - up_write(&ctrl->reset_lock); return rc; } @@ -925,7 +926,6 @@ struct controller *pcie_init(struct pcie_device *dev) ctrl->slot_cap = slot_cap; mutex_init(&ctrl->ctrl_lock); mutex_init(&ctrl->state_lock); - init_rwsem(&ctrl->reset_lock); init_waitqueue_head(&ctrl->requester); init_waitqueue_head(&ctrl->queue); INIT_DELAYED_WORK(&ctrl->button_work, pciehp_queue_pushbutton_work); diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 45c51af..455da72 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -4902,6 +4902,8 @@ static int pci_reset_hotplug_slot(struct hotplug_slot *hotplug, int probe) if (!hotplug || !try_module_get(hotplug->owner)) return rc; + lockdep_assert_held_write(&hotplug->pci_slot->reset_lock); + if (hotplug->ops->reset_slot) rc = hotplug->ops->reset_slot(hotplug, probe); @@ -5110,6 +5112,8 @@ int pci_reset_function(struct pci_dev *dev) if (!dev->reset_fn) return -ENOTTY; + if (dev->slot) + down_write(&dev->slot->reset_lock); pci_dev_lock(dev); pci_dev_save_and_disable(dev); @@ -5117,6 +5121,8 @@ int pci_reset_function(struct pci_dev *dev) pci_dev_restore(dev); pci_dev_unlock(dev); + if (dev->slot) + up_write(&dev->slot->reset_lock); return rc; } @@ -5169,6 +5175,9 @@ int pci_try_reset_function(struct pci_dev *dev) if (!dev->reset_fn) return -ENOTTY; + if (dev->slot && !down_write_trylock(&dev->slot->reset_lock)) + return -EAGAIN; + if (!pci_dev_trylock(dev)) return -EAGAIN; @@ -5176,6 +5185,8 @@ int pci_try_reset_function(struct pci_dev *dev) rc = __pci_reset_function_locked(dev); pci_dev_restore(dev); pci_dev_unlock(dev); + if (dev->slot) + up_write(&dev->slot->reset_lock); return rc; } @@ -5274,6 +5285,7 @@ static void pci_slot_lock(struct pci_slot *slot) { struct pci_dev *dev; + down_write(&slot->reset_lock); list_for_each_entry(dev, &slot->bus->devices, bus_list) { if (!dev->slot || dev->slot != slot) continue; @@ -5295,6 +5307,7 @@ static void pci_slot_unlock(struct pci_slot *slot) pci_bus_unlock(dev->subordinate); pci_dev_unlock(dev); } + up_write(&slot->reset_lock); } /* Return 1 on successful lock, 0 on contention */ @@ -5302,6 +5315,9 @@ static int pci_slot_trylock(struct pci_slot *slot) { struct pci_dev *dev; + if (!down_write_trylock(&slot->reset_lock)) + return 0; + list_for_each_entry(dev, &slot->bus->devices, bus_list) { if (!dev->slot || dev->slot != slot) continue; @@ -5325,6 +5341,7 @@ static int pci_slot_trylock(struct pci_slot *slot) pci_bus_unlock(dev->subordinate); pci_dev_unlock(dev); } + up_write(&slot->reset_lock); return 0; } diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c index cc386ef..e8e7d09 100644 --- a/drivers/pci/slot.c +++ b/drivers/pci/slot.c @@ -279,6 +279,8 @@ struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr, INIT_LIST_HEAD(&slot->list); list_add(&slot->list, &parent->slots); + init_rwsem(&slot->reset_lock); + down_read(&pci_bus_sem); list_for_each_entry(dev, &parent->devices, bus_list) if (PCI_SLOT(dev->devfn) == slot_nr) diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index f634c81..260650e 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -454,13 +454,20 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev) * We can not use the "try" reset interface here, which will * overwrite the previously restored configuration information. */ - if (vdev->reset_works && pci_cfg_access_trylock(pdev)) { - if (device_trylock(&pdev->dev)) { - if (!__pci_reset_function_locked(pdev)) - vdev->needs_reset = false; - device_unlock(&pdev->dev); + if (vdev->reset_works) { + if (!pdev->slot || + down_write_trylock(&pdev->slot->reset_lock)) { + if (pci_cfg_access_trylock(pdev)) { + if (device_trylock(&pdev->dev)) { + if (!__pci_reset_function_locked(pdev)) + vdev->needs_reset = false; + device_unlock(&pdev->dev); + } + pci_cfg_access_unlock(pdev); + } + if (pdev->slot) + up_write(&pdev->slot->reset_lock); } - pci_cfg_access_unlock(pdev); } pci_restore_state(pdev); diff --git a/drivers/xen/xen-pciback/passthrough.c b/drivers/xen/xen-pciback/passthrough.c index 66e9b81..98a9ec8 100644 --- a/drivers/xen/xen-pciback/passthrough.c +++ b/drivers/xen/xen-pciback/passthrough.c @@ -89,11 +89,17 @@ static void __xen_pcibk_release_pci_dev(struct xen_pcibk_device *pdev, mutex_unlock(&dev_data->lock); if (found_dev) { - if (lock) + if (lock) { + if (found_dev->slot) + down_write(&found_dev->slot->reset_lock); device_lock(&found_dev->dev); + } pcistub_put_pci_dev(found_dev); - if (lock) + if (lock) { device_unlock(&found_dev->dev); + if (found_dev->slot) + up_write(&found_dev->slot->reset_lock); + } } } @@ -164,9 +170,13 @@ static void __xen_pcibk_release_devices(struct xen_pcibk_device *pdev) list_for_each_entry_safe(dev_entry, t, &dev_data->dev_list, list) { struct pci_dev *dev = dev_entry->dev; list_del(&dev_entry->list); + if (dev->slot) + down_write(&dev->slot->reset_lock); device_lock(&dev->dev); pcistub_put_pci_dev(dev); device_unlock(&dev->dev); + if (dev->slot) + up_write(&dev->slot->reset_lock); kfree(dev_entry); } diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c index e876c3d..91779a2 100644 --- a/drivers/xen/xen-pciback/pci_stub.c +++ b/drivers/xen/xen-pciback/pci_stub.c @@ -463,6 +463,9 @@ static int __init pcistub_init_devices_late(void) spin_unlock_irqrestore(&pcistub_devices_lock, flags); + if (psdev->dev->slot) + down_write(&psdev->dev->slot->reset_lock); + device_lock(&psdev->dev->dev); err = pcistub_init_device(psdev->dev); if (err) { dev_err(&psdev->dev->dev, @@ -470,6 +473,9 @@ static int __init pcistub_init_devices_late(void) kfree(psdev); psdev = NULL; } + device_unlock(&psdev->dev->dev); + if (psdev->dev->slot) + up_write(&psdev->dev->slot->reset_lock); spin_lock_irqsave(&pcistub_devices_lock, flags); diff --git a/drivers/xen/xen-pciback/vpci.c b/drivers/xen/xen-pciback/vpci.c index 5447b5a..d157b1d 100644 --- a/drivers/xen/xen-pciback/vpci.c +++ b/drivers/xen/xen-pciback/vpci.c @@ -171,11 +171,17 @@ static void __xen_pcibk_release_pci_dev(struct xen_pcibk_device *pdev, mutex_unlock(&vpci_dev->lock); if (found_dev) { - if (lock) + if (lock) { + if (found_dev->slot) + down_write(&found_dev->slot->reset_lock); device_lock(&found_dev->dev); + } pcistub_put_pci_dev(found_dev); - if (lock) + if (lock) { device_unlock(&found_dev->dev); + if (found_dev->slot) + up_write(&found_dev->slot->reset_lock); + } } } @@ -216,9 +222,13 @@ static void __xen_pcibk_release_devices(struct xen_pcibk_device *pdev) list) { struct pci_dev *dev = e->dev; list_del(&e->list); + if (dev->slot) + down_write(&dev->slot->reset_lock); device_lock(&dev->dev); pcistub_put_pci_dev(dev); device_unlock(&dev->dev); + if (dev->slot) + up_write(&dev->slot->reset_lock); kfree(e); } } diff --git a/include/linux/pci.h b/include/linux/pci.h index 2a2d00e..12869bd 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -38,6 +38,7 @@ #include <linux/interrupt.h> #include <linux/io.h> #include <linux/resource_ext.h> +#include <linux/rwsem.h> #include <uapi/linux/pci.h> #include <linux/pci_ids.h> @@ -65,11 +66,16 @@ /* return bus from PCI devid = ((u16)bus_number) << 8) | devfn */ #define PCI_BUS_NUM(x) (((x) >> 8) & 0xff) -/* pci_slot represents a physical slot */ +/** + * struct pci_slot - represents a physical slot + * @reset_lock: held for writing during a slot reset; acquire for reading to + * protect access to register bits which may flap upon a reset + */ struct pci_slot { struct pci_bus *bus; /* Bus this slot is on */ struct list_head list; /* Node in list of slots */ struct hotplug_slot *hotplug; /* Hotplug info (move here) */ + struct rw_semaphore reset_lock; unsigned char number; /* PCI_SLOT(pci_dev->devfn) */ struct kobject kobj; }; -- 2.27.0