On Mon, Aug 12, 2019 at 05:31:33PM +0300, Mika Westerberg wrote: > If there are more than one PCIe switch with hotplug downstream ports > hot-removing them leads to a following deadlock: > > INFO: task irq/126-pciehp:198 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > irq/126-pciehp D 0 198 2 0x80000000 > Call Trace: > __schedule+0x2a2/0x880 > schedule+0x2c/0x80 > schedule_timeout+0x246/0x350 > ? ttwu_do_activate+0x67/0x90 > wait_for_completion+0xb7/0x140 > ? wake_up_q+0x80/0x80 > kthread_stop+0x49/0x110 > __free_irq+0x15c/0x2a0 > free_irq+0x32/0x70 > pcie_shutdown_notification+0x2f/0x50 > pciehp_remove+0x27/0x50 > pcie_port_remove_service+0x36/0x50 > device_release_driver_internal+0x18c/0x250 > device_release_driver+0x12/0x20 > bus_remove_device+0xec/0x160 > device_del+0x13b/0x350 > ? pcie_port_find_device+0x60/0x60 > device_unregister+0x1a/0x60 > remove_iter+0x1e/0x30 > device_for_each_child+0x56/0x90 > pcie_port_device_remove+0x22/0x40 > pcie_portdrv_remove+0x20/0x60 > pci_device_remove+0x3e/0xc0 > device_release_driver_internal+0x18c/0x250 > device_release_driver+0x12/0x20 > pci_stop_bus_device+0x6f/0x90 > pci_stop_bus_device+0x31/0x90 > pci_stop_and_remove_bus_device+0x12/0x20 > pciehp_unconfigure_device+0x88/0x140 > pciehp_disable_slot+0x6a/0x110 > pciehp_handle_presence_or_link_change+0x263/0x400 > pciehp_ist+0x1c9/0x1d0 > ? irq_forced_thread_fn+0x80/0x80 > irq_thread_fn+0x24/0x60 > irq_thread+0xeb/0x190 > ? irq_thread_fn+0x60/0x60 > kthread+0x120/0x140 > ? irq_thread_check_affinity+0xf0/0xf0 > ? kthread_park+0x90/0x90 > ret_from_fork+0x35/0x40 > INFO: task irq/190-pciehp:2288 blocked for more than 120 seconds. > irq/190-pciehp D 0 2288 2 0x80000000 > Call Trace: > __schedule+0x2a2/0x880 > schedule+0x2c/0x80 > schedule_preempt_disabled+0xe/0x10 > __mutex_lock.isra.9+0x2e0/0x4d0 > ? __mutex_lock_slowpath+0x13/0x20 > __mutex_lock_slowpath+0x13/0x20 > mutex_lock+0x2c/0x30 > pci_lock_rescan_remove+0x15/0x20 > pciehp_unconfigure_device+0x4d/0x140 > pciehp_disable_slot+0x6a/0x110 > pciehp_handle_presence_or_link_change+0x263/0x400 > pciehp_ist+0x1c9/0x1d0 > ? irq_forced_thread_fn+0x80/0x80 > irq_thread_fn+0x24/0x60 > irq_thread+0xeb/0x190 > ? irq_thread_fn+0x60/0x60 > kthread+0x120/0x140 > ? irq_thread_check_affinity+0xf0/0xf0 > ? kthread_park+0x90/0x90 > ret_from_fork+0x35/0x40 > > What happens here is that the whole hierarchy is runtime resumed and the > parent PCIe downstream port, who got the hot-remove event, starts > removing devices below it taking pci_lock_rescan_remove() lock. When the > child PCIe port is runtime resumed it calls pciehp_check_presence() > which ends up calling pciehp_card_present() and pciehp_check_link_active(). > Both of these read their parts of PCIe config space by calling helper > function pcie_capability_read_word(). Now, this function notices that > the underlying device is already gone and returns PCIBIOS_DEVICE_NOT_FOUND > with the capability value set to 0. When pciehp gets this value it > thinks that its child device is also hot-removed and schedules its IRQ > thread to handle the event. I can't remember if there was a reason why 8c0d3a02c130 ("PCI: Add accessors for PCI Express Capability") reset *val to 0 if pci_read_config_word() fails. It doesn't seem like the right thing; it seems like it would be better for it to be consistent with a plain pci_read_config_word(). > The deadlock happens when the child's IRQ thread runs and tries to > acquire pci_lock_rescan_remove() which is already taken by the parent > and the parent waits for the child's IRQ thread to finish. > > We can prevent this from happening by checking the return value of > pcie_capability_read_word() and if it is PCIBIOS_DEVICE_NOT_FOUND stop > performing any hot-removal activities. > > Signed-off-by: Mika Westerberg <mika.westerberg@xxxxxxxxxxxxxxx> > --- > No changes from the previous version. > > drivers/pci/hotplug/pciehp.h | 6 +++--- > drivers/pci/hotplug/pciehp_core.c | 11 ++++++++--- > drivers/pci/hotplug/pciehp_ctrl.c | 4 ++-- > drivers/pci/hotplug/pciehp_hpc.c | 32 +++++++++++++++++++++++-------- > 4 files changed, 37 insertions(+), 16 deletions(-) > > diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h > index 8c51a04b8083..81c514ab9518 100644 > --- a/drivers/pci/hotplug/pciehp.h > +++ b/drivers/pci/hotplug/pciehp.h > @@ -173,10 +173,10 @@ int pciehp_query_power_fault(struct controller *ctrl); > void pciehp_green_led_on(struct controller *ctrl); > void pciehp_green_led_off(struct controller *ctrl); > void pciehp_green_led_blink(struct controller *ctrl); > -bool pciehp_card_present(struct controller *ctrl); > -bool pciehp_card_present_or_link_active(struct controller *ctrl); > +int pciehp_card_present(struct controller *ctrl); > +int pciehp_card_present_or_link_active(struct controller *ctrl); > int pciehp_check_link_status(struct controller *ctrl); > -bool pciehp_check_link_active(struct controller *ctrl); > +int pciehp_check_link_active(struct controller *ctrl); > void pciehp_release_ctrl(struct controller *ctrl); > > int pciehp_sysfs_enable_slot(struct hotplug_slot *hotplug_slot); > diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c > index e9f82afa3773..4c032d75c874 100644 > --- a/drivers/pci/hotplug/pciehp_core.c > +++ b/drivers/pci/hotplug/pciehp_core.c > @@ -134,10 +134,15 @@ static int get_adapter_status(struct hotplug_slot *hotplug_slot, u8 *value) > { > struct controller *ctrl = to_ctrl(hotplug_slot); > struct pci_dev *pdev = ctrl->pcie->port; > + int ret; > > pci_config_pm_runtime_get(pdev); > - *value = pciehp_card_present_or_link_active(ctrl); > + ret = pciehp_card_present_or_link_active(ctrl); > pci_config_pm_runtime_put(pdev); > + if (ret < 0) > + return ret; > + > + *value = ret; > return 0; > } > > @@ -153,13 +158,13 @@ static int get_adapter_status(struct hotplug_slot *hotplug_slot, u8 *value) > */ > static void pciehp_check_presence(struct controller *ctrl) > { > - bool occupied; > + int occupied; > > down_read(&ctrl->reset_lock); > mutex_lock(&ctrl->state_lock); > > occupied = pciehp_card_present_or_link_active(ctrl); > - if ((occupied && (ctrl->state == OFF_STATE || > + if ((occupied > 0 && (ctrl->state == OFF_STATE || > ctrl->state == BLINKINGON_STATE)) || > (!occupied && (ctrl->state == ON_STATE || > ctrl->state == BLINKINGOFF_STATE))) > diff --git a/drivers/pci/hotplug/pciehp_ctrl.c b/drivers/pci/hotplug/pciehp_ctrl.c > index 631ced0ab28a..5a433cc8621f 100644 > --- a/drivers/pci/hotplug/pciehp_ctrl.c > +++ b/drivers/pci/hotplug/pciehp_ctrl.c > @@ -221,7 +221,7 @@ void pciehp_handle_disable_request(struct controller *ctrl) > > void pciehp_handle_presence_or_link_change(struct controller *ctrl, u32 events) > { > - bool present, link_active; > + int present, link_active; > > /* > * If the slot is on and presence or link has changed, turn it off. > @@ -252,7 +252,7 @@ void pciehp_handle_presence_or_link_change(struct controller *ctrl, u32 events) > mutex_lock(&ctrl->state_lock); > present = pciehp_card_present(ctrl); > link_active = pciehp_check_link_active(ctrl); > - if (!present && !link_active) { > + if (present <= 0 && link_active <= 0) { > mutex_unlock(&ctrl->state_lock); > return; > } > diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c > index bd990e3371e3..1f918b043adb 100644 > --- a/drivers/pci/hotplug/pciehp_hpc.c > +++ b/drivers/pci/hotplug/pciehp_hpc.c > @@ -201,13 +201,16 @@ static void pcie_write_cmd_nowait(struct controller *ctrl, u16 cmd, u16 mask) > pcie_do_write_cmd(ctrl, cmd, mask, false); > } > > -bool pciehp_check_link_active(struct controller *ctrl) > +int pciehp_check_link_active(struct controller *ctrl) > { > struct pci_dev *pdev = ctrl_dev(ctrl); > u16 lnk_status; > - bool ret; > + int ret; > + > + ret = pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status); > + if (ret == PCIBIOS_DEVICE_NOT_FOUND) > + return -ENODEV; > > - pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status); > ret = !!(lnk_status & PCI_EXP_LNKSTA_DLLLA); > > if (ret) > @@ -373,13 +376,17 @@ void pciehp_get_latch_status(struct controller *ctrl, u8 *status) > *status = !!(slot_status & PCI_EXP_SLTSTA_MRLSS); > } > > -bool pciehp_card_present(struct controller *ctrl) > +int pciehp_card_present(struct controller *ctrl) > { > struct pci_dev *pdev = ctrl_dev(ctrl); > u16 slot_status; > + int ret; > > - pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &slot_status); > - return slot_status & PCI_EXP_SLTSTA_PDS; > + ret = pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &slot_status); > + if (ret == PCIBIOS_DEVICE_NOT_FOUND) > + return -ENODEV; Isn't this racy? # pdev is present pci_read_config_word if (pci_dev_is_disconnected(pdev)) # currently false # pdev is removed pci_bus_read_config_word # fails, returns ~0 slot_status = ~0 I think pci_read_config_word() checks pci_dev_is_disconnected() merely as an optimization. Obviously it can't guarantee that the subsequent config access will succeed. If pci_dev_is_disconnected() was false but the config read fails, I think we'll get ~0 data and return 1, i.e., "PDS was set". Shouldn't we check for slot_status being an error response (~0) instead of looking for PCIBIOS_DEVICE_NOT_FOUND? There are 7 RsvdP bits in Slot Status, so ~0 is not a valid value for the register. All 16 bits of Link Status are defined, but ~0 is still an invalid value because the Current Link Speed and Negotiated Link Width fields only define a few valid encodings. > + return !!(slot_status & PCI_EXP_SLTSTA_PDS); > } > > /** > @@ -390,10 +397,19 @@ bool pciehp_card_present(struct controller *ctrl) > * Presence Detect State bit, this helper also returns true if the Link Active > * bit is set. This is a concession to broken hotplug ports which hardwire > * Presence Detect State to zero, such as Wilocity's [1ae9:0200]. > + * > + * Returns: %1 if the slot is occupied and %0 if it is not. If the hotplug > + * port is not present anymore returns %-ENODEV. > */ > -bool pciehp_card_present_or_link_active(struct controller *ctrl) > +int pciehp_card_present_or_link_active(struct controller *ctrl) > { > - return pciehp_card_present(ctrl) || pciehp_check_link_active(ctrl); > + int ret; > + > + ret = pciehp_card_present(ctrl); > + if (ret) > + return ret; > + > + return pciehp_check_link_active(ctrl); The names of these functions seem misleading to me: all they can really tell us is "the card *was* present" or "the link *was* active" at some time in the past. But the names make it so tempting to pretend that "the card *is* present" or "the link *is* active", and that may no longer be true. I think names like "pciehp_card_absent()" and "pciehp_link_down()" would make it easier to think about these situations. > } > > int pciehp_query_power_fault(struct controller *ctrl) > -- > 2.20.1 >