On Wed, Jun 12, 2024 at 11:16:24AM -0700, Keith Busch wrote: > PCIe hotplug events modify the topology in their IRQ thread once it can > acquire the global pci_rescan_remove_lock. > > If a different removal event happens to acquire that lock first, and > that removal event is for the parent device of the bridge processing the > other hotplug event, then we are deadlocked: the parent removal will > wait indefinitely on the child's IRQ thread because the parent is > holding the global lock the child thread needs to make forward progress. Yes, that's a known problem. I submitted a fix for it in 2018: https://lore.kernel.org/all/4c882e25194ba8282b78fe963fec8faae7cf23eb.1529173804.git.lukas@xxxxxxxxx/ The patch I proposed was similar to yours, but was smaller and confined to pciehp_pci.c. It was part of a larger series and when respinning that series I dropped the patch, which is the reason it never got applied. I explained the rationale for dropping it in this message: https://lore.kernel.org/all/20180906162634.ylyp3ydwujf5wuxx@xxxxxxxxx/ Basically all these proposals (both mine and yours) are not great because they add another layer of duct tape without tackling the underlying problem -- that pci_lock_rescan_remove() is way too coarse-grained and needs to be replaced by finer-grained locking. That however is a complex task that we haven't made significant forward progress on in the last couple of years. Something else always seemed more important. So I don't really know what to say. I recognize it's a problem but I'm hesitant to endorse a duct tape fix. :( In the second link above I mentioned that my approach doesn't solve another race discovered by Xiongfeng Wang. pciehp has been refactored considerably since then, so I'm not sure if that particular issue still exists... Thanks, Lukas