Re: [PATCHv2 1/2] PCI: pciehp: fix concurrent sub-tree removal deadlock

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jun 27, 2024 at 09:47:05AM +0200, Lukas Wunner wrote:
> On Wed, Jun 12, 2024 at 11:16:24AM -0700, Keith Busch wrote:
> > PCIe hotplug events modify the topology in their IRQ thread once it can
> > acquire the global pci_rescan_remove_lock.
> > 
> > If a different removal event happens to acquire that lock first, and
> > that removal event is for the parent device of the bridge processing the
> > other hotplug event, then we are deadlocked: the parent removal will
> > wait indefinitely on the child's IRQ thread because the parent is
> > holding the global lock the child thread needs to make forward progress.
> 
> Yes, that's a known problem.  I submitted a fix for it in 2018:
> 
> https://lore.kernel.org/all/4c882e25194ba8282b78fe963fec8faae7cf23eb.1529173804.git.lukas@xxxxxxxxx/
> 
> The patch I proposed was similar to yours, but was smaller and
> confined to pciehp_pci.c.  It was part of a larger series and
> when respinning that series I dropped the patch, which is the
> reason it never got applied.  I explained the rationale for
> dropping it in this message:
> 
> https://lore.kernel.org/all/20180906162634.ylyp3ydwujf5wuxx@xxxxxxxxx/
> 
> Basically all these proposals (both mine and yours) are not great
> because they add another layer of duct tape without tackling the
> underlying problem -- that pci_lock_rescan_remove() is way too
> coarse-grained and needs to be replaced by finer-grained locking.
> That however is a complex task that we haven't made significant
> forward progress on in the last couple of years.  Something else
> always seemed more important.
> 
> So I don't really know what to say.  I recognize it's a problem
> but I'm hesitant to endorse a duct tape fix. :(
> 
> In the second link above I mentioned that my approach doesn't solve
> another race discovered by Xiongfeng Wang.  pciehp has been refactored
> considerably since then, so I'm not sure if that particular issue
> still exists...

Thanks for comments. I agree the current locking scheme is the real
problem here. But I would still selfishly take a duct tape solution at
this point! :) There's so many direct pci_lock_rescan_remove() users
with hardware I don't have, it may be difficult to properly test a
significant change to the locking.

The sysfs race appears to still exist today. My patch wouldn't help with
that either.




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux