Re: PCI: hotplug: Erroneous removal of hotplug PCI devices

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jan 23, 2019 at 08:28:29PM +0100, Lukas Wunner wrote:
> On Wed, Jan 23, 2019 at 12:09:46PM -0700, Keith Busch wrote:
> > On Wed, Jan 23, 2019 at 08:07:23PM +0100, Lukas Wunner wrote:
> > > On Wed, Jan 23, 2019 at 07:54:20PM +0100, Lukas Wunner wrote:
> > > > So I don't see a perfect solution.  What device are we talking about
> > > > anyway?  400 ms is a *long* time.
> > > 
> > > Also, how exactly does this issue manifest itself:  Is it just an
> > > annoyance that the slot is brought up/down/up or does it not work
> > > at all?
> > 
> > Yeah, there is an nvme driver bug that hits a dead lock if you bring
> > a very quick add-remove sequence. The nvme remove tries to delete IO
> > resources before the async probe side set them up, so the driver doesn't
> > actually see that they're invalid. I have a proposed fix, but waiting to
> > here if it is successful.
> > 
> > bz: https://bugzilla.kernel.org/show_bug.cgi?id=202081
> 
> Hm, there's no full dmesg output attached, so it's not possible to
> tell what the topology looks like and what the vendor/device ID of
> 0000:b0:04.0 is.
> 
> Also, there's only a card present / link up sequence visible in the
> abridged dmesg output which has a 4 usec delay, but no link up / card
> present sequence with a 400 msec delay?

Yeah, not easy to follow, and some discussion was off the bz.

Link Change:

  [  838.784541] pciehp 0000:b0:04.0:pcie204: Slot(178): Link Up

Presence Detect Change +4msec:

  [  839.183506] pciehp 0000:b0:04.0:pcie204: Slot(178): Card not present

Inbetween these two entries has nvme start setting up its controller
detected on the link up. The "not present" side tries to remove the same
nvme device, but fails to invalidate the IO resources because it's racing
with probe before it even set them up, leaving probe unable to complete
IO a moment later because its IRQ resources were disabled.

Meanwhile, the blk-mq timeout handler can't do anything because the
device state is disconnected and believes the removal side is handling
things. What a mess...

We can fix it, just want to hear if Alex can confirm the proposal is
successful.



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux