On 1/23/19 1:53 PM, Keith Busch wrote: > > [EXTERNAL EMAIL] > > On Wed, Jan 23, 2019 at 08:28:29PM +0100, Lukas Wunner wrote: >> On Wed, Jan 23, 2019 at 12:09:46PM -0700, Keith Busch wrote: >>> On Wed, Jan 23, 2019 at 08:07:23PM +0100, Lukas Wunner wrote: >>>> On Wed, Jan 23, 2019 at 07:54:20PM +0100, Lukas Wunner wrote: >>>>> So I don't see a perfect solution. What device are we talking about >>>>> anyway? 400 ms is a *long* time. >>>> >>>> Also, how exactly does this issue manifest itself: Is it just an >>>> annoyance that the slot is brought up/down/up or does it not work >>>> at all? >>> >>> Yeah, there is an nvme driver bug that hits a dead lock if you bring >>> a very quick add-remove sequence. The nvme remove tries to delete IO >>> resources before the async probe side set them up, so the driver doesn't >>> actually see that they're invalid. I have a proposed fix, but waiting to >>> here if it is successful. >>> >>> bz: https://bugzilla.kernel.org/show_bug.cgi?id=202081 >> >> Hm, there's no full dmesg output attached, so it's not possible to >> tell what the topology looks like and what the vendor/device ID of >> 0000:b0:04.0 is. >> >> Also, there's only a card present / link up sequence visible in the >> abridged dmesg output which has a 4 usec delay, but no link up / card >> present sequence with a 400 msec delay? > > Yeah, not easy to follow, and some discussion was off the bz. > > Link Change: > > [ 838.784541] pciehp 0000:b0:04.0:pcie204: Slot(178): Link Up > > Presence Detect Change +4msec: > > [ 839.183506] pciehp 0000:b0:04.0:pcie204: Slot(178): Card not present > > Inbetween these two entries has nvme start setting up its controller > detected on the link up. The "not present" side tries to remove the same > nvme device, but fails to invalidate the IO resources because it's racing > with probe before it even set them up, leaving probe unable to complete > IO a moment later because its IRQ resources were disabled. > > Meanwhile, the blk-mq timeout handler can't do anything because the > device state is disconnected and believes the removal side is handling > things. What a mess... > > We can fix it, just want to hear if Alex can confirm the proposal is > successful. OOPS! Totally missed there was a patch on bz. Will update bz once testing is done. Alex