Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

Ming Lei <ming.lei@xxxxxxxxxx> · Thu, 17 May 2018 06:18:44 +0800

On Wed, May 16, 2018 at 09:18:26AM -0600, Keith Busch wrote:
> On Wed, May 16, 2018 at 12:31:28PM +0800, Ming Lei wrote:
> > Hi Keith,
> > 
> > This issue may probably be fixed by Jianchao's patch of 'nvme: pci: set nvmeq->cq_vector
> > after alloc cq/sq'[1] and my another patch of 'nvme: pci: unquiesce admin
> > queue after controller is shutdown'[2], and both two have been included in the
> > posted V6.
> 
> No, it's definitely not related to that patch. The link is down in this
> test, I can assure you we're bailing out long before we ever even try to
> create an IO queue. The failing condition is detected by nvme_pci_enable's
> check for all 1's completions at the very beginning.

OK, this kind of failure during reset can be triggered in my test easily, then
nvme_remove_dead_ctrl() is called too, but not see IO hang from remove path.

As we discussed, it shouldn't be so, since queues are unquiesced &
killed, all IO should have been failed immediately. Also controller has
been shutdown, the queues are frozen too, so blk_mq_freeze_queue_wait()
won't wait on one unfrozen queue.

So could you post the debugfs log when the hang happens so that we may
find some clue?

Also, I don't think your issue is caused by this patchset, since
nvme_remove_dead_ctrl_work() and nvme_remove() aren't touched by this patch.
That means this issue may be triggered without this patchset too,
so could we start to review this patchset meantime?

Thanks,
Ming