Re: [PATCH 3/6] nvme: Move all IO out of controller reset

Keith Busch <keith.busch@xxxxxxxxxxxxxxx> · Mon, 21 May 2018 10:23:55 -0600

On Tue, May 22, 2018 at 12:04:53AM +0800, Ming Lei wrote:
> On Mon, May 21, 2018 at 09:44:33AM -0600, Keith Busch wrote:
> > On Mon, May 21, 2018 at 11:34:27PM +0800, Ming Lei wrote:
> > > nvme_dev_disable() quiesces queues first before killing queues.
> > > 
> > > If queues are quiesced during or before nvme_wait_freeze() is run
> > > from the 2nd part of reset, the 2nd part can't move on, and IO hang
> > > is caused. Finally no reset can be scheduled at all.
> > 
> > But this patch moves nvme_wait_freeze outside the reset path, so I'm
> > afraid I'm unable to follow how you've concluded the wait freeze is
> > somehow part of the reset.
> 
> For example:
> 
> 1) the 1st timeout event:
> 
> - nvme_dev_disable()
> - reset
> - scan_work
> 
> 2) the 2nd timeout event:
> 
> nvme_dev_disable() may come just after nvme_start_queues() in
> the above reset of the 1st timeout. And nvme_timeout() won't
> schedule a new reset since the controller state is NVME_CTRL_CONNECTING.

Let me get this straight -- you're saying nvme_start_queues is going
to somehow immediately trigger timeout work? I can't see how that could
possibly happen in real life, but we can just remove it and use the existing
nvme_start_ctrl to handle that in the LIVE state.