Re: [PATCH V3 7/8] nvme: pci: recover controller reliably

Ming Lei <ming.lei@xxxxxxxxxx> · Fri, 4 May 2018 12:24:52 +0800

On Thu, May 03, 2018 at 11:46:56PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> Thanks for your kindly response.
> 
> On 05/03/2018 06:08 PM, Ming Lei wrote:
> > nvme_eh_reset() can move on, if controller state is either CONNECTING or
> > RESETTING, nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING) won't
> > be called in nvme_eh_reset(), and nvme_pre_reset_dev() will be called
> > directly, then follows scheduling of the eh_reset_work.
> 
> We have to consider another context, nvme_reset_work.
> There are two contexts that will invoke nvme_pre_reset_dev.
> nvme_eh_reset could invoke nvme_pre_reset_dev even if the state is RESETTING or CONNECTING.
> 
> nvme_error_handler                nvme_reset_ctrl
>                                     -> change state to RESETTING
>                                     -> queue reset work
>                                   nvme_reset_work          
>   -> nvme_dev_disable                                     
>   -> nvme_eh_reset
>     -> nvme_pre_reset_dev          -> nvme_pre_reset_dev
>       -> change state to CONNECTING   -> change state fails
>                                    -> nvme_remove_dead_ctrl
> There seems to be other race scenarios between them.

IMO it is fine to change the change state in nvme_pre_reset_dev()
as the following way:

if (dev->ctrl.state != NVME_CTRL_CONNECTING)
	if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_CONNECTING))
		goto out;

Also dev->reset_lock has been introduced, and we may hold it for
nvme_pre_reset_dev() in the path of nvme_reset_ctrl() too, then
there isn't the above race any more.

> 
> Just invoke nvme_dev_disable in nvme_error_handler context and hand over the other things
> to nvme_reset_work as the v2 patch series seems clearer.

That way may not fix the current race: nvme_dev_disable() will
quiesce/freeze queue again when resetting is in-progress, then
nothing can move on.

One big change in V3 is to run nvme_dev_disable() & nvme_pre_reset_dev()
in the EH thread, and make sure queues are started once
nvme_pre_reset_dev() returns. But as you pointed, it may not work well
enough if admin requests are timed-out in nvme_pre_reset_dev().

> The primary target of nvme_error_handler is to drain IOs and disable controller.
> reset_work could take over the left things of the recovery work

Draining IOs isn't necessary, we can remove freezing queues & wait_freeze()
from nvme_dev_disable() & nvme_reset_work() for non-shutdown, that can be
easy to fix all these races since the .eh_reset_work isn't needed any more,
because quiescing is enough to prevent IOs from entering driver/device.

The only issue is that more requests may be failed when reset failed,
so I don't want to try that, and just follows the current behaviour.

IMO the primary goal of nvme EH is to recover controller, that is
what nvme_dev_disable() and resetting should do, if IO isn't drained,
timeout still may be triggered.

The big trouble is about percpu_ref, once the q_usage_counter is killed
to start freezing for preventing IO from entering block layer, we have
to run blk_mq_freeze_queue_wait() before unfreezing the usage couner. If
blk_mq_freeze_queue_wait() isn't needed, things can be much easier for us.

> 
> IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl should be ok.
> If the admin ios in reset_work timeout, the nvme_error_handler could also provide help to complete them.
> 

Right, that is one goal of this patchset.

> > There are actually two issues here, both are covered in this patchset:
> > 
> > 1) when nvme_wait_freeze() is for draining IO, controller error may
> > happen again, this patch can shutdown & reset controller again to
> > recover it.
> >
> 
> We could split the nvme_reset_work into two contexts instead of introduce another nvme_eh_reset.
> Otherwise, the code looks really complicated.

Yeah, that is what I plan to do in V4, introduce .eh_pre_reset_work and
.eh_post_reset_work, and run the following things in .eh_pre_reset_work:

	- nvme_pre_reset_dev()
	- schedule .eh_post_reset_work

and run draining IO & updating controller state in .eh_post_reset_work.
.eh_pre_reset_work will be scheduled in nvme_error_handler().

This way may address the issue of admin req timeout in nvme_pre_reset_dev(),
what do you think of this approach?

Thanks,
Ming