Hi ming On 05/04/2018 12:24 PM, Ming Lei wrote: >> Just invoke nvme_dev_disable in nvme_error_handler context and hand over the other things >> to nvme_reset_work as the v2 patch series seems clearer. > That way may not fix the current race: nvme_dev_disable() will > quiesce/freeze queue again when resetting is in-progress, then > nothing can move on. With timeout quiesce and nvme EH kthread, the nvme_dev_disable could ensure all the in-flight requests to be handled. When we queue the reset_work in nvme_error_handler, there will be no timeout anymore. If there is timeout due to the admin io in reset_work, this is expected, reset_work will be interrupted and fail. If the io timeout when reset_work wait freeze, because the reset_work is waiting, things cannot move on. we have multiple method to handle this: 1. as Keith's solution, change ctrl state to DELETING and fail the reset_work. 2. as current patch set, try to recovery it. but we have to split reset_work into two contexts. Actually, there are many other places where the nvme_dev_disable will be invoked. > > One big change in V3 is to run nvme_dev_disable() & nvme_pre_reset_dev() > in the EH thread, and make sure queues are started once > nvme_pre_reset_dev() returns. But as you pointed, it may not work well > enough if admin requests are timed-out in nvme_pre_reset_dev(). > >> The primary target of nvme_error_handler is to drain IOs and disable controller. >> reset_work could take over the left things of the recovery work > Draining IOs isn't necessary, we can remove freezing queues & wait_freeze() > from nvme_dev_disable() & nvme_reset_work() for non-shutdown, that can be > easy to fix all these races since the .eh_reset_work isn't needed any more, > because quiescing is enough to prevent IOs from entering driver/device. > > The only issue is that more requests may be failed when reset failed, > so I don't want to try that, and just follows the current behaviour. > > IMO the primary goal of nvme EH is to recover controller, that is > what nvme_dev_disable() and resetting should do, if IO isn't drained, > timeout still may be triggered. > > The big trouble is about percpu_ref, once the q_usage_counter is killed > to start freezing for preventing IO from entering block layer, we have > to run blk_mq_freeze_queue_wait() before unfreezing the usage couner. If > blk_mq_freeze_queue_wait() isn't needed, things can be much easier for us. Sorry for my bad description. I mean nvme_dev_disable will drain all the in-flight requests and requeue or fail them, then we will not get any timeout again. In fact, I used to talk with Keith about remove freezing queues & wait_freeze for non-shutdown. There is an obstacle, nvme_dev_add -> blk_mq_update_nr_hw_queues. When the ctrl comes back, the number of hw queues maybe changed. Even if we move the wait_freeze, blk_mq_update_nr_hw_queues will do that. On the other hand, if not freeze the queue, the requests that are submitted to the dying hw queue during the resetting, will be sacrificed. > >> IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl should be ok. >> If the admin ios in reset_work timeout, the nvme_error_handler could also provide help to complete them. >> > Right, that is one goal of this patchset. > >>> There are actually two issues here, both are covered in this patchset: >>> >>> 1) when nvme_wait_freeze() is for draining IO, controller error may >>> happen again, this patch can shutdown & reset controller again to >>> recover it. >>> >> We could split the nvme_reset_work into two contexts instead of introduce another nvme_eh_reset. >> Otherwise, the code looks really complicated. > Yeah, that is what I plan to do in V4, introduce .eh_pre_reset_work and > .eh_post_reset_work, and run the following things in .eh_pre_reset_work: > > - nvme_pre_reset_dev() > - schedule .eh_post_reset_work > > and run draining IO & updating controller state in .eh_post_reset_work. > .eh_pre_reset_work will be scheduled in nvme_error_handler(). > > This way may address the issue of admin req timeout in nvme_pre_reset_dev(), > what do you think of this approach? nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface. Then it is more convenient to ensure that there will be only one resetting instance running. Thanks Jianchao