On Fri, Apr 27, 2018 at 11:51:57AM -0600, Keith Busch wrote: > On Thu, Apr 26, 2018 at 08:39:55PM +0800, Ming Lei wrote: > > +/* > > + * This one is called after queues are quiesced, and no in-fligh timeout > > + * and nvme interrupt handling. > > + */ > > +static void nvme_pci_cancel_request(struct request *req, void *data, > > + bool reserved) > > +{ > > + /* make sure timed-out requests are covered too */ > > + if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) { > > + req->aborted_gstate = 0; > > + req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED; > > + } > > + > > + nvme_cancel_request(req, data, reserved); > > +} > > I don't know about this. I feel like blk-mq owns these flags and LLD's > shouldn't require such knowledge of their implementation details. If driver returns BLK_EH_NOT_HANDLED from .timeout(), the driver may need to deal with this flag. But it won't be necessary to use this flag here, and it can be done by clearing 'req->aborted_gstate' always. > > I understand how the problems are happening a bit better now. It used > to be that blk-mq would lock an expired command one at a time, so when > we had a batch of IO timeouts, the driver was able to complete all of > them inside a single IO timeout handler. > > That's not the case anymore, so the driver is called for every IO > timeout even though if it reaped all the commands at once. Actually there isn't the case before, even for legacy path, one .timeout() handles one request only. > > IMO, the block layer should allow the driver to complete all timed out > IOs at once rather than go through ->timeout() for each of them and > without knowing about the request state details. Much of the problems > would go away in that case. That need to change every driver, looks not easy to do, also not sure if other drivers need this way. > > But I think much of this can be fixed by just syncing the queues in the > probe work, and may be a more reasonable bug-fix for 4.17 rather than > rewrite the entire timeout handler. What do you think of this patch? It's > an update of one I posted a few months ago, but uses Jianchao's safe > namespace list locking. > > --- > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index a3771c5729f5..198b4469c3e2 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -3562,12 +3562,25 @@ void nvme_start_queues(struct nvme_ctrl *ctrl) > struct nvme_ns *ns; > > down_read(&ctrl->namespaces_rwsem); > - list_for_each_entry(ns, &ctrl->namespaces, list) > + list_for_each_entry(ns, &ctrl->namespaces, list) { > blk_mq_unquiesce_queue(ns->queue); > + blk_mq_kick_requeue_list(ns->queue); > + } > up_read(&ctrl->namespaces_rwsem); > } > EXPORT_SYMBOL_GPL(nvme_start_queues); > > +void nvme_sync_queues(struct nvme_ctrl *ctrl) > +{ > + struct nvme_ns *ns; > + > + down_read(&ctrl->namespaces_rwsem); > + list_for_each_entry(ns, &ctrl->namespaces, list) > + blk_sync_queue(ns->queue); > + up_read(&ctrl->namespaces_rwsem); > +} > +EXPORT_SYMBOL_GPL(nvme_sync_queues); > + > int nvme_reinit_tagset(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set) > { > if (!ctrl->ops->reinit_request) > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h > index 7ded7a51c430..e62273198725 100644 > --- a/drivers/nvme/host/nvme.h > +++ b/drivers/nvme/host/nvme.h > @@ -402,6 +402,7 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len, > void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status, > union nvme_result *res); > > +void nvme_sync_queues(struct nvme_ctrl *ctrl); > void nvme_stop_queues(struct nvme_ctrl *ctrl); > void nvme_start_queues(struct nvme_ctrl *ctrl); > void nvme_kill_queues(struct nvme_ctrl *ctrl); > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index fbc71fac6f1e..a2f3ad105620 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -2303,6 +2303,7 @@ static void nvme_reset_work(struct work_struct *work) > */ > if (dev->ctrl.ctrl_config & NVME_CC_ENABLE) > nvme_dev_disable(dev, false); > + nvme_sync_queues(&dev->ctrl); This sync may be raced with one timed-out request, which may be handled as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't work reliably. There are more issues except for the above one if disabling controller and resetting it are run in different context: 1) inside resetting, nvme_dev_disable() may be called, but this way may cause double completion on the previous timed-out request. 2) the timed-out request can't be covered by nvme_dev_disable(), and this way is too tricky and easy to cause trouble. IMO, it is much easier to avoid the above issues by handling controller recover in one single thread. I will address all your comments and post V3 for review. Thanks, Ming