On Tue, May 15, 2018 at 8:33 AM, Keith Busch <keith.busch@xxxxxxxxxxxxxxx> wrote: > On Tue, May 15, 2018 at 07:47:07AM +0800, Ming Lei wrote: >> > > > [ 760.727485] nvme nvme1: EH 0: after recovery -19 >> > > > [ 760.727488] nvme nvme1: EH: fail controller >> > > >> > > The above issue(hang in nvme_remove()) is still an old issue, which >> > > is because queues are kept as quiesce during remove, so could you >> > > please test the following change? >> > > >> > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >> > > index 1dec353388be..c78e5a0cde06 100644 >> > > --- a/drivers/nvme/host/core.c >> > > +++ b/drivers/nvme/host/core.c >> > > @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) >> > > */ >> > > if (ctrl->state == NVME_CTRL_DEAD) >> > > nvme_kill_queues(ctrl); >> > > + else { >> > > + if (ctrl->admin_q) >> > > + blk_mq_unquiesce_queue(ctrl->admin_q); >> > > + nvme_start_queues(ctrl); >> > > + } >> > > >> > > down_write(&ctrl->namespaces_rwsem); >> > > list_splice_init(&ctrl->namespaces, &ns_list); >> > >> > The above won't actually do anything here since the broken link puts the >> > controller in the DEAD state, so we've killed the queues which also >> > unquiesces them. >> >> I suggest you to double check if the controller is set as DEAD >> in nvme_remove() since there won't be any log dumped when this happen. > > Yes, it's dead. pci_device_is_present returns false when the link is > broken. > > Also, the logs showed the capacity was set to 0, which only happens when > we kill the namespace queues, which supposedly restarts the queues too. > Right, nvme_kill_queues() may trigger that, and in my 019 test, not see pci_device_is_present() returns false, but nvme_kill_queues() has been called in nvme_remove_dead_ctrl_work(), and still didn't reproduce the hang in blk_cleanup_queue() yet. Looks a bit weird, but debugfs may show some clue, :-) Thanks, Ming Lei