Re: Hang at NVME Host caused by Controller reset

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




This time, with "nvme-fabrics: allow to queue requests for live queues"
patch applied, I see hang only at blk_queue_enter():

Interesting, does the reset loop hang? or is it able to make forward
progress?

Looks like the freeze depth is messed up with the timeout handler.
We shouldn't call nvme_tcp_teardown_io_queues in the timeout handler
because it messes with the freeze depth, causing the unfreeze to not
wake the waiter (blk_queue_enter). We should simply stop the queue
and complete the I/O, and the condition was wrong too, because we
need to do it only for the connect command (which cannot reset the
timer). So we should check for reserved in the timeout handler.

Can you please try this patch?
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 62fbaecdc960..c3288dd2c92f 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -464,6 +464,7 @@ static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
        if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
                return;

+       dev_warn(ctrl->device, "starting error recovery\n");
        queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
 }

@@ -2156,33 +2157,37 @@ nvme_tcp_timeout(struct request *rq, bool reserved)
        struct nvme_tcp_ctrl *ctrl = req->queue->ctrl;
        struct nvme_tcp_cmd_pdu *pdu = req->pdu;

-       /*
-        * Restart the timer if a controller reset is already scheduled. Any
- * timed out commands would be handled before entering the connecting
-        * state.
-        */
-       if (ctrl->ctrl.state == NVME_CTRL_RESETTING)
-               return BLK_EH_RESET_TIMER;
-
        dev_warn(ctrl->ctrl.device,
                "queue %d: timeout request %#x type %d\n",
                nvme_tcp_queue_id(req->queue), rq->tag, pdu->hdr.type);

-       if (ctrl->ctrl.state != NVME_CTRL_LIVE) {
+       switch (ctrl->ctrl.state) {
+       case NVME_CTRL_RESETTING:
                /*
- * Teardown immediately if controller times out while starting
-                * or we are already started error recovery. all outstanding
- * requests are completed on shutdown, so we return BLK_EH_DONE. + * Restart the timer if a controller reset is already scheduled. + * Any timed out commands would be handled before entering the
+                * connecting state.
                 */
-               flush_work(&ctrl->err_work);
-               nvme_tcp_teardown_io_queues(&ctrl->ctrl, false);
-               nvme_tcp_teardown_admin_queue(&ctrl->ctrl, false);
-               return BLK_EH_DONE;
+               return BLK_EH_RESET_TIMER;
+       case NVME_CTRL_CONNECTING:
+               if (reserved) {
+                       /*
+ * stop queue immediately if controller times out while connecting + * or we are already started error recovery. all outstanding + * requests are completed on shutdown, so we return BLK_EH_DONE.
+                        */
+ nvme_tcp_stop_queue(&ctrl->ctrl, nvme_tcp_queue_id(req->queue));
+                       nvme_req(rq)->flags |= NVME_REQ_CANCELLED;
+                       nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
+                       blk_mq_complete_request(rq);
+                       return BLK_EH_DONE;
+               }
+               /* fallthru */
+       default:
+       case NVME_CTRL_LIVE:
+               nvme_tcp_error_recovery(&ctrl->ctrl);
        }

-       dev_warn(ctrl->ctrl.device, "starting error recovery\n");
-       nvme_tcp_error_recovery(&ctrl->ctrl);
-
        return BLK_EH_RESET_TIMER;
 }
--



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux