Re: [PATCH v1 1/3] driver core: Support asynchronous driver shutdown

Daniel Wagner <dwagner@xxxxxxx> · Thu, 31 Mar 2022 14:07:44 +0200




On Wed, Mar 30, 2022 at 02:12:18PM +0000, Belanger, Martin wrote:
> I know this patch is mainly for PCI devices, however, NVMe over Fabrics 
> devices can suffer even longer shutdowns. Last September, I reported 
> that shutting down an NVMe-oF TCP connection while the network is down 
> will result in a 1-minute deadlock. That's because the driver tries to perform 
> a proper shutdown by sending commands to the remote target and the 
> timeout for unanswered commands is 1-minute. If one needs to shut down 
> several NVMe-oF connections, each connection will be shut down sequentially 
> taking each 1 minute. Try running "nvme disconnect-all" while the network 
> is down and you'll see what I mean. Of course, the KATO is supposed to 
> detect when connectivity is lost, but if you have a long KATO (e.g. 2 minutes)
> you will most likely hit this condition.

I've debugging something similar:

[44888.710527] nvme nvme0: Removing ctrl: NQN "xxx"
[44898.981684] nvme nvme0: failed to send request -32
[44960.982977] nvme nvme0: queue 0: timeout request 0x18 type 4
[44960.983099] nvme nvme0: Property Set error: 881, offset 0x14

Currently testing this patch:

+++ b/drivers/nvme/host/tcp.c
@@ -1103,9 +1103,12 @@ static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
        if (ret == -EAGAIN) {
                ret = 0;
        } else if (ret < 0) {
+               struct request *rq = blk_mq_rq_from_pdu(queue->request);
+
                dev_err(queue->ctrl->ctrl.device,
                        "failed to send request %d\n", ret);
-               if (ret != -EPIPE && ret != -ECONNRESET)
+               if ((ret != -EPIPE && ret != -ECONNRESET) ||
+                   rq->cmd_flags & REQ_FAILFAST_DRIVER)
                        nvme_tcp_fail_request(queue->request);
                nvme_tcp_done_send_req(queue);
        }