On Wed, Mar 30, 2022 at 02:12:18PM +0000, Belanger, Martin wrote: > I know this patch is mainly for PCI devices, however, NVMe over Fabrics > devices can suffer even longer shutdowns. Last September, I reported > that shutting down an NVMe-oF TCP connection while the network is down > will result in a 1-minute deadlock. That's because the driver tries to perform > a proper shutdown by sending commands to the remote target and the > timeout for unanswered commands is 1-minute. If one needs to shut down > several NVMe-oF connections, each connection will be shut down sequentially > taking each 1 minute. Try running "nvme disconnect-all" while the network > is down and you'll see what I mean. Of course, the KATO is supposed to > detect when connectivity is lost, but if you have a long KATO (e.g. 2 minutes) > you will most likely hit this condition. I've debugging something similar: [44888.710527] nvme nvme0: Removing ctrl: NQN "xxx" [44898.981684] nvme nvme0: failed to send request -32 [44960.982977] nvme nvme0: queue 0: timeout request 0x18 type 4 [44960.983099] nvme nvme0: Property Set error: 881, offset 0x14 Currently testing this patch: +++ b/drivers/nvme/host/tcp.c @@ -1103,9 +1103,12 @@ static int nvme_tcp_try_send(struct nvme_tcp_queue *queue) if (ret == -EAGAIN) { ret = 0; } else if (ret < 0) { + struct request *rq = blk_mq_rq_from_pdu(queue->request); + dev_err(queue->ctrl->ctrl.device, "failed to send request %d\n", ret); - if (ret != -EPIPE && ret != -ECONNRESET) + if ((ret != -EPIPE && ret != -ECONNRESET) || + rq->cmd_flags & REQ_FAILFAST_DRIVER) nvme_tcp_fail_request(queue->request); nvme_tcp_done_send_req(queue); }