Hi Sagi/Max Here are more findings with the bisect: The time for reset operation changed from 3s[1] to 12s[2] after commit[3], and after commit[4], the reset operation timeout at the second reset[5], let me know if you need any testing for it, thanks.
Does this at least eliminate the timeout? -- diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index a162f6c6da6e..60e415078893 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -25,7 +25,7 @@ extern unsigned int nvme_io_timeout; extern unsigned int admin_timeout; #define NVME_ADMIN_TIMEOUT (admin_timeout * HZ) -#define NVME_DEFAULT_KATO 5 +#define NVME_DEFAULT_KATO 10 #ifdef CONFIG_ARCH_NO_SG_CHAIN #define NVME_INLINE_SG_CNT 0 --
[1] # time nvme reset /dev/nvme0 real 0m3.049s user 0m0.000s sys 0m0.006s [2] # time nvme reset /dev/nvme0 real 0m12.498s user 0m0.000s sys 0m0.006s [3] commit 5ec5d3bddc6b912b7de9e3eb6c1f2397faeca2bc (HEAD) Author: Max Gurtovoy <maxg@xxxxxxxxxxxx> Date: Tue May 19 17:05:56 2020 +0300 nvme-rdma: add metadata/T10-PI support [4] commit a70b81bd4d9d2d6c05cfe6ef2a10bccc2e04357a (HEAD) Author: Hannes Reinecke <hare@xxxxxxx> Date: Fri Apr 16 13:46:20 2021 +0200 nvme: sanitize KATO setting-
This change effectively changed the keep-alive timeout from 15 to 5 and modified the host to send keepalives every 2.5 seconds instead of 5. I guess that in combination that now it takes longer to create and delete rdma resources (either qps or mrs) it starts to timeout in setups where there are a lot of queues.