Hi Sagi/Max Changing the value to 10 or 15 fixed the timeout issue. And the reset operation still needs more than 12s on my environment, I also tried disabling the pi_enable, the reset operation will be back to 3s, so seems the added 9s was due to the PI enabled code path. On Mon, Feb 14, 2022 at 8:12 PM Max Gurtovoy <mgurtovoy@xxxxxxxxxx> wrote: > > > On 2/14/2022 1:32 PM, Sagi Grimberg wrote: > > > >> Hi Sagi/Max > >> Here are more findings with the bisect: > >> > >> The time for reset operation changed from 3s[1] to 12s[2] after > >> commit[3], and after commit[4], the reset operation timeout at the > >> second reset[5], let me know if you need any testing for it, thanks. > > > > Does this at least eliminate the timeout? > > -- > > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h > > index a162f6c6da6e..60e415078893 100644 > > --- a/drivers/nvme/host/nvme.h > > +++ b/drivers/nvme/host/nvme.h > > @@ -25,7 +25,7 @@ extern unsigned int nvme_io_timeout; > > extern unsigned int admin_timeout; > > #define NVME_ADMIN_TIMEOUT (admin_timeout * HZ) > > > > -#define NVME_DEFAULT_KATO 5 > > +#define NVME_DEFAULT_KATO 10 > > > > #ifdef CONFIG_ARCH_NO_SG_CHAIN > > #define NVME_INLINE_SG_CNT 0 > > -- > > > or for the initial test you can use --keep-alive-tmo=<10 or 15> flag in > the connect command > > >> > >> [1] > >> # time nvme reset /dev/nvme0 > >> > >> real 0m3.049s > >> user 0m0.000s > >> sys 0m0.006s > >> [2] > >> # time nvme reset /dev/nvme0 > >> > >> real 0m12.498s > >> user 0m0.000s > >> sys 0m0.006s > >> [3] > >> commit 5ec5d3bddc6b912b7de9e3eb6c1f2397faeca2bc (HEAD) > >> Author: Max Gurtovoy <maxg@xxxxxxxxxxxx> > >> Date: Tue May 19 17:05:56 2020 +0300 > >> > >> nvme-rdma: add metadata/T10-PI support > >> > >> [4] > >> commit a70b81bd4d9d2d6c05cfe6ef2a10bccc2e04357a (HEAD) > >> Author: Hannes Reinecke <hare@xxxxxxx> > >> Date: Fri Apr 16 13:46:20 2021 +0200 > >> > >> nvme: sanitize KATO setting- > > > > This change effectively changed the keep-alive timeout > > from 15 to 5 and modified the host to send keepalives every > > 2.5 seconds instead of 5. > > > > I guess that in combination that now it takes longer to > > create and delete rdma resources (either qps or mrs) > > it starts to timeout in setups where there are a lot of > > queues. > -- Best Regards, Yi Zhang