Hi Max The patch fixed the timeout issue when I use one non-debug kernel, but when I tested on debug kernel with your patches, the timeout still can be triggered with "nvme reset/nvme disconnect-all" operations. On Tue, Feb 15, 2022 at 10:31 PM Max Gurtovoy <mgurtovoy@xxxxxxxxxx> wrote: > > Thanks Yi Zhang. > > Few years ago I've sent some patches that were supposed to fix the KA > mechanism but eventually they weren't accepted. > > I haven't tested it since but maybe you can run some tests with it. > > The attached patches are partial and include only rdma transport for > your testing. > > If it work for you we can work on it again and argue for correctness. > > Please don't use the workaround we suggested earlier with these patches. > > -Max. > > On 2/15/2022 3:52 PM, Yi Zhang wrote: > > Hi Sagi/Max > > > > Changing the value to 10 or 15 fixed the timeout issue. > > And the reset operation still needs more than 12s on my environment, I > > also tried disabling the pi_enable, the reset operation will be back > > to 3s, so seems the added 9s was due to the PI enabled code path. > > > > On Mon, Feb 14, 2022 at 8:12 PM Max Gurtovoy <mgurtovoy@xxxxxxxxxx> wrote: > >> > >> On 2/14/2022 1:32 PM, Sagi Grimberg wrote: > >>>> Hi Sagi/Max > >>>> Here are more findings with the bisect: > >>>> > >>>> The time for reset operation changed from 3s[1] to 12s[2] after > >>>> commit[3], and after commit[4], the reset operation timeout at the > >>>> second reset[5], let me know if you need any testing for it, thanks. > >>> Does this at least eliminate the timeout? > >>> -- > >>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h > >>> index a162f6c6da6e..60e415078893 100644 > >>> --- a/drivers/nvme/host/nvme.h > >>> +++ b/drivers/nvme/host/nvme.h > >>> @@ -25,7 +25,7 @@ extern unsigned int nvme_io_timeout; > >>> extern unsigned int admin_timeout; > >>> #define NVME_ADMIN_TIMEOUT (admin_timeout * HZ) > >>> > >>> -#define NVME_DEFAULT_KATO 5 > >>> +#define NVME_DEFAULT_KATO 10 > >>> > >>> #ifdef CONFIG_ARCH_NO_SG_CHAIN > >>> #define NVME_INLINE_SG_CNT 0 > >>> -- > >>> > >> or for the initial test you can use --keep-alive-tmo=<10 or 15> flag in > >> the connect command > >> > >>>> [1] > >>>> # time nvme reset /dev/nvme0 > >>>> > >>>> real 0m3.049s > >>>> user 0m0.000s > >>>> sys 0m0.006s > >>>> [2] > >>>> # time nvme reset /dev/nvme0 > >>>> > >>>> real 0m12.498s > >>>> user 0m0.000s > >>>> sys 0m0.006s > >>>> [3] > >>>> commit 5ec5d3bddc6b912b7de9e3eb6c1f2397faeca2bc (HEAD) > >>>> Author: Max Gurtovoy <maxg@xxxxxxxxxxxx> > >>>> Date: Tue May 19 17:05:56 2020 +0300 > >>>> > >>>> nvme-rdma: add metadata/T10-PI support > >>>> > >>>> [4] > >>>> commit a70b81bd4d9d2d6c05cfe6ef2a10bccc2e04357a (HEAD) > >>>> Author: Hannes Reinecke <hare@xxxxxxx> > >>>> Date: Fri Apr 16 13:46:20 2021 +0200 > >>>> > >>>> nvme: sanitize KATO setting- > >>> This change effectively changed the keep-alive timeout > >>> from 15 to 5 and modified the host to send keepalives every > >>> 2.5 seconds instead of 5. > >>> > >>> I guess that in combination that now it takes longer to > >>> create and delete rdma resources (either qps or mrs) > >>> it starts to timeout in setups where there are a lot of > >>> queues. > > -- Best Regards, Yi Zhang