Re: [bug report] NVMe/IB: reset_controller need more than 1min

Yi Zhang <yi.zhang@xxxxxxxxxx> · Mon, 13 Dec 2021 14:12:24 +0800

On Sun, Dec 12, 2021 at 5:45 PM Sagi Grimberg <sagi@xxxxxxxxxxx> wrote:
>
>
>
> On 12/11/21 5:01 AM, Yi Zhang wrote:
> > On Fri, Jun 25, 2021 at 12:14 AM Yi Zhang <yi.zhang@xxxxxxxxxx> wrote:
> >>
> >> On Thu, Jun 24, 2021 at 5:32 AM Sagi Grimberg <sagi@xxxxxxxxxxx> wrote:
> >>>
> >>>
> >>>> Hello
> >>>>
> >>>> Gentle ping here, this issue still exists on latest 5.13-rc7
> >>>>
> >>>> # time nvme reset /dev/nvme0
> >>>>
> >>>> real 0m12.636s
> >>>> user 0m0.002s
> >>>> sys 0m0.005s
> >>>> # time nvme reset /dev/nvme0
> >>>>
> >>>> real 0m12.641s
> >>>> user 0m0.000s
> >>>> sys 0m0.007s
> >>>
> >>> Strange that even normal resets take so long...
> >>> What device are you using?
> >>
> >> Hi Sagi
> >>
> >> Here is the device info:
> >> Mellanox Technologies MT27700 Family [ConnectX-4]
> >>
> >>>
> >>>> # time nvme reset /dev/nvme0
> >>>>
> >>>> real 1m16.133s
> >>>> user 0m0.000s
> >>>> sys 0m0.007s
> >>>
> >>> There seems to be a spurious command timeout here, but maybe this
> >>> is due to the fact that the queues take so long to connect and
> >>> the target expires the keep-alive timer.
> >>>
> >>> Does this patch help?
> >>
> >> The issue still exists, let me know if you need more testing for it. :)
> >
> > Hi Sagi
> > ping, this issue still can be reproduced on the latest
> > linux-block/for-next, do you have a chance to recheck it, thanks.
>
> Can you check if it happens with the below patch:

Hi Sagi
It is still reproducible with the change, here is the log:

# time nvme reset /dev/nvme0

real    0m12.973s
user    0m0.000s
sys     0m0.006s
# time nvme reset /dev/nvme0

real    1m15.606s
user    0m0.000s
sys     0m0.007s

# dmesg | grep nvme
[  900.634877] nvme nvme0: resetting controller
[  909.026958] nvme nvme0: creating 40 I/O queues.
[  913.604297] nvme nvme0: mapped 40/0/0 default/read/poll queues.
[  917.600993] nvme nvme0: resetting controller
[  988.562230] nvme nvme0: I/O 2 QID 0 timeout
[  988.567607] nvme nvme0: Property Set error: 881, offset 0x14
[  988.608181] nvme nvme0: creating 40 I/O queues.
[  993.203495] nvme nvme0: mapped 40/0/0 default/read/poll queues.

BTW, this issue cannot be reproduced on my NVME/ROCE environment.

> --
> diff --git a/drivers/nvme/target/fabrics-cmd.c
> b/drivers/nvme/target/fabrics-cmd.c
> index f91a56180d3d..6e5aadfb07a0 100644
> --- a/drivers/nvme/target/fabrics-cmd.c
> +++ b/drivers/nvme/target/fabrics-cmd.c
> @@ -191,6 +191,14 @@ static u16 nvmet_install_queue(struct nvmet_ctrl
> *ctrl, struct nvmet_req *req)
>                  }
>          }
>
> +       /*
> +        * Controller establishment flow may take some time, and the
> host may not
> +        * send us keep-alive during this period, hence reset the
> +        * traffic based keep-alive timer so we don't trigger a
> +        * controller teardown as a result of a keep-alive expiration.
> +        */
> +       ctrl->reset_tbkas = true;
> +
>          return 0;
>
>   err:
> --
>

-- 
Best Regards,
  Yi Zhang