Re: kernel panic during nvmf connect over mlx5

Leon Romanovsky <leon@xxxxxxxxxx> · Thu, 22 Feb 2018 12:10:01 +0200

On Thu, Feb 22, 2018 at 09:10:00AM +0000, Kalderon, Michal wrote:
> Hi Leon, Sagi,
>
> We're trying to run simple nvmf connect over connectx4 on kernel 4.15.4 and we're hitting the following
> kernel panic on the initiator side.
> Are there any known issues on this kernel?
>
> Server configuration
> [root@lbtlvb-pcie157 linux-4.15.4]# nvmetcli
> /> ls
> o- / ......................................................................................................................... [...]
>   o- hosts ................................................................................................................... [...]
>   o- ports ................................................................................................................... [...]
>   | o- 1 ..................................................................................................................... [...]
>   |   o- referrals ........................................................................................................... [...]
>   |   o- subsystems .......................................................................................................... [...]
>   |     o- nvme-subsystem-tmp ............................................................................................... [...]
>   o- subsystems .............................................................................................................. [...]
>     o- nvme-subsystem-tmp ................................................................................................... [...]
>       o- allowed_hosts ....................................................................................................... [...]
>       o- namespaces .......................................................................................................... [...]
>         o- 1 ................................................................................................................. [...]
>
> Discovery is successful with the following command:
> nvme discover  -t rdma -a 192.168.20.157 -s 1023
> Discovery Log Number of Records 1, Generation counter 1
> =====Discovery Log Entry 0======
> trtype:  rdma
> adrfam:  ipv4
> subtype: nvme subsystem
> treq:    not specified
> portid:  1
> trsvcid: 1023
>
> subnqn:  nvme-subsystem-tmp
> traddr:  192.168.20.157
>
> rdma_prtype: not specified
> rdma_qptype: connected
> rdma_cms:    rdma-cm
> rdma_pkey: 0x0000
>
>
> When running connect as follows, we get the kernel panic
> nvme connect -t rdma -n nvme-subsystem-tmp -a 192.168.20.157 -s 1023
>
> Please advise how to proceed.
>
> Thanks,
> Michal
>
> [  663.010545] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 192.168.20.157:1023
> [  663.010545] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 192.168.20.157:1023
> [  663.052781] nvme nvme0: creating 24 I/O queues.
> [  663.052781] nvme nvme0: creating 24 I/O queues.
> [  663.408093] nvme nvme0: Connect command failed, error wo/DNR bit: -16402
> [  663.408093] nvme nvme0: Connect command failed, error wo/DNR bit: -16402
> [  663.409116] nvme nvme0: failed to connect queue: 3 ret=-18
> [  663.409116] nvme nvme0: failed to connect queue: 3 ret=-18

I'm not NVMeF expert, but -18 error code means EXDEV and not many places
in code can return this error, also it is negative => it is Linux's
error and not NVMe.

So based on 4.16-rc1 code, the flow is:
 nvme_rdma_start_queue ->
    nvmf_connect_io_queue ->
      __nvme_submit_sync_cmd ->
         nvme_alloc_request ->
	  blk_mq_alloc_request_hctx ->

 437         /*
 438          * Check if the hardware context is actually mapped to anything.
 439          * If not tell the caller that it should skip this queue.
 440          */
 441         alloc_data.hctx = q->queue_hw_ctx[hctx_idx];
 442         if (!blk_mq_hw_queue_mapped(alloc_data.hctx)) {
 443                 blk_queue_exit(q);
 444                 return ERR_PTR(-EXDEV);
 445         }

Hope it helps.

Thanks
Attachment:
signature.asc

Description: PGP signature