Re: kernel panic during nvmf connect over mlx5

Max Gurtovoy <maxg@xxxxxxxxxxxx> · Thu, 22 Feb 2018 13:04:19 +0200

This issue was fixed in commit:

"mlx5: fix mlx5_get_vector_affinity to start from completion vector 0"
and would be added (probably to 4.15.5).

please try it :)

-Max.

On 2/22/2018 12:10 PM, Leon Romanovsky wrote:
On Thu, Feb 22, 2018 at 09:10:00AM +0000, Kalderon, Michal wrote:
Hi Leon, Sagi,

We're trying to run simple nvmf connect over connectx4 on kernel 4.15.4 and we're hitting the following
kernel panic on the initiator side.
Are there any known issues on this kernel?

Server configuration
[root@lbtlvb-pcie157 linux-4.15.4]# nvmetcli
/> ls
o- / ......................................................................................................................... [...]
   o- hosts ................................................................................................................... [...]
   o- ports ................................................................................................................... [...]
   | o- 1 ..................................................................................................................... [...]
   |   o- referrals ........................................................................................................... [...]
   |   o- subsystems .......................................................................................................... [...]
   |     o- nvme-subsystem-tmp ............................................................................................... [...]
   o- subsystems .............................................................................................................. [...]
     o- nvme-subsystem-tmp ................................................................................................... [...]
       o- allowed_hosts ....................................................................................................... [...]
       o- namespaces .......................................................................................................... [...]
         o- 1 ................................................................................................................. [...]

Discovery is successful with the following command:
nvme discover  -t rdma -a 192.168.20.157 -s 1023
Discovery Log Number of Records 1, Generation counter 1
=====Discovery Log Entry 0======
trtype:  rdma
adrfam:  ipv4
subtype: nvme subsystem
treq:    not specified
portid:  1
trsvcid: 1023

subnqn:  nvme-subsystem-tmp
traddr:  192.168.20.157

rdma_prtype: not specified
rdma_qptype: connected
rdma_cms:    rdma-cm
rdma_pkey: 0x0000

When running connect as follows, we get the kernel panic
nvme connect -t rdma -n nvme-subsystem-tmp -a 192.168.20.157 -s 1023

Please advise how to proceed.

Thanks,
Michal

[  663.010545] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 192.168.20.157:1023
[  663.010545] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 192.168.20.157:1023
[  663.052781] nvme nvme0: creating 24 I/O queues.
[  663.052781] nvme nvme0: creating 24 I/O queues.
[  663.408093] nvme nvme0: Connect command failed, error wo/DNR bit: -16402
[  663.408093] nvme nvme0: Connect command failed, error wo/DNR bit: -16402
[  663.409116] nvme nvme0: failed to connect queue: 3 ret=-18
[  663.409116] nvme nvme0: failed to connect queue: 3 ret=-18

I'm not NVMeF expert, but -18 error code means EXDEV and not many places
in code can return this error, also it is negative => it is Linux's
error and not NVMe.

So based on 4.16-rc1 code, the flow is:
  nvme_rdma_start_queue ->
     nvmf_connect_io_queue ->
       __nvme_submit_sync_cmd ->
          nvme_alloc_request ->
	  blk_mq_alloc_request_hctx ->

  437         /*
  438          * Check if the hardware context is actually mapped to anything.
  439          * If not tell the caller that it should skip this queue.
  440          */
  441         alloc_data.hctx = q->queue_hw_ctx[hctx_idx];
  442         if (!blk_mq_hw_queue_mapped(alloc_data.hctx)) {
  443                 blk_queue_exit(q);
  444                 return ERR_PTR(-EXDEV);
  445         }

Hope it helps.

Thanks

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html