RE: kernel panic during nvmf connect over mlx5

"Kalderon, Michal" <Michal.Kalderon@xxxxxxxxxx> · Thu, 22 Feb 2018 12:18:37 +0000



> From: Max Gurtovoy [mailto:maxg@xxxxxxxxxxxx]
> Sent: Thursday, February 22, 2018 1:04 PM
> 
> This issue was fixed in commit:
> 
> "mlx5: fix mlx5_get_vector_affinity to start from completion vector 0"
> and would be added (probably to 4.15.5).
> 
> please try it :)
> 
That fixed the issue. Thanks for the quick response

> -Max.
> 
> On 2/22/2018 12:10 PM, Leon Romanovsky wrote:
> > On Thu, Feb 22, 2018 at 09:10:00AM +0000, Kalderon, Michal wrote:
> >> Hi Leon, Sagi,
> >>
> >> We're trying to run simple nvmf connect over connectx4 on kernel
> >> 4.15.4 and we're hitting the following kernel panic on the initiator side.
> >> Are there any known issues on this kernel?
> >>
> >> Server configuration
> >> [root@lbtlvb-pcie157 linux-4.15.4]# nvmetcli /> ls
> >> o- /
> .....................................................................................................................
> .... [...]
> >>    o- hosts
> ...................................................................................................................
> [...]
> >>    o- ports
> ...................................................................................................................
> [...]
> >>    | o- 1
> .....................................................................................................................
> [...]
> >>    |   o- referrals
> ........................................................................................................... [...]
> >>    |   o- subsystems
> .......................................................................................................... [...]
> >>    |     o- nvme-subsystem-tmp
> ............................................................................................... [...]
> >>    o- subsystems
> .............................................................................................................. [...]
> >>      o- nvme-subsystem-tmp
> ................................................................................................... [...]
> >>        o- allowed_hosts
> ....................................................................................................... [...]
> >>        o- namespaces
> .......................................................................................................... [...]
> >>          o- 1
> >> .....................................................................
> >> ............................................ [...]
> >>
> >> Discovery is successful with the following command:
> >> nvme discover  -t rdma -a 192.168.20.157 -s 1023 Discovery Log Number
> >> of Records 1, Generation counter 1 =====Discovery Log Entry 0======
> >> trtype:  rdma
> >> adrfam:  ipv4
> >> subtype: nvme subsystem
> >> treq:    not specified
> >> portid:  1
> >> trsvcid: 1023
> >>
> >> subnqn:  nvme-subsystem-tmp
> >> traddr:  192.168.20.157
> >>
> >> rdma_prtype: not specified
> >> rdma_qptype: connected
> >> rdma_cms:    rdma-cm
> >> rdma_pkey: 0x0000
> >>
> >>
> >> When running connect as follows, we get the kernel panic nvme connect
> >> -t rdma -n nvme-subsystem-tmp -a 192.168.20.157 -s 1023
> >>
> >> Please advise how to proceed.
> >>
> >> Thanks,
> >> Michal
> >>
> >> [  663.010545] nvme nvme0: new ctrl: NQN
> >> "nqn.2014-08.org.nvmexpress.discovery", addr 192.168.20.157:1023 [
> >> 663.010545] nvme nvme0: new ctrl: NQN "nqn.2014-
> 08.org.nvmexpress.discovery", addr 192.168.20.157:1023 [  663.052781] nvme
> nvme0: creating 24 I/O queues.
> >> [  663.052781] nvme nvme0: creating 24 I/O queues.
> >> [  663.408093] nvme nvme0: Connect command failed, error wo/DNR bit:
> >> -16402 [  663.408093] nvme nvme0: Connect command failed, error
> >> wo/DNR bit: -16402 [  663.409116] nvme nvme0: failed to connect
> >> queue: 3 ret=-18 [  663.409116] nvme nvme0: failed to connect queue:
> >> 3 ret=-18
> >
> > I'm not NVMeF expert, but -18 error code means EXDEV and not many
> > places in code can return this error, also it is negative => it is
> > Linux's error and not NVMe.
> >
> > So based on 4.16-rc1 code, the flow is:
> >   nvme_rdma_start_queue ->
> >      nvmf_connect_io_queue ->
> >        __nvme_submit_sync_cmd ->
> >           nvme_alloc_request ->
> > 	  blk_mq_alloc_request_hctx ->
> >
> >   437         /*
> >   438          * Check if the hardware context is actually mapped to anything.
> >   439          * If not tell the caller that it should skip this queue.
> >   440          */
> >   441         alloc_data.hctx = q->queue_hw_ctx[hctx_idx];
> >   442         if (!blk_mq_hw_queue_mapped(alloc_data.hctx)) {
> >   443                 blk_queue_exit(q);
> >   444                 return ERR_PTR(-EXDEV);
> >   445         }
> >
> > Hope it helps.
> >
> > Thanks
> >
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html