Re: [bug report] nvme/rdma: nvme connect failed after offline one cpu on host side

Sagi Grimberg <sagi@xxxxxxxxxxx> · Tue, 5 Jul 2022 02:04:53 +0300

update the subject to better describe the issue:

So I tried this issue on one nvme/rdma environment, and it was also
reproducible, here are the steps:

# echo 0 >/sys/devices/system/cpu/cpu0/online
# dmesg | tail -10
[  781.577235] smpboot: CPU 0 is now offline
# nvme connect -t rdma -a 172.31.45.202 -s 4420 -n testnqn
Failed to write to /dev/nvme-fabrics: Invalid cross-device link
no controller found: failed to write to nvme-fabrics device

# dmesg
[  781.577235] smpboot: CPU 0 is now offline
[  799.471627] nvme nvme0: creating 39 I/O queues.
[  801.053782] nvme nvme0: mapped 39/0/0 default/read/poll queues.
[  801.064149] nvme nvme0: Connect command failed, error wo/DNR bit: -16402
[  801.073059] nvme nvme0: failed to connect queue: 1 ret=-18

This is because of blk_mq_alloc_request_hctx() and was raised before.

IIRC there was reluctance to make it allocate a request for an hctx even
if its associated mapped cpu is offline.

The latest attempt was from Ming:
[PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx

Don't know where that went tho...