Re: [bug report] nvme/rdma: nvme connect failed after offline one cpu on host side

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 7/26/22 05:05, Ming Lei wrote:
On Thu, Jul 07, 2022 at 10:28:22AM +0300, Sagi Grimberg wrote:

update the subject to better describe the issue:

So I tried this issue on one nvme/rdma environment, and it was also
reproducible, here are the steps:

# echo 0 >/sys/devices/system/cpu/cpu0/online
# dmesg | tail -10
[  781.577235] smpboot: CPU 0 is now offline
# nvme connect -t rdma -a 172.31.45.202 -s 4420 -n testnqn
Failed to write to /dev/nvme-fabrics: Invalid cross-device link
no controller found: failed to write to nvme-fabrics device

# dmesg
[  781.577235] smpboot: CPU 0 is now offline
[  799.471627] nvme nvme0: creating 39 I/O queues.
[  801.053782] nvme nvme0: mapped 39/0/0 default/read/poll queues.
[  801.064149] nvme nvme0: Connect command failed, error wo/DNR bit: -16402
[  801.073059] nvme nvme0: failed to connect queue: 1 ret=-18

This is because of blk_mq_alloc_request_hctx() and was raised before.

IIRC there was reluctance to make it allocate a request for an hctx even
if its associated mapped cpu is offline.

The latest attempt was from Ming:
[PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx

Don't know where that went tho...

The attempt relies on that the queue for connecting io queue uses
non-admined irq, unfortunately that can't be true for all drivers,
so that way can't go.

The only consumer is nvme-fabrics, so others don't matter.
Maybe we need a different interface that allows this relaxation.

So far, I'd suggest to fix nvme_*_connect_io_queues() to ignore failed
io queue, then the nvme host still can be setup with less io queues.

What happens when the CPU comes back? Not sure we can simply ignore it.

Anyway, it is a not good choice to fail the whole controller if only one
queue can't be connected.

That is irrelevant.

I meant the queue can be kept as non-LIVE, and
it should work since no any io can be issued to this queue when it is
non-LIVE.

The way that nvme-pci behaves is to create all the queues and either
have them idle when their mapped cpu is offline, and have the queue
there and ready when the cpu comes back. It is the simpler approach and
I would like to have it for fabrics too, but to establish a fabrics
queue we need to send a request (connect) to the controller. The fact
that we cannot simply get a reference to a request for a given hw queue
is baffling to me.

Just wondering why we can't re-connect the io queue and set LIVE after
any CPU in the this hctx->cpumask becomes online? blk-mq could add one
pair of callbacks for driver for handing this queue change.
Certainly possible, but you are creating yet another interface solely
for nvme-fabrics that covers up for the existing interface that does not
satisfy what nvme-fabrics (the only consumer of it) would like it to do.

I guess you meant that the others(rdma and tcp) use non-managed queue,
so they needn't such change?

But it isn't true actually, blk-mq/nvme still can't handle it well. From
blk-mq's viewpoint, if all CPUs in hctx->cpumask are offline, it will
treat the hctx as inactive and not workable, and refuses to allocate
request from this hctx, no matter if the underlying queue irq is managed
or not.

Now after 14dc7a18abbe ("block: Fix handling of offline queues in
blk_mq_alloc_request_hctx(), it may break controller setup easily if
any CPU is offline.

I'd suggest to fix the issue in unified way since nvme-fabric needs to be
covered, then nvme's user experience can be improved.

That is exactly what I want, but unlike pcie, nvmf creates the queue
using a connect request that is not driven from a user context. Hence
it would be nice to have an interface to get it done.

The alternative would be to make nvmf connect not use blk-mq, but that
is not a good alternative in my mind. Having a callback interface for
cpu hotplug is just another interface that every transport will need
to implement, and it makes nvmf different than pci.

BTW, I guess rdma/tcp/fc's queue may take extra or bigger resources than
nvme pci, if resource are only allocated until the queue is active, queue
resource utilization may be improved.

That is not a concern what-so-ever. Queue resources are cheap enough
that we shouldn't have to care about it in this scale.



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux