On Thu, Jul 07, 2022 at 10:28:22AM +0300, Sagi Grimberg wrote: > > > > > > > update the subject to better describe the issue: > > > > > > > > > > > > So I tried this issue on one nvme/rdma environment, and it was also > > > > > > reproducible, here are the steps: > > > > > > > > > > > > # echo 0 >/sys/devices/system/cpu/cpu0/online > > > > > > # dmesg | tail -10 > > > > > > [ 781.577235] smpboot: CPU 0 is now offline > > > > > > # nvme connect -t rdma -a 172.31.45.202 -s 4420 -n testnqn > > > > > > Failed to write to /dev/nvme-fabrics: Invalid cross-device link > > > > > > no controller found: failed to write to nvme-fabrics device > > > > > > > > > > > > # dmesg > > > > > > [ 781.577235] smpboot: CPU 0 is now offline > > > > > > [ 799.471627] nvme nvme0: creating 39 I/O queues. > > > > > > [ 801.053782] nvme nvme0: mapped 39/0/0 default/read/poll queues. > > > > > > [ 801.064149] nvme nvme0: Connect command failed, error wo/DNR bit: -16402 > > > > > > [ 801.073059] nvme nvme0: failed to connect queue: 1 ret=-18 > > > > > > > > > > This is because of blk_mq_alloc_request_hctx() and was raised before. > > > > > > > > > > IIRC there was reluctance to make it allocate a request for an hctx even > > > > > if its associated mapped cpu is offline. > > > > > > > > > > The latest attempt was from Ming: > > > > > [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx > > > > > > > > > > Don't know where that went tho... > > > > > > > > The attempt relies on that the queue for connecting io queue uses > > > > non-admined irq, unfortunately that can't be true for all drivers, > > > > so that way can't go. > > > > > > The only consumer is nvme-fabrics, so others don't matter. > > > Maybe we need a different interface that allows this relaxation. > > > > > > > So far, I'd suggest to fix nvme_*_connect_io_queues() to ignore failed > > > > io queue, then the nvme host still can be setup with less io queues. > > > > > > What happens when the CPU comes back? Not sure we can simply ignore it. > > > > Anyway, it is a not good choice to fail the whole controller if only one > > queue can't be connected. > > That is irrelevant. Isn't the exact issue reported by Yi? If there is one cpu offline, the controller may not be setup in case of 1:1 mapping, do you think this way is reasonable? > > > I meant the queue can be kept as non-LIVE, and > > it should work since no any io can be issued to this queue when it is > > non-LIVE. > > The way that nvme-pci behaves is to create all the queues and either > have them idle when their mapped cpu is offline, and have the queue > there and ready when the cpu comes back. It is the simpler approach and > I would like to have it for fabrics too, but to establish a fabrics > queue we need to send a request (connect) to the controller. The fact > that we cannot simply get a reference to a request for a given hw queue > is baffling to me. It is because the connection need one request from specified hctx, this way is anti blk-mq queue design. Previously kernel panic is caused, but now controller can't be setup if any io queue can't be connected. > > > Just wondering why we can't re-connect the io queue and set LIVE after > > any CPU in the this hctx->cpumask becomes online? blk-mq could add one > > pair of callbacks for driver for handing this queue change. > Certainly possible, but you are creating yet another interface solely > for nvme-fabrics that covers up for the existing interface that does not > satisfy what nvme-fabrics (the only consumer of it) would like it to do. The interface can be well defined, and may have generic usage, such as, delay allocating request pool until the queue becomes active(any cpu in its mapping becomes online) for saving memory consumption. Thanks, Ming