Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx

Hannes Reinecke <hare@xxxxxxx> · Thu, 1 Jul 2021 10:00:27 +0200

On 7/1/21 1:59 AM, Ming Lei wrote:
On Wed, Jun 30, 2021 at 09:46:35PM +0200, Hannes Reinecke wrote:
On 6/30/21 8:59 PM, Sagi Grimberg wrote:

Shouldn't we rather modify the tagset to only refer to
the current online
CPUs _only_, thereby never submit a connect request for hctx with only
offline CPUs?

Then you may setup very less io queues, and performance may suffer even
though lots of CPUs become online later.
;
Only if we stay with the reduced number of I/O queues. Which is
not what I'm
proposing; I'd rather prefer to connect and disconnect queues
from the cpu
hotplug handler. For starters we could even trigger a reset once
the first
cpu within a hctx is onlined.

Yeah, that need one big/complicated patchset, but not see any advantages
over this simple approach.

I tend to agree with Ming here.

Actually, Daniel and me came to a slightly different idea: use cpu hotplug
notifier.
Thing is, blk-mq already has cpu hotplug notifier, which should ensure that
no I/O is pending during cpu hotplug.

Why should we ensure that for non-managed irq?

While not strictly necessary, it does align the hctx layout with the 
current CPU topology.
As such we won't have any 'stale' CPUs or queues in the hctx layout, and 
with that avoid any issues we'll be having due to inactive CPUs in the 
cpumask.

If we now add a nvme cpu hotplug notifier which essentially kicks off a
reset once all cpu in a hctx are offline the reset logic will rearrange the
queues to match the current cpu layout.
And when the cpus are getting onlined we'll do another reset.

Daniel is currently preparing a patch; let's see how it goes.

What is the advantage of that big change over this simple way?

With the simple way we might (and, as the results show, do) run the 
nvme_ctrl_reset() in parallel to CPU hotplug.
This leads to quite some complexity, and as we've seen is triggering 
quite some race conditions.

Hence I do think we need to synchronize nvme_ctrl_reset() with CPU 
hotplug, to ensure that the reset handler is completed before the cpu is 
completely torn down.

But synchronizing with nvme_ctrl_reset() means to issue a flush_work(); 
and if we do that we might as well go full circle and call 
nvme_ctrl_reset_sync(). And as this will rearrange the hctx to match the 
current CPU topology we don't need to fiddle with it in the hotplug 
handler, and can leave everything to nvme_ctrl_reset().

Cheers,

Hannes
--
Dr. Hannes Reinecke                Kernel Storage Architect
hare@xxxxxxx                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer