Re: blk-mq: improvement CPU hotplug (simplified version) v4

John Garry <john.garry@xxxxxxxxxx> · Wed, 27 May 2020 21:31:30 +0100

On 27/05/2020 21:07, Bart Van Assche wrote:
On 2020-05-27 11:06, Christoph Hellwig wrote:
this series ensures I/O is quiesced before a cpu and thus the managed
interrupt handler is shut down.

This patchset tries to address the issue by the following approach:

  - before the last cpu in hctx->cpumask is going to offline, mark this
    hctx as inactive

  - disable preempt during allocating tag for request, and after tag is
    allocated, check if this hctx is inactive. If yes, give up the
    allocation and try remote allocation from online CPUs

  - before hctx becomes inactive, drain all allocated requests on this
    hctx

The guts of the changes are from Ming Lei, I just did a bunch of prep
cleanups so that they can fit in more nicely.  The series also depends
on my "avoid a few q_usage_counter roundtrips v3" series.

Thanks John Garry for running lots of tests on arm64 with this previous
version patches and co-working on investigating all kinds of issues.

Hi Christoph,

Thanks for having prepared and posted this new patch series. After v3
was posted and before v4 was posted I had a closer look at the IRQ core.
My conclusions (which may be incorrect) are as follows:
* The only function that sets the 'is_managed' member of struct
   irq_affinity_desc to 1 is irq_create_affinity_masks().
* There are two ways to cause that function to be called: setting the
   PCI_IRQ_AFFINITY flag when calling pci_alloc_irq_vectors_affinity() or
   passing the 'affd' argument. pci_alloc_irq_vectors() calls
   pci_alloc_irq_vectors_affinity().
* The following drivers pass an affinity domain argument when allocating
   interrupts: virtio_blk, nvme, be2iscsi, csiostor, hisi_sas, megaraid,
   mpt3sas, qla2xxx, virtio_scsi.
* The following drivers set the PCI_IRQ_AFFINITY flag but do not pass an
   affinity domain: aacraid, hpsa, lpfc, smartqpi, virtio_pci_common.

What is not clear to me is why managed interrupts are shut down if the
last CPU in their affinity mask is shut down? Has it been considered to
modify the IRQ core such that managed PCIe interrupts are assigned to
another CPU if the last CPU in their affinity mask is shut down? 

I think Thomas answered that here already:
https://lore.kernel.org/lkml/alpine.DEB.2.21.1901291717370.1513@xxxxxxxxxxxxxxxxxxxxxxx/

(vector space exhaustion)

Would
that make it unnecessary to drain hardware queues during CPU
hotplugging? Or is there perhaps something in the PCI or PCIe
specifications or in one of the architectures supported by Linux that
prevents doing this?

Is this the commit that introduced shutdown of managed interrupts:
c5cb83bb337c ("genirq/cpuhotplug: Handle managed IRQs on CPU hotplug")?

Some of my knowledge about non-managed and managed interrupts comes from
https://lore.kernel.org/lkml/alpine.DEB.2.20.1710162106400.2037@nanos/

Thanks,

Bart.
.