Re: blk-mq: improvement CPU hotplug (simplified version) v4

Bart Van Assche <bvanassche@xxxxxxx> · Wed, 27 May 2020 13:07:15 -0700

On 2020-05-27 11:06, Christoph Hellwig wrote:
> this series ensures I/O is quiesced before a cpu and thus the managed
> interrupt handler is shut down.
> 
> This patchset tries to address the issue by the following approach:
> 
>  - before the last cpu in hctx->cpumask is going to offline, mark this
>    hctx as inactive
> 
>  - disable preempt during allocating tag for request, and after tag is
>    allocated, check if this hctx is inactive. If yes, give up the
>    allocation and try remote allocation from online CPUs
> 
>  - before hctx becomes inactive, drain all allocated requests on this
>    hctx
> 
> The guts of the changes are from Ming Lei, I just did a bunch of prep
> cleanups so that they can fit in more nicely.  The series also depends
> on my "avoid a few q_usage_counter roundtrips v3" series.
> 
> Thanks John Garry for running lots of tests on arm64 with this previous
> version patches and co-working on investigating all kinds of issues.

Hi Christoph,

Thanks for having prepared and posted this new patch series. After v3
was posted and before v4 was posted I had a closer look at the IRQ core.
My conclusions (which may be incorrect) are as follows:
* The only function that sets the 'is_managed' member of struct
  irq_affinity_desc to 1 is irq_create_affinity_masks().
* There are two ways to cause that function to be called: setting the
  PCI_IRQ_AFFINITY flag when calling pci_alloc_irq_vectors_affinity() or
  passing the 'affd' argument. pci_alloc_irq_vectors() calls
  pci_alloc_irq_vectors_affinity().
* The following drivers pass an affinity domain argument when allocating
  interrupts: virtio_blk, nvme, be2iscsi, csiostor, hisi_sas, megaraid,
  mpt3sas, qla2xxx, virtio_scsi.
* The following drivers set the PCI_IRQ_AFFINITY flag but do not pass an
  affinity domain: aacraid, hpsa, lpfc, smartqpi, virtio_pci_common.

What is not clear to me is why managed interrupts are shut down if the
last CPU in their affinity mask is shut down? Has it been considered to
modify the IRQ core such that managed PCIe interrupts are assigned to
another CPU if the last CPU in their affinity mask is shut down? Would
that make it unnecessary to drain hardware queues during CPU
hotplugging? Or is there perhaps something in the PCI or PCIe
specifications or in one of the architectures supported by Linux that
prevents doing this?

Is this the commit that introduced shutdown of managed interrupts:
c5cb83bb337c ("genirq/cpuhotplug: Handle managed IRQs on CPU hotplug")?

Some of my knowledge about non-managed and managed interrupts comes from
https://lore.kernel.org/lkml/alpine.DEB.2.20.1710162106400.2037@nanos/

Thanks,

Bart.