Re: [PATCH v3] PCI: hv: Only reuse existing IRTE allocation for Multi-MSI

Wei Liu <wei.liu@xxxxxxxxxx> · Fri, 11 Nov 2022 23:40:31 +0000

On Fri, Nov 11, 2022 at 10:56:28AM +0100, Lorenzo Pieralisi wrote:
> On Fri, Nov 04, 2022 at 03:29:53PM -0700, Dexuan Cui wrote:
> > Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches:
> > 08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector")
> > 455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI")
> > b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> > a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI")
> > 
> > It turns out that the third patch (b4b77778ecc5) causes a performance
> > regression because all the interrupts now happen on 1 physical CPU (or two
> > pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI
> > devices, it may suffer from soft lockups if the workload is heavy, e.g.,
> > see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@xxxxxxxxxxxxx/
> > 
> > Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in
> > hv_irq_unmask() -> hv_arch_irq_unmask() ->
> > hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target
> > virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is
> > determined only once in hv_compose_msi_msg() where only vCPU0 is specified;
> > consequently the hypervisor only uses 1 target pCPU for all the interrupts.
> > 
> > Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU
> > is determinted the second time, the vCPU in the effective affinity mask is
> > used (i.e., it isn't always vCPU0), so the hypervisor chooses different
> > pCPU for each interrupt.
> > 
> > The hypercall will be fixed in future to update the pCPU as well, but
> > that will take quite a while, so let's restore the old behavior in
> > hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for
> > single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin
> > manner for each PCI device, so the interrupts of different devices can
> > happen on different pCPUs, though the interrupts of each device happen on
> > some single pCPU.
> > 
> > The hypercall fix may not be backported to all old versions of Hyper-V, so
> > we want to have this guest side change for ever (or at least till we're sure
> > the old affected versions of Hyper-V are no longer supported).
> > 
> > Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> > Co-developed-by: Jeffrey Hugo <quic_jhugo@xxxxxxxxxxx>
> > Signed-off-by: Jeffrey Hugo <quic_jhugo@xxxxxxxxxxx>
> > Co-developed-by: Carl Vanderlip <quic_carlv@xxxxxxxxxxx>
> > Signed-off-by: Carl Vanderlip <quic_carlv@xxxxxxxxxxx>
> > Signed-off-by: Dexuan Cui <decui@xxxxxxxxxxxxx>
> > 
> > ---
> > 
> > v1 is here:
> >   https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@xxxxxxxxxxxxx/
> > 
> > Changes in v2:
> >   round-robin the vCPU for multi-MSI.
> >   The commit message is re-worked.
> >   Added Jeff and Carl's Co-developed-by and Signed-off-by.
> > 
> > Changes in v3:
> >   Michael Kelley kindly helped to make a great comment, and I added the
> >   comment before hv_compose_msi_req_get_cpu(). Thank you, Michael!
> > 
> >   Rebased to Hyper-V tree's "hyperv-fixes" branch:
> >       https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-fixes
> > 
> >   Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go
> >   through the Hyper-V tree because it's rebased to another hv_pci patch (which
> >   only exists in the Hyper-V tree for now):
> >       e70af8d040d2 ("PCI: hv: Fix the definition of vector in hv_compose_msi_msg()") 
> > 
> >   BTW, Michael has some other hv_pci patches, which would also need go through
> >       the Hyper-V tree:
> >       https://lwn.net/ml/linux-kernel/1666288635-72591-1-git-send-email-mikelley%40microsoft.com/
> > 
> > 
> >  drivers/pci/controller/pci-hyperv.c | 90 ++++++++++++++++++++++++-----
> >  1 file changed, 75 insertions(+), 15 deletions(-)
> > 
> > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> > index ba64284eaf9f..fa5a1ba35a82 100644
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -1613,7 +1613,7 @@ static void hv_pci_compose_compl(void *context, struct pci_response *resp,
> >  }
> >  
> >  static u32 hv_compose_msi_req_v1(
> > -	struct pci_create_interrupt *int_pkt, const struct cpumask *affinity,
> > +	struct pci_create_interrupt *int_pkt,
> >  	u32 slot, u8 vector, u16 vector_count)
> >  {
> >  	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE;
> > @@ -1631,6 +1631,35 @@ static u32 hv_compose_msi_req_v1(
> >  	return sizeof(*int_pkt);
> >  }
> >  
> > +/*
> > + * The vCPU selected by hv_compose_multi_msi_req_get_cpu() and
> > + * hv_compose_msi_req_get_cpu() is a "dummy" vCPU because the final vCPU to be
> > + * interrupted is specified later in hv_irq_unmask() and communicated to Hyper-V
> > + * via the HVCALL_RETARGET_INTERRUPT hypercall. But the choice of dummy vCPU is
> > + * not irrelevant because Hyper-V chooses the physical CPU to handle the
> > + * interrupts based on the vCPU specified in message sent to the vPCI VSP in
> > + * hv_compose_msi_msg(). Hyper-V's choice of pCPU is not visible to the guest,
> > + * but assigning too many vPCI device interrupts to the same pCPU can cause a
> > + * performance bottleneck. So we spread out the dummy vCPUs to influence Hyper-V
> > + * to spread out the pCPUs that it selects.
> > + *
> > + * For the single-MSI and MSI-X cases, it's OK for hv_compose_msi_req_get_cpu()
> > + * to always return the same dummy vCPU, because a second call to
> > + * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to choose a
> > + * new pCPU for the interrupt. But for the multi-MSI case, the second call to
> > + * hv_compose_msi_msg() exits without sending a message to the vPCI VSP, so the
> 
> Why ? Can't you fix _that_ ? Why can't the initial call to
> hv_compose_msi_msg() determine the _real_ target vCPU ?

Dexuan, any comment on this?

> 
> > + * original dummy vCPU is used. This dummy vCPU must be round-robin'ed so that
> > + * the pCPUs are spread out. All interrupts for a multi-MSI device end up using
> > + * the same pCPU, even though the vCPUs will be spread out by later calls
> > + * to hv_irq_unmask(), but that is the best we can do now.
> > + *
> > + * With current Hyper-V, the HVCALL_RETARGET_INTERRUPT hypercall does *not*
> 
> "current" Hyper-V means nothing, remove it or provide versioning
> information. Imagine yourself reading this comment some time
> in the future.
> 

And this?

Wei.