On Fri, Nov 11, 2022 at 10:56:28AM +0100, Lorenzo Pieralisi wrote: > On Fri, Nov 04, 2022 at 03:29:53PM -0700, Dexuan Cui wrote: > > Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches: > > 08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector") > > 455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI") > > b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()") > > a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI") > > > > It turns out that the third patch (b4b77778ecc5) causes a performance > > regression because all the interrupts now happen on 1 physical CPU (or two > > pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI > > devices, it may suffer from soft lockups if the workload is heavy, e.g., > > see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@xxxxxxxxxxxxx/ > > > > Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in > > hv_irq_unmask() -> hv_arch_irq_unmask() -> > > hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target > > virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is > > determined only once in hv_compose_msi_msg() where only vCPU0 is specified; > > consequently the hypervisor only uses 1 target pCPU for all the interrupts. > > > > Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU > > is determinted the second time, the vCPU in the effective affinity mask is > > used (i.e., it isn't always vCPU0), so the hypervisor chooses different > > pCPU for each interrupt. > > > > The hypercall will be fixed in future to update the pCPU as well, but > > that will take quite a while, so let's restore the old behavior in > > hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for > > single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin > > manner for each PCI device, so the interrupts of different devices can > > happen on different pCPUs, though the interrupts of each device happen on > > some single pCPU. > > > > The hypercall fix may not be backported to all old versions of Hyper-V, so > > we want to have this guest side change for ever (or at least till we're sure > > the old affected versions of Hyper-V are no longer supported). > > > > Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()") > > Co-developed-by: Jeffrey Hugo <quic_jhugo@xxxxxxxxxxx> > > Signed-off-by: Jeffrey Hugo <quic_jhugo@xxxxxxxxxxx> > > Co-developed-by: Carl Vanderlip <quic_carlv@xxxxxxxxxxx> > > Signed-off-by: Carl Vanderlip <quic_carlv@xxxxxxxxxxx> > > Signed-off-by: Dexuan Cui <decui@xxxxxxxxxxxxx> > > > > --- > > > > v1 is here: > > https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@xxxxxxxxxxxxx/ > > > > Changes in v2: > > round-robin the vCPU for multi-MSI. > > The commit message is re-worked. > > Added Jeff and Carl's Co-developed-by and Signed-off-by. > > > > Changes in v3: > > Michael Kelley kindly helped to make a great comment, and I added the > > comment before hv_compose_msi_req_get_cpu(). Thank you, Michael! > > > > Rebased to Hyper-V tree's "hyperv-fixes" branch: > > https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-fixes > > > > Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go > > through the Hyper-V tree because it's rebased to another hv_pci patch (which > > only exists in the Hyper-V tree for now): > > e70af8d040d2 ("PCI: hv: Fix the definition of vector in hv_compose_msi_msg()") > > > > BTW, Michael has some other hv_pci patches, which would also need go through > > the Hyper-V tree: > > https://lwn.net/ml/linux-kernel/1666288635-72591-1-git-send-email-mikelley%40microsoft.com/ > > > > > > drivers/pci/controller/pci-hyperv.c | 90 ++++++++++++++++++++++++----- > > 1 file changed, 75 insertions(+), 15 deletions(-) > > > > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c > > index ba64284eaf9f..fa5a1ba35a82 100644 > > --- a/drivers/pci/controller/pci-hyperv.c > > +++ b/drivers/pci/controller/pci-hyperv.c > > @@ -1613,7 +1613,7 @@ static void hv_pci_compose_compl(void *context, struct pci_response *resp, > > } > > > > static u32 hv_compose_msi_req_v1( > > - struct pci_create_interrupt *int_pkt, const struct cpumask *affinity, > > + struct pci_create_interrupt *int_pkt, > > u32 slot, u8 vector, u16 vector_count) > > { > > int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE; > > @@ -1631,6 +1631,35 @@ static u32 hv_compose_msi_req_v1( > > return sizeof(*int_pkt); > > } > > > > +/* > > + * The vCPU selected by hv_compose_multi_msi_req_get_cpu() and > > + * hv_compose_msi_req_get_cpu() is a "dummy" vCPU because the final vCPU to be > > + * interrupted is specified later in hv_irq_unmask() and communicated to Hyper-V > > + * via the HVCALL_RETARGET_INTERRUPT hypercall. But the choice of dummy vCPU is > > + * not irrelevant because Hyper-V chooses the physical CPU to handle the > > + * interrupts based on the vCPU specified in message sent to the vPCI VSP in > > + * hv_compose_msi_msg(). Hyper-V's choice of pCPU is not visible to the guest, > > + * but assigning too many vPCI device interrupts to the same pCPU can cause a > > + * performance bottleneck. So we spread out the dummy vCPUs to influence Hyper-V > > + * to spread out the pCPUs that it selects. > > + * > > + * For the single-MSI and MSI-X cases, it's OK for hv_compose_msi_req_get_cpu() > > + * to always return the same dummy vCPU, because a second call to > > + * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to choose a > > + * new pCPU for the interrupt. But for the multi-MSI case, the second call to > > + * hv_compose_msi_msg() exits without sending a message to the vPCI VSP, so the > > Why ? Can't you fix _that_ ? Why can't the initial call to > hv_compose_msi_msg() determine the _real_ target vCPU ? Dexuan, any comment on this? > > > + * original dummy vCPU is used. This dummy vCPU must be round-robin'ed so that > > + * the pCPUs are spread out. All interrupts for a multi-MSI device end up using > > + * the same pCPU, even though the vCPUs will be spread out by later calls > > + * to hv_irq_unmask(), but that is the best we can do now. > > + * > > + * With current Hyper-V, the HVCALL_RETARGET_INTERRUPT hypercall does *not* > > "current" Hyper-V means nothing, remove it or provide versioning > information. Imagine yourself reading this comment some time > in the future. > And this? Wei.