On Mon, 2020-11-02 at 17:11 -0800, Dexuan Cui wrote: > When a Linux VM runs on Hyper-V, if the VM has CPUs with >255 APIC IDs, > the CPUs can't be the destination of IOAPIC interrupts, because the > IOAPIC RTE's Dest Field has only 8 bits. Currently the hackery driver > drivers/iommu/hyperv-iommu.c is used to ensure IOAPIC interrupts are > only routed to CPUs that don't have >255 APIC IDs. However, there is > an issue with kdump, because the kdump kernel can run on any CPU, and > hence IOAPIC interrupts can't work if the kdump kernel run on a CPU > with a >255 APIC ID. > > The kdump issue can be fixed by the Extended Dest ID, which is introduced > recently by David Woodhouse (for IOAPIC, see the field virt_destid_8_14 in > struct IO_APIC_route_entry). Of course, the Extended Dest ID needs the > support of the underlying hypervisor. The latest Hyper-V has added the > support recently: with this commit, on such a Hyper-V host, Linux VM > does not use hyperv-iommu.c because hyperv_prepare_irq_remapping() > returns -ENODEV; instead, Linux kernel's generic support of Extended Dest > ID from David is used, meaning that Linux VM is able to support up to > 32K CPUs, and IOAPIC interrupts can be routed to all the CPUs. > > On an old Hyper-V host that doesn't support the Extended Dest ID, nothing > changes with this commit: Linux VM is still able to bring up the CPUs with > > 255 APIC IDs with the help of hyperv-iommu.c, but IOAPIC interrupts still > > can not go to such CPUs, and the kdump kernel still can not work properly > on such CPUs. > > Signed-off-by: Dexuan Cui <decui@xxxxxxxxxxxxx> Acked-by: David Woodhouse <dwmw@xxxxxxxxxxxx> > +/* > + * If ms_hyperv_msi_ext_dest_id() returns true, hyperv_prepare_irq_remapping() > + * returns -ENODEV and the Hyper-V IOMMU driver is not used; instead, the > + * generic support of the 15-bit APIC ID is used: see __irq_msi_compose_msg(). > + * > + * Note: For a VM on Hyper-V, no emulated legacy device supports PCI MSI/MSI-X, > + * and PCI MSI/MSI-X only come from the assigned physical PCIe device, and the > + * PCI MSI/MSI-X interrupts are handled by the pci-hyperv driver. Here despite > + * the word "msi" in the name "msi_ext_dest_id", actually the callback only > + * affects how IOAPIC interrupts are routed. > + */ I named it like that on purpose to make the point that the I/OAPIC is just a device for turning line interrupts into MSIs. Some VMMs, just like real hardware, really do implement their I/OAPIC emulation that way. It makes a lot of sense to do so if you support interrupt remapping. FWIW I might have phrased your last paragraph in that comment as Note: for a VM on Hyper-V, the I/OAPIC is the only device which (logically) generates MSIs directly to the system APIC irq domain. There is no HPET, and PCI MSI/MSI-X interrupts are remapped by the pci-hyperv host bridge. But don't bother to change it; I think I've made my point quite well enough with https://git.kernel.org/tip/tip/c/5d5a97133 :) -- dwmw2
Attachment:
smime.p7s
Description: S/MIME cryptographic signature