> From: Thomas Gleixner <tglx@xxxxxxxxxxxxx> > Sent: Wednesday, June 23, 2021 7:59 AM > [...] > I only can assume that back then the main problem was vector exhaustion > on the host and to avoid allocating memory for interrupt descriptors > etc, right? > > The host vector exhaustion problem was that each MSIX vector consumed a > real CPU vector which is a limited resource on X86. This is not longer > the case today: > > 1) pci_msix_enable[range]() consumes exactly zero CPU vectors from > the allocatable range independent of the number of MSIX vectors > it allocates, unless it is in multi-queue managed mode where it > will actually reserve a vector (maybe two) per CPU. > > But for devices which are not using that mode, they just > opportunistically "reserve" vectors. > > All entries are initialized with a special system vector which > when raised will emit a nastigram in dmesg. > > 2) request_irq() actually allocates a CPU vector from the > allocatable vector space which can obviously still fail, which is > perfectly fine. > > So the only downside today of allocating more MSI-X vectors than > necessary is memory consumption for the irq descriptors. Curious about irte entry when IRQ remapping is enabled. Is it also allocated at request_irq()? > > Though for virtualization there is still another problem: > > Even if all possible MSI-X vectors for a passthrough PCI device would > be allocated upfront independent of the actual usage in the guest, > then there is still the issue of request_irq() failing on the host > once the guest decides to use any of those interrupts. > > It's a halfways reasonable argumentation by some definition of > reasonable, that this case would be a host system configuration problem > and the admin who overcommitted is responsible for the consequence. > > Where the only reasonable consequence is to kill the guest right there > because there is no mechanism to actually tell it that the host ran out > of resources. > > Not at all a pretty solution, but it is contrary to the status quo well > defined. The most important aspect is that it is well defined for the > case of success: > > If it succeeds then there is no way that already requested interrupts > can be lost or end up being redirected to the legacy PCI irq due to > clearing the MSIX enable bit, which is a gazillion times better than > the "let's hope it works" based tinkerware we have now. fair enough. > > So, aside of the existing VFIO/PCI/MSIX thing being just half thought > out, even thinking about proliferating this with IMS is bonkers. > > IMS is meant to avoid the problem of MSI-X which needs to disable MSI-X > in order to expand the number of vectors. The use cases are going to be > even more dynamic than the usual device drivers, so the lost interrupt > issue will be much more likely to trigger. > > So no, we are not going to proliferate this complete ignorance of how > MSI-X actually works and just cram another "feature" into code which is > known to be incorrect. > So the correct flow is like below: guest::enable_msix() trapped_by_host() pci_alloc_irq_vectors(); // for all possible vMSI-X entries pci_enable_msix(); guest::unmask() trapped_by_host() request_irqs(); the first trap calls a new VFIO ioctl e.g. VFIO_DEVICE_ALLOC_IRQS. the 2nd trap can reuse existing VFIO_DEVICE_SET_IRQS which just does request_irq() if specified irqs have been allocated. Then map ims to this flow: guest::enable_msix() trapped_by_host() msi_domain_alloc_irqs(); // for all possible vMSI-X entries for_all_allocated_irqs(i) pci_update_msi_desc_id(i, default_pasid); // a new helper func guest::unmask(entry#0) trapped_by_host() request_irqs(); ims_array_irq_startup(); // write msi_desc.id (default_pasid) to ims entry guest::set_msix_perm(entry#1, guest_sva_pasid) trapped_by_host() pci_update_msi_desc_id(1, host_sva_pasid); guest::unmask(entry#1) trapped_by_host() request_irqs(); ims_array_irq_startup(); // write msi_desc.id (host_sva_pasid) to ims entry Does above match your thoughts? Thanks Kevin