On 6/22/2021 3:16 AM, Tian, Kevin wrote:
Hi, Alex, Need your help to understand the current MSI-X virtualization flow in VFIO. Some background info first. Recently we are discussing how to virtualize MSI-X with Interrupt Message Storage (IMS) on mdev: https://lore.kernel.org/kvm/87im2lyiv6.ffs@xxxxxxxxxxxxxxxxxxxxxxx/ IMS is a device specific interrupt storage, allowing an optimized and scalable manner for generating interrupts. idxd mdev exposes virtual MSI-X capability to guest but uses IMS entries physically for generating interrupts. Thomas has helped implement a generic ims irqchip driver: https://lore.kernel.org/linux-hyperv/20200826112335.202234502@xxxxxxxxxxxxx/ idxd device allows software to specify an IMS entry (for triggering completion interrupt) when submitting a descriptor. To prevent one mdev triggering malicious interrupt into another mdev (by specifying an arbitrary entry), idxd ims entry includes a PASID field for validation - only a matching PASID in the executed descriptor can trigger interrupt via this entry. idxd driver is expected to program ims entries with PASIDs that are allocated to the mdev which owns those entries. Other devices may have different ID and format to isolate ims entries. But we need abstract a generic means for programming vendor-specific ID into vendor-specific ims entry, without violating the layering model. Thomas suggested vendor driver to first register ID information (possibly plus the location where to write ID to) in msi_desc when allocating irqs (extend existing alloc function or via new helper function) and then have the generic ims irqchip driver to update ID to the ims entry when it's started up by request_irq(). Then there are two questions to be answered: 1) How does vendor driver decide the ID to be registered to msi_desc? 2) How is Thomas's model mapped to the MSI-X virtualization flow in VFIO? For the 1st open, there are two types of PASIDs on idxd mdev: 1) default PASID: one per mdev and allocated when mdev is created; 2) sva PASIDs: multiple per mdev and allocated on-demand (via vIOMMU); If vIOMMU is not exposed, all ims entries of this mdev should be programmed with default PASID which is always available in mdev's lifespan. If vIOMMU is exposed and guest sva is enabled, entries used for sva should be tagged with sva PASIDs, leaving others tagged with default PASID. To help achieve intra-guest interrupt isolation, guest idxd driver needs program guest sva PASIDs into virtual MSIX_PERM register (one per MSI-X entry) for validation. Access to MSIX_PERM is trap-and-emulated by host idxd driver which then figure out which PASID to register to msi_desc (require PASID translation info via new /dev/iommu proposal). The guest driver is expected to update MSIX_PERM before request_irq(). Now the 2nd open requires your help. Below is what I learned from current vfio/qemu code (for vfio-pci device): 0) Qemu doesn't attempt to allocate all irqs as reported by msix-> table_size. It is done in an dynamic and incremental way. 1) VFIO provides just one command (VFIO_DEVICE_SET_IRQS) for allocating/enabling irqs given a set of vMSIX vectors [start, count]: a) if irqs not allocated, allocate irqs [start+count]. Enable irqs for specified vectors [start, count] via request_irq(); b) if irqs already allocated, enable irqs for specified vectors; c) if irq already enabled, disable and re-enable irqs for specified vectors because user may specify a different eventfd; 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_ DEVICE_SET_IRQS to enable vector#0, even though it's currently masked by the guest. Interrupts are received by Qemu but blocked from guest via mask/pending bit emulation. The main intention is to enable physical MSI-X; 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_ DEVICE_SET_IRQS to enable vector#0 again, with a eventfd different from the one provided in 2); 4) When guest unmasks vector#1, Qemu finds it's outside of allocated vectors (only vector#0 now): a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free irq for vector#0; b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable irqs for both vector#0 and vector#1; 5) When guest unmasks vector#2, same flow in 4) continues. .... If above understanding is correct, how is lost interrupt avoided between 4.a) and 4.b) given that irq has been torn down for vector#0 in the middle while from guest p.o.v this vector is actually unmasked? There must be a mechanism in place, but I just didn't figure it out... Given above flow is robust, mapping Thomas's model to this flow is straightforward. Assume idxd mdev has two vectors: vector#0 for misc/error interrupt and vector#1 as completion interrupt for guest sva. VFIO_DEVICE_SET_IRQS is handled by idxd mdev driver: 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_ DEVICE_SET_IRQS to enable vector#0. Because vector#0 is not used for sva, MSIX_PERM#0 has PASID disabled. Host idxd driver knows to register default PASID to msi_desc#0 when allocating irqs. Then .startup() callback of ims irqchip is called to program default PASID saved in msi_desc#0 to the target ims entry when request_irq(). 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_ DEVICE_SET_IRQS to enable vector#0 again. Following same logic as vfio-pci, idxd driver first disable irq#0 via free_irq() and then re-enable irq#0 via request_irq(). It's still default PASID being used according to msi_desc#0.
Hi Kevin, slight correction here. Because vector#0 is emulated for idxd vdev, it has no IMS backing. So there is no msi_desc#0 for that vector. msi_desc#0 actually starts at vector#1 where IMS is allocated to back it. vector#0 does not go through request_irq(). It only has eventfd part. Everything you say is correct but starts at vector#1.
4) When guest unmasks vector#1, Qemu finds it's outside of allocated vectors (only vector#0 now): a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free irq for vector#0. msi_desc#0 is also freed. b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable irqs for both vector#0 and vector#1. At this point, MSIX_PERM#0 has PASID disabled while MSIX_PERM#1 has a valid guest PASID1 for sva. idxd driver registers default PASID to msix_desc#0 and host PASID2 (translated from guest PASID1) to msix_desc#1 when allocating irqs. Later when both irqs are enabled via request_irq(), ims irqchip driver updates the target ims entries according to msix_desc#0 and misx_desc#1 respectively. But this is specific to how Qemu virtualizes MSI-X today. What about it may change (or another device model) to allocate all table_size irqs when guest enables MSI-X capability? At that point we don't have valid MSIX_PERM content to register PASID info to msix_desc. Possibly what we really require is a separate helper function allowing driver to update msix_desc after irq allocation, e.g. when guest unmasks a vector... and do you see any other facets which are overlooked here? Thanks Kevin