> From: Jiang, Dave <dave.jiang@xxxxxxxxx> > Sent: Tuesday, June 22, 2021 11:51 PM > > On 6/22/2021 3:16 AM, Tian, Kevin wrote: > > Hi, Alex, > > > > Need your help to understand the current MSI-X virtualization flow in > > VFIO. Some background info first. > > > > Recently we are discussing how to virtualize MSI-X with Interrupt > > Message Storage (IMS) on mdev: > > https://lore.kernel.org/kvm/87im2lyiv6.ffs@xxxxxxxxxxxxxxxxxxxxxxx/ > > > > IMS is a device specific interrupt storage, allowing an optimized and > > scalable manner for generating interrupts. idxd mdev exposes virtual > > MSI-X capability to guest but uses IMS entries physically for generating > > interrupts. > > > > Thomas has helped implement a generic ims irqchip driver: > > https://lore.kernel.org/linux- > hyperv/20200826112335.202234502@xxxxxxxxxxxxx/ > > > > idxd device allows software to specify an IMS entry (for triggering > > completion interrupt) when submitting a descriptor. To prevent one > > mdev triggering malicious interrupt into another mdev (by specifying > > an arbitrary entry), idxd ims entry includes a PASID field for validation - > > only a matching PASID in the executed descriptor can trigger interrupt > > via this entry. idxd driver is expected to program ims entries with > > PASIDs that are allocated to the mdev which owns those entries. > > > > Other devices may have different ID and format to isolate ims entries. > > But we need abstract a generic means for programming vendor-specific > > ID into vendor-specific ims entry, without violating the layering model. > > > > Thomas suggested vendor driver to first register ID information (possibly > > plus the location where to write ID to) in msi_desc when allocating irqs > > (extend existing alloc function or via new helper function) and then have > > the generic ims irqchip driver to update ID to the ims entry when it's > > started up by request_irq(). > > > > Then there are two questions to be answered: > > > > 1) How does vendor driver decide the ID to be registered to msi_desc? > > 2) How is Thomas's model mapped to the MSI-X virtualization flow in > VFIO? > > > > For the 1st open, there are two types of PASIDs on idxd mdev: > > > > 1) default PASID: one per mdev and allocated when mdev is created; > > 2) sva PASIDs: multiple per mdev and allocated on-demand (via > vIOMMU); > > > > If vIOMMU is not exposed, all ims entries of this mdev should be > > programmed with default PASID which is always available in mdev's > > lifespan. > > > > If vIOMMU is exposed and guest sva is enabled, entries used for sva > > should be tagged with sva PASIDs, leaving others tagged with default > > PASID. To help achieve intra-guest interrupt isolation, guest idxd driver > > needs program guest sva PASIDs into virtual MSIX_PERM register (one > > per MSI-X entry) for validation. Access to MSIX_PERM is trap-and-emulated > > by host idxd driver which then figure out which PASID to register to > > msi_desc (require PASID translation info via new /dev/iommu proposal). > > > > The guest driver is expected to update MSIX_PERM before request_irq(). > > > > Now the 2nd open requires your help. Below is what I learned from > > current vfio/qemu code (for vfio-pci device): > > > > 0) Qemu doesn't attempt to allocate all irqs as reported by msix-> > > table_size. It is done in an dynamic and incremental way. > > > > 1) VFIO provides just one command (VFIO_DEVICE_SET_IRQS) for > > allocating/enabling irqs given a set of vMSIX vectors [start, count]: > > > > a) if irqs not allocated, allocate irqs [start+count]. Enable irqs for > > specified vectors [start, count] via request_irq(); > > b) if irqs already allocated, enable irqs for specified vectors; > > c) if irq already enabled, disable and re-enable irqs for specified > > vectors because user may specify a different eventfd; > > > > 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_ > > DEVICE_SET_IRQS to enable vector#0, even though it's currently > > masked by the guest. Interrupts are received by Qemu but blocked > > from guest via mask/pending bit emulation. The main intention is > > to enable physical MSI-X; > > > > 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_ > > DEVICE_SET_IRQS to enable vector#0 again, with a eventfd different > > from the one provided in 2); > > > > 4) When guest unmasks vector#1, Qemu finds it's outside of allocated > > vectors (only vector#0 now): > > > > a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free > > irq for vector#0; > > > > b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable > > irqs for both vector#0 and vector#1; > > > > 5) When guest unmasks vector#2, same flow in 4) continues. > > > > .... > > > > If above understanding is correct, how is lost interrupt avoided between > > 4.a) and 4.b) given that irq has been torn down for vector#0 in the middle > > while from guest p.o.v this vector is actually unmasked? There must be > > a mechanism in place, but I just didn't figure it out... > > > > Given above flow is robust, mapping Thomas's model to this flow is > > straightforward. Assume idxd mdev has two vectors: vector#0 for > > misc/error interrupt and vector#1 as completion interrupt for guest > > sva. VFIO_DEVICE_SET_IRQS is handled by idxd mdev driver: > > > > 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_ > > DEVICE_SET_IRQS to enable vector#0. Because vector#0 is not > > used for sva, MSIX_PERM#0 has PASID disabled. Host idxd driver > > knows to register default PASID to msi_desc#0 when allocating irqs. > > Then .startup() callback of ims irqchip is called to program default > > PASID saved in msi_desc#0 to the target ims entry when request_irq(). > > > > 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_ > > DEVICE_SET_IRQS to enable vector#0 again. Following same logic > > as vfio-pci, idxd driver first disable irq#0 via free_irq() and then > > re-enable irq#0 via request_irq(). It's still default PASID being used > > according to msi_desc#0. > > Hi Kevin, slight correction here. Because vector#0 is emulated for idxd > vdev, it has no IMS backing. So there is no msi_desc#0 for that vector. > msi_desc#0 actually starts at vector#1 where IMS is allocated to back > it. vector#0 does not go through request_irq(). It only has eventfd > part. Everything you say is correct but starts at vector#1. > You are right. But for illustration simplicity, let's still assume both vector #0 and #1 are backed by ims in following discussion, since purely emulated vector is anyway outside of this context. 😊 Thanks Kevin