On Thu, Apr 22, 2021 at 08:34:32AM +0000, Tian, Kevin wrote: > The shim layer could be considered as a new iommu backend in VFIO, > which connects VFIO iommu ops to the internal helpers in > drivers/ioasid. It may be the best we can do because of SPAPR, but the ideal outcome should be to remove the entire pluggable IOMMU stuff from vfio entirely and have it only use /dev/ioasid We should never add another pluggable IOMMU type to vfio - everything should be done through drives/iommu now that it is much more capable. > Another tricky thing is that a container may be linked to multiple iommu > domains in VFIO, as devices in the container may locate behind different > IOMMUs with inconsistent capability (commit 1ef3e2bc). Frankly this sounds over complicated. I would think /dev/ioasid should select the IOMMU when the first device is joined, and all future joins must be compatible with the original IOMMU - ie there is only one set of IOMMU capabilities in a /dev/ioasid. This means qemue might have multiple /dev/ioasid's if the system has multiple incompatible IOMMUs (is this actually a thing?) The platform should design its IOMMU domains to minimize the number of /dev/ioasid's required. Is there a reason we need to share IOASID'd between completely divergance IOMMU implementations? I don't expect the HW should be able to physically share page tables?? That decision point alone might be the thing that just says we can't ever have /dev/vfio/vfio == /dev/ioasid > Just to confirm. Above flow is for current map/unmap flavor as what > VFIO/vDPA do today. Later when nested translation is supported, > there is no need to detach gpa_ioasid_fd. Instead, a new cmd will > be introduced to nest rid_ioasid_fd on top of gpa_ioasid_fd: Sure.. The tricky bit will be to define both of the common nested operating modes. nested_ioasid = ioctl(ioasid_fd, CREATE_NESTED_IOASID, gpa_ioasid_id); ioctl(ioasid_fd, SET_NESTED_IOASID_PAGE_TABLES, nested_ioasid, ..) // IOMMU will match on the device RID, no PASID: ioctl(vfio_device, ATTACH_IOASID, nested_ioasid); // IOMMU will match on the device RID and PASID: ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid); Notice that ATTACH (or bind, whatever) is always done on the vfio_device FD. ATTACH tells the IOMMU HW to link the PCI BDF&PASID to a specific page table defined by an IOASID. I expect we have many flavours of IOASID tables, eg we have normal, and 'nested with table controlled by hypervisor'. ARM has 'nested with table controlled by guest' right? So like this? nested_ioasid = ioctl(ioasid_fd, CREATE_DELGATED_IOASID, gpa_ioasid_id, <some kind of viommu_id>) // PASID now goes to <viommu_id> ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid); Where <viommu_id> is some internal to the guest handle of the viommu page table scoped within gpa_ioasid_id? Like maybe it is GPA of the base of the page table? The guest can't select its own PASIDs without telling the hypervisor, right? > I also feel hiding group from uAPI is a good thing and is interested in > the rationale behind for explicitly managing group in vfio (which is > essentially the same boundary as provided by iommu group), e.g. for > better user experience when group security is broken? Indeed, I can see how things might have just evolved into this, but if it has a purpose it seems pretty hidden. we need it or not seems pretty hidden. Jason