> From: Jason Gunthorpe <jgg@xxxxxxxxxx> > Sent: Friday, June 25, 2021 10:36 PM > > On Fri, Jun 25, 2021 at 10:27:18AM +0000, Tian, Kevin wrote: > > > - When receiving the binding call for the 1st device in a group, iommu_fd > > calls iommu_group_set_block_dma(group, dev->driver) which does > > several things: > > The whole problem here is trying to match this new world where we want > devices to be in charge of their own IOMMU configuration and the old > world where groups are in charge. > > Inserting the group fd and then calling a device-centric > VFIO_GROUP_GET_DEVICE_FD_NEW doesn't solve this conflict, and isn't > necessary. We can always get the group back from the device at any > point in the sequence do to a group wide operation. > > What I saw as the appeal of the sort of idea was to just completely > leave all the difficult multi-device-group scenarios behind on the old > group centric API and then we don't have to deal with them at all, or > least not right away. > > I'd see some progression where iommu_fd only works with 1:1 groups at > the start. Other scenarios continue with the old API. > > Then maybe groups where all devices use the same IOASID. > > Then 1:N groups if the source device is reliably identifiable, this > requires iommu subystem work to attach domains to sub-group objects - > not sure it is worthwhile. > > But at least we can talk about each step with well thought out patches > > The only thing that needs to be done to get the 1:1 step is to broadly > define how the other two cases will work so we don't get into trouble > and set some way to exclude the problematic cases from even getting to > iommu_fd in the first place. > > For instance if we go ahead and create /dev/vfio/device nodes we could > do this only if the group was 1:1, otherwise the group cdev has to be > used, along with its API. > Thinking more along your direction, here is an updated sketch: [Stage-1] Multi-devices group (1:N) is handled by existing vfio group fd and vfio_iommu_type1 driver. Singleton group (1:1) is handled via a new device-centric protocol: 1) /dev/vfio/device nodes are created for devices in singleton group or devices w/o group (mdev) 2) user gets iommu_fd by open("/dev/iommu"). A default block_dma domain is created per iommu_fd (or globally) with an empty I/O page table. 3) iommu_fd reports that only 1:1 group is supported 4) user gets device_fd by open("/dev/vfio/device"). At this point mmap() should be blocked since a security context hasn't been established for this fd. This could be done by returning an error (EACCESS or EAGAIN?), or succeeding w/o actually setting up the mapping. 5) user requests to bind device_fd to iommu_fd which verifies the group is not 1:N (for mdev the check is on the parent device). Successful binding automatically attaches the device to the block_ dma domain via iommu_attach_group(). From now on the user is permitted to access the device. If mmap() in 3) is allowed, vfio actually sets up the MMIO mapping at this point. 6) before the device is unbound from iommu_fd, it is always in a security context. Attaching/detaching just switches the security context between the block_dma domain and an ioasid domain. 7) Unbinding detaches the device from the block_dma domain and re-attach it to the default domain. From now on the user should be denied from accessing the device. vfio should tear down the MMIO mapping at this point. [Stage-2] Both 1:1 and 1:N groups are handled via the new device-centric protocol. Old vfio uAPI is kept for legacy applications. All devices in the same group must share the same I/O address space. A key difference from stage-1 is the additional check on group viability: 1) vfio creates /dev/vfio/device nodes for all devices 2) Same as stage-1 for getting iommu_fd 3) iommu_fd reports that both 1:1 and 1:N groups are supported 4) Same as stage-1 for getting device_fd 5) when receiving the binding call for the 1st device in a group, iommu fd does several things: a) Identify the group of this device and check group viability. A group is viable only when all devices in the group are in one of below states: * driver-less * bound to a driver which is same as the one which does the binding call (vfio in this case) * bound to an otherwise allowed driver (which indicates that it is safe for iommu_fd usage around probe()) b) Attach all devices in the group to the block_dma domain, via existing iommu_attach_group(). c) Register a notifier callback to verifie group viability on IOMMU_GROUP_ NOTIFY_BOUND_DRIVER event. BUG_ON() might be eliminated if we can find a way to deny probe of non-iommu-safe drivers. From now on the user is permitted to access the device. Similar to stage-1, vfio may set up the MMIO mapping at this point. 6) Binding other devices in the same group just succeed 7) Before the last device in the group is unbound from iommu_fd, all devices in the group (even not bound to iommu_fd) switch together between block_dma domain and ioasid domain, initiated by attaching to or detaching from an ioasid. a) iommu_fd verifies that all bound devices in the same group must be attached to a single IOASID. b) the 1st device attach in the group moves the entire group to use the new IOASID domain. c) the last device detach moves the entire group back to the block-dma domain. 8) A device is allowed to be unbound from iommu_fd when other devices in the group are still bound. In this case all devices in this group are still attached to a security context (block-dma or ioasid). vfio may still zap the mmio mapping (though still in security context) since it doesn't know group in this new flow. The unbound device should not be bound to another driver which could break the group viability. 9) When user requests to unbind the last device in the group, iommu_fd detaches the whole group from the block-dma domain. All mmio mappings must be zapped immediately. Devices in the group are re-attached to the default domain from now on (not safe for user to access). [Stage-3] It's still an open whether we want to further allow devices within a group attached to different IOASIDs in case that the source devices are reliably identifiable. This is an usage not supported by existing vfio and might be not worthwhile due to improved isolation over time. When it's required, iommu layer has to create sub-group objects and expose the sub-group topology to userspace. In the meantime, iommu API will be extended to allow sub-group attach/detach operations. In this case, there is no much difference in stage-2 flow. iommu_fd just needs to understand the sub-group topology when allowing a group of devices attached to different IOASIDs. The key is still to enforce that the entire group is in iommu_fd managed security contexts (block-dma or ioasid) as long as one or more devices in the group are still bound to it. Thanks Kevin