> From: Jason Gunthorpe <jgg@xxxxxxxxxx> > Sent: Saturday, May 29, 2021 4:03 AM > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote: > > /dev/ioasid provides an unified interface for managing I/O page tables for > > devices assigned to userspace. Device passthrough frameworks (VFIO, > vDPA, > > etc.) are expected to use this interface instead of creating their own logic to > > isolate untrusted device DMAs initiated by userspace. > > It is very long, but I think this has turned out quite well. It > certainly matches the basic sketch I had in my head when we were > talking about how to create vDPA devices a few years ago. > > When you get down to the operations they all seem pretty common sense > and straightfoward. Create an IOASID. Connect to a device. Fill the > IOASID with pages somehow. Worry about PASID labeling. > > It really is critical to get all the vendor IOMMU people to go over it > and see how their HW features map into this. > Agree. btw I feel it might be good to have several design opens centrally discussed after going through all the comments. Otherwise they may be buried in different sub-threads and potentially with insufficient care (especially for people who haven't completed the reading). I summarized five opens here, about: 1) Finalizing the name to replace /dev/ioasid; 2) Whether one device is allowed to bind to multiple IOASID fd's; 3) Carry device information in invalidation/fault reporting uAPI; 4) What should/could be specified when allocating an IOASID; 5) The protocol between vfio group and kvm; For 1), two alternative names are mentioned: /dev/iommu and /dev/ioas. I don't have a strong preference and would like to hear votes from all stakeholders. /dev/iommu is slightly better imho for two reasons. First, per AMD's presentation in last KVM forum they implement vIOMMU in hardware thus need to support user-managed domains. An iommu uAPI notation might make more sense moving forward. Second, it makes later uAPI naming easier as 'IOASID' can be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of IOASID_ALLOC_IOASID. :) Another naming open is about IOASID (the software handle for ioas) and the associated hardware ID (PASID or substream ID). Jason thought PASID is defined more from SVA angle while ARM's convention sounds clearer from device p.o.v. Following this direction then SID/SSID will be used to replace RID/PASID in this RFC (and possibly also implying that the kernel IOASID allocator should also be renamed to SSID allocator). I don't have better alternative. If no one objects, I'll change to this new naming in next version. For 2), Jason prefers to not blocking it if no kernel design reason. If one device is allowed to bind multiple IOASID fd's, the main problem is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1 and giova_ioasid created in fd2 and then nesting them together (and whether any cross-fd notification required when handling invalidation etc.). We thought that this just adds some complexity while not sure about the value of supporting it (when one fd can already afford all discussed usages). Therefore this RFC proposes a device only bound to at most one IOASID fd. Does this rationale make sense? To the other end there was also thought whether we should make a single I/O address space per IOASID fd. This was discussed in previous thread that #fd's are insufficient to afford theoretical 1M's address spaces per device. But let's have another revisit and draw a clear conclusion whether this option is viable. For 3), Jason/Jean both think it's cleaner to carry device info in the uAPI. Actually this was one option we developed in earlier internal versions of this RFC. Later on we changed it to the current way based on misinterpretation of previous discussion. Thinking more we will adopt this suggestion in next version, due to both efficiency (I/O page fault is already a long path ) and security reason (some faults are unrecoverable thus the faulting device must be identified/isolated). This implies that VFIO_BOUND_IOASID will be extended to allow user specify a device label. This label will be recorded in /dev/iommu to serve per-device invalidation request from and report per-device fault data to the user. In addition, vPASID (if provided by user) will be also recorded in /dev/iommu so vPASID<->pPASID conversion is conducted properly. e.g. invalidation request from user carries a vPASID which must be converted into pPASID before calling iommu driver. Vice versa for raw fault data which carries pPASID while the user expects a vPASID. For 4), There are two options for specifying the IOASID attributes: In this RFC, an IOASID has no attribute before it's attached to any device. After device attach, user queries capability/format info about the IOMMU which the device belongs to, and then call different ioctl commands to set the attributes for an IOASID (e.g. map/unmap, bind/unbind user pgtable, nesting, etc.). This follows how the underlying iommu-layer API is designed: a domain reports capability/format info and serves iommu ops only after it's attached to a device. Jason suggests having user to specify all attributes about how an IOASID is expected to work when creating this IOASID. This requires /dev/iommu to provide capability/format info once a device is bound to ioasid fd (before creating any IOASID). In concept this should work, since given a device we can always find its IOMMU. The only gap is aforementioned: current iommu API is designed per domain instead of per-device. Seems to close this design open we have to touch the kAPI design. and Joerg's input is highly appreciated here. For 5), I'd expect Alex to chime in. Per my understanding looks the original purpose of this protocol is not about I/O address space. It's for KVM to know whether any device is assigned to this VM and then do something special (e.g. posted interrupt, EPT cache attribute, etc.). Because KVM deduces some policy based on the fact of assigned device, it needs to hold a reference to related vfio group. this part is irrelevant to this RFC. But ARM's VMID usage is related to I/O address space thus needs some consideration. Another strange thing is about PPC. Looks it also leverages this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_ group. I don't know why it's done through KVM instead of VFIO uAPI in the first place. Thanks Kevin