> From: Joerg Roedel <joro@xxxxxxxxxx> > Sent: Monday, May 17, 2021 11:35 PM > > On Mon, May 17, 2021 at 10:35:00AM -0300, Jason Gunthorpe wrote: > > Well, I'm sorry, but there is a huge other thread talking about the > > IOASID design in great detail and why this is all needed. Jumping into > > this thread without context and basically rejecting all the > > conclusions that were reached over the last several weeks is really > > not helpful - especially since your objection is not technical. > > > > I think you should wait for Intel to put together the /dev/ioasid uAPI > > proposal and the example use cases it should address then you can give > > feedback there, with proper context. > > Yes, I think the next step is that someone who read the whole thread > writes up the conclusions and a rough /dev/ioasid API proposal, also > mentioning the use-cases it addresses. Based on that we can discuss the > implications this needs to have for IOMMU-API and code. > > From the use-cases I know the mdev concept is just fine. But if there is > a more generic one we can talk about it. > Although /dev/iommu v2 proposal is still in progress, I think there are enough background gathered in v1 to resume this discussion now. In a nutshell /dev/iommu requires two sets of services from the iommu layer: - for an kernel-managed I/O page table via map/unmap; - for an user-managed I/O page table via bind/invalidate and nested on a kernel-managed parent I/O page table; Each I/O page table could be attached by multiple devices. /dev/iommu maintains device specific routing information (RID, or RID+PASID) for where to install the I/O page table in the IOMMU for each attached device. Kernel-managed page table is represented by iommu domain. Existing IOMMU-API allows /dev/iommu to attach a RID device to iommu domain. A new interface is required, e.g. iommu_attach_device_pasid(domain, dev, pasid), to cover (RID+PASID) attaching. Once attaching succeeds, no change to following map/unmap which are domain-wide thus applied to both RID and RID+PASID. In case of dev_iotlb invalidation is required, the iommu driver is responsible for handling it for every attached RID or RID+PASID if ats is enabled. to take one example, the parent (RID1) has three work queues. WQ1 is for parent's own DMA-API usage, with WQ2 (PASID-x) assigned to VM1 and WQ3 (PASID-y) assigned to VM2. VM2 is also assigned with another device (RID2). In this case there are three kernel-managed I/O page tables (IOVA in kernel, GPA for VM1 and GPA for VM2), thus RID1 is attached to three domains: RID1 --- domain1 (default, IOVA) | | | |-- [RID1] | |-- domain2 (vm1, GPA) | | | |-- [RID1, PASID-x] | |-- domain3 (vm2, GPA) | | | |-- [RID1, PASID-y] | | | |-- [RID2] The iommu layer should maintain above attaching status per device and per iommu domain. There is no mdev/subdev concept in the iommu layer. It's just about RID or PASID. User-manage I/O page table might be represented by a new object which describes: - routing information (RID or RID+PASID) - pointer to iommu_domain of the parent I/O page table (inherit the domain ID in iotlb due to nesting) - address of the I/O page table There might be chance to share the structure with native SVA which also has page table managed outside of iommu subsystem. But we can leave it and figure out until coding. And a new set of IOMMU-API: - iommu_{un}bind_pgtable(domain, dev, addr); - iommu_{un}bind_pgtable_pasid(domain, dev, addr, pasid); - iommu_cache_invalidate(domain, dev, invalid_info); - and APIs for registering fault handler and completing faults; (here 'domain' is the one representing the parent I/O page table) Because nesting essentially creates a new reference to the parent I/O page table, iommu_bind_pgtable_pasid() implicitly calls __iommu_attach_ device_pasid() to setup the connection between the parent domain and the new [RID,PASID]. It's not necessary to do so for iommu_bind_pgtable() since the RID is already attached when the parent I/O page table is created. In consequence the example topology is updated as below, with guest SVA enabled in both vm1 and vm2: RID1 --- domain1 (default, IOVA) | | | |-- [RID1] | |-- domain2 (vm1, GPA) | | | |-- [RID1, PASID-x] | |-- [RID1, PASID-a] // nested for vm1 process1 | |-- [RID1, PASID-b] // nested for vm1 process2 | |-- domain3 (vm2, GPA) | | | |-- [RID1, PASID-y] | |-- [RID1, PASID-c] // nested for vm2 process1 | | | |-- [RID2] | |-- [RID2, PASID-a] // nested for vm2 process2 Thoughts? Thanks Kevin