> From: Jason Gunthorpe <jgg@xxxxxxxxxx> > Sent: Thursday, December 9, 2021 2:31 AM > > On Wed, Dec 08, 2021 at 05:20:39PM +0000, Jean-Philippe Brucker wrote: > > On Wed, Dec 08, 2021 at 08:56:16AM -0400, Jason Gunthorpe wrote: > > > From a progress perspective I would like to start with simple 'page > > > tables in userspace', ie no PASID in this step. > > > > > > 'page tables in userspace' means an iommufd ioctl to create an > > > iommu_domain where the IOMMU HW is directly travesering a > > > device-specific page table structure in user space memory. All the HW > > > today implements this by using another iommu_domain to allow the > IOMMU > > > HW DMA access to user memory - ie nesting or multi-stage or whatever. > > > > > > This would come along with some ioctls to invalidate the IOTLB. > > > > > > I'm imagining this step as a iommu_group->op->create_user_domain() > > > driver callback which will create a new kind of domain with > > > domain-unique ops. Ie map/unmap related should all be NULL as those > > > are impossible operations. > > > > > > From there the usual struct device (ie RID) attach/detatch stuff needs > > > to take care of routing DMAs to this iommu_domain. > > > > > > Step two would be to add the ability for an iommufd using driver to > > > request that a RID&PASID is connected to an iommu_domain. This > > > connection can be requested for any kind of iommu_domain, kernel > owned > > > or user owned. > > > > > > I don't quite have an answer how exactly the SMMUv3 vs Intel > > > difference in PASID routing should be resolved. > > > > In SMMUv3 the user pgd is always stored in the PASID table (actually > > called "context descriptor table" but I want to avoid confusion with > > the VT-d "context table"). And to access the PASID table, the SMMUv3 first > > translate its GPA into a PA using the stage-2 page table. For userspace to > > pass individual pgds to the kernel, as opposed to passing whole PASID > > tables, the host kernel needs to reserve GPA space and map it in stage-2, > > so it can store the PASID table in there. Userspace manages GPA space. > > It is what I thought.. So in the SMMUv3 spec the STE is completely in > kernel memory, but it points to an S1ContextPtr that must be an IPA if > the "stage 1 translation tables" are IPA. Only via S1ContextPtr can we > decode the substream? > > So in SMMUv3 land we don't really ever talk about PASID, we have a > 'user page table' that is bound to an entire RID and *all* PASIDs. > > While Intel would have a 'user page table' that is only bound to a RID > & PASID > > Certianly it is not a difference we can hide from userspace. Concept-wise it is still a 'user page table' with vendor specific format. Taking your earlier analog it's just for a single 84-bit address space (20PASID+64bitVA) per RID. So what we requires is still one unified ioctl in your step-1 to support per-RID 'user page table'. For ARM it's SMMU's PASID table format. There is no step-2 since PASID is already within the address space covered by the user PASID table. For Intel it's VT-d's 1st level page table format. When moving to step-2 then allows multiple 'user page tables' connected to RID & PASID. > > > This would be easy for a single pgd. In this case the PASID table has a > > single entry and userspace could just pass one GPA page during > > registration. However it isn't easily generalized to full PASID support, > > because managing a multi-level PASID table will require runtime GPA > > allocation, and that API is awkward. That's why we opted for "attach PASID > > table" operation rather than "attach page table" (back then the choice was > > easy since VT-d used the same concept). > > I think the entire context descriptor table should be in userspace, > and filled in by userspace, as part of the userspace page table. > > The kernel API should accept the S1ContextPtr IPA and all the parts of > the STE that relate to the defining the layout of what the S1Context > points to an thats it. > > We should have another mode where the kernel owns everything, and the > S1ContexPtr is a PA with Stage 2 bypassed. I guess this is for the usage like DPDK. In that case yes we can have unified ioctl since the kernel manages everything including the PASID table. > > That part is fine, the more open question is what does the driver > interface look like when userspace tell something like vfio-pci to > connect to this thing. At some level the attaching device needs to > authorize iommufd to take the entire PASID table and RID. as long as smmu driver advocates only supporting step-1 ioctl, then this authorization should be implied already. > > Specifically we cannot use this thing with a mdev, while the Intel > version of a userspace page table can be. yes. Supporting mdev is all the reason why Intel puts the PASID table in host physical address space to be managed by the kernel. > > Maybe that is just some 'allow whole device' flag in an API > As said, I feel this special flag is not required as long as the vendor iommu driver only supports your step-1 interface which implies 'allow whole device' for ARM. Thanks Kevin