On Thu, Sep 13, 2018 at 04:03:01PM +0100, Jean-Philippe Brucker wrote: > On 13/09/2018 01:19, Tian, Kevin wrote: > >>> This is proposed for architectures which support finer granularity > >>> second level translation with no impact on architectures which only > >>> support Source ID or the similar granularity. > >> > >> Just to be clear, in this paragraph you're only referring to the > >> Nested/second-level translation for mdev, which is specific to vt-d > >> rev3? Other architectures can still do first-level translation with > >> PASID, to support some use-cases of IOMMU aware mediated device > >> (assigning mdevs to userspace drivers, for example) > > > > yes. aux domain concept applies only to vt-d rev3 which introduces > > scalable mode. Care is taken to avoid breaking usages on existing > > architectures. > > > > one note. Assigning mdevs to user space alone doesn't imply IOMMU > > aware. All existing mdev usages use software or proprietary methods to > > isolate DMA. There is only one potential IOMMU aware mdev usage > > which we talked not rely on vt-d rev3 scalable mode - wrap a random > > PCI device into a single mdev instance (no sharing). In that case mdev > > inherits RID from parent PCI device, thus is isolated by IOMMU in RID > > granular. Our RFC supports this usage too. In VFIO two usages (PASID- > > based and RID-based) use same code path, i.e. always binding domain to > > the parent device of mdev. But within IOMMU they go different paths. > > PASID-based will go to aux-domain as iommu_enable_aux_domain > > has been called on that device. RID-based will follow existing > > unmanaged domain path, as if it is parent device assignment. > > For Arm SMMU we're more interested in the PASID-granular case than the > RID-granular one. It doesn't necessarily require vt-d rev3 scalable > mode, the following example can be implemented with an SMMUv3, since it > only needs PASID-granular first-level translation: You are right, you can simply use the first level as IOVA for every PASID. Only issue becomes when you need to assign that to a guest, you would be required to shadow the 1st level. If you have a 2nd level per-pasid first level can be managed in guest and don't require to shadow them. > > We have a PCI function that supports PASID, and can be partitioned into > multiple isolated entities, mdevs. Each mdev has an MMIO frame, an MSI > vector and a PASID. > > Different processes (userspace drivers, not QEMU) each open one mdev. A > process controlling one mdev has two ways of doing DMA: > > (1) Classically, the process uses a VFIO_TYPE1v2_IOMMU container. This > creates an auxiliary domain for the mdev, with PASID #35. The process > creates DMA mappings with VFIO_IOMMU_MAP_DMA. VFIO calls iommu_map on > the auxiliary domain. The IOMMU driver populates the pgtables associated > with PASID #35. > > (2) SVA. One way of doing it: the process uses a new > "VFIO_TYPE1_SVA_IOMMU" type of container. VFIO binds the process address > space to the device, gets PASID #35. Simpler, but not everyone wants to > use SVA, especially not userspace drivers which need the highest > performance. > > > This example only needs to modify first-level translation, and works > with SMMUv3. The kernel here could be the host, in which case > second-level translation is disabled in the SMMU, or it could be the > guest, in which case second-level mappings are created by QEMU and > first-level translation is managed by assigning PASID tables to the guest. > > So (2) would use iommu_sva_bind_device(), but (1) needs something else. > Aren't auxiliary domains suitable for (1)? Why limit auxiliary domain to > second-level or nested translation? It seems silly to use a different > API for first-level, since the flow in userspace and VFIO is the same as > your second-level case as far as MAP_DMA ioctl goes. The difference is > that in your case the auxiliary domain supports an additional operation > which binds first-level page tables. An auxiliary domain that only > supports first-level wouldn't support this operation, but it can still > implement iommu_map/unmap/etc. > > > Another note: if for some reason you did want to allow userspace to > choose between first-level or second-level, you could implement the > VFIO_TYPE1_NESTING_IOMMU container. It acts like a VFIO_TYPE1v2_IOMMU, > but also sets the DOMAIN_ATTR_NESTING on the IOMMU domain. So DMA_MAP > ioctl on a NESTING container would populate second-level, and DMA_MAP on > a normal container populates first-level. But if you're always going to > use second-level by default, the distinction isn't necessary. Where is the nesting attribute specified? in vt-d2 it was part of context entry, so also meant all PASID's are nested now. In vt-d3 its part of PASID context. It seems unsafe to share PASID's with different VM's since any request W/O PASID has only one mapping. > > > >> Sounds good, I'll drop the private PASID patch if we can figure out a > >> solution to the attach/detach_dev problem discussed on patch 8/10 > >> > > > > Can you elaborate a bit on private PASID usage? what is the > > high level flow on it? > > > > Again based on earlier explanation, aux domain is specific to IOMMU > > architecture supporting vtd scalable mode-like capability, which allows > > separate 2nd/1st level translations per PASID. Need a better understanding > > how private PASID is relevant here. > > Private PASIDs are used for doing iommu_map/iommu_unmap on PASIDs > (first-level translation): > https://www.spinics.net/lists/dri-devel/msg177003.html As above, some > people don't want SVA, some can't do it, some may even want a few > private address spaces just for their kernel driver. They need a way to > allocate PASIDs and do iommu_map/iommu_unmap on them, without binding to > a process. I was planning to add the private PASID patch to my SVA > series, but in my opinion the feature overlaps with auxiliary domains. It sounds like it maps to AUX domains.