Re: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

Jean-Philippe Brucker <jean-philippe.brucker@xxxxxxx> · Thu, 13 Sep 2018 16:03:01 +0100

On 13/09/2018 01:19, Tian, Kevin wrote:
>>> This is proposed for architectures which support finer granularity
>>> second level translation with no impact on architectures which only
>>> support Source ID or the similar granularity.
>>
>> Just to be clear, in this paragraph you're only referring to the
>> Nested/second-level translation for mdev, which is specific to vt-d
>> rev3? Other architectures can still do first-level translation with
>> PASID, to support some use-cases of IOMMU aware mediated device
>> (assigning mdevs to userspace drivers, for example)
> 
> yes. aux domain concept applies only to vt-d rev3 which introduces
> scalable mode. Care is taken to avoid breaking usages on existing
> architectures.
> 
> one note. Assigning mdevs to user space alone doesn't imply IOMMU
> aware. All existing mdev usages use software or proprietary methods to
> isolate DMA. There is only one potential IOMMU aware mdev usage 
> which we talked not rely on vt-d rev3 scalable mode - wrap a random 
> PCI device into a single mdev instance (no sharing). In that case mdev 
> inherits RID from parent PCI device, thus is isolated by IOMMU in RID 
> granular. Our RFC supports this usage too. In VFIO two usages (PASID-
> based and RID-based) use same code path, i.e. always binding domain to
> the parent device of mdev. But within IOMMU they go different paths.
> PASID-based will go to aux-domain as iommu_enable_aux_domain
> has been called on that device. RID-based will follow existing 
> unmanaged domain path, as if it is parent device assignment.

For Arm SMMU we're more interested in the PASID-granular case than the
RID-granular one. It doesn't necessarily require vt-d rev3 scalable
mode, the following example can be implemented with an SMMUv3, since it
only needs PASID-granular first-level translation:

We have a PCI function that supports PASID, and can be partitioned into
multiple isolated entities, mdevs. Each mdev has an MMIO frame, an MSI
vector and a PASID.

Different processes (userspace drivers, not QEMU) each open one mdev. A
process controlling one mdev has two ways of doing DMA:

(1) Classically, the process uses a VFIO_TYPE1v2_IOMMU container. This
creates an auxiliary domain for the mdev, with PASID #35. The process
creates DMA mappings with VFIO_IOMMU_MAP_DMA. VFIO calls iommu_map on
the auxiliary domain. The IOMMU driver populates the pgtables associated
with PASID #35.

(2) SVA. One way of doing it: the process uses a new
"VFIO_TYPE1_SVA_IOMMU" type of container. VFIO binds the process address
space to the device, gets PASID #35. Simpler, but not everyone wants to
use SVA, especially not userspace drivers which need the highest
performance.

This example only needs to modify first-level translation, and works
with SMMUv3. The kernel here could be the host, in which case
second-level translation is disabled in the SMMU, or it could be the
guest, in which case second-level mappings are created by QEMU and
first-level translation is managed by assigning PASID tables to the guest.

So (2) would use iommu_sva_bind_device(), but (1) needs something else.
Aren't auxiliary domains suitable for (1)? Why limit auxiliary domain to
second-level or nested translation? It seems silly to use a different
API for first-level, since the flow in userspace and VFIO is the same as
your second-level case as far as MAP_DMA ioctl goes. The difference is
that in your case the auxiliary domain supports an additional operation
which binds first-level page tables. An auxiliary domain that only
supports first-level wouldn't support this operation, but it can still
implement iommu_map/unmap/etc.

Another note: if for some reason you did want to allow userspace to
choose between first-level or second-level, you could implement the
VFIO_TYPE1_NESTING_IOMMU container. It acts like a VFIO_TYPE1v2_IOMMU,
but also sets the DOMAIN_ATTR_NESTING on the IOMMU domain. So DMA_MAP
ioctl on a NESTING container would populate second-level, and DMA_MAP on
a normal container populates first-level. But if you're always going to
use second-level by default, the distinction isn't necessary.

>> Sounds good, I'll drop the private PASID patch if we can figure out a
>> solution to the attach/detach_dev problem discussed on patch 8/10
>>
> 
> Can you elaborate a bit on private PASID usage? what is the
> high level flow on it? 
> 
> Again based on earlier explanation, aux domain is specific to IOMMU
> architecture supporting vtd scalable mode-like capability, which allows
> separate 2nd/1st level translations per PASID. Need a better understanding
> how private PASID is relevant here.

Private PASIDs are used for doing iommu_map/iommu_unmap on PASIDs
(first-level translation):
https://www.spinics.net/lists/dri-devel/msg177003.html As above, some
people don't want SVA, some can't do it, some may even want a few
private address spaces just for their kernel driver. They need a way to
allocate PASIDs and do iommu_map/iommu_unmap on them, without binding to
a process. I was planning to add the private PASID patch to my SVA
series, but in my opinion the feature overlaps with auxiliary domains.

Thanks,
Jean