Re: [RFC v2] /dev/iommu uAPI proposal

Eric Auger <eric.auger@xxxxxxxxxx> · Tue, 10 Aug 2021 09:17:10 +0200

Hi Kevin,

On 8/5/21 2:36 AM, Tian, Kevin wrote:
>> From: Eric Auger <eric.auger@xxxxxxxxxx>
>> Sent: Wednesday, August 4, 2021 11:59 PM
>>
> [...] 
>>> 1.2. Attach Device to I/O address space
>>> +++++++++++++++++++++++++++++++++++++++
>>>
>>> Device attach/bind is initiated through passthrough framework uAPI.
>>>
>>> Device attaching is allowed only after a device is successfully bound to
>>> the IOMMU fd. User should provide a device cookie when binding the
>>> device through VFIO uAPI. This cookie is used when the user queries
>>> device capability/format, issues per-device iotlb invalidation and
>>> receives per-device I/O page fault data via IOMMU fd.
>>>
>>> Successful binding puts the device into a security context which isolates
>>> its DMA from the rest system. VFIO should not allow user to access the
>> s/from the rest system/from the rest of the system
>>> device before binding is completed. Similarly, VFIO should prevent the
>>> user from unbinding the device before user access is withdrawn.
>> With Intel scalable IOV, I understand you could assign an RID/PASID to
>> one VM and another one to another VM (which is not the case for ARM). Is
>> it a targetted use case?How would it be handled? Is it related to the
>> sub-groups evoked hereafter?
> Not related to sub-group. Each mdev is bound to the IOMMU fd respectively
> with the defPASID which represents the mdev.
But how does it work in term of security. The device (RID) is bound to
an IOMMU fd. But then each SID/PASID may be working for a different VM.
How do you detect this is safe as each SID can work safely for a
different VM versus the ARM case where it is not possible.

1.3 says
"

1)  A successful binding call for the first device in the group creates
    the security context for the entire group, by:
"
What does it mean for above scalable IOV use case?

>
>> Actually all devices bound to an IOMMU fd should have the same parent
>> I/O address space or root address space, am I correct? If so, maybe add
>> this comment explicitly?
> in most cases yes but it's not mandatory. multiple roots are allowed
> (e.g. with vIOMMU but no nesting).
OK, right, this corresponds to example 4.2 for example. I misinterpreted
the notion of security context. The security context does not match the
IOMMU fd but is something implicit created on 1st device binding.
>
> [...]
>>> The device in the /dev/iommu context always refers to a physical one
>>> (pdev) which is identifiable via RID. Physically each pdev can support
>>> one default I/O address space (routed via RID) and optionally multiple
>>> non-default I/O address spaces (via RID+PASID).
>>>
>>> The device in VFIO context is a logic concept, being either a physical
>>> device (pdev) or mediated device (mdev or subdev). Each vfio device
>>> is represented by RID+cookie in IOMMU fd. User is allowed to create
>>> one default I/O address space (routed by vRID from user p.o.v) per
>>> each vfio_device.
>> The concept of default address space is not fully clear for me. I
>> currently understand this is a
>> root address space (not nesting). Is that coorect.This may need
>> clarification.
> w/o PASID there is only one address space (either GPA or GIOVA)
> per device. This one is called default. whether it's root is orthogonal
> (e.g. GIOVA could be also nested) to the device view of this space.
>
> w/ PASID additional address spaces can be targeted by the device.
> those are called non-default.
>
> I could also rename default to RID address space and non-default to 
> RID+PASID address space if doing so makes it clearer.
Yes I think it is worth having a kind of glossary and defining root as,
default as as you clearly defined child/parent.
>
>>> VFIO decides the routing information for this default
>>> space based on device type:
>>>
>>> 1)  pdev, routed via RID;
>>>
>>> 2)  mdev/subdev with IOMMU-enforced DMA isolation, routed via
>>>     the parent's RID plus the PASID marking this mdev;
>>>
>>> 3)  a purely sw-mediated device (sw mdev), no routing required i.e. no
>>>     need to install the I/O page table in the IOMMU. sw mdev just uses
>>>     the metadata to assist its internal DMA isolation logic on top of
>>>     the parent's IOMMU page table;
>> Maybe you should introduce this concept of SW mediated device earlier
>> because it seems to special case the way the attach behaves. I am
>> especially refering to
>>
>> "Successful attaching activates an I/O address space in the IOMMU, if the
>> device is not purely software mediated"
> makes sense.
>
>>> In addition, VFIO may allow user to create additional I/O address spaces
>>> on a vfio_device based on the hardware capability. In such case the user
>>> has its own view of the virtual routing information (vPASID) when marking
>>> these non-default address spaces.
>> I do not catch what does mean "marking these non default address space".
> as explained above, those non-default address spaces are identified/routed
> via PASID. 
>
>>> 1.3. Group isolation
>>> ++++++++++++++++++++
> [...]
>>> 1)  A successful binding call for the first device in the group creates
>>>     the security context for the entire group, by:
>>>
>>>     * Verifying group viability in a similar way as VFIO does;
>>>
>>>     * Calling IOMMU-API to move the group into a block-dma state,
>>>       which makes all devices in the group attached to an block-dma
>>>       domain with an empty I/O page table;
>> this block-dma state/domain would deserve to be better defined (I know
>> you already evoked it in 1.1 with the dma mapping protocol though)
>> activates an empty I/O page table in the IOMMU (if the device is not
>> purely SW mediated)?
> sure. some explanations are scattered in following paragraph, but I
> can consider to further clarify it.
>
>> How does that relate to the default address space? Is it the same?
> different. this block-dma domain doesn't hold any valid mapping. The
> default address space is represented by a normal unmanaged domain.
> the ioasid attaching operation will detach the device from the block-dma
> domain and then attach it to the target ioasid.
OK

Thanks

Eric
>
>>> 2. uAPI Proposal
>>> ----------------------
> [...]
>>> /*
>>>   * Allocate an IOASID.
>>>   *
>>>   * IOASID is the FD-local software handle representing an I/O address
>>>   * space. Each IOASID is associated with a single I/O page table. User
>>>   * must call this ioctl to get an IOASID for every I/O address space that is
>>>   * intended to be tracked by the kernel.
>>>   *
>>>   * User needs to specify the attributes of the IOASID and associated
>>>   * I/O page table format information according to one or multiple devices
>>>   * which will be attached to this IOASID right after. The I/O page table
>>>   * is activated in the IOMMU when it's attached by a device. Incompatible
>> .. if not SW mediated
>>>   * format between device and IOASID will lead to attaching failure.
>>>   *
>>>   * The root IOASID should always have a kernel-managed I/O page
>>>   * table for safety. Locked page accounting is also conducted on the root.
>> The definition of root IOASID is not easily found in this spec. Maybe
>> this would deserve some clarification.
> make sense.
>
> and thanks for other typo-related comments.
>
> Thanks
> Kevin