RE: [RFC] /dev/ioasid uAPI proposal

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Tue, 1 Jun 2021 07:01:57 +0000

> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Saturday, May 29, 2021 4:03 AM
> 
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > /dev/ioasid provides an unified interface for managing I/O page tables for
> > devices assigned to userspace. Device passthrough frameworks (VFIO,
> vDPA,
> > etc.) are expected to use this interface instead of creating their own logic to
> > isolate untrusted device DMAs initiated by userspace.
> 
> It is very long, but I think this has turned out quite well. It
> certainly matches the basic sketch I had in my head when we were
> talking about how to create vDPA devices a few years ago.
> 
> When you get down to the operations they all seem pretty common sense
> and straightfoward. Create an IOASID. Connect to a device. Fill the
> IOASID with pages somehow. Worry about PASID labeling.
> 
> It really is critical to get all the vendor IOMMU people to go over it
> and see how their HW features map into this.
> 

Agree. btw I feel it might be good to have several design opens 
centrally discussed after going through all the comments. Otherwise 
they may be buried in different sub-threads and potentially with 
insufficient care (especially for people who haven't completed the
reading).

I summarized five opens here, about:

1)  Finalizing the name to replace /dev/ioasid;
2)  Whether one device is allowed to bind to multiple IOASID fd's;
3)  Carry device information in invalidation/fault reporting uAPI;
4)  What should/could be specified when allocating an IOASID;
5)  The protocol between vfio group and kvm;

For 1), two alternative names are mentioned: /dev/iommu and 
/dev/ioas. I don't have a strong preference and would like to hear 
votes from all stakeholders. /dev/iommu is slightly better imho for 
two reasons. First, per AMD's presentation in last KVM forum they 
implement vIOMMU in hardware thus need to support user-managed 
domains. An iommu uAPI notation might make more sense moving 
forward. Second, it makes later uAPI naming easier as 'IOASID' can 
be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of 
IOASID_ALLOC_IOASID. :)

Another naming open is about IOASID (the software handle for ioas) 
and the associated hardware ID (PASID or substream ID). Jason thought 
PASID is defined more from SVA angle while ARM's convention sounds 
clearer from device p.o.v. Following this direction then SID/SSID will be 
used to replace RID/PASID in this RFC (and possibly also implying that 
the kernel IOASID allocator should also be renamed to SSID allocator). 
I don't have better alternative. If no one objects, I'll change to this new
naming in next version. 

For 2), Jason prefers to not blocking it if no kernel design reason. If 
one device is allowed to bind multiple IOASID fd's, the main problem
is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1 
and giova_ioasid created in fd2 and then nesting them together (and
whether any cross-fd notification required when handling invalidation
etc.). We thought that this just adds some complexity while not sure 
about the value of supporting it (when one fd can already afford all 
discussed usages). Therefore this RFC proposes a device only bound 
to at most one IOASID fd. Does this rationale make sense?

To the other end there was also thought whether we should make
a single I/O address space per IOASID fd. This was discussed in previous
thread that #fd's are insufficient to afford theoretical 1M's address
spaces per device. But let's have another revisit and draw a clear
conclusion whether this option is viable.

For 3), Jason/Jean both think it's cleaner to carry device info in the 
uAPI. Actually this was one option we developed in earlier internal
versions of this RFC. Later on we changed it to the current way based
on misinterpretation of previous discussion. Thinking more we will
adopt this suggestion in next version, due to both efficiency (I/O page
fault is already a long path ) and security reason (some faults are 
unrecoverable thus the faulting device must be identified/isolated).

This implies that VFIO_BOUND_IOASID will be extended to allow user
specify a device label. This label will be recorded in /dev/iommu to
serve per-device invalidation request from and report per-device 
fault data to the user. In addition, vPASID (if provided by user) will
be also recorded in /dev/iommu so vPASID<->pPASID conversion 
is conducted properly. e.g. invalidation request from user carries
a vPASID which must be converted into pPASID before calling iommu
driver. Vice versa for raw fault data which carries pPASID while the
user expects a vPASID.

For 4), There are two options for specifying the IOASID attributes:

    In this RFC, an IOASID has no attribute before it's attached to any
    device. After device attach, user queries capability/format info
    about the IOMMU which the device belongs to, and then call
    different ioctl commands to set the attributes for an IOASID (e.g.
    map/unmap, bind/unbind user pgtable, nesting, etc.). This follows
    how the underlying iommu-layer API is designed: a domain reports
    capability/format info and serves iommu ops only after it's attached 
    to a device.

    Jason suggests having user to specify all attributes about how an
    IOASID is expected to work when creating this IOASID. This requires
    /dev/iommu to provide capability/format info once a device is bound
    to ioasid fd (before creating any IOASID). In concept this should work, 
    since given a device we can always find its IOMMU. The only gap is
    aforementioned: current iommu API is designed per domain instead 
    of per-device. 

Seems to close this design open we have to touch the kAPI design. and 
Joerg's input is highly appreciated here.

For 5), I'd expect Alex to chime in. Per my understanding looks the
original purpose of this protocol is not about I/O address space. It's
for KVM to know whether any device is assigned to this VM and then
do something special (e.g. posted interrupt, EPT cache attribute, etc.).
Because KVM deduces some policy based on the fact of assigned device, 
it needs to hold a reference to related vfio group. this part is irrelevant
to this RFC. 

But ARM's VMID usage is related to I/O address space thus needs some
consideration. Another strange thing is about PPC. Looks it also leverages
this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_
group. I don't know why it's done through KVM instead of VFIO uAPI in
the first place.

Thanks
Kevin