Re: [RFC] /dev/ioasid uAPI proposal

Jason Gunthorpe <jgg@xxxxxxxxxx> · Fri, 28 May 2021 14:35:38 -0300

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:

> IOASID nesting can be implemented in two ways: hardware nesting and 
> software nesting. With hardware support the child and parent I/O page 
> tables are walked consecutively by the IOMMU to form a nested translation. 
> When it's implemented in software, the ioasid driver is responsible for 
> merging the two-level mappings into a single-level shadow I/O page table. 
> Software nesting requires both child/parent page tables operated through 
> the dma mapping protocol, so any change in either level can be captured 
> by the kernel to update the corresponding shadow mapping.

Why? A SW emulation could do this synchronization during invalidation
processing if invalidation contained an IOVA range.

I think this document would be stronger to include some "Rational"
statements in key places

> Based on the underlying IOMMU capability one device might be allowed 
> to attach to multiple I/O address spaces, with DMAs accessing them by 
> carrying different routing information. One of them is the default I/O 
> address space routed by PCI Requestor ID (RID) or ARM Stream ID. The 
> remaining are routed by RID + Process Address Space ID (PASID) or 
> Stream+Substream ID. For simplicity the following context uses RID and
> PASID when talking about the routing information for I/O address spaces.

I wonder if we should just adopt the ARM naming as the API
standard. It is general and doesn't have the SVA connotation that
"Process Address Space ID" carries.

> Device must be bound to an IOASID FD before attach operation can be
> conducted. This is also through VFIO uAPI. In this proposal one device 
> should not be bound to multiple FD's. Not sure about the gain of 
> allowing it except adding unnecessary complexity. But if others have 
> different view we can further discuss.

Unless there is some internal kernel design reason to block it, I
wouldn't go out of my way to prevent it.

> VFIO must ensure its device composes DMAs with the routing information
> attached to the IOASID. For pdev it naturally happens since vPASID is 
> directly programmed to the device by guest software. For mdev this 
> implies any guest operation carrying a vPASID on this device must be 
> trapped into VFIO and then converted to pPASID before sent to the 
> device. A detail explanation about PASID virtualization policies can be 
> found in section 4. 

vPASID and related seems like it needs other IOMMU vendors to take a
very careful look. I'm really glad to see this starting to be spelled
out in such a clear way, as it was hard to see from the patches there
is vendor variation.

> With above design /dev/ioasid uAPI is all about I/O address spaces. 
> It doesn't include any device routing information, which is only 
> indirectly registered to the ioasid driver through VFIO uAPI. For
> example, I/O page fault is always reported to userspace per IOASID,
> although it's physically reported per device (RID+PASID). 

I agree with Jean-Philippe - at the very least erasing this
information needs a major rational - but I don't really see why it
must be erased? The HW reports the originating device, is it just a
matter of labeling the devices attached to the /dev/ioasid FD so it
can be reported to userspace?

> multiple attached devices) and then generates a per-device virtual I/O 
> page fault into guest. Similarly the iotlb invalidation uAPI describes the 
> granularity in the I/O address space (all, or a range), different from the 
> underlying IOMMU semantics (domain-wide, PASID-wide, range-based).

This seems OK though, I can't think of a reason to allow an IOASID to
be left partially invalidated???

> I/O page tables routed through PASID are installed in a per-RID PASID 
> table structure. Some platforms implement the PASID table in the guest 
> physical space (GPA), expecting it managed by the guest. The guest
> PASID table is bound to the IOMMU also by attaching to an IOASID, 
> representing the per-RID vPASID space. 
> 
> We propose the host kernel needs to explicitly track  guest I/O page 
> tables even on these platforms, i.e. the same pgtable binding protocol 
> should be used universally on all platforms (with only difference on who
> actually writes the PASID table). One opinion from previous discussion 
> was treating this special IOASID as a container for all guest I/O page 
> tables i.e. hiding them from the host. 

> However this way significantly 
> violates the philosophy in this /dev/ioasid proposal. It is not one IOASID 
> one address space any more. Device routing information (indirectly 
> marking hidden I/O spaces) has to be carried in iotlb invalidation and 
> page faulting uAPI to help connect vIOMMU with the underlying 
> pIOMMU. This is one design choice to be confirmed with ARM guys.

I'm confused by this rational.

For a vIOMMU that has IO page tables in the guest the basic
choices are:
 - Do we have a hypervisor trap to bind the page table or not? (RID
   and PASID may differ here)
 - Do we have a hypervisor trap to invaliate the page tables or not?

If the first is a hypervisor trap then I agree it makes sense to create a
child IOASID that points to each guest page table and manage it
directly. This should not require walking guest page tables as it is
really just informing the HW where the page table lives. HW will walk
them.

If there are no hypervisor traps (does this exist?) then there is no
way to involve the hypervisor here and the child IOASID should simply
be a pointer to the guest's data structure that describes binding. In
this case that IOASID should claim all PASIDs when bound to a
RID. 

Invalidation should be passed up the to the IOMMU driver in terms of
the guest tables information and either the HW or software has to walk
to guest tables to make sense of it.

Events from the IOMMU to userspace should be tagged with the attached
device label and the PASID/substream ID. This means there is no issue
to have a a 'all PASID' IOASID.

> Notes:
> -   It might be confusing as IOASID is also used in the kernel (drivers/
>     iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
>     find a better name later to differentiate.

+1 on Jean-Philippe's remarks

> -   PPC has not be considered yet as we haven't got time to fully understand
>     its semantics. According to previous discussion there is some generality 
>     between PPC window-based scheme and VFIO type1 semantics. Let's 
>     first make consensus on this proposal and then further discuss how to 
>     extend it to cover PPC's requirement.

>From what I understood PPC is not so bad, Nesting IOASID's did its
preload feature and it needed a way to specify/query the IOVA range a
IOASID will cover.

> -   There is a protocol between vfio group and kvm. Needs to think about
>     how it will be affected following this proposal.

Ugh, I always stop looking when I reach that boundary. Can anyone
summarize what is going on there?

Most likely passing the /dev/ioasid into KVM's FD (or vicevera) is the
right answer. Eg if ARM needs to get the VMID from KVM and set it to
ioasid then a KVM "ioctl set_arm_vmid(/dev/ioasid)" call is
reasonable. Certainly better than the symbol get sutff we have right
now.

I will read through the detail below in another email

Jason