On 2020/10/12 下午4:38, Tian, Kevin wrote:
From: Jason Wang <jasowang@xxxxxxxxxx>
Sent: Monday, September 14, 2020 12:20 PM
[...]
> If it's possible, I would suggest a generic uAPI instead of a VFIO
specific one.
Jason suggest something like /dev/sva. There will be a lot of other
subsystems that could benefit from this (e.g vDPA).
Have you ever considered this approach?
Hi, Jason,
We did some study on this approach and below is the output. It's a
long writing but I didn't find a way to further abstract w/o losing
necessary context. Sorry about that.
Overall the real purpose of this series is to enable IOMMU nested
translation capability with vSVA as one major usage, through
below new uAPIs:
1) Report/enable IOMMU nested translation capability;
2) Allocate/free PASID;
3) Bind/unbind guest page table;
4) Invalidate IOMMU cache;
5) Handle IOMMU page request/response (not in this series);
1/3/4) is the minimal set for using IOMMU nested translation, with
the other two optional. For example, the guest may enable vSVA on
a device without using PASID. Or, it may bind its gIOVA page table
which doesn't require page fault support. Finally, all operations can
be applied to either physical device or subdevice.
Then we evaluated each uAPI whether generalizing it is a good thing
both in concept and regarding to complexity.
First, unlike other uAPIs which are all backed by iommu_ops, PASID
allocation/free is through the IOASID sub-system.
A question here, is IOASID expected to be the single management
interface for PASID?
(I'm asking since there're already vendor specific IDA based PASID
allocator e.g amdgpu_pasid_alloc())
From this angle
we feel generalizing PASID management does make some sense.
First, PASID is just a number and not related to any device before
it's bound to a page table and IOMMU domain. Second, PASID is a
global resource (at least on Intel VT-d),
I think we need a definition of "global" here. It looks to me for vt-d
the PASID table is per device.
Another question, is this possible to have two DMAR hardware unit(at
least I can see two even in my laptop). In this case, is PASID still a
global resource?
while having separate VFIO/
VDPA allocation interfaces may easily cause confusion in userspace,
e.g. which interface to be used if both VFIO/VDPA devices exist.
Moreover, an unified interface allows centralized control over how
many PASIDs are allowed per process.
Yes.
One unclear part with this generalization is about the permission.
Do we open this interface to any process or only to those which
have assigned devices? If the latter, what would be the mechanism
to coordinate between this new interface and specific passthrough
frameworks?
I'm not sure, but if you just want a permission, you probably can
introduce new capability (CAP_XXX) for this.
A more tricky case, vSVA support on ARM (Eric/Jean
please correct me) plans to do per-device PASID namespace which
is built on a bind_pasid_table iommu callback to allow guest fully
manage its PASIDs on a given passthrough device.
I see, so I think the answer is to prepare for the namespace support
from the start. (btw, I don't see how namespace is handled in current
IOASID module?)
I'm not sure
how such requirement can be unified w/o involving passthrough
frameworks, or whether ARM could also switch to global PASID
style...
Second, IOMMU nested translation is a per IOMMU domain
capability. Since IOMMU domains are managed by VFIO/VDPA
(alloc/free domain, attach/detach device, set/get domain attribute,
etc.), reporting/enabling the nesting capability is an natural
extension to the domain uAPI of existing passthrough frameworks.
Actually, VFIO already includes a nesting enable interface even
before this series. So it doesn't make sense to generalize this uAPI
out.
So my understanding is that VFIO already:
1) use multiple fds
2) separate IOMMU ops to a dedicated container fd (type1 iommu)
3) provides API to associated devices/group with a container
And all the proposal in this series is to reuse the container fd. It
should be possible to replace e.g type1 IOMMU with a unified module.
Then the tricky part comes with the remaining operations (3/4/5),
which are all backed by iommu_ops thus effective only within an
IOMMU domain. To generalize them, the first thing is to find a way
to associate the sva_FD (opened through generic /dev/sva) with an
IOMMU domain that is created by VFIO/VDPA. The second thing is
to replicate {domain<->device/subdevice} association in /dev/sva
path because some operations (e.g. page fault) is triggered/handled
per device/subdevice.
Is there any reason that the #PF can not be handled via SVA fd?
Therefore, /dev/sva must provide both per-
domain and per-device uAPIs similar to what VFIO/VDPA already
does. Moreover, mapping page fault to subdevice requires pre-
registering subdevice fault data to IOMMU layer when binding
guest page table, while such fault data can be only retrieved from
parent driver through VFIO/VDPA.
However, we failed to find a good way even at the 1st step about
domain association. The iommu domains are not exposed to the
userspace, and there is no 1:1 mapping between domain and device.
In VFIO, all devices within the same VFIO container share the address
space but they may be organized in multiple IOMMU domains based
on their bus type. How (should we let) the userspace know the
domain information and open an sva_FD for each domain is the main
problem here.
The SVA fd is not necessarily opened by userspace. It could be get
through subsystem specific uAPIs.
E.g for vDPA if a vDPA device contains several vSVA-capable domains, we can:
1) introduce uAPI for userspace to know the number of vSVA-capable domain
2) introduce e.g VDPA_GET_SVA_FD to get the fd for each vSVA-capable domain
In the end we just realized that doing such generalization doesn't
really lead to a clear design and instead requires tight coordination
between /dev/sva and VFIO/VDPA for almost every new uAPI
(especially about synchronization when the domain/device
association is changed or when the device/subdevice is being reset/
drained). Finally it may become a usability burden to the userspace
on proper use of the two interfaces on the assigned device.
Based on above analysis we feel that just generalizing PASID mgmt.
might be a good thing to look at while the remaining operations are
better being VFIO/VDPA specific uAPIs. anyway in concept those are
just a subset of the page table management capabilities that an
IOMMU domain affords. Since all other aspects of the IOMMU domain
is managed by VFIO/VDPA already, continuing this path for new nesting
capability sounds natural. There is another option by generalizing the
entire IOMMU domain management (sort of the entire vfio_iommu_
type1), but it's unclear whether such intrusive change is worthwhile
(especially when VFIO/VDPA already goes different route even in legacy
mapping uAPI: map/unmap vs. IOTLB).
Thoughts?
I'm ok with starting with a unified PASID management and consider the
unified vSVA/vIOMMU uAPI later.
Thanks
Thanks
Kevin