RE: [RFC] /dev/ioasid uAPI proposal

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Thu, 3 Jun 2021 06:39:30 +0000

> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Saturday, May 29, 2021 7:37 AM
> 
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> 
> > 2.1. /dev/ioasid uAPI
> > +++++++++++++++++
> >
> > /*
> >   * Check whether an uAPI extension is supported.
> >   *
> >   * This is for FD-level capabilities, such as locked page pre-registration.
> >   * IOASID-level capabilities are reported through IOASID_GET_INFO.
> >   *
> >   * Return: 0 if not supported, 1 if supported.
> >   */
> > #define IOASID_CHECK_EXTENSION	_IO(IOASID_TYPE, IOASID_BASE + 0)
> 
> 
> > /*
> >   * Register user space memory where DMA is allowed.
> >   *
> >   * It pins user pages and does the locked memory accounting so sub-
> >   * sequent IOASID_MAP/UNMAP_DMA calls get faster.
> >   *
> >   * When this ioctl is not used, one user page might be accounted
> >   * multiple times when it is mapped by multiple IOASIDs which are
> >   * not nested together.
> >   *
> >   * Input parameters:
> >   *	- vaddr;
> >   *	- size;
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define IOASID_REGISTER_MEMORY	_IO(IOASID_TYPE, IOASID_BASE + 1)
> > #define IOASID_UNREGISTER_MEMORY	_IO(IOASID_TYPE,
> IOASID_BASE + 2)
> 
> So VA ranges are pinned and stored in a tree and later references to
> those VA ranges by any other IOASID use the pin cached in the tree?

yes.

> 
> It seems reasonable and is similar to the ioasid parent/child I
> suggested for PPC.
> 
> IMHO this should be merged with the all SW IOASID that is required for
> today's mdev drivers. If this can be done while keeping this uAPI then

Agree. Regarding uAPI there is no difference between SW IOASID and
HW IOASID. The main difference is behind /dev/ioasid, that SW IOASID 
is not linked to the IOMMU.

> great, otherwise I don't think it is so bad to weakly nest a physical
> IOASID under a SW one just to optimize page pinning.
> 
> Either way this seems like a smart direction
> 
> > /*
> >   * Allocate an IOASID.
> >   *
> >   * IOASID is the FD-local software handle representing an I/O address
> >   * space. Each IOASID is associated with a single I/O page table. User
> >   * must call this ioctl to get an IOASID for every I/O address space that is
> >   * intended to be enabled in the IOMMU.
> >   *
> >   * A newly-created IOASID doesn't accept any command before it is
> >   * attached to a device. Once attached, an empty I/O page table is
> >   * bound with the IOMMU then the user could use either DMA mapping
> >   * or pgtable binding commands to manage this I/O page table.
> 
> Can the IOASID can be populated before being attached?
> 
> >   * Device attachment is initiated through device driver uAPI (e.g. VFIO)
> >   *
> >   * Return: allocated ioasid on success, -errno on failure.
> >   */
> > #define IOASID_ALLOC	_IO(IOASID_TYPE, IOASID_BASE + 3)
> > #define IOASID_FREE	_IO(IOASID_TYPE, IOASID_BASE + 4)
> 
> I assume alloc will include quite a big structure to satisfy the
> various vendor needs?

I'll skip below /dev/ioasid uAPI comments about alloc/bind. It's already 
covered in other sub-threads.

[...]

> >
> > 2.2. /dev/vfio uAPI
> > ++++++++++++++++
> 
> To be clear you mean the 'struct vfio_device' API, these are not
> IOCTLs on the container or group?

Exactly

> 
> > /*
> >    * Bind a vfio_device to the specified IOASID fd
> >    *
> >    * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> >    * vfio device should not be bound to multiple ioasid_fd's.
> >    *
> >    * Input parameters:
> >    *  - ioasid_fd;
> >    *
> >    * Return: 0 on success, -errno on failure.
> >    */
> > #define VFIO_BIND_IOASID_FD           _IO(VFIO_TYPE, VFIO_BASE + 22)
> > #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)
> 
> This is where it would make sense to have an output "device id" that
> allows /dev/ioasid to refer to this "device" by number in events and
> other related things.

As chatted earlier, either an input or output "device id" is fine here.

> 
> >
> > 2.3. KVM uAPI
> > ++++++++++++
> >
> > /*
> >   * Update CPU PASID mapping
> >   *
> >   * This is necessary when ENQCMD will be used in the guest while the
> >   * targeted device doesn't accept the vPASID saved in the CPU MSR.
> >   *
> >   * This command allows user to set/clear the vPASID->pPASID mapping
> >   * in the CPU, by providing the IOASID (and FD) information representing
> >   * the I/O address space marked by this vPASID.
> >   *
> >   * Input parameters:
> >   *	- user_pasid;
> >   *	- ioasid_fd;
> >   *	- ioasid;
> >   */
> > #define KVM_MAP_PASID	_IO(KVMIO, 0xf0)
> > #define KVM_UNMAP_PASID	_IO(KVMIO, 0xf1)
> 
> It seems simple enough.. So the physical PASID can only be assigned if
> the user has an IOASID that points at it? Thus it is secure?

Yes. The kernel doesn't trust user to provide a random physical PASID.

> 
> > 3. Sample structures and helper functions
> >
> > Three helper functions are provided to support VFIO_BIND_IOASID_FD:
> >
> > 	struct ioasid_ctx *ioasid_ctx_fdget(int fd);
> > 	int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev
> *dev);
> > 	int ioasid_unregister_device(struct ioasid_dev *dev);
> >
> > An ioasid_ctx is created for each fd:
> >
> > 	struct ioasid_ctx {
> > 		// a list of allocated IOASID data's
> > 		struct list_head		ioasid_list;
> 
> Would expect an xarray
> 
> > 		// a list of registered devices
> > 		struct list_head		dev_list;
> 
> xarray of device_id

list of ioasid_dev objects. device_id will be put inside each object.

> 
> > 		// a list of pre-registered virtual address ranges
> > 		struct list_head		prereg_list;
> 
> Should re-use the existing SW IOASID table, and be an interval tree.

What is the existing SW IOASID table?

> 
> > Each registered device is represented by ioasid_dev:
> >
> > 	struct ioasid_dev {
> > 		struct list_head		next;
> > 		struct ioasid_ctx	*ctx;
> > 		// always be the physical device
> > 		struct device 		*device;
> > 		struct kref		kref;
> > 	};
> >
> > Because we assume one vfio_device connected to at most one ioasid_fd,
> > here ioasid_dev could be embedded in vfio_device and then linked to
> > ioasid_ctx->dev_list when registration succeeds. For mdev the struct
> > device should be the pointer to the parent device. PASID marking this
> > mdev is specified later when VFIO_ATTACH_IOASID.
> 
> Don't embed a struct like this in something with vfio_device - that
> just makes a mess of reference counting by having multiple krefs in
> the same memory block. Keep it as a pointer, the attach operation
> should return a pointer to the above struct.

OK. Also based on the agreement that one device can bind to multiple
fd's, this struct embed approach also doesn't work then.

> 
> > An ioasid_data is created when IOASID_ALLOC, as the main object
> > describing characteristics about an I/O page table:
> >
> > 	struct ioasid_data {
> > 		// link to ioasid_ctx->ioasid_list
> > 		struct list_head		next;
> >
> > 		// the IOASID number
> > 		u32			ioasid;
> >
> > 		// the handle to convey iommu operations
> > 		// hold the pgd (TBD until discussing iommu api)
> > 		struct iommu_domain *domain;
> 
> But at least for the first coding draft I would expect to see this API
> presented with no PASID support and a simple 1:1 with iommu_domain.
> How
> PASID gets modeled is the big TBD, right?

yes. As the starting point we will assume 1:1 association. This should
work for PF/VF. But very soon mdev must be considered. I expect 
we can start conversation on PASID support once this uAPI proposal
is settled down.

> 
> > ioasid_data and iommu_domain have overlapping roles as both are
> > introduced to represent an I/O address space. It is still a big TBD how
> > the two should be corelated or even merged, and whether new iommu
> > ops are required to handle RID+PASID explicitly.
> 
> I think it is OK that the uapi and kernel api have different
> structs. The uapi focused one should hold the uapi related data, which
> is what you've shown here, I think.
> 
> > Two helper functions are provided to support VFIO_ATTACH_IOASID:
> >
> > 	struct attach_info {
> > 		u32	ioasid;
> > 		// If valid, the PASID to be used physically
> > 		u32	pasid;
> > 	};
> > 	int ioasid_device_attach(struct ioasid_dev *dev,
> > 		struct attach_info info);
> > 	int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
> 
> Honestly, I still prefer this to be highly explicit as this is where
> all device driver authors get invovled:
> 
> ioasid_pci_device_attach(struct pci_device *pdev, struct ioasid_dev *dev,
> u32 ioasid);
> ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32 *physical_pasid,
> struct ioasid_dev *dev, u32 ioasid);

Then better naming it as pci_device_attach_ioasid since the 1st parameter
is struct pci_device?

By keeping physical_pasid as a pointer, you want to remove the last helper
function (ioasid_get_global_pasid) so the global pasid is returned along
with the attach function?

> 
> And presumably a variant for ARM non-PCI platform (?) devices.
> 
> This could boil down to a __ioasid_device_attach() as you've shown.
> 
> > A new object is introduced and linked to ioasid_data->attach_data for
> > each successful attach operation:
> >
> > 	struct ioasid_attach_data {
> > 		struct list_head		next;
> > 		struct ioasid_dev	*dev;
> > 		u32 			pasid;
> > 	}
> 
> This should be returned as a pointer and detatch should be:
> 
> int ioasid_device_detach(struct ioasid_attach_data *);

ok

> 
> > As explained in the design section, there is no explicit group enforcement
> > in /dev/ioasid uAPI or helper functions. But the ioasid driver does
> > implicit group check - before every device within an iommu group is
> > attached to this IOASID, the previously-attached devices in this group are
> > put in ioasid_data->partial_devices. The IOASID rejects any command if
> > the partial_devices list is not empty.
> 
> It is simple enough. Would be good to design in a diagnostic string so
> userspace can make sense of the failure. Eg return something like
> -EDEADLK and provide an ioctl 'why did EDEADLK happen' ?
> 

Make sense.

> 
> > Then is the last helper function:
> > 	u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx,
> > 		u32 ioasid, bool alloc);
> >
> > ioasid_get_global_pasid is necessary in scenarios where multiple devices
> > want to share a same PASID value on the attached I/O page table (e.g.
> > when ENQCMD is enabled, as explained in next section). We need a
> > centralized place (ioasid_data->pasid) to hold this value (allocated when
> > first called with alloc=true). vfio device driver calls this function (alloc=
> > true) to get the global PASID for an ioasid before calling ioasid_device_
> > attach. KVM also calls this function (alloc=false) to setup PASID translation
> > structure when user calls KVM_MAP_PASID.
> 
> When/why would the VFIO driver do this? isn't this just some varient
> of pasid_attach?
> 
> ioasid_pci_device_enqcmd_attach(struct pci_device *pdev, u32
> *physical_pasid, struct ioasid_dev *dev, u32 ioasid);
> 
> ?

will adopt this way.

> 
> > 4. PASID Virtualization
> >
> > When guest SVA (vSVA) is enabled, multiple GVA address spaces are
> > created on the assigned vfio device. This leads to the concepts of
> > "virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned
> > by the guest to mark an GVA address space while pPASID is the one
> > selected by the host and actually routed in the wire.
> >
> > vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.
> 
> Should the vPASID programmed into the IOASID before calling
> VFIO_ATTACH_IOASID?

No. As explained in earlier reply, when multiple devices are attached
to the same IOASID the guest may link the page table to different
vPASID# cross attached devices. Anyway vPASID is a per-RID thing.

> 
> > vfio device driver translates vPASID to pPASID before calling ioasid_attach_
> > device, with two factors to be considered:
> >
> > -    Whether vPASID is directly used (vPASID==pPASID) in the wire, or
> >      should be instead converted to a newly-allocated one (vPASID!=
> >      pPASID);
> >
> > -    If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
> >      space or a global PASID space (implying sharing pPASID cross devices,
> >      e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
> >      as part of the process context);
> 
> This whole section is 4 really confusing. I think it would be more
> understandable to focus on the list below and minimize the vPASID
> 
> > The actual policy depends on pdev vs. mdev, and whether ENQCMD is
> > supported. There are three possible scenarios:
> >
> > (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization
> > policies.)
> 
> This has become unclear. I think this should start by identifying the
> 6 main type of devices and how they can use pPASID/vPASID:
> 
> 0) Device is a RID and cannot issue PASID
> 1) Device is a mdev and cannot issue PASID
> 2) Device is a mdev and programs a single fixed PASID during bind,
>    does not accept PASID from the guest

There are no vPASID per se in above 3 types. So this section only
focus on the latter 3 types. But I can include them in next version
if it sets the tone clearer.

> 
> 3) Device accepts any PASIDs from the guest. No
>    vPASID/pPASID translation is possible. (classic vfio_pci)
> 4) Device accepts any PASID from the guest and has an
>    internal vPASID/pPASID translation (enhanced vfio_pci)

what is enhanced vfio_pci? In my writing this is for mdev
which doesn't support ENQCMD

> 5) Device accepts and PASID from the guest and relys on
>    external vPASID/pPASID translation via ENQCMD (Intel SIOV mdev)
> 
> 0-2 don't use vPASID at all
> 
> 3-5 consume a vPASID but handle it differently.
> 
> I think the 3-5 map into what you are trying to explain in the table
> below, which is the rules for allocating the vPASID depending on which
> of device types 3-5 are present and or mixed.

Exactly

> 
> For instance device type 3 requires vPASID == pPASID because it can't
> do translation at all.
> 
> This probably all needs to come through clearly in the /dev/ioasid
> interface. Once the attached devices are labled it would make sense to
> have a 'query device' /dev/ioasid IOCTL to report the details based on
> how the device attached and other information.

This is a good point. Another benefit of having a device label.

for 0-2 the device will report no PASID support. Although this may duplicate
with other information (e.g. PCI PASID cap), this provides a vendor-agnostic
way for reporting details around IOASID.

for 3-5 the device will report PASID support. In these cases the user is
expected to always provide a vPASID. 

for 5 in addition the device will report a requirement on CPU PASID 
translation. For such device the user should talk to KVM to setup the PASID
mapping. This way the user doesn't need to know whether a device is
pdev or mdev. Just follows what device capability reports.

> 
> > 2)  mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)
> >
> >      PASIDs are also used by kernel to mark the default I/O address space
> >      for mdev, thus cannot be delegated to the guest. Instead, the mdev
> >      driver must allocate a new pPASID for each vPASID (thus vPASID!=
> >      pPASID) and then use pPASID when attaching this mdev to an ioasid.
> 
> I don't understand this at all.. What does "PASIDs are also used by
> the kernel" mean?

Just refer to your type-2. Because PASIDs on this device are already used
by the parent driver to mark mdev, we cannot delegate the per-RID space
to the guest.

> 
> >      The mdev driver needs cache the PASID mapping so in mediation
> >      path vPASID programmed by the guest can be converted to pPASID
> >      before updating the physical MMIO register.
> 
> This is my scenario #4 above. Device and internally virtualize
> vPASID/pPASID - how that is done is up to the device. But this is all
> just labels, when such a device attaches, it should use some specific
> API:
> 
> ioasid_pci_device_vpasid_attach(struct pci_device *pdev,
>  u32 *physical_pasid, u32 *virtual_pasid, struct ioasid_dev *dev, u32 ioasid);

yes.

> 
> And then maintain its internal translation
> 
> >      In previous thread a PASID range split scheme was discussed to support
> >      this combination, but we haven't worked out a clean uAPI design yet.
> >      Therefore in this proposal we decide to not support it, implying the
> >      user should have some intelligence to avoid such scenario. It could be
> >      a TODO task for future.
> 
> It really just boils down to how to allocate the PASIDs to get around
> the bad viommu interface that assumes all PASIDs are usable by all
> devices.

viommu (e.g. Intel VT-d) has good interface to restrict how many PASIDs
are available to the guest. There is a PASID size filed in the viommu 
register. Here the puzzle is just about how to design a good uAPI to 
handle this mixed scenario where vPASID/pPASID are in split range and 
must be linked to the same I/O page table together.

I'll see whether this can be afforded after addressing other comments
in this section.

> 
> > In spite of those subtle considerations, the kernel implementation could
> > start simple, e.g.:
> >
> > -    v==p for pdev;
> > -    v!=p and always use a global PASID pool for all mdev's;
> 
> Regardless all this mess needs to be hidden from the consuming drivers
> with some simple APIs as above. The driver should indicate what its HW
> can do and the PASID #'s that magically come out of /dev/ioasid should
> be appropriate.
> 

Yes, I see how it should work now.

Thanks
Kevin