> From: Jason Gunthorpe <jgg@xxxxxxxxxx> > Sent: Saturday, May 29, 2021 7:37 AM > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote: > > > 2.1. /dev/ioasid uAPI > > +++++++++++++++++ > > > > /* > > * Check whether an uAPI extension is supported. > > * > > * This is for FD-level capabilities, such as locked page pre-registration. > > * IOASID-level capabilities are reported through IOASID_GET_INFO. > > * > > * Return: 0 if not supported, 1 if supported. > > */ > > #define IOASID_CHECK_EXTENSION _IO(IOASID_TYPE, IOASID_BASE + 0) > > > > /* > > * Register user space memory where DMA is allowed. > > * > > * It pins user pages and does the locked memory accounting so sub- > > * sequent IOASID_MAP/UNMAP_DMA calls get faster. > > * > > * When this ioctl is not used, one user page might be accounted > > * multiple times when it is mapped by multiple IOASIDs which are > > * not nested together. > > * > > * Input parameters: > > * - vaddr; > > * - size; > > * > > * Return: 0 on success, -errno on failure. > > */ > > #define IOASID_REGISTER_MEMORY _IO(IOASID_TYPE, IOASID_BASE + 1) > > #define IOASID_UNREGISTER_MEMORY _IO(IOASID_TYPE, > IOASID_BASE + 2) > > So VA ranges are pinned and stored in a tree and later references to > those VA ranges by any other IOASID use the pin cached in the tree? yes. > > It seems reasonable and is similar to the ioasid parent/child I > suggested for PPC. > > IMHO this should be merged with the all SW IOASID that is required for > today's mdev drivers. If this can be done while keeping this uAPI then Agree. Regarding uAPI there is no difference between SW IOASID and HW IOASID. The main difference is behind /dev/ioasid, that SW IOASID is not linked to the IOMMU. > great, otherwise I don't think it is so bad to weakly nest a physical > IOASID under a SW one just to optimize page pinning. > > Either way this seems like a smart direction > > > /* > > * Allocate an IOASID. > > * > > * IOASID is the FD-local software handle representing an I/O address > > * space. Each IOASID is associated with a single I/O page table. User > > * must call this ioctl to get an IOASID for every I/O address space that is > > * intended to be enabled in the IOMMU. > > * > > * A newly-created IOASID doesn't accept any command before it is > > * attached to a device. Once attached, an empty I/O page table is > > * bound with the IOMMU then the user could use either DMA mapping > > * or pgtable binding commands to manage this I/O page table. > > Can the IOASID can be populated before being attached? > > > * Device attachment is initiated through device driver uAPI (e.g. VFIO) > > * > > * Return: allocated ioasid on success, -errno on failure. > > */ > > #define IOASID_ALLOC _IO(IOASID_TYPE, IOASID_BASE + 3) > > #define IOASID_FREE _IO(IOASID_TYPE, IOASID_BASE + 4) > > I assume alloc will include quite a big structure to satisfy the > various vendor needs? I'll skip below /dev/ioasid uAPI comments about alloc/bind. It's already covered in other sub-threads. [...] > > > > 2.2. /dev/vfio uAPI > > ++++++++++++++++ > > To be clear you mean the 'struct vfio_device' API, these are not > IOCTLs on the container or group? Exactly > > > /* > > * Bind a vfio_device to the specified IOASID fd > > * > > * Multiple vfio devices can be bound to a single ioasid_fd, but a single > > * vfio device should not be bound to multiple ioasid_fd's. > > * > > * Input parameters: > > * - ioasid_fd; > > * > > * Return: 0 on success, -errno on failure. > > */ > > #define VFIO_BIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 22) > > #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23) > > This is where it would make sense to have an output "device id" that > allows /dev/ioasid to refer to this "device" by number in events and > other related things. As chatted earlier, either an input or output "device id" is fine here. > > > > > 2.3. KVM uAPI > > ++++++++++++ > > > > /* > > * Update CPU PASID mapping > > * > > * This is necessary when ENQCMD will be used in the guest while the > > * targeted device doesn't accept the vPASID saved in the CPU MSR. > > * > > * This command allows user to set/clear the vPASID->pPASID mapping > > * in the CPU, by providing the IOASID (and FD) information representing > > * the I/O address space marked by this vPASID. > > * > > * Input parameters: > > * - user_pasid; > > * - ioasid_fd; > > * - ioasid; > > */ > > #define KVM_MAP_PASID _IO(KVMIO, 0xf0) > > #define KVM_UNMAP_PASID _IO(KVMIO, 0xf1) > > It seems simple enough.. So the physical PASID can only be assigned if > the user has an IOASID that points at it? Thus it is secure? Yes. The kernel doesn't trust user to provide a random physical PASID. > > > 3. Sample structures and helper functions > > > > Three helper functions are provided to support VFIO_BIND_IOASID_FD: > > > > struct ioasid_ctx *ioasid_ctx_fdget(int fd); > > int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev > *dev); > > int ioasid_unregister_device(struct ioasid_dev *dev); > > > > An ioasid_ctx is created for each fd: > > > > struct ioasid_ctx { > > // a list of allocated IOASID data's > > struct list_head ioasid_list; > > Would expect an xarray > > > // a list of registered devices > > struct list_head dev_list; > > xarray of device_id list of ioasid_dev objects. device_id will be put inside each object. > > > // a list of pre-registered virtual address ranges > > struct list_head prereg_list; > > Should re-use the existing SW IOASID table, and be an interval tree. What is the existing SW IOASID table? > > > Each registered device is represented by ioasid_dev: > > > > struct ioasid_dev { > > struct list_head next; > > struct ioasid_ctx *ctx; > > // always be the physical device > > struct device *device; > > struct kref kref; > > }; > > > > Because we assume one vfio_device connected to at most one ioasid_fd, > > here ioasid_dev could be embedded in vfio_device and then linked to > > ioasid_ctx->dev_list when registration succeeds. For mdev the struct > > device should be the pointer to the parent device. PASID marking this > > mdev is specified later when VFIO_ATTACH_IOASID. > > Don't embed a struct like this in something with vfio_device - that > just makes a mess of reference counting by having multiple krefs in > the same memory block. Keep it as a pointer, the attach operation > should return a pointer to the above struct. OK. Also based on the agreement that one device can bind to multiple fd's, this struct embed approach also doesn't work then. > > > An ioasid_data is created when IOASID_ALLOC, as the main object > > describing characteristics about an I/O page table: > > > > struct ioasid_data { > > // link to ioasid_ctx->ioasid_list > > struct list_head next; > > > > // the IOASID number > > u32 ioasid; > > > > // the handle to convey iommu operations > > // hold the pgd (TBD until discussing iommu api) > > struct iommu_domain *domain; > > But at least for the first coding draft I would expect to see this API > presented with no PASID support and a simple 1:1 with iommu_domain. > How > PASID gets modeled is the big TBD, right? yes. As the starting point we will assume 1:1 association. This should work for PF/VF. But very soon mdev must be considered. I expect we can start conversation on PASID support once this uAPI proposal is settled down. > > > ioasid_data and iommu_domain have overlapping roles as both are > > introduced to represent an I/O address space. It is still a big TBD how > > the two should be corelated or even merged, and whether new iommu > > ops are required to handle RID+PASID explicitly. > > I think it is OK that the uapi and kernel api have different > structs. The uapi focused one should hold the uapi related data, which > is what you've shown here, I think. > > > Two helper functions are provided to support VFIO_ATTACH_IOASID: > > > > struct attach_info { > > u32 ioasid; > > // If valid, the PASID to be used physically > > u32 pasid; > > }; > > int ioasid_device_attach(struct ioasid_dev *dev, > > struct attach_info info); > > int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid); > > Honestly, I still prefer this to be highly explicit as this is where > all device driver authors get invovled: > > ioasid_pci_device_attach(struct pci_device *pdev, struct ioasid_dev *dev, > u32 ioasid); > ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32 *physical_pasid, > struct ioasid_dev *dev, u32 ioasid); Then better naming it as pci_device_attach_ioasid since the 1st parameter is struct pci_device? By keeping physical_pasid as a pointer, you want to remove the last helper function (ioasid_get_global_pasid) so the global pasid is returned along with the attach function? > > And presumably a variant for ARM non-PCI platform (?) devices. > > This could boil down to a __ioasid_device_attach() as you've shown. > > > A new object is introduced and linked to ioasid_data->attach_data for > > each successful attach operation: > > > > struct ioasid_attach_data { > > struct list_head next; > > struct ioasid_dev *dev; > > u32 pasid; > > } > > This should be returned as a pointer and detatch should be: > > int ioasid_device_detach(struct ioasid_attach_data *); ok > > > As explained in the design section, there is no explicit group enforcement > > in /dev/ioasid uAPI or helper functions. But the ioasid driver does > > implicit group check - before every device within an iommu group is > > attached to this IOASID, the previously-attached devices in this group are > > put in ioasid_data->partial_devices. The IOASID rejects any command if > > the partial_devices list is not empty. > > It is simple enough. Would be good to design in a diagnostic string so > userspace can make sense of the failure. Eg return something like > -EDEADLK and provide an ioctl 'why did EDEADLK happen' ? > Make sense. > > > Then is the last helper function: > > u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx, > > u32 ioasid, bool alloc); > > > > ioasid_get_global_pasid is necessary in scenarios where multiple devices > > want to share a same PASID value on the attached I/O page table (e.g. > > when ENQCMD is enabled, as explained in next section). We need a > > centralized place (ioasid_data->pasid) to hold this value (allocated when > > first called with alloc=true). vfio device driver calls this function (alloc= > > true) to get the global PASID for an ioasid before calling ioasid_device_ > > attach. KVM also calls this function (alloc=false) to setup PASID translation > > structure when user calls KVM_MAP_PASID. > > When/why would the VFIO driver do this? isn't this just some varient > of pasid_attach? > > ioasid_pci_device_enqcmd_attach(struct pci_device *pdev, u32 > *physical_pasid, struct ioasid_dev *dev, u32 ioasid); > > ? will adopt this way. > > > 4. PASID Virtualization > > > > When guest SVA (vSVA) is enabled, multiple GVA address spaces are > > created on the assigned vfio device. This leads to the concepts of > > "virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned > > by the guest to mark an GVA address space while pPASID is the one > > selected by the host and actually routed in the wire. > > > > vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID. > > Should the vPASID programmed into the IOASID before calling > VFIO_ATTACH_IOASID? No. As explained in earlier reply, when multiple devices are attached to the same IOASID the guest may link the page table to different vPASID# cross attached devices. Anyway vPASID is a per-RID thing. > > > vfio device driver translates vPASID to pPASID before calling ioasid_attach_ > > device, with two factors to be considered: > > > > - Whether vPASID is directly used (vPASID==pPASID) in the wire, or > > should be instead converted to a newly-allocated one (vPASID!= > > pPASID); > > > > - If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID > > space or a global PASID space (implying sharing pPASID cross devices, > > e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR > > as part of the process context); > > This whole section is 4 really confusing. I think it would be more > understandable to focus on the list below and minimize the vPASID > > > The actual policy depends on pdev vs. mdev, and whether ENQCMD is > > supported. There are three possible scenarios: > > > > (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization > > policies.) > > This has become unclear. I think this should start by identifying the > 6 main type of devices and how they can use pPASID/vPASID: > > 0) Device is a RID and cannot issue PASID > 1) Device is a mdev and cannot issue PASID > 2) Device is a mdev and programs a single fixed PASID during bind, > does not accept PASID from the guest There are no vPASID per se in above 3 types. So this section only focus on the latter 3 types. But I can include them in next version if it sets the tone clearer. > > 3) Device accepts any PASIDs from the guest. No > vPASID/pPASID translation is possible. (classic vfio_pci) > 4) Device accepts any PASID from the guest and has an > internal vPASID/pPASID translation (enhanced vfio_pci) what is enhanced vfio_pci? In my writing this is for mdev which doesn't support ENQCMD > 5) Device accepts and PASID from the guest and relys on > external vPASID/pPASID translation via ENQCMD (Intel SIOV mdev) > > 0-2 don't use vPASID at all > > 3-5 consume a vPASID but handle it differently. > > I think the 3-5 map into what you are trying to explain in the table > below, which is the rules for allocating the vPASID depending on which > of device types 3-5 are present and or mixed. Exactly > > For instance device type 3 requires vPASID == pPASID because it can't > do translation at all. > > This probably all needs to come through clearly in the /dev/ioasid > interface. Once the attached devices are labled it would make sense to > have a 'query device' /dev/ioasid IOCTL to report the details based on > how the device attached and other information. This is a good point. Another benefit of having a device label. for 0-2 the device will report no PASID support. Although this may duplicate with other information (e.g. PCI PASID cap), this provides a vendor-agnostic way for reporting details around IOASID. for 3-5 the device will report PASID support. In these cases the user is expected to always provide a vPASID. for 5 in addition the device will report a requirement on CPU PASID translation. For such device the user should talk to KVM to setup the PASID mapping. This way the user doesn't need to know whether a device is pdev or mdev. Just follows what device capability reports. > > > 2) mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global) > > > > PASIDs are also used by kernel to mark the default I/O address space > > for mdev, thus cannot be delegated to the guest. Instead, the mdev > > driver must allocate a new pPASID for each vPASID (thus vPASID!= > > pPASID) and then use pPASID when attaching this mdev to an ioasid. > > I don't understand this at all.. What does "PASIDs are also used by > the kernel" mean? Just refer to your type-2. Because PASIDs on this device are already used by the parent driver to mark mdev, we cannot delegate the per-RID space to the guest. > > > The mdev driver needs cache the PASID mapping so in mediation > > path vPASID programmed by the guest can be converted to pPASID > > before updating the physical MMIO register. > > This is my scenario #4 above. Device and internally virtualize > vPASID/pPASID - how that is done is up to the device. But this is all > just labels, when such a device attaches, it should use some specific > API: > > ioasid_pci_device_vpasid_attach(struct pci_device *pdev, > u32 *physical_pasid, u32 *virtual_pasid, struct ioasid_dev *dev, u32 ioasid); yes. > > And then maintain its internal translation > > > In previous thread a PASID range split scheme was discussed to support > > this combination, but we haven't worked out a clean uAPI design yet. > > Therefore in this proposal we decide to not support it, implying the > > user should have some intelligence to avoid such scenario. It could be > > a TODO task for future. > > It really just boils down to how to allocate the PASIDs to get around > the bad viommu interface that assumes all PASIDs are usable by all > devices. viommu (e.g. Intel VT-d) has good interface to restrict how many PASIDs are available to the guest. There is a PASID size filed in the viommu register. Here the puzzle is just about how to design a good uAPI to handle this mixed scenario where vPASID/pPASID are in split range and must be linked to the same I/O page table together. I'll see whether this can be afforded after addressing other comments in this section. > > > In spite of those subtle considerations, the kernel implementation could > > start simple, e.g.: > > > > - v==p for pdev; > > - v!=p and always use a global PASID pool for all mdev's; > > Regardless all this mess needs to be hidden from the consuming drivers > with some simple APIs as above. The driver should indicate what its HW > can do and the PASID #'s that magically come out of /dev/ioasid should > be appropriate. > Yes, I see how it should work now. Thanks Kevin