RE: [RFC] /dev/ioasid uAPI proposal

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Tue, 1 Jun 2021 08:38:00 +0000

> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Saturday, May 29, 2021 3:59 AM
> 
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> >
> > 5. Use Cases and Flows
> >
> > Here assume VFIO will support a new model where every bound device
> > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > going through legacy container/group interface. For illustration purpose
> > those devices are just called dev[1...N]:
> >
> > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> >
> > As explained earlier, one IOASID fd is sufficient for all intended use cases:
> >
> > 	ioasid_fd = open("/dev/ioasid", mode);
> >
> > For simplicity below examples are all made for the virtualization story.
> > They are representative and could be easily adapted to a non-virtualization
> > scenario.
> 
> For others, I don't think this is *strictly* necessary, we can
> probably still get to the device_fd using the group_fd and fit in
> /dev/ioasid. It does make the rest of this more readable though.

Jason, want to confirm here. Per earlier discussion we remain an
impression that you want VFIO to be a pure device driver thus
container/group are used only for legacy application. From this
comment are you suggesting that VFIO can still keep container/
group concepts and user just deprecates the use of vfio iommu
uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has
a simple policy that an IOASID will reject cmd if partially-attached 
group exists)?

> 
> 
> > Three types of IOASIDs are considered:
> >
> > 	gpa_ioasid[1...N]: 	for GPA address space
> > 	giova_ioasid[1...N]:	for guest IOVA address space
> > 	gva_ioasid[1...N]:	for guest CPU VA address space
> >
> > At least one gpa_ioasid must always be created per guest, while the other
> > two are relevant as far as vIOMMU is concerned.
> >
> > Examples here apply to both pdev and mdev, if not explicitly marked out
> > (e.g. in section 5.5). VFIO device driver in the kernel will figure out the
> > associated routing information in the attaching operation.
> >
> > For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> > INFO are skipped in these examples.
> >
> > 5.1. A simple example
> > ++++++++++++++++++
> >
> > Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> > space is managed through DMA mapping protocol:
> >
> > 	/* Bind device to IOASID fd */
> > 	device_fd = open("/dev/vfio/devices/dev1", mode);
> > 	ioasid_fd = open("/dev/ioasid", mode);
> > 	ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
> >
> > 	/* Attach device to IOASID */
> > 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > 	at_data = { .ioasid = gpa_ioasid};
> > 	ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Setup GPA mapping */
> > 	dma_map = {
> > 		.ioasid	= gpa_ioasid;
> > 		.iova	= 0;		// GPA
> > 		.vaddr	= 0x40000000;	// HVA
> > 		.size	= 1GB;
> > 	};
> > 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> >
> > If the guest is assigned with more than dev1, user follows above sequence
> > to attach other devices to the same gpa_ioasid i.e. sharing the GPA
> > address space cross all assigned devices.
> 
> eg
> 
>  	device2_fd = open("/dev/vfio/devices/dev1", mode);
>  	ioctl(device2_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
>  	ioctl(device2_fd, VFIO_ATTACH_IOASID, &at_data);
> 
> Right?

Exactly, except a small typo ('dev1' -> 'dev2'). :)

> 
> >
> > 5.2. Multiple IOASIDs (no nesting)
> > ++++++++++++++++++++++++++++
> >
> > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> > both devices are attached to gpa_ioasid. After boot the guest creates
> > an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> > through mode (gpa_ioasid).
> >
> > Suppose IOASID nesting is not supported in this case. Qemu need to
> > generate shadow mappings in userspace for giova_ioasid (like how
> > VFIO works today).
> >
> > To avoid duplicated locked page accounting, it's recommended to pre-
> > register the virtual address range that will be used for DMA:
> >
> > 	device_fd1 = open("/dev/vfio/devices/dev1", mode);
> > 	device_fd2 = open("/dev/vfio/devices/dev2", mode);
> > 	ioasid_fd = open("/dev/ioasid", mode);
> > 	ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> > 	ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
> >
> > 	/* pre-register the virtual address range for accounting */
> > 	mem_info = { .vaddr = 0x40000000; .size = 1GB };
> > 	ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
> >
> > 	/* Attach dev1 and dev2 to gpa_ioasid */
> > 	gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > 	at_data = { .ioasid = gpa_ioasid};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Setup GPA mapping */
> > 	dma_map = {
> > 		.ioasid	= gpa_ioasid;
> > 		.iova	= 0; 		// GPA
> > 		.vaddr	= 0x40000000;	// HVA
> > 		.size	= 1GB;
> > 	};
> > 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> >
> > 	/* After boot, guest enables an GIOVA space for dev2 */
> > 	giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> >
> > 	/* First detach dev2 from previous address space */
> > 	at_data = { .ioasid = gpa_ioasid};
> > 	ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
> >
> > 	/* Then attach dev2 to the new address space */
> > 	at_data = { .ioasid = giova_ioasid};
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Setup a shadow DMA mapping according to vIOMMU
> > 	  * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> > 	  */
> 
> Here "shadow DMA" means relay the guest's vIOMMU page tables to the HW
> IOMMU?

'shadow' means the merged mapping: GIOVA(0x2000) -> HVA (0x40001000)

> 
> > 	dma_map = {
> > 		.ioasid	= giova_ioasid;
> > 		.iova	= 0x2000; 	// GIOVA
> > 		.vaddr	= 0x40001000;	// HVA
> 
> eg HVA came from reading the guest's page tables and finding it wanted
> GPA 0x1000 mapped to IOVA 0x2000?

yes

> 
> 
> > 5.3. IOASID nesting (software)
> > +++++++++++++++++++++++++
> >
> > Same usage scenario as 5.2, with software-based IOASID nesting
> > available. In this mode it is the kernel instead of user to create the
> > shadow mapping.
> >
> > The flow before guest boots is same as 5.2, except one point. Because
> > giova_ioasid is nested on gpa_ioasid, locked accounting is only
> > conducted for gpa_ioasid. So it's not necessary to pre-register virtual
> > memory.
> >
> > To save space we only list the steps after boots (i.e. both dev1/dev2
> > have been attached to gpa_ioasid before guest boots):
> >
> > 	/* After boots */
> > 	/* Make GIOVA space nested on GPA space */
> > 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev2 to the new address space (child)
> > 	  * Note dev2 is still attached to gpa_ioasid (parent)
> > 	  */
> > 	at_data = { .ioasid = giova_ioasid};
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Setup a GIOVA->GPA mapping for giova_ioasid, which will be
> > 	  * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> > 	  * to form a shadow mapping.
> > 	  */
> > 	dma_map = {
> > 		.ioasid	= giova_ioasid;
> > 		.iova	= 0x2000;	// GIOVA
> > 		.vaddr	= 0x1000;	// GPA
> > 		.size	= 4KB;
> > 	};
> > 	ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> And in this version the kernel reaches into the parent IOASID's page
> tables to translate 0x1000 to 0x40001000 to physical page? So we
> basically remove the qemu process address space entirely from this
> translation. It does seem convenient

yes.

> 
> > 5.4. IOASID nesting (hardware)
> > +++++++++++++++++++++++++
> >
> > Same usage scenario as 5.2, with hardware-based IOASID nesting
> > available. In this mode the pgtable binding protocol is used to
> > bind the guest IOVA page table with the IOMMU:
> >
> > 	/* After boots */
> > 	/* Make GIOVA space nested on GPA space */
> > 	giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev2 to the new address space (child)
> > 	  * Note dev2 is still attached to gpa_ioasid (parent)
> > 	  */
> > 	at_data = { .ioasid = giova_ioasid};
> > 	ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Bind guest I/O page table  */
> > 	bind_data = {
> > 		.ioasid	= giova_ioasid;
> > 		.addr	= giova_pgtable;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> I really think you need to use consistent language. Things that
> allocate a new IOASID should be calle IOASID_ALLOC_IOASID. If multiple
> IOCTLs are needed then it is IOASID_ALLOC_IOASID_PGTABLE, etc.
> alloc/create/bind is too confusing.
> 
> > 5.5. Guest SVA (vSVA)
> > ++++++++++++++++++
> >
> > After boots the guest further create a GVA address spaces (gpasid1) on
> > dev1. Dev2 is not affected (still attached to giova_ioasid).
> >
> > As explained in section 4, user should avoid expose ENQCMD on both
> > pdev and mdev.
> >
> > The sequence applies to all device types (being pdev or mdev), except
> > one additional step to call KVM for ENQCMD-capable mdev:
> >
> > 	/* After boots */
> > 	/* Make GVA space nested on GPA space */
> > 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev1 to the new address space and specify vPASID */
> > 	at_data = {
> > 		.ioasid		= gva_ioasid;
> > 		.flag 		= IOASID_ATTACH_USER_PASID;
> > 		.user_pasid	= gpasid1;
> > 	};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 
> Still a little unsure why the vPASID is here not on the gva_ioasid. Is
> there any scenario where we want different vpasid's for the same
> IOASID? I guess it is OK like this. Hum.

Yes, it's completely sane that the guest links a I/O page table to 
different vpasids on dev1 and dev2. The IOMMU doesn't mandate
that when multiple devices share an I/O page table they must use
the same PASID#. 

> 
> > 	/* if dev1 is ENQCMD-capable mdev, update CPU PASID
> > 	  * translation structure through KVM
> > 	  */
> > 	pa_data = {
> > 		.ioasid_fd	= ioasid_fd;
> > 		.ioasid		= gva_ioasid;
> > 		.guest_pasid	= gpasid1;
> > 	};
> > 	ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
> 
> Make sense
> 
> > 	/* Bind guest I/O page table  */
> > 	bind_data = {
> > 		.ioasid	= gva_ioasid;
> > 		.addr	= gva_pgtable1;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> Again I do wonder if this should just be part of alloc_ioasid. Is
> there any reason to split these things? The only advantage to the
> split is the device is known, but the device shouldn't impact
> anything..

I summarized this as open#4 in another mail for focused discussion.

> 
> > 5.6. I/O page fault
> > +++++++++++++++
> >
> > (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> > to guest IOMMU driver and backwards).
> >
> > -   Host IOMMU driver receives a page request with raw fault_data {rid,
> >     pasid, addr};
> >
> > -   Host IOMMU driver identifies the faulting I/O page table according to
> >     information registered by IOASID fault handler;
> >
> > -   IOASID fault handler is called with raw fault_data (rid, pasid, addr),
> which
> >     is saved in ioasid_data->fault_data (used for response);
> >
> > -   IOASID fault handler generates an user fault_data (ioasid, addr), links it
> >     to the shared ring buffer and triggers eventfd to userspace;
> 
> Here rid should be translated to a labeled device and return the
> device label from VFIO_BIND_IOASID_FD. Depending on how the device
> bound the label might match to a rid or to a rid,pasid

Yes, I acknowledged this input from you and Jean about page fault and 
bind_pasid_table. I summarized it as open#3 in another mail.

thus following is skipped...

Thanks
Kevin

> 
> > -   Upon received event, Qemu needs to find the virtual routing information
> >     (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are
> >     multiple, pick a random one. This should be fine since the purpose is to
> >     fix the I/O page table on the guest;
> 
> The device label should fix this
> 
> > -   Qemu finds the pending fault event, converts virtual completion data
> >     into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
> >     complete the pending fault;
> >
> > -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
> >     ioasid_data->fault_data, and then calls iommu api to complete it with
> >     {rid, pasid, response_code};
> 
> So resuming a fault on an ioasid will resume all devices pending on
> the fault?
> 
> > 5.7. BIND_PASID_TABLE
> > ++++++++++++++++++++
> >
> > PASID table is put in the GPA space on some platform, thus must be
> updated
> > by the guest. It is treated as another user page table to be bound with the
> > IOMMU.
> >
> > As explained earlier, the user still needs to explicitly bind every user I/O
> > page table to the kernel so the same pgtable binding protocol (bind, cache
> > invalidate and fault handling) is unified cross platforms.
> >
> > vIOMMUs may include a caching mode (or paravirtualized way) which,
> once
> > enabled, requires the guest to invalidate PASID cache for any change on the
> > PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
> >
> > In case of missing such capability, Qemu could enable write-protection on
> > the guest PASID table to achieve the same effect.
> >
> > 	/* After boots */
> > 	/* Make vPASID space nested on GPA space */
> > 	pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev1 to pasidtbl_ioasid */
> > 	at_data = { .ioasid = pasidtbl_ioasid};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Bind PASID table */
> > 	bind_data = {
> > 		.ioasid	= pasidtbl_ioasid;
> > 		.addr	= gpa_pasid_table;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
> >
> > 	/* vIOMMU detects a new GVA I/O space created */
> > 	gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > 				gpa_ioasid);
> >
> > 	/* Attach dev1 to the new address space, with gpasid1 */
> > 	at_data = {
> > 		.ioasid		= gva_ioasid;
> > 		.flag 		= IOASID_ATTACH_USER_PASID;
> > 		.user_pasid	= gpasid1;
> > 	};
> > 	ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> >
> > 	/* Bind guest I/O page table. Because SET_PASID_TABLE has been
> > 	  * used, the kernel will not update the PASID table. Instead, just
> > 	  * track the bound I/O page table for handling invalidation and
> > 	  * I/O page faults.
> > 	  */
> > 	bind_data = {
> > 		.ioasid	= gva_ioasid;
> > 		.addr	= gva_pgtable1;
> > 		// and format information
> > 	};
> > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> I still don't quite get the benifit from doing this.
> 
> The idea to create an all PASID IOASID seems to work better with less
> fuss on HW that is directly parsing the guest's PASID table.
> 
> Cache invalidate seems easy enough to support
> 
> Fault handling needs to return the (ioasid, device_label, pasid) when
> working with this kind of ioasid.
> 
> It is true that it does create an additional flow qemu has to
> implement, but it does directly mirror the HW.
> 
> Jason