RE: [RFC] /dev/ioasid uAPI proposal

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Thu, 3 Jun 2021 02:49:56 +0000

> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Thursday, June 3, 2021 12:59 AM
> 
> On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > > 	/* Bind guest I/O page table  */
> > > > 	bind_data = {
> > > > 		.ioasid	= gva_ioasid;
> > > > 		.addr	= gva_pgtable1;
> > > > 		// and format information
> > > > 	};
> > > > 	ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> > >
> > > Again I do wonder if this should just be part of alloc_ioasid. Is
> > > there any reason to split these things? The only advantage to the
> > > split is the device is known, but the device shouldn't impact
> > > anything..
> >
> > I'm pretty sure the device(s) could matter, although they probably
> > won't usually.
> 
> It is a bit subtle, but the /dev/iommu fd itself is connected to the
> devices first. This prevents wildly incompatible devices from being
> joined together, and allows some "get info" to report the capability
> union of all devices if we want to do that.

I would expect the capability reported per-device via /dev/iommu. 
Incompatible devices can bind to the same fd but cannot attach to
the same IOASID. This allows incompatible devices to share locked
page accounting.

> 
> The original concept was that devices joined would all have to support
> the same IOASID format, at least for the kernel owned map/unmap IOASID
> type. Supporting different page table formats maybe is reason to
> revisit that concept.

I hope my memory was not broken, that the original concept was 
the devices attached to the same IOASID must support the same
format. Otherwise they need attach to different IOASIDs (but still
within the same fd).

> 
> There is a small advantage to re-using the IOASID container because of
> the get_user_pages caching and pinned accounting management at the FD
> level.

With above concept we don't need IOASID container then.

> 
> I don't know if that small advantage is worth the extra complexity
> though.
> 
> > But it would certainly be possible for a system to have two
> > different host bridges with two different IOMMUs with different
> > pagetable formats.  Until you know which devices (and therefore
> > which host bridge) you're talking about, you don't know what formats
> > of pagetable to accept.  And if you have devices from *both* bridges
> > you can't bind a page table at all - you could theoretically support
> > a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> > in both formats, but it would be pretty reasonable not to support
> > that.
> 
> The basic process for a user space owned pgtable mode would be:
> 
>  1) qemu has to figure out what format of pgtable to use
> 
>     Presumably it uses query functions using the device label. The
>     kernel code should look at the entire device path through all the
>     IOMMU HW to determine what is possible.
> 
>     Or it already knows because the VM's vIOMMU is running in some
>     fixed page table format, or the VM's vIOMMU already told it, or
>     something.

I'd expect the both. First get the hardware format. Then detect whether
it's compatible to the vIOMMU format.

> 
>  2) qemu creates an IOASID and based on #1 and says 'I want this format'

Based on earlier discussion this will possibly be:

struct iommu_ioasid_create_info {

// if set this is a guest-managed page table, use bind+invalidate, with
// info provided in struct pgtable_info;
// if clear it's host-managed and use map+unmap;
#define IOMMU_IOASID_FLAG_USER_PGTABLE		1

// if set it is for pasid table binding. same implication as USER_PGTABLE
// except it's for a different pgtable type
#define IOMMU_IOASID_FLAG_USER_PASID_TABLE	2
	int		flags;

	// Create nesting if not INVALID_IOASID
	u32		parent_ioasid;

	// additional info about the page table
	union {
		// for user-managed page table
		struct {
			u64	user_pgd;
			u32	format;
			u32	addr_width;
			// and other vendor format info
		} pgtable_info;

		// for kernel-managed page table
		struct {
			// not required on x86
			// for ppc, iirc the user wants to claim a window
			// explicitly?
		} map_info;
	};
};

then there will be no UNBIND_PGTABLE ioctl. The unbind is done 
automatically when the IOASID is freed.

> 
>  3) qemu binds the IOASID to the device.

let's use 'attach' for consistency. 😊 'bind' is for ioasid fd which must
be completed in step 0) so format can be reported in step 1)

> 
>     If qmeu gets it wrong then it just fails.
> 
>  4) For the next device qemu would have to figure out if it can re-use
>     an existing IOASID based on the required proeprties.
> 
> You pointed to the case of mixing vIOMMU's of different platforms. So
> it is completely reasonable for qemu to ask for a "ARM 64 bit IOMMU
> page table mode v2" while running on an x86 because that is what the
> vIOMMU is wired to work with.
> 
> Presumably qemu will fall back to software emulation if this is not
> possible.
> 
> One interesting option for software emulation is to just transform the
> ARM page table format to a x86 page table format in userspace and use
> nested bind/invalidate to synchronize with the kernel. With SW nesting
> I suspect this would be much faster
> 

or just use map+unmap. It's no difference from how an virtio-iommu could
work on all platforms, which is by definition is not the same type as the
underlying hardware.

Thanks
Kevin