RE: [RFC] /dev/ioasid uAPI proposal

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Thu, 3 Jun 2021 03:02:05 +0000

> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Thursday, June 3, 2021 1:20 AM
> 
[...]
> > I wonder if there's a way to model this using a nested AS rather than
> > requiring special operations.  e.g.
> >
> > 	'prereg' IOAS
> > 	|
> > 	\- 'rid' IOAS
> > 	   |
> > 	   \- 'pasid' IOAS (maybe)
> >
> > 'prereg' would have a kernel managed pagetable into which (for
> > example) qemu platform code would map all guest memory (using
> > IOASID_MAP_DMA).  qemu's vIOMMU driver would then mirror the guest's
> > IO mappings into the 'rid' IOAS in terms of GPA.
> >
> > This wouldn't quite work as is, because the 'prereg' IOAS would have
> > no devices.  But we could potentially have another call to mark an
> > IOAS as a purely "preregistration" or pure virtual IOAS.  Using that
> > would be an alternative to attaching devices.
> 
> It is one option for sure, this is where I was thinking when we were
> talking in the other thread. I think the decision is best
> implementation driven as the datastructure to store the
> preregsitration data should be rather purpose built.

Yes. For now I prefer to managing prereg through a separate cmd
instead of special-casing it in the IOASID graph. Anyway this is sort
of a per-fd thing.

> 
> > > /*
> > >   * Map/unmap process virtual addresses to I/O virtual addresses.
> > >   *
> > >   * Provide VFIO type1 equivalent semantics. Start with the same
> > >   * restriction e.g. the unmap size should match those used in the
> > >   * original mapping call.
> > >   *
> > >   * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> > >   * must be already in the preregistered list.
> > >   *
> > >   * Input parameters:
> > >   *	- u32 ioasid;
> > >   *	- refer to vfio_iommu_type1_dma_{un}map
> > >   *
> > >   * Return: 0 on success, -errno on failure.
> > >   */
> > > #define IOASID_MAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 6)
> > > #define IOASID_UNMAP_DMA	_IO(IOASID_TYPE, IOASID_BASE + 7)
> >
> > I'm assuming these would be expected to fail if a user managed
> > pagetable has been bound?
> 
> Me too, or a SVA page table.
> 
> This document would do well to have a list of imagined page table
> types and the set of operations that act on them. I think they are all
> pretty disjoint..
> 
> Your presentation of 'kernel owns the table' vs 'userspace owns the
> table' is a useful clarification to call out too

sure, I incorporated this comment in last reply.

> 
> > > 5. Use Cases and Flows
> > >
> > > Here assume VFIO will support a new model where every bound device
> > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > > going through legacy container/group interface. For illustration purpose
> > > those devices are just called dev[1...N]:
> > >
> > > 	device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> >
> > Minor detail, but I'd suggest /dev/vfio/pci/DDDD:BB:SS.F for the
> > filenames for actual PCI functions.  Maybe /dev/vfio/mdev/something
> > for mdevs.  That leaves other subdirs of /dev/vfio free for future
> > non-PCI device types, and /dev/vfio itself for the legacy group
> > devices.
> 
> There are a bunch of nice options here if we go this path

Yes, this part is only roughly visited to focus on /dev/iommu first. In later
versions it will be considered more seriously.

> 
> > > 5.2. Multiple IOASIDs (no nesting)
> > > ++++++++++++++++++++++++++++
> > >
> > > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> > > both devices are attached to gpa_ioasid.
> >
> > Doesn't really affect your example, but note that the PAPR IOMMU does
> > not have a passthrough mode, so devices will not initially be attached
> > to gpa_ioasid - they will be unusable for DMA until attached to a
> > gIOVA ioasid.

'initially' here is still user-requested action. For PAPR you should do
attach only when it's necessary.

Thanks
Kevin