Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Apr 28, 2022 at 11:22:58AM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 28, 2022 at 03:58:30PM +1000, David Gibson wrote:
> > On Thu, Mar 31, 2022 at 09:58:41AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Mar 31, 2022 at 03:36:29PM +1100, David Gibson wrote:
> > > 
> > > > > +/**
> > > > > + * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
> > > > > + * @size: sizeof(struct iommu_ioas_iova_ranges)
> > > > > + * @ioas_id: IOAS ID to read ranges from
> > > > > + * @out_num_iovas: Output total number of ranges in the IOAS
> > > > > + * @__reserved: Must be 0
> > > > > + * @out_valid_iovas: Array of valid IOVA ranges. The array length is the smaller
> > > > > + *                   of out_num_iovas or the length implied by size.
> > > > > + * @out_valid_iovas.start: First IOVA in the allowed range
> > > > > + * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
> > > > > + *
> > > > > + * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
> > > > > + * not allowed. out_num_iovas will be set to the total number of iovas
> > > > > + * and the out_valid_iovas[] will be filled in as space permits.
> > > > > + * size should include the allocated flex array.
> > > > > + */
> > > > > +struct iommu_ioas_iova_ranges {
> > > > > +	__u32 size;
> > > > > +	__u32 ioas_id;
> > > > > +	__u32 out_num_iovas;
> > > > > +	__u32 __reserved;
> > > > > +	struct iommu_valid_iovas {
> > > > > +		__aligned_u64 start;
> > > > > +		__aligned_u64 last;
> > > > > +	} out_valid_iovas[];
> > > > > +};
> > > > > +#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
> > > > 
> > > > Is the information returned by this valid for the lifeime of the IOAS,
> > > > or can it change?  If it can change, what events can change it?
> > > >
> > > > If it *can't* change, then how do we have enough information to
> > > > determine this at ALLOC time, since we don't necessarily know which
> > > > (if any) hardware IOMMU will be attached to it.
> > > 
> > > It is a good point worth documenting. It can change. Particularly
> > > after any device attachment.
> > 
> > Right.. this is vital and needs to be front and centre in the
> > comments/docs here.  Really, I think an interface that *doesn't* have
> > magically changing status would be better (which is why I was
> > advocating that the user set the constraints, and the kernel supplied
> > or failed outright).  Still I recognize that has its own problems.
> 
> That is a neat idea, it could be a nice option, it lets userspace
> further customize the kernel allocator.
> 
> But I don't have a use case in mind? The simplified things I know
> about want to attach their devices then allocate valid IOVA, they
> don't really have a notion about what IOVA regions they are willing to
> accept, or necessarily do hotplug.

The obvious use case is qemu (or whatever) emulating a vIOMMU.  The
emulation code knows the IOVA windows that are expected of the vIOMMU
(because that's a property of the emulated platform), and requests
them of the host IOMMU.  If the host can supply that, you're good
(this doesn't necessarily mean the host windows match exactly, just
that the requested windows fit within the host windows).  If not,
you report an error.  This can be done at any point when the host
windows might change - so try to attach a device that can't support
the requested windows, and it will fail.  Attaching a device which
shrinks the windows, but still fits the requested windows within, and
you're still good to go.

For a typical direct userspace case you don't want that.  However, it
probably *does* make sense for userspace to specify how large a window
it wants.  So some form that allows you to specify size without base
address also makes sense.  In that case the kernel would set a base
address according to the host IOMMU's capabilities, or fail if it
can't supply any window of the requested size.  When to allocate that
base address is a bit unclear though.  If you do it at window request
time, then you might pick something that a later device can't work
with.  If you do it later, it's less clear how to sensibly report it
to userspace.

One option might be to only allow IOAS_MAP (or COPY) operations after
windows are requested, but otherwise you can choose the order.  So,
things that have strict requirements for the windows (vIOMMU
emulation) would request the windows then add devices: they know the
windows they need, if the devices can't work with that, that's what
needs to fail.  A userspace driver, however, would attach the devices
it wants to use, then request a window (without specifying base
address).

A query ioctl to give the largest possible windows in the current
state could still be useful for debugging here, of course, but
wouldn't need to be used in the normal course of operation.

> What might be interesting is to have some option to load in a machine
> specific default ranges - ie the union of every group and and every
> iommu_domain. The idea being that after such a call hotplug of a
> device should be very likely to succeed.
> 
> Though I don't have a user in mind..
> 
> > > I added this:
> > > 
> > >  * Query an IOAS for ranges of allowed IOVAs. Mapping IOVA outside these ranges
> > >  * is not allowed. out_num_iovas will be set to the total number of iovas and
> > >  * the out_valid_iovas[] will be filled in as space permits. size should include
> > >  * the allocated flex array.
> > >  *
> > >  * The allowed ranges are dependent on the HW path the DMA operation takes, and
> > >  * can change during the lifetime of the IOAS. A fresh empty IOAS will have a
> > >  * full range, and each attached device will narrow the ranges based on that
> > >  * devices HW restrictions.
> > 
> > I think you need to be even more explicit about this: which exact
> > operations on the fd can invalidate exactly which items in the
> > information from this call?  Can it only ever be narrowed, or can it
> > be broadened with any operations?
> 
> I think "attach" is the phrase we are using for that operation - it is
> not a specific IOCTL here because it happens on, say, the VFIO device FD.
> 
> Let's add "detatching a device can widen the ranges. Userspace should
> query ranges after every attach/detatch to know what IOVAs are valid
> for mapping."
> 
> > > > > +#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
> > > > 
> > > > Since it can only copy a single mapping, what's the benefit of this
> > > > over just repeating an IOAS_MAP in the new IOAS?
> > > 
> > > It causes the underlying pin accounting to be shared and can avoid
> > > calling GUP entirely.
> > 
> > If that's the only purpose, then that needs to be right here in the
> > comments too.  So is expected best practice to IOAS_MAP everything you
> > might want to map into a sort of "scratch" IOAS, then IOAS_COPY the
> > mappings you actually end up wanting into the "real" IOASes for use?
> 
> That is one possibility, yes. qemu seems to be using this to establish
> a clone ioas of an existing operational one which is another usage
> model.

Right, for qemu (or other hypervisors) the obvious choice would be to
create a "staging" IOAS where IOVA == GPA, then COPY that into the various
emulated bus IOASes.  For a userspace driver situation, I'm guessing
you'd map your relevant memory pool into an IOAS, then COPY to the
IOAS you need for whatever specific devices you're using.

> I added this additionally:
> 
>  * This may be used to efficiently clone a subset of an IOAS to another, or as a
>  * kind of 'cache' to speed up mapping. Copy has an effciency advantage over
>  * establishing equivilant new mappings, as internal resources are shared, and
>  * the kernel will pin the user memory only once.

I think adding that helps substantially.

> > Seems like it would be nicer for the interface to just figure it out
> > for you: I can see there being sufficient complications with that to
> > have this slightly awkward interface, but I think it needs a rationale
> > to accompany it.
> 
> It is more than complicates, the kernel has no way to accurately know
> when a user pointer is an alias of an existing user pointer or is
> something new because the mm has become incoherent.
> 
> It is possible that uncoordinated modules in userspace could
> experience data corruption if the wrong decision is made - mm
> coherence with pinning is pretty weak in Linux.. Since I dislike that
> kind of unreliable magic I made it explicit.

Fair enough.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux