Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote:

> > The default iommu_domain that the iommu driver creates will be used
> > here, it is up to the iommu driver to choose something reasonable for
> > use by applications like DPDK. ie PPC should probably pick its biggest
> > x86-like aperture.
> 
> So, using the big aperture means a very high base IOVA
> (1<<59)... which means that it won't work at all if you want to attach
> any devices that aren't capable of 64-bit DMA.

I'd expect to include the 32 bit window too..

> Using the maximum possible window size would mean we either
> potentially waste a lot of kernel memory on pagetables, or we use
> unnecessarily large number of levels to the pagetable.

All drivers have this issue to one degree or another. We seem to be
ignoring it - in any case this is a micro optimization, not a
functional need?

> More generally, the problem with the interface advertising limitations
> and it being up to userspace to work out if those are ok or not is
> that it's fragile.  It's pretty plausible that some future IOMMU model
> will have some new kind of limitation that can't be expressed in the
> query structure we invented now.

The basic API is very simple - the driver needs to provide ranges of
IOVA and map/unmap - I don't think we have a future problem here we
need to try and guess and solve today.

Even PPC fits this just fine, the open question for DPDK is more
around optimization, not functional.

> But if userspace requests the capabilities it wants, and the kernel
> acks or nacks that, we can support the new host IOMMU with existing
> software just fine.

No, this just makes it fragile in the other direction because now
userspace has to know what platform specific things to ask for *or it
doesn't work at all*. This is not a improvement for the DPDK cases.

Kernel decides, using all the kernel knowledge it has and tells the
application what it can do - this is the basic simplified interface.

> > The iommu-driver-specific struct is the "advanced" interface and
> > allows a user-space IOMMU driver to tightly control the HW with full
> > HW specific knowledge. This is where all the weird stuff that is not
> > general should go.
> 
> Right, but forcing anything more complicated than "give me some IOVA
> region" to go through the advanced interface means that qemu (or any
> hypervisor where the guest platform need not identically match the
> host) has to have n^2 complexity to match each guest IOMMU model to
> each host IOMMU model.

I wouldn't say n^2, but yes, qemu needs to have a userspace driver for
the platform IOMMU, and yes it needs this to reach optimal
behavior. We already know this is a hard requirement for using nesting
as acceleration, I don't see why apertures are so different.

> Errr.. how do you figure?  On ppc the ranges and pagesizes are
> definitely negotiable.  I'm not really familiar with other models, but
> anything which allows *any* variations in the pagetable structure will
> effectively have at least some negotiable properties.

As above, if you ask for the wrong thing then you don't get
anything. If DPDK asks for something that works on ARM like 0 -> 4G
then PPC and x86 will always fail. How is this improving anything to
require applications to carefully ask for exactly the right platform
specific ranges?

It isn't like there is some hard coded value we can put into DPDK that
will work on every platform. So kernel must pick for DPDK, IMHO. I
don't see any feasible alternative.

> Which is why I'm suggesting that the base address be an optional
> request.  DPDK *will* care about the size of the range, so it just
> requests that and gets told a base address.

We've talked about a size of IOVA address space before, strictly as a
hint, to possible optimize page table layout, or something, and I'm
fine with that idea. But - we have no driver implementation today, so
I'm not sure what we can really do with this right now..

Kevin could Intel consume a hint on IOVA space and optimize the number
of IO page table levels?

> > and IMHO, qemu
> > is fine to have a PPC specific userspace driver to tweak this PPC
> > unique thing if the default windows are not acceptable.
> 
> Not really, because it's the ppc *target* (guest) side which requires
> the specific properties, but selecting the "advanced" interface
> requires special knowledge on the *host* side.

The ppc specific driver would be on the generic side of qemu in its
viommu support framework. There is lots of host driver optimization
possible here with knowledge of the underlying host iommu HW. It
should not be connected to the qemu target.

It is not so different from today where qemu has to know about ppc's
special vfio interface generically even to emulate x86.

> > IMHO it is no different from imagining an Intel specific userspace
> > driver that is using userspace IO pagetables to optimize
> > cross-platform qemu vIOMMU emulation.
> 
> I'm not quite sure what you have in mind here.  How is it both Intel
> specific and cross-platform?

It is part of the generic qemu iommu interface layer. For nesting qemu
would copy the guest page table format to the host page table format
in userspace and trigger invalidation - no pinning, no kernel
map/unmap calls. It can only be done with detailed knowledge of the
host iommu since the host iommu io page table format is exposed
directly to userspace.

> Note however, that having multiple apertures isn't really ppc specific.
> Anything with an IO hole effectively has separate apertures above and
> below the hole.  They're much closer together in address than POWER's
> two apertures, but I don't see that makes any fundamental difference
> to the handling of them.

In the iommu core it handled the io holes and things through the group
reserved IOVA list - there isn't actualy a limit in the iommu_domain,
it has a flat pagetable format - and in cases like PASID/SVA the group
reserved list doesn't apply at all.

> Another approach would be to give the required apertures / pagesizes
> in the initial creation of the domain/IOAS.  In that case they would
> be static for the IOAS, as well as the underlying iommu_domains: any
> ATTACH which would be incompatible would fail.

This is the device-specific iommu_domain creation path. The domain can
have information defining its aperture.

> That makes life hard for the DPDK case, though.  Obviously we can
> still make the base address optional, but for it to be static the
> kernel would have to pick it immediately, before we know what devices
> will be attached, which will be a problem on any system where there
> are multiple IOMMUs with different constraints.

Which is why the current scheme is fully automatic and we rely on the
iommu driver to automatically select something sane for DPDK/etc
today.

> > In general I have no issue with limiting the IOVA allocator in the
> > kernel, I just don't have a use case of an application that could use
> > the IOVA allocator (like DPDK) and also needs a limitation..
> 
> Well, I imagine DPDK has at least the minimal limitation that it needs
> the aperture to be a certain minimum size (I'm guessing at least the
> size of its pinned hugepage working memory region).  That's a
> limitation that's unlikely to fail on modern hardware, but it's there.

Yes, DPDK does assume there is some fairly large available aperture,
that should be the driver default behavior, IMHO.

> > That breaks what I'm
> > trying to do to make DPDK/etc portable and dead simple.
> 
> It doesn't break portability at all.  As for simplicity, yes it adds
> an extra required step, but the payoff is that it's now impossible to
> subtly screw up by failing to recheck your apertures after an ATTACH.
> That is, it's taking a step which was implicitly required and
> replacing it with one that's explicitly required.

Again, as above, it breaks portability because apps have no hope to
know what window range to ask for to succeed. It cannot just be a hard
coded range.

Jason



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux