Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, May 06, 2022 at 03:25:03PM +1000, David Gibson wrote:
> On Thu, May 05, 2022 at 04:07:28PM -0300, Jason Gunthorpe wrote:

> > When the iommu_domain is created I want to have a
> > iommu-driver-specific struct, so PPC can customize its iommu_domain
> > however it likes.
> 
> This requires that the client be aware of the host side IOMMU model.
> That's true in VFIO now, and it's nasty; I was really hoping we could
> *stop* doing that.

iommufd has two modes, the 'generic interface' which what this patch
series shows that does not require any device specific knowledge.

The default iommu_domain that the iommu driver creates will be used
here, it is up to the iommu driver to choose something reasonable for
use by applications like DPDK. ie PPC should probably pick its biggest
x86-like aperture.

The iommu-driver-specific struct is the "advanced" interface and
allows a user-space IOMMU driver to tightly control the HW with full
HW specific knowledge. This is where all the weird stuff that is not
general should go.

> Note that I'm talking here *purely* about the non-optimized case where
> all updates to the host side IO pagetables are handled by IOAS_MAP /
> IOAS_COPY, with no direct hardware access to user or guest managed IO
> pagetables.  The optimized case obviously requires end-to-end
> agreement on the pagetable format amongst other domain properties.

Sure, this is how things are already..

> What I'm hoping is that qemu (or whatever) can use this non-optimized
> as a fallback case where it does't need to know the properties of
> whatever host side IOMMU models there are.  It just requests what it
> needs based on the vIOMMU properties it needs to replicate and the
> host kernel either can supply it or can't.

There aren't really any negotiable vIOMMU properties beyond the
ranges, and the ranges are not *really* negotiable.

There are lots of dragons with the idea we can actually negotiate
ranges - because asking for the wrong range for what the HW can do
means you don't get anything. Which is completely contrary to the idea
of easy generic support for things like DPDK.

So DPDK can't ask for ranges, it is not generic.

This means we are really talking about a qemu-only API, and IMHO, qemu
is fine to have a PPC specific userspace driver to tweak this PPC
unique thing if the default windows are not acceptable.

IMHO it is no different from imagining an Intel specific userspace
driver that is using userspace IO pagetables to optimize
cross-platform qemu vIOMMU emulation. We should be comfortable with
the idea that accessing the full device-specific feature set requires
a HW specific user space driver.

> Admittedly those are pretty niche cases, but allowing for them gives
> us flexibility for the future.  Emulating an ARM SMMUv3 guest on an
> ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in
> the future, and AFAICT, ARM are much less conservative that x86 about
> maintaining similar hw interfaces over time.  That's why I think
> considering these ppc cases will give a more robust interface for
> other future possibilities as well.

I don't think so - PPC has just done two things that are completely
divergent from eveything else - having two IO page tables for the same
end point, and using map/unmap hypercalls instead of a nested page
table.

Everyone else seems to be focused on IOPTEs that are similar to CPU
PTEs, particularly to enable SVA and other tricks, and CPU's don't
have either of this weirdness.

> You can consider that a map/unmap hypercall, but the size of the
> mapping is fixed (the IO pagesize which was previously set for the
> aperture).

Yes, I would consider that a map/unmap hypercall vs a nested translation.
 
> > Assuming yes, I'd expect that:
> > 
> > The iommu_domain for nested PPC is just a log of map/unmap hypervsior
> > calls to make. Whenever a new PE is attached to that domain it gets
> > the logged map's replayed to set it up, and when a PE is detached the
> > log is used to unmap everything.
> 
> And likewise duplicate every H_PUT_TCE to all the PEs in the domain.
> Sure.  It means the changes won't be atomic across the domain, but I
> guess that doesn't matter.  I guess you could have the same thing on a
> sufficiently complex x86 or ARM system, if you put two devices into
> the IOAS that were sufficiently far from each other in the bus
> topology that they use a different top-level host IOMMU.

Yes, strict atomicity is not needed.

> > It is not perfectly memory efficient - and we could perhaps talk about
> > a API modification to allow re-use of the iommufd datastructure
> > somehow, but I think this is a good logical starting point.
> 
> Because the map size is fixed, a "replay log" is effectively
> equivalent to a mirror of the entire IO pagetable.

So, for virtualized PPC the iommu_domain is an xarray of mapped PFNs
and device attach/detach just sweeps the xarray and does the
hypercalls. Very similar to what we discussed for S390.

It seems OK, this isn't even really overhead in most cases as we
always have to track the mapped PFNs anyhow.

> > Once a domain is created and attached to a group the aperture should
> > be immutable.
> 
> I realize that's the model right now, but is there a strong rationale
> for that immutability?

I have a strong preference that iommu_domains have immutable
properties once they are created just for overall sanity. Otherwise
everything becomes a racy mess.

If the iommu_domain changes dynamically then things using the aperture
data get all broken - it is complicated to fix. So it would need a big
reason to do it, I think.
 
> Come to that, IIUC that's true for the iommu_domain at the lower
> level, but not for the IOAS at a higher level.  You've stated that's
> *not* immutable, since it can shrink as new devices are added to the
> IOAS.  So I guess in that case the IOAS must be mapped by multiple
> iommu_domains?

Yes, thats right. The IOAS is expressly mutable because it its on
multiple domains and multiple groups each of which contribute to the
aperture. The IOAS aperture is the narrowest of everything, and we
have semi-reasonable semantics here. Here we have all the special code
to juggle this, but even in this case we can't handle an iommu_domain
or group changing asynchronously.

> Right but that aperture is coming only from the hardware.  What I'm
> suggesting here is that userspace can essentially opt into a *smaller*
> aperture (or apertures) than the hardware permits.  The value of this
> is that if the effective hardware aperture shrinks due to adding
> devices, the kernel has the information it needs to determine if this
> will be a problem for the userspace client or not.

We can do this check in userspace without more kernel APIs, userspace
should fetch the ranges and confirm they are still good after
attaching devices.

In general I have no issue with limiting the IOVA allocator in the
kernel, I just don't have a use case of an application that could use
the IOVA allocator (like DPDK) and also needs a limitation..

> > >   * A newly created kernel-managed IOAS has *no* IOVA windows
> > 
> > Already done, the iommufd IOAS has no iommu_domains inside it at
> > creation time.
> 
> That.. doesn't seem like the same thing at all.  If there are no
> domains, there are no restrictions from the hardware, so there's
> effectively an unlimited aperture.

Yes.. You wanted a 0 sized window instead? Why? That breaks what I'm
trying to do to make DPDK/etc portable and dead simple.
 
> > >   * A CREATE_WINDOW operation is added
> > >       * This takes a size, pagesize/granularity, optional base address
> > >         and optional additional attributes 
> > >       * If any of the specified attributes are incompatible with the
> > >         underlying hardware, the operation fails
> > 
> > iommu layer has nothing called a window. The closest thing is a
> > domain.
> 
> Maybe "window" is a bad name.  You called it "aperture" above (and
> I've shifted to that naming) and implied it *was* part of the IOMMU
> domain. 

It is but not as an object that can be mutated - it is just a
property.

You are talking about a window object that exists somehow, I don't
know this fits beyond some creation attribute of the domain..

> That said, it doesn't need to be at the iommu layer - as I
> said I'm consdiering this primarily a software concept and it could be
> at the IOAS layer.  The linkage would be that every iommufd operation
> which could change either the user-selected windows or the effective
> hardware apertures must ensure that all the user windows (still) lie
> within the hardware apertures.

As Kevin said, we need to start at the iommu_domain layer first -
when we understand how that needs to look then we can imagine what the
uAPI should be.

I don't want to create something in iommufd that is wildly divergent
from what the iommu_domain/etc can support.

> > The only thing the kernel can do is rely a notification that something
> > happened to a PE. The kernel gets an event on the PE basis, I would
> > like it to replicate it to all the devices and push it through the
> > VFIO device FD.
> 
> Again: it's not about the notifications (I don't even really know how
> those work in EEH).

Well, then I don't know what we are talking about - I'm interested in
what uAPI is needed to support this, as far as can I tell that is a
notification something bad happened and some control knobs?

As I said, I'd prefer this to be on the vfio_device FD vs on iommufd
and would like to find a way to make that work.

> expects that to happen at the (virtual) hardware level.  So if a guest
> trips an error on one device, it expects IO to stop for every other
> device in the PE.  But if hardware's notion of the PE is smaller than
> the guest's the host hardware won't accomplish that itself.

So, why do that to the guest? Shouldn't the PE in the guest be
strictly a subset of the PE in the host, otherwise you hit all these
problems you are talking about?

> used this in anger, I think a better approach is to just straight up
> fail any attempt to do any EEH operation if there's not a 1 to 1
> mapping from guest PE (domain) to host PE (group).

That makes sense

Jason



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux