Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Apr 29, 2022 at 04:20:36PM +1000, David Gibson wrote:

> > I think PPC and S390 are solving the same problem here. I think S390
> > is going to go to a SW nested model where it has an iommu_domain
> > controlled by iommufd that is populated with the pinned pages, eg
> > stored in an xarray.
> > 
> > Then the performance map/unmap path is simply copying pages from the
> > xarray to the real IOPTEs - and this would be modeled as a nested
> > iommu_domain with a SW vIOPTE walker instead of a HW vIOPTE walker.
> > 
> > Perhaps this is agreeable for PPC too?
> 
> Uh.. maybe?  Note that I'm making these comments based on working on
> this some years ago (the initial VFIO for ppc implementation in
> particular).  I'm no longer actively involved in ppc kernel work.

OK
 
> > > 3) "dynamic DMA windows" (DDW).  The IBM IOMMU hardware allows for 2 IOVA
> > > windows, which aren't contiguous with each other.  The base addresses
> > > of each of these are fixed, but the size of each window, the pagesize
> > > (i.e. granularity) of each window and the number of levels in the
> > > IOMMU pagetable are runtime configurable.  Because it's true in the
> > > hardware, it's also true of the vIOMMU interface defined by the IBM
> > > hypervisor (and adpoted by KVM as well).  So, guests can request
> > > changes in how these windows are handled.  Typical Linux guests will
> > > use the "low" window (IOVA 0..2GiB) dynamically, and the high window
> > > (IOVA 1<<60..???) to map all of RAM.  However, as a hypervisor we
> > > can't count on that; the guest can use them however it wants.
> > 
> > As part of nesting iommufd will have a 'create iommu_domain using
> > iommu driver specific data' primitive.
> > 
> > The driver specific data for PPC can include a description of these
> > windows so the PPC specific qemu driver can issue this new ioctl
> > using the information provided by the guest.
> 
> Hmm.. not sure if that works.  At the moment, qemu (for example) needs
> to set up the domains/containers/IOASes as it constructs the machine,
> because that's based on the virtual hardware topology.  Initially they
> use the default windows (0..2GiB first window, second window
> disabled).  Only once the guest kernel is up and running does it issue
> the hypercalls to set the final windows as it prefers.  In theory the
> guest could change them during runtime though it's unlikely in
> practice.  They could change during machine lifetime in practice,
> though, if you rebooted from one guest kernel to another that uses a
> different configuration.
> 
> *Maybe* IOAS construction can be deferred somehow, though I'm not sure
> because the assigned devices need to live somewhere.

This is a general requirement for all the nesting implementations, we
start out with some default nested page table and then later the VM
does the vIOMMU call to change it. So nesting will have to come along
with some kind of 'switch domains IOCTL'

In this case I would guess PPC could do the same and start out with a
small (nested) iommu_domain and then create the VM's desired
iommu_domain from the hypercall, and switch to it.

It is a bit more CPU work since maps in the lower range would have to
be copied over, but conceptually the model matches the HW nesting.

> > > You might be able to do this by simply failing this outright if
> > > there's anything other than exactly one IOMMU group bound to the
> > > container / IOAS (which I think might be what VFIO itself does now).
> > > Handling that with a device centric API gets somewhat fiddlier, of
> > > course.
> > 
> > Maybe every device gets a copy of the error notification?
> 
> Alas, it's harder than that.  One of the things that can happen on an
> EEH fault is that the entire PE gets suspended (blocking both DMA and
> MMIO, IIRC) until the proper recovery steps are taken.  

I think qemu would have to de-duplicate the duplicated device
notifications and then it can go from a device notifiation to the
device's iommu_group to the IOAS to the vPE?

A simple serial number in the event would make this pretty simple.

The way back to clear the event would just forward the commands
through a random device in the iommu_group to the PE?

Thanks,
Jason



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux