Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach

Alex Williamson <alex.williamson@xxxxxxxxxx> · Wed, 24 Apr 2024 14:13:49 -0600

On Wed, 24 Apr 2024 15:36:26 -0300
Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:

> On Wed, Apr 24, 2024 at 12:24:37PM -0600, Alex Williamson wrote:
> > > The only reason to pass the PF's PASID cap is to give free space to
> > > the VMM. If we are saying that gaps are free space (excluding a list
> > > of bad devices) then we don't acutally need to do that anymore.  
> > 
> > Are we saying that now??  That's new.  
> 
> I suggested it a few times
> 
> >   
> > > VMM will always create a synthetic PASID cap and kernel will always
> > > suppress a real one.
> > > 
> > > An iommufd query will indicate if the vIOMMU can support vPASID on
> > > that device.
> > > 
> > > Same for all the troublesome non-physical caps.
> > >   
> > > > > There are migration considerations too - the blocks need to be
> > > > > migrated over and end up in the same place as well..    
> > > > 
> > > > Can you elaborate what is the problem with the kernel emulating
> > > > the PASID cap in this consideration?    
> > > 
> > > If the kernel changes the algorithm, say it wants to do PASID, PRI,
> > > something_new then it might change the layout
> > > 
> > > We can't just have the kernel decide without also providing a way for
> > > userspace to say what the right layout actually is. :\  
> > 
> > The capability layout is only relevant to migration, right?    
> 
> Yes, proabbly
> 
> > A variant
> > driver that supports migration is a prerequisite and would also be
> > responsible for exposing the PASID capability.  This isn't as disjoint
> > as it's being portrayed.  
> 
> I guess..  But also not quite. We still have the problem that kernel
> migration driver V1 could legitimately create a different config space
> that migration driver V2
> 
> And now you are saying that the migration driver has to parse the
> migration stream and readjust its own layout
> 
> And every driver needs to do this?
> 
> We can, it is a quite big bit of infrastructure I think, but sure..
> 
> I fear the VMM still has to be involved somehow because it still has
> to know if the source VMM has removed any kernel created caps.

This is kind of an absurd example to portray as a ubiquitous problem.
Typically the config space layout is a reflection of hardware whether
the device supports migration or not.  If a driver were to insert a
virtual capability, then yes it would want to be consistent about it if
it also cares about migration.  If the driver needs to change the
location of a virtual capability, problems will arise, but that's also
not something that every driver needs to do.

Also, how exactly does emulating the capability in the VMM solve this
problem?  Currently QEMU migration simply applies state to an identical
VM on the target.  QEMU doesn't modify the target VM to conform to the
data stream.  So in either case, the problem might be more along the
lines of how to make a V1 device from a V2 driver, which is more the
device type/flavor/persona problem.

> > Outside of migration, what does it matter if the cap layout is
> > different?  A driver should never hard code the address for a
> > capability.  
> 
> Yes, talking about migration here - migration is the hardest case it
> seems.
>  
> > > At least if the VMM is doing this then the VMM can include the
> > > information in its migration scheme and use it to recreate the PCI
> > > layout withotu having to create a bunch of uAPI to do so.  
> > 
> > We're again back to migration compatibility, where again the capability
> > layout would be governed by the migration support in the in-kernel
> > variant driver.  Once migration is involved the location of a PASID
> > shouldn't be arbitrary, whether it's provided by the kernel or the VMM.  
> 
> I wasn't going in this direction. I was thinking to make the VMM
> create the config space layout that is approriate and hold it stable
> as a migration ABI.
> 
> I think in practice many VMMs are going to do this anyhow unless we
> put full support for config space synthesis, stable versions, and
> version selection in the kernel directly. I was thinking to avoid
> doing that.

Currently QEMU replies on determinism that a given command line results
in an identical machine configuration and identical devices.  State of
that target VM is then populated, not defined by, the migration stream.

> > Regardless, the VMM ultimately has the authority what the guest
> > sees in config space.  The VMM is not bound to expose the PASID at the
> > offset provided by the kernel, or bound to expose it at all.  The
> > kernel exposed PASID can simply provide an available location and set
> > of enabled capabilities.   
> 
> And if the VMM is going to ignore the kernel layout then why do so
> much work in the kernel to create it?

Ok, let's not ignore it ;)

> I think we need to decide, either only the VMM or only the kernel
> should do this.

What are you actually proposing?  Are you suggesting that if a device
supports the Power Management capability it will be virtualized at
offset 0x60, if the device supports the MSI capability it will be
virtualized at 0x68,... if a device supports PASID it will be
virtualized at offset 0x300, etc...?

That's not only impractical because we can't layout all the capabilities
within the available space, but also because we will run into masking
hidden registers and devices where the driver hard codes a capability
offset.

If the VMM implements the "find a gap" solution then it's just as
subject to config space changes in hardware or provided by the variant
driver.

If not either of those, are we hard coding a device specific config
space map into the VMM or providing one on the command line?  I thought
we were using vfio-pci variant drivers and a defined vfio migration API
in order to prevent modifying the VMM for every device we want to
support migration.  Also I'd wonder if the driver itself shouldn't be
configured to provide a compatible type.  Thanks,

Alex