Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach

Alex Williamson <alex.williamson@xxxxxxxxxx> · Fri, 26 Apr 2024 14:13:54 -0600

On Fri, 26 Apr 2024 11:11:17 -0300
Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:

> On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
> 
> > This is kind of an absurd example to portray as a ubiquitous problem.
> > Typically the config space layout is a reflection of hardware whether
> > the device supports migration or not.  
> 
> Er, all our HW has FW constructed config space. It changes with FW
> upgrades. We change it during the life of the product. This has to be
> considered..

So as I understand it, the concern is that you have firmware that
supports migration, but it also openly hostile to the fundamental
aspects of exposing a stable device ABI in support of migration.

> > If a driver were to insert a
> > virtual capability, then yes it would want to be consistent about it if
> > it also cares about migration.  If the driver needs to change the
> > location of a virtual capability, problems will arise, but that's also
> > not something that every driver needs to do.  
> 
> Well, mlx5 has to cope with this. It supports so many devices with so
> many config space layouts :( I don't know if we can just hard wire an
> offset to stick in a PASID cap and expect that to work...
> 
> > Also, how exactly does emulating the capability in the VMM solve this
> > problem?  Currently QEMU migration simply applies state to an identical
> > VM on the target.  QEMU doesn't modify the target VM to conform to the
> > data stream.  So in either case, the problem might be more along the
> > lines of how to make a V1 device from a V2 driver, which is more the
> > device type/flavor/persona problem.  
> 
> Yes, it doesn't solve anything, it just puts the responsibility for
> something that is very complicated in userspace where there are more
> options to configure and customize it to the environment.
> 
> > Currently QEMU replies on determinism that a given command line results
> > in an identical machine configuration and identical devices.  State of
> > that target VM is then populated, not defined by, the migration stream.  
> 
> But that won't be true if the kernel is making decisions. The config
> space layout depends now on the kernel driver version too.

But in the cases where we support migration there's a device specific
variant driver that supports that migration.  It's the job of that
variant driver to not only export and import the device state, but also
to provide a consistent ABI to the user, which includes the config
space layout.  I don't understand why we'd say the device programming
ABI itself falls within the purview of the device/variant driver, but
PCI config space is defined by device specific code at a higher level.

> > > I think we need to decide, either only the VMM or only the kernel
> > > should do this.  
> > 
> > What are you actually proposing?  
> 
> Okay, what I'm thinking about is a text file that describes the vPCI
> function configuration space to create. The community will standardize
> this and VMMs will have to implement to get PASID/etc. Maybe the
> community will provide a BSD licensed library to do this job.
> 
> The text file allows the operator to specify exactly the configuration
> space the VFIO function should have. It would not be derived
> automatically from physical. AFAIK qemu does not have this capability
> currently.
> 
> This reflects my observation and discussions around the live migration
> standardization. I belive we are fast reaching a point where this is
> required.
> 
> Consider standards based migration between wildly different
> devices. The devices will not standardize their physical config space,
> but an operator could generate a consistent vPCI config space that
> works with all the devices in their fleet.
> 
> Consider the usual working model of the large operators - they define
> instance types with some regularity. But an instance type is fixed in
> concrete once it is specified, things like the vPCI config space are
> fixed.
> 
> Running Instance A on newer hardware with a changed physical config
> space should continue to present Instance A's vPCI config layout
> regardless. Ie Instance A might not support PASID but Instance B can
> run on newer HW that does. The config space layout depends on the
> requested Instance Type, not the physical layout.
> 
> The auto-configuration of the config layout from physical is a nice
> feature and is excellent for development/small scale, but it shouldn't
> be the only way to work.
> 
> So - if we accept that text file configuration should be something the
> VMM supports then let's reconsider how to solve the PASID problem.
> 
> I'd say the way to solve it should be via a text file specifying a
> full config space layout that includes the PASID cap. From the VMM
> perspective this works fine, and it ports to every VMM directly via
> processing the text file.
> 
> The autoconfiguration use case can be done by making a tool build the
> text file by deriving it from physical, much like today. The single
> instance of that tool could have device specific knowledge to avoid
> quirks. This way the smarts can still be shared by all the VMMs
> without going into the kernel. Special devices with hidden config
> space could get special quirks or special reference text files into
> the tool repo.
> 
> Serious operators doing production SRIOV/etc would negotiate the text
> file with the HW vendors when they define their Instance Type. Ideally
> these reference text files would be contributed to the tool repo
> above. I think there would be some nice idea to define fully open
> source Instance Types that include VFIO devices too.

Regarding "if we accept that text file configuration should be
something the VMM supports", I'm not on board with this yet, so
applying it to PASID discussion seems premature.

We've developed variant drivers specifically to host the device specific
aspects of migration support.  The requirement of a consistent config
space layout is a problem that only exists relative to migration.  This
is an issue that I would have considered the responsibility of the
variant driver, which would likely expect a consistent interface from
the hardware/firmware.  Why does hostile firmware suddenly make it the
VMM's problem to provide a consistent ABI to the config space of the
device rather than the variant driver?

Obviously config maps are something that a VMM could do, but it also
seems to impose a non-trivial burden that every VMM requires an
implementation of a config space map and integration for each device
rather than simply expecting the exposed config space of the device to
be part of the migration ABI.  Also this solution specifically only
addresses config space compatibility without considering the more
generic issue that a variant driver can expose different device
personas.  A versioned persona and config space virtualization in the
variant driver is a much more flexible solution.  Thanks,

Alex