Re: [PATCH 0/3] sample: vfio mdev display devices.

"Dr. David Alan Gilbert" <dgilbert@xxxxxxxxxx> · Wed, 25 Apr 2018 20:52:30 +0100

* Alex Williamson (alex.williamson@xxxxxxxxxx) wrote:
> On Wed, 25 Apr 2018 21:00:39 +0530
> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
> 
> > On 4/25/2018 4:29 AM, Alex Williamson wrote:
> > > On Wed, 25 Apr 2018 01:20:08 +0530
> > > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
> > >   
> > >> On 4/24/2018 3:10 AM, Alex Williamson wrote:  
> > >>> On Wed, 18 Apr 2018 12:31:53 -0600
> > >>> Alex Williamson <alex.williamson@xxxxxxxxxx> wrote:
> > >>>     
> > >>>> On Mon,  9 Apr 2018 12:35:10 +0200
> > >>>> Gerd Hoffmann <kraxel@xxxxxxxxxx> wrote:
> > >>>>    
> > >>>>> This little series adds three drivers, for demo-ing and testing vfio
> > >>>>> display interface code.  There is one mdev device for each interface
> > >>>>> type (mdpy.ko for region and mbochs.ko for dmabuf).      
> > >>>>
> > >>>> Erik Skultety brought up a good question today regarding how libvirt is
> > >>>> meant to handle these different flavors of display interfaces and
> > >>>> knowing whether a given mdev device has display support at all.  It
> > >>>> seems that we cannot simply use the default display=auto because
> > >>>> libvirt needs to specifically configure gl support for a dmabuf type
> > >>>> interface versus not having such a requirement for a region interface,
> > >>>> perhaps even removing the emulated graphics in some cases (though I
> > >>>> don't think we have boot graphics through either solution yet).
> > >>>> Additionally, GVT-g seems to need the x-igd-opregion support
> > >>>> enabled(?), which is a non-starter for libvirt as it's an experimental
> > >>>> option!
> > >>>>
> > >>>> Currently the only way to determine display support is through the
> > >>>> VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on
> > >>>> their own they'd need to get to the point where they could open the
> > >>>> vfio device and perform the ioctl.  That means opening a vfio
> > >>>> container, adding the group, setting the iommu type, and getting the
> > >>>> device.  I was initially a bit appalled at asking libvirt to do that,
> > >>>> but the alternative is to put this information in sysfs, but doing that
> > >>>> we risk that we need to describe every nuance of the mdev device
> > >>>> through sysfs and it becomes a dumping ground for every possible
> > >>>> feature an mdev device might have.
> ...    
> > >>>> So I was ready to return and suggest that maybe libvirt should probe
> > >>>> the device to know about these ancillary configuration details, but
> > >>>> then I remembered that both mdev vGPU vendors had external dependencies
> > >>>> to even allow probing the device.  KVMGT will fail to open the device
> > >>>> if it's not associated with an instance of KVM and NVIDIA vGPU, I
> > >>>> believe, will fail if the vGPU manager process cannot find the QEMU
> > >>>> instance to extract the VM UUID.  (Both of these were bad ideas)    
> > >>>
> > >>> Here's another proposal that's really growing on me:
> > >>>
> > >>>  * Fix the vendor drivers!  Allow devices to be opened and probed
> > >>>    without these external dependencies.
> > >>>  * Libvirt uses the existing vfio API to open the device and probe the
> > >>>    necessary ioctls, if it can't probe the device, the feature is
> > >>>    unavailable, ie. display=off, no migration.
> > >>>     
> > >>
> > >> I'm trying to think simpler mechanism using sysfs that could work for
> > >> any feature and knowing source-destination migration compatibility check
> > >> by libvirt before initiating migration.
> > >>
> > >> I have another proposal:
> > >> * Add a ioctl VFIO_DEVICE_PROBE_FEATURES
> > >> struct vfio_device_features {
> > >>     __u32 argsz;
> > >>     __u32 features;
> > >> }
> > >>
> > >> Define bit for each feature:
> > >> #define VFIO_DEVICE_FEATURE_DISPLAY_REGION	(1 << 0)
> > >> #define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF	(1 << 1)
> > >> #define VFIO_DEVICE_FEATURE_MIGRATION		(1 << 2)
> > >>
> > >> * Vendor driver returns bitmask of supported features during
> > >> initialization phase.
> > >>
> > >> * In vfio core module, trap this ioctl for each device  in
> > >> vfio_device_fops_unl_ioctl(),  
> > > 
> > > Whoops, chicken and egg problem, VFIO_GROUP_GET_DEVICE_FD is our
> > > blocking point with mdev drivers, we can't get a device fd, so we can't
> > > call an ioctl on the device fd.
> > >   
> > 
> > I'm sorry, I thought we could expose features when QEMU initialize, but
> > libvirt needs to know supported features before QEMU initialize.
> > 
> > 
> > >> check features bitmask returned by vendor
> > >> driver and add a sysfs file if feature is supported that device. This
> > >> sysfs file would return 0/1.  
> > > 
> > > I don't understand why we have an ioctl interface, if the user can get
> > > to the device fd then we have existing interfaces to probe these
> > > things, it seems like you're just wanting to pass a features bitmap
> > > through to vfio_add_group_dev() that vfio-core would expose through
> > > sysfs, but a list of feature bits doesn't convey enough info except for
> > > the most basic uses.
> > >    
> > 
> > Yes, vfio_add_group_dev() seems to be better way to convey features to
> > vfio core.
> > 
> > >> For migration this bit will only indicate if host driver supports
> > >> migration feature.
> > >>
> > >> For source and destination compatibility check libvirt would need more
> > >> data/variables to check like,
> > >> * if same type of 'mdev_type' device create-able at destination,
> > >>    i.e. if ('mdev_type'->available_instances > 0)
> > >>
> > >> * if host_driver_version at source and destination are compatible.
> > >> Host driver from same release branch should be mostly compatible, but if
> > >> there are major changes in structures or APIs, host drivers from
> > >> different branches might not be compatible, for example, if source and
> > >> destination are from different branches and one of the structure had
> > >> changed, then data collected at source might not be compatible with
> > >> structures at destination and typecasting it to changed structures would
> > >> mess up migrated data during restoration.  
> > > 
> > > Of course now you're asking that libvirt understand the release
> > > versioning scheme of every vendor driver and that it remain
> > > programatically consistent.  We can't even do this with in-kernel
> > > drivers.  And in the end, still the best we can do is guess.
> > >  
> > 
> > Libvirt doesn't need to understand the version, libvirt need to do
> > strcmp version string from source and destination. If those are equal,
> > then libvirt would understand that they are compatible.
> 
> Who's to say that the driver version and migration compatibility have
> any relation at all?  Some drivers might focus on designing their own
> migration interface that can maintain compatibility across versions
> (QEMU does this), some drivers may only allow identical version
> migration (which is going to frustrate upper level management tools and
> customers - RHEL goes to great extents to support cross version
> migration).  We cannot have a one size fits all here that driver version
> defines completely the migration compatibility.

I'll agree; I don't know enough about these devices, but to give you
some example of things I'd expect to work:
   a) User adds new machines to their data centre with larger/newer
version of the same vendors GPU; in some cases that should work
(depending on vendor details etc)
   b) The same thing but with identical hardware but a newer driver on
the destination.

Obviously there will be some cut offs that say some versions are
incompatible;  but for normal migration we jump through serious hoops
to make sure stuff works; customers will expect the same with some
VFIO devices.

> > >> * if guest_driver_version is compatible with host driver at destination.
> > >> For mdev devices, guest driver communicates with host driver in some
> > >> form. If there are changes in structures/APIs of such communication,
> > >> guest driver at source might not be compatible with host driver at
> > >> destination.  
> > > 
> > > And another guess plus now the guest driver is involved which libvirt
> > > has no visibility to.
> > >    
> > 
> > Like above libvirt need to do strcmp.
> 
> Insufficient, imo
> 
> > >> 'available_instances' sysfs already exist, later two should be added by
> > >> vendor driver which libvirt can use for migration compatibility check.  
> > > 
> > > As noted previously, display and migration are not necessarily
> > > mdev-only features, it's possible that vfio-pci or vfio-platform could
> > > also implement these, so the sysfs interface cannot be restricted to
> > > the mdev template and lifecycle interface.
> > >   
> > 
> > I agree.
> > Feature bitmask passed to vfio core is not mdev specific. But here
> > 'available_instances' for migration compatibility check is mdev
> > specific. If mdev device is not create-able at destination, there is no
> > point in initiating migration by libvirt.
> 
> 'available_instances' for migration compatibility check...?  We use
> available_instances to know whether we have the resources to create a
> given mdev type.  It's certainly a prerequisite to have a device of the
> identical type at the migration target and how we define what is an
> identical device for a directly assigned PCI device is yet another
> overly complicated rat hole.  But an identical device doesn't
> necessarily imply migration compatibility and I think that's the
> problem we're tackling.  We cannot assume based only on the device type
> that migration is compatible, that's basically saying we're never going
> to have any bugs or oversights or new features in the migration stream.

Those things certainly happen; state that we forgot to transfer, new
features enables on devices, devices configured in different ways.

> Chatting with Laine, it may be worth a step back to include migration
> experts and people up the stack with more visibility to how openstack
> operates.  The issue here is that if vfio gains migration support then
> we have a portion of the migration stream that is not under the control
> of QEMU, we cannot necessarily tie it to a QEMU machine type and we
> cannot necessarily dictate how the vfio bus driver (vendor driver)
> handles versioning and compatibility.  My intent was to expose some
> sort of migration information through the vfio API so that upper level
> tools could determine source and target compatibility, but this in
> itself is I think something new that those tools need to agree how it
> might be done.  How would something like openstack want to handle not
> only finding a migration target with a compatible device, but also
> verifying if the device supports the migration format of the source
> device?
> 
> Alternatively, should we do anything?  Is the problem too hard and we
> should let the driver return an error when it receives an incompatible
> migration stream, aborting the migration?

It's a bit nasty; if you've hit the 'evacuate host' button then what
happens when you've got some incompatible hosts.

Dave

> > > One more try... we have a vfio_group fd.  This is created by the bus
> > > drivers calling vfio_add_group_dev() and registers a struct device, a
> > > struct vfio_device_ops, and private data.  Typically we only wire the
> > > device_ops to the resulting file descriptor we get from
> > > VFIO_GROUP_GET_DEVICE_FD, but could we enable sort of a nested ioctl
> > > through the group fd?  The ioctl would need to take a string arg to
> > > match to a device name, plus an ioctl cmd and arg for the device_ops
> > > ioctl.  The group ioctl would need to filter cmds to known, benign
> > > queries.  We'd also need to verify that the allowed ioctls have no
> > > dependencies on setup done in device_ops.open().  
> > 
> > So these ioctls would be called without devices open() call, doesn't
> > this seem to be against file operations standard?
> 
> vfio_device_ops is modeled largely after file operations, but I don't
> think we're bound by that for the interaction between vfio-core and the
> vfio bus drivers.  We could make a separate callback for unprivileged
> ioctls, but that seems like more work per driver when we really want to
> maintain the identical API, we just want to provide a more limited
> interface and change the calling point.
> 
> An issue I thought of for migration though is that this path wouldn't
> have access to the migration region and therefore if we place a header
> within that region containing the compatibility and versioning
> information, the user still couldn't access it.  This doesn't seem to
> be a blocker though as we could put that information within the region
> capability that defines the region as used for migration.  Possibly a
> device could have multiple migration regions with different formats
> for backwards compatibility, of course then we'd need a way to
> determine which to use and which combinations have been validated.
> Thanks,
> 
> Alex
--
Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK