On 4/26/2018 1:22 AM, Dr. David Alan Gilbert wrote: > * Alex Williamson (alex.williamson@xxxxxxxxxx) wrote: >> On Wed, 25 Apr 2018 21:00:39 +0530 >> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: >> >>> On 4/25/2018 4:29 AM, Alex Williamson wrote: >>>> On Wed, 25 Apr 2018 01:20:08 +0530 >>>> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: >>>> >>>>> On 4/24/2018 3:10 AM, Alex Williamson wrote: >>>>>> On Wed, 18 Apr 2018 12:31:53 -0600 >>>>>> Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: >>>>>> >>>>>>> On Mon, 9 Apr 2018 12:35:10 +0200 >>>>>>> Gerd Hoffmann <kraxel@xxxxxxxxxx> wrote: >>>>>>> >>>>>>>> This little series adds three drivers, for demo-ing and testing vfio >>>>>>>> display interface code. There is one mdev device for each interface >>>>>>>> type (mdpy.ko for region and mbochs.ko for dmabuf). >>>>>>> >>>>>>> Erik Skultety brought up a good question today regarding how libvirt is >>>>>>> meant to handle these different flavors of display interfaces and >>>>>>> knowing whether a given mdev device has display support at all. It >>>>>>> seems that we cannot simply use the default display=auto because >>>>>>> libvirt needs to specifically configure gl support for a dmabuf type >>>>>>> interface versus not having such a requirement for a region interface, >>>>>>> perhaps even removing the emulated graphics in some cases (though I >>>>>>> don't think we have boot graphics through either solution yet). >>>>>>> Additionally, GVT-g seems to need the x-igd-opregion support >>>>>>> enabled(?), which is a non-starter for libvirt as it's an experimental >>>>>>> option! >>>>>>> >>>>>>> Currently the only way to determine display support is through the >>>>>>> VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on >>>>>>> their own they'd need to get to the point where they could open the >>>>>>> vfio device and perform the ioctl. That means opening a vfio >>>>>>> container, adding the group, setting the iommu type, and getting the >>>>>>> device. I was initially a bit appalled at asking libvirt to do that, >>>>>>> but the alternative is to put this information in sysfs, but doing that >>>>>>> we risk that we need to describe every nuance of the mdev device >>>>>>> through sysfs and it becomes a dumping ground for every possible >>>>>>> feature an mdev device might have. >> ... >>>>>>> So I was ready to return and suggest that maybe libvirt should probe >>>>>>> the device to know about these ancillary configuration details, but >>>>>>> then I remembered that both mdev vGPU vendors had external dependencies >>>>>>> to even allow probing the device. KVMGT will fail to open the device >>>>>>> if it's not associated with an instance of KVM and NVIDIA vGPU, I >>>>>>> believe, will fail if the vGPU manager process cannot find the QEMU >>>>>>> instance to extract the VM UUID. (Both of these were bad ideas) >>>>>> >>>>>> Here's another proposal that's really growing on me: >>>>>> >>>>>> * Fix the vendor drivers! Allow devices to be opened and probed >>>>>> without these external dependencies. >>>>>> * Libvirt uses the existing vfio API to open the device and probe the >>>>>> necessary ioctls, if it can't probe the device, the feature is >>>>>> unavailable, ie. display=off, no migration. >>>>>> >>>>> >>>>> I'm trying to think simpler mechanism using sysfs that could work for >>>>> any feature and knowing source-destination migration compatibility check >>>>> by libvirt before initiating migration. >>>>> >>>>> I have another proposal: >>>>> * Add a ioctl VFIO_DEVICE_PROBE_FEATURES >>>>> struct vfio_device_features { >>>>> __u32 argsz; >>>>> __u32 features; >>>>> } >>>>> >>>>> Define bit for each feature: >>>>> #define VFIO_DEVICE_FEATURE_DISPLAY_REGION (1 << 0) >>>>> #define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF (1 << 1) >>>>> #define VFIO_DEVICE_FEATURE_MIGRATION (1 << 2) >>>>> >>>>> * Vendor driver returns bitmask of supported features during >>>>> initialization phase. >>>>> >>>>> * In vfio core module, trap this ioctl for each device in >>>>> vfio_device_fops_unl_ioctl(), >>>> >>>> Whoops, chicken and egg problem, VFIO_GROUP_GET_DEVICE_FD is our >>>> blocking point with mdev drivers, we can't get a device fd, so we can't >>>> call an ioctl on the device fd. >>>> >>> >>> I'm sorry, I thought we could expose features when QEMU initialize, but >>> libvirt needs to know supported features before QEMU initialize. >>> >>> >>>>> check features bitmask returned by vendor >>>>> driver and add a sysfs file if feature is supported that device. This >>>>> sysfs file would return 0/1. >>>> >>>> I don't understand why we have an ioctl interface, if the user can get >>>> to the device fd then we have existing interfaces to probe these >>>> things, it seems like you're just wanting to pass a features bitmap >>>> through to vfio_add_group_dev() that vfio-core would expose through >>>> sysfs, but a list of feature bits doesn't convey enough info except for >>>> the most basic uses. >>>> >>> >>> Yes, vfio_add_group_dev() seems to be better way to convey features to >>> vfio core. >>> >>>>> For migration this bit will only indicate if host driver supports >>>>> migration feature. >>>>> >>>>> For source and destination compatibility check libvirt would need more >>>>> data/variables to check like, >>>>> * if same type of 'mdev_type' device create-able at destination, >>>>> i.e. if ('mdev_type'->available_instances > 0) >>>>> >>>>> * if host_driver_version at source and destination are compatible. >>>>> Host driver from same release branch should be mostly compatible, but if >>>>> there are major changes in structures or APIs, host drivers from >>>>> different branches might not be compatible, for example, if source and >>>>> destination are from different branches and one of the structure had >>>>> changed, then data collected at source might not be compatible with >>>>> structures at destination and typecasting it to changed structures would >>>>> mess up migrated data during restoration. >>>> >>>> Of course now you're asking that libvirt understand the release >>>> versioning scheme of every vendor driver and that it remain >>>> programatically consistent. We can't even do this with in-kernel >>>> drivers. And in the end, still the best we can do is guess. >>>> >>> >>> Libvirt doesn't need to understand the version, libvirt need to do >>> strcmp version string from source and destination. If those are equal, >>> then libvirt would understand that they are compatible. >> >> Who's to say that the driver version and migration compatibility have >> any relation at all? Some drivers might focus on designing their own >> migration interface that can maintain compatibility across versions >> (QEMU does this), some drivers may only allow identical version >> migration (which is going to frustrate upper level management tools and >> customers - RHEL goes to great extents to support cross version >> migration). We cannot have a one size fits all here that driver version >> defines completely the migration compatibility. > > I'll agree; I don't know enough about these devices, but to give you > some example of things I'd expect to work: > a) User adds new machines to their data centre with larger/newer > version of the same vendors GPU; in some cases that should work > (depending on vendor details etc) > b) The same thing but with identical hardware but a newer driver on > the destination. > > Obviously there will be some cut offs that say some versions are > incompatible; but for normal migration we jump through serious hoops > to make sure stuff works; customers will expect the same with some > VFIO devices. > How libvirt checks that cut off where some versions are incompatible? >>>>> * if guest_driver_version is compatible with host driver at destination. >>>>> For mdev devices, guest driver communicates with host driver in some >>>>> form. If there are changes in structures/APIs of such communication, >>>>> guest driver at source might not be compatible with host driver at >>>>> destination. >>>> >>>> And another guess plus now the guest driver is involved which libvirt >>>> has no visibility to. >>>> >>> >>> Like above libvirt need to do strcmp. >> >> Insufficient, imo >> >>>>> 'available_instances' sysfs already exist, later two should be added by >>>>> vendor driver which libvirt can use for migration compatibility check. >>>> >>>> As noted previously, display and migration are not necessarily >>>> mdev-only features, it's possible that vfio-pci or vfio-platform could >>>> also implement these, so the sysfs interface cannot be restricted to >>>> the mdev template and lifecycle interface. >>>> >>> >>> I agree. >>> Feature bitmask passed to vfio core is not mdev specific. But here >>> 'available_instances' for migration compatibility check is mdev >>> specific. If mdev device is not create-able at destination, there is no >>> point in initiating migration by libvirt. >> >> 'available_instances' for migration compatibility check...? We use >> available_instances to know whether we have the resources to create a >> given mdev type. It's certainly a prerequisite to have a device of the >> identical type at the migration target and how we define what is an >> identical device for a directly assigned PCI device is yet another >> overly complicated rat hole. But an identical device doesn't >> necessarily imply migration compatibility and I think that's the >> problem we're tackling. We cannot assume based only on the device type >> that migration is compatible, that's basically saying we're never going >> to have any bugs or oversights or new features in the migration stream. > > Those things certainly happen; state that we forgot to transfer, new > features enables on devices, devices configured in different ways. > How libvirt checks migration compatibility for other devices across QEMU versions where source support a device and destination running with older QEMU version doesn't support that device or that device doesn't exist in that system? Thanks, Kirti >> Chatting with Laine, it may be worth a step back to include migration >> experts and people up the stack with more visibility to how openstack >> operates. The issue here is that if vfio gains migration support then >> we have a portion of the migration stream that is not under the control >> of QEMU, we cannot necessarily tie it to a QEMU machine type and we >> cannot necessarily dictate how the vfio bus driver (vendor driver) >> handles versioning and compatibility. My intent was to expose some >> sort of migration information through the vfio API so that upper level >> tools could determine source and target compatibility, but this in >> itself is I think something new that those tools need to agree how it >> might be done. How would something like openstack want to handle not >> only finding a migration target with a compatible device, but also >> verifying if the device supports the migration format of the source >> device? >> >> Alternatively, should we do anything? Is the problem too hard and we >> should let the driver return an error when it receives an incompatible >> migration stream, aborting the migration? > > It's a bit nasty; if you've hit the 'evacuate host' button then what > happens when you've got some incompatible hosts. > > Dave > >>>> One more try... we have a vfio_group fd. This is created by the bus >>>> drivers calling vfio_add_group_dev() and registers a struct device, a >>>> struct vfio_device_ops, and private data. Typically we only wire the >>>> device_ops to the resulting file descriptor we get from >>>> VFIO_GROUP_GET_DEVICE_FD, but could we enable sort of a nested ioctl >>>> through the group fd? The ioctl would need to take a string arg to >>>> match to a device name, plus an ioctl cmd and arg for the device_ops >>>> ioctl. The group ioctl would need to filter cmds to known, benign >>>> queries. We'd also need to verify that the allowed ioctls have no >>>> dependencies on setup done in device_ops.open(). >>> >>> So these ioctls would be called without devices open() call, doesn't >>> this seem to be against file operations standard? >> >> vfio_device_ops is modeled largely after file operations, but I don't >> think we're bound by that for the interaction between vfio-core and the >> vfio bus drivers. We could make a separate callback for unprivileged >> ioctls, but that seems like more work per driver when we really want to >> maintain the identical API, we just want to provide a more limited >> interface and change the calling point. >> >> An issue I thought of for migration though is that this path wouldn't >> have access to the migration region and therefore if we place a header >> within that region containing the compatibility and versioning >> information, the user still couldn't access it. This doesn't seem to >> be a blocker though as we could put that information within the region >> capability that defines the region as used for migration. Possibly a >> device could have multiple migration regions with different formats >> for backwards compatibility, of course then we'd need a way to >> determine which to use and which combinations have been validated. >> Thanks, >> >> Alex > -- > Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK >