* Kirti Wankhede (kwankhede@xxxxxxxxxx) wrote: > > > On 4/26/2018 1:22 AM, Dr. David Alan Gilbert wrote: > > * Alex Williamson (alex.williamson@xxxxxxxxxx) wrote: > >> On Wed, 25 Apr 2018 21:00:39 +0530 > >> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > >> > >>> On 4/25/2018 4:29 AM, Alex Williamson wrote: > >>>> On Wed, 25 Apr 2018 01:20:08 +0530 > >>>> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > >>>> > >>>>> On 4/24/2018 3:10 AM, Alex Williamson wrote: > >>>>>> On Wed, 18 Apr 2018 12:31:53 -0600 > >>>>>> Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: > >>>>>> > >>>>>>> On Mon, 9 Apr 2018 12:35:10 +0200 > >>>>>>> Gerd Hoffmann <kraxel@xxxxxxxxxx> wrote: > >>>>>>> > >>>>>>>> This little series adds three drivers, for demo-ing and testing vfio > >>>>>>>> display interface code. There is one mdev device for each interface > >>>>>>>> type (mdpy.ko for region and mbochs.ko for dmabuf). > >>>>>>> > >>>>>>> Erik Skultety brought up a good question today regarding how libvirt is > >>>>>>> meant to handle these different flavors of display interfaces and > >>>>>>> knowing whether a given mdev device has display support at all. It > >>>>>>> seems that we cannot simply use the default display=auto because > >>>>>>> libvirt needs to specifically configure gl support for a dmabuf type > >>>>>>> interface versus not having such a requirement for a region interface, > >>>>>>> perhaps even removing the emulated graphics in some cases (though I > >>>>>>> don't think we have boot graphics through either solution yet). > >>>>>>> Additionally, GVT-g seems to need the x-igd-opregion support > >>>>>>> enabled(?), which is a non-starter for libvirt as it's an experimental > >>>>>>> option! > >>>>>>> > >>>>>>> Currently the only way to determine display support is through the > >>>>>>> VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on > >>>>>>> their own they'd need to get to the point where they could open the > >>>>>>> vfio device and perform the ioctl. That means opening a vfio > >>>>>>> container, adding the group, setting the iommu type, and getting the > >>>>>>> device. I was initially a bit appalled at asking libvirt to do that, > >>>>>>> but the alternative is to put this information in sysfs, but doing that > >>>>>>> we risk that we need to describe every nuance of the mdev device > >>>>>>> through sysfs and it becomes a dumping ground for every possible > >>>>>>> feature an mdev device might have. > >> ... > >>>>>>> So I was ready to return and suggest that maybe libvirt should probe > >>>>>>> the device to know about these ancillary configuration details, but > >>>>>>> then I remembered that both mdev vGPU vendors had external dependencies > >>>>>>> to even allow probing the device. KVMGT will fail to open the device > >>>>>>> if it's not associated with an instance of KVM and NVIDIA vGPU, I > >>>>>>> believe, will fail if the vGPU manager process cannot find the QEMU > >>>>>>> instance to extract the VM UUID. (Both of these were bad ideas) > >>>>>> > >>>>>> Here's another proposal that's really growing on me: > >>>>>> > >>>>>> * Fix the vendor drivers! Allow devices to be opened and probed > >>>>>> without these external dependencies. > >>>>>> * Libvirt uses the existing vfio API to open the device and probe the > >>>>>> necessary ioctls, if it can't probe the device, the feature is > >>>>>> unavailable, ie. display=off, no migration. > >>>>>> > >>>>> > >>>>> I'm trying to think simpler mechanism using sysfs that could work for > >>>>> any feature and knowing source-destination migration compatibility check > >>>>> by libvirt before initiating migration. > >>>>> > >>>>> I have another proposal: > >>>>> * Add a ioctl VFIO_DEVICE_PROBE_FEATURES > >>>>> struct vfio_device_features { > >>>>> __u32 argsz; > >>>>> __u32 features; > >>>>> } > >>>>> > >>>>> Define bit for each feature: > >>>>> #define VFIO_DEVICE_FEATURE_DISPLAY_REGION (1 << 0) > >>>>> #define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF (1 << 1) > >>>>> #define VFIO_DEVICE_FEATURE_MIGRATION (1 << 2) > >>>>> > >>>>> * Vendor driver returns bitmask of supported features during > >>>>> initialization phase. > >>>>> > >>>>> * In vfio core module, trap this ioctl for each device in > >>>>> vfio_device_fops_unl_ioctl(), > >>>> > >>>> Whoops, chicken and egg problem, VFIO_GROUP_GET_DEVICE_FD is our > >>>> blocking point with mdev drivers, we can't get a device fd, so we can't > >>>> call an ioctl on the device fd. > >>>> > >>> > >>> I'm sorry, I thought we could expose features when QEMU initialize, but > >>> libvirt needs to know supported features before QEMU initialize. > >>> > >>> > >>>>> check features bitmask returned by vendor > >>>>> driver and add a sysfs file if feature is supported that device. This > >>>>> sysfs file would return 0/1. > >>>> > >>>> I don't understand why we have an ioctl interface, if the user can get > >>>> to the device fd then we have existing interfaces to probe these > >>>> things, it seems like you're just wanting to pass a features bitmap > >>>> through to vfio_add_group_dev() that vfio-core would expose through > >>>> sysfs, but a list of feature bits doesn't convey enough info except for > >>>> the most basic uses. > >>>> > >>> > >>> Yes, vfio_add_group_dev() seems to be better way to convey features to > >>> vfio core. > >>> > >>>>> For migration this bit will only indicate if host driver supports > >>>>> migration feature. > >>>>> > >>>>> For source and destination compatibility check libvirt would need more > >>>>> data/variables to check like, > >>>>> * if same type of 'mdev_type' device create-able at destination, > >>>>> i.e. if ('mdev_type'->available_instances > 0) > >>>>> > >>>>> * if host_driver_version at source and destination are compatible. > >>>>> Host driver from same release branch should be mostly compatible, but if > >>>>> there are major changes in structures or APIs, host drivers from > >>>>> different branches might not be compatible, for example, if source and > >>>>> destination are from different branches and one of the structure had > >>>>> changed, then data collected at source might not be compatible with > >>>>> structures at destination and typecasting it to changed structures would > >>>>> mess up migrated data during restoration. > >>>> > >>>> Of course now you're asking that libvirt understand the release > >>>> versioning scheme of every vendor driver and that it remain > >>>> programatically consistent. We can't even do this with in-kernel > >>>> drivers. And in the end, still the best we can do is guess. > >>>> > >>> > >>> Libvirt doesn't need to understand the version, libvirt need to do > >>> strcmp version string from source and destination. If those are equal, > >>> then libvirt would understand that they are compatible. > >> > >> Who's to say that the driver version and migration compatibility have > >> any relation at all? Some drivers might focus on designing their own > >> migration interface that can maintain compatibility across versions > >> (QEMU does this), some drivers may only allow identical version > >> migration (which is going to frustrate upper level management tools and > >> customers - RHEL goes to great extents to support cross version > >> migration). We cannot have a one size fits all here that driver version > >> defines completely the migration compatibility. > > > > I'll agree; I don't know enough about these devices, but to give you > > some example of things I'd expect to work: > > a) User adds new machines to their data centre with larger/newer > > version of the same vendors GPU; in some cases that should work > > (depending on vendor details etc) > > b) The same thing but with identical hardware but a newer driver on > > the destination. > > > > Obviously there will be some cut offs that say some versions are > > incompatible; but for normal migration we jump through serious hoops > > to make sure stuff works; customers will expect the same with some > > VFIO devices. > > > > How libvirt checks that cut off where some versions are incompatible? We have versioned 'machine types' - so for example QEMU has pc-i440fx-2.11 pc-i440fx-2.10 machine types; any version of qemu that supports machine type pc-i440fx-2.10 should behave the same to it's emulated devices. If we change the behaviour then we tie it to the new machine type; so the behaviour of a device in pc-i440fx-2.11 might be a bit different. Occasionally we'll kill off old machine types; (actually we should do it more!) - but certainly when we do downstream versions we tie it to machine types as well. We also have some migration-capability flags, so some features can only be used if both sides have that flag, and also Libvirt has some checking of host CPU flags. > >>>>> * if guest_driver_version is compatible with host driver at destination. > >>>>> For mdev devices, guest driver communicates with host driver in some > >>>>> form. If there are changes in structures/APIs of such communication, > >>>>> guest driver at source might not be compatible with host driver at > >>>>> destination. > >>>> > >>>> And another guess plus now the guest driver is involved which libvirt > >>>> has no visibility to. > >>>> > >>> > >>> Like above libvirt need to do strcmp. > >> > >> Insufficient, imo > >> > >>>>> 'available_instances' sysfs already exist, later two should be added by > >>>>> vendor driver which libvirt can use for migration compatibility check. > >>>> > >>>> As noted previously, display and migration are not necessarily > >>>> mdev-only features, it's possible that vfio-pci or vfio-platform could > >>>> also implement these, so the sysfs interface cannot be restricted to > >>>> the mdev template and lifecycle interface. > >>>> > >>> > >>> I agree. > >>> Feature bitmask passed to vfio core is not mdev specific. But here > >>> 'available_instances' for migration compatibility check is mdev > >>> specific. If mdev device is not create-able at destination, there is no > >>> point in initiating migration by libvirt. > >> > >> 'available_instances' for migration compatibility check...? We use > >> available_instances to know whether we have the resources to create a > >> given mdev type. It's certainly a prerequisite to have a device of the > >> identical type at the migration target and how we define what is an > >> identical device for a directly assigned PCI device is yet another > >> overly complicated rat hole. But an identical device doesn't > >> necessarily imply migration compatibility and I think that's the > >> problem we're tackling. We cannot assume based only on the device type > >> that migration is compatible, that's basically saying we're never going > >> to have any bugs or oversights or new features in the migration stream. > > > > Those things certainly happen; state that we forgot to transfer, new > > features enables on devices, devices configured in different ways. > > > > How libvirt checks migration compatibility for other devices across QEMU > versions where source support a device and destination running with > older QEMU version doesn't support that device or that device doesn't > exist in that system? Libvirt inspects the qemu to get lists of devices and capabilities; I'll leave it to the libvirt guys to add more detail if needed. Dave > Thanks, > Kirti > > >> Chatting with Laine, it may be worth a step back to include migration > >> experts and people up the stack with more visibility to how openstack > >> operates. The issue here is that if vfio gains migration support then > >> we have a portion of the migration stream that is not under the control > >> of QEMU, we cannot necessarily tie it to a QEMU machine type and we > >> cannot necessarily dictate how the vfio bus driver (vendor driver) > >> handles versioning and compatibility. My intent was to expose some > >> sort of migration information through the vfio API so that upper level > >> tools could determine source and target compatibility, but this in > >> itself is I think something new that those tools need to agree how it > >> might be done. How would something like openstack want to handle not > >> only finding a migration target with a compatible device, but also > >> verifying if the device supports the migration format of the source > >> device? > >> > >> Alternatively, should we do anything? Is the problem too hard and we > >> should let the driver return an error when it receives an incompatible > >> migration stream, aborting the migration? > > > > It's a bit nasty; if you've hit the 'evacuate host' button then what > > happens when you've got some incompatible hosts. > > > > Dave > > > >>>> One more try... we have a vfio_group fd. This is created by the bus > >>>> drivers calling vfio_add_group_dev() and registers a struct device, a > >>>> struct vfio_device_ops, and private data. Typically we only wire the > >>>> device_ops to the resulting file descriptor we get from > >>>> VFIO_GROUP_GET_DEVICE_FD, but could we enable sort of a nested ioctl > >>>> through the group fd? The ioctl would need to take a string arg to > >>>> match to a device name, plus an ioctl cmd and arg for the device_ops > >>>> ioctl. The group ioctl would need to filter cmds to known, benign > >>>> queries. We'd also need to verify that the allowed ioctls have no > >>>> dependencies on setup done in device_ops.open(). > >>> > >>> So these ioctls would be called without devices open() call, doesn't > >>> this seem to be against file operations standard? > >> > >> vfio_device_ops is modeled largely after file operations, but I don't > >> think we're bound by that for the interaction between vfio-core and the > >> vfio bus drivers. We could make a separate callback for unprivileged > >> ioctls, but that seems like more work per driver when we really want to > >> maintain the identical API, we just want to provide a more limited > >> interface and change the calling point. > >> > >> An issue I thought of for migration though is that this path wouldn't > >> have access to the migration region and therefore if we place a header > >> within that region containing the compatibility and versioning > >> information, the user still couldn't access it. This doesn't seem to > >> be a blocker though as we could put that information within the region > >> capability that defines the region as used for migration. Possibly a > >> device could have multiple migration regions with different formats > >> for backwards compatibility, of course then we'd need a way to > >> determine which to use and which combinations have been validated. > >> Thanks, > >> > >> Alex > > -- > > Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK > > -- Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK