* Alex Williamson (alex.williamson@xxxxxxxxxx) wrote: > On Wed, 25 Apr 2018 21:00:39 +0530 > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > > > On 4/25/2018 4:29 AM, Alex Williamson wrote: > > > On Wed, 25 Apr 2018 01:20:08 +0530 > > > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > > > > > >> On 4/24/2018 3:10 AM, Alex Williamson wrote: > > >>> On Wed, 18 Apr 2018 12:31:53 -0600 > > >>> Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: > > >>> > > >>>> On Mon, 9 Apr 2018 12:35:10 +0200 > > >>>> Gerd Hoffmann <kraxel@xxxxxxxxxx> wrote: > > >>>> > > >>>>> This little series adds three drivers, for demo-ing and testing vfio > > >>>>> display interface code. There is one mdev device for each interface > > >>>>> type (mdpy.ko for region and mbochs.ko for dmabuf). > > >>>> > > >>>> Erik Skultety brought up a good question today regarding how libvirt is > > >>>> meant to handle these different flavors of display interfaces and > > >>>> knowing whether a given mdev device has display support at all. It > > >>>> seems that we cannot simply use the default display=auto because > > >>>> libvirt needs to specifically configure gl support for a dmabuf type > > >>>> interface versus not having such a requirement for a region interface, > > >>>> perhaps even removing the emulated graphics in some cases (though I > > >>>> don't think we have boot graphics through either solution yet). > > >>>> Additionally, GVT-g seems to need the x-igd-opregion support > > >>>> enabled(?), which is a non-starter for libvirt as it's an experimental > > >>>> option! > > >>>> > > >>>> Currently the only way to determine display support is through the > > >>>> VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on > > >>>> their own they'd need to get to the point where they could open the > > >>>> vfio device and perform the ioctl. That means opening a vfio > > >>>> container, adding the group, setting the iommu type, and getting the > > >>>> device. I was initially a bit appalled at asking libvirt to do that, > > >>>> but the alternative is to put this information in sysfs, but doing that > > >>>> we risk that we need to describe every nuance of the mdev device > > >>>> through sysfs and it becomes a dumping ground for every possible > > >>>> feature an mdev device might have. > ... > > >>>> So I was ready to return and suggest that maybe libvirt should probe > > >>>> the device to know about these ancillary configuration details, but > > >>>> then I remembered that both mdev vGPU vendors had external dependencies > > >>>> to even allow probing the device. KVMGT will fail to open the device > > >>>> if it's not associated with an instance of KVM and NVIDIA vGPU, I > > >>>> believe, will fail if the vGPU manager process cannot find the QEMU > > >>>> instance to extract the VM UUID. (Both of these were bad ideas) > > >>> > > >>> Here's another proposal that's really growing on me: > > >>> > > >>> * Fix the vendor drivers! Allow devices to be opened and probed > > >>> without these external dependencies. > > >>> * Libvirt uses the existing vfio API to open the device and probe the > > >>> necessary ioctls, if it can't probe the device, the feature is > > >>> unavailable, ie. display=off, no migration. > > >>> > > >> > > >> I'm trying to think simpler mechanism using sysfs that could work for > > >> any feature and knowing source-destination migration compatibility check > > >> by libvirt before initiating migration. > > >> > > >> I have another proposal: > > >> * Add a ioctl VFIO_DEVICE_PROBE_FEATURES > > >> struct vfio_device_features { > > >> __u32 argsz; > > >> __u32 features; > > >> } > > >> > > >> Define bit for each feature: > > >> #define VFIO_DEVICE_FEATURE_DISPLAY_REGION (1 << 0) > > >> #define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF (1 << 1) > > >> #define VFIO_DEVICE_FEATURE_MIGRATION (1 << 2) > > >> > > >> * Vendor driver returns bitmask of supported features during > > >> initialization phase. > > >> > > >> * In vfio core module, trap this ioctl for each device in > > >> vfio_device_fops_unl_ioctl(), > > > > > > Whoops, chicken and egg problem, VFIO_GROUP_GET_DEVICE_FD is our > > > blocking point with mdev drivers, we can't get a device fd, so we can't > > > call an ioctl on the device fd. > > > > > > > I'm sorry, I thought we could expose features when QEMU initialize, but > > libvirt needs to know supported features before QEMU initialize. > > > > > > >> check features bitmask returned by vendor > > >> driver and add a sysfs file if feature is supported that device. This > > >> sysfs file would return 0/1. > > > > > > I don't understand why we have an ioctl interface, if the user can get > > > to the device fd then we have existing interfaces to probe these > > > things, it seems like you're just wanting to pass a features bitmap > > > through to vfio_add_group_dev() that vfio-core would expose through > > > sysfs, but a list of feature bits doesn't convey enough info except for > > > the most basic uses. > > > > > > > Yes, vfio_add_group_dev() seems to be better way to convey features to > > vfio core. > > > > >> For migration this bit will only indicate if host driver supports > > >> migration feature. > > >> > > >> For source and destination compatibility check libvirt would need more > > >> data/variables to check like, > > >> * if same type of 'mdev_type' device create-able at destination, > > >> i.e. if ('mdev_type'->available_instances > 0) > > >> > > >> * if host_driver_version at source and destination are compatible. > > >> Host driver from same release branch should be mostly compatible, but if > > >> there are major changes in structures or APIs, host drivers from > > >> different branches might not be compatible, for example, if source and > > >> destination are from different branches and one of the structure had > > >> changed, then data collected at source might not be compatible with > > >> structures at destination and typecasting it to changed structures would > > >> mess up migrated data during restoration. > > > > > > Of course now you're asking that libvirt understand the release > > > versioning scheme of every vendor driver and that it remain > > > programatically consistent. We can't even do this with in-kernel > > > drivers. And in the end, still the best we can do is guess. > > > > > > > Libvirt doesn't need to understand the version, libvirt need to do > > strcmp version string from source and destination. If those are equal, > > then libvirt would understand that they are compatible. > > Who's to say that the driver version and migration compatibility have > any relation at all? Some drivers might focus on designing their own > migration interface that can maintain compatibility across versions > (QEMU does this), some drivers may only allow identical version > migration (which is going to frustrate upper level management tools and > customers - RHEL goes to great extents to support cross version > migration). We cannot have a one size fits all here that driver version > defines completely the migration compatibility. I'll agree; I don't know enough about these devices, but to give you some example of things I'd expect to work: a) User adds new machines to their data centre with larger/newer version of the same vendors GPU; in some cases that should work (depending on vendor details etc) b) The same thing but with identical hardware but a newer driver on the destination. Obviously there will be some cut offs that say some versions are incompatible; but for normal migration we jump through serious hoops to make sure stuff works; customers will expect the same with some VFIO devices. > > >> * if guest_driver_version is compatible with host driver at destination. > > >> For mdev devices, guest driver communicates with host driver in some > > >> form. If there are changes in structures/APIs of such communication, > > >> guest driver at source might not be compatible with host driver at > > >> destination. > > > > > > And another guess plus now the guest driver is involved which libvirt > > > has no visibility to. > > > > > > > Like above libvirt need to do strcmp. > > Insufficient, imo > > > >> 'available_instances' sysfs already exist, later two should be added by > > >> vendor driver which libvirt can use for migration compatibility check. > > > > > > As noted previously, display and migration are not necessarily > > > mdev-only features, it's possible that vfio-pci or vfio-platform could > > > also implement these, so the sysfs interface cannot be restricted to > > > the mdev template and lifecycle interface. > > > > > > > I agree. > > Feature bitmask passed to vfio core is not mdev specific. But here > > 'available_instances' for migration compatibility check is mdev > > specific. If mdev device is not create-able at destination, there is no > > point in initiating migration by libvirt. > > 'available_instances' for migration compatibility check...? We use > available_instances to know whether we have the resources to create a > given mdev type. It's certainly a prerequisite to have a device of the > identical type at the migration target and how we define what is an > identical device for a directly assigned PCI device is yet another > overly complicated rat hole. But an identical device doesn't > necessarily imply migration compatibility and I think that's the > problem we're tackling. We cannot assume based only on the device type > that migration is compatible, that's basically saying we're never going > to have any bugs or oversights or new features in the migration stream. Those things certainly happen; state that we forgot to transfer, new features enables on devices, devices configured in different ways. > Chatting with Laine, it may be worth a step back to include migration > experts and people up the stack with more visibility to how openstack > operates. The issue here is that if vfio gains migration support then > we have a portion of the migration stream that is not under the control > of QEMU, we cannot necessarily tie it to a QEMU machine type and we > cannot necessarily dictate how the vfio bus driver (vendor driver) > handles versioning and compatibility. My intent was to expose some > sort of migration information through the vfio API so that upper level > tools could determine source and target compatibility, but this in > itself is I think something new that those tools need to agree how it > might be done. How would something like openstack want to handle not > only finding a migration target with a compatible device, but also > verifying if the device supports the migration format of the source > device? > > Alternatively, should we do anything? Is the problem too hard and we > should let the driver return an error when it receives an incompatible > migration stream, aborting the migration? It's a bit nasty; if you've hit the 'evacuate host' button then what happens when you've got some incompatible hosts. Dave > > > One more try... we have a vfio_group fd. This is created by the bus > > > drivers calling vfio_add_group_dev() and registers a struct device, a > > > struct vfio_device_ops, and private data. Typically we only wire the > > > device_ops to the resulting file descriptor we get from > > > VFIO_GROUP_GET_DEVICE_FD, but could we enable sort of a nested ioctl > > > through the group fd? The ioctl would need to take a string arg to > > > match to a device name, plus an ioctl cmd and arg for the device_ops > > > ioctl. The group ioctl would need to filter cmds to known, benign > > > queries. We'd also need to verify that the allowed ioctls have no > > > dependencies on setup done in device_ops.open(). > > > > So these ioctls would be called without devices open() call, doesn't > > this seem to be against file operations standard? > > vfio_device_ops is modeled largely after file operations, but I don't > think we're bound by that for the interaction between vfio-core and the > vfio bus drivers. We could make a separate callback for unprivileged > ioctls, but that seems like more work per driver when we really want to > maintain the identical API, we just want to provide a more limited > interface and change the calling point. > > An issue I thought of for migration though is that this path wouldn't > have access to the migration region and therefore if we place a header > within that region containing the compatibility and versioning > information, the user still couldn't access it. This doesn't seem to > be a blocker though as we could put that information within the region > capability that defines the region as used for migration. Possibly a > device could have multiple migration regions with different formats > for backwards compatibility, of course then we'd need a way to > determine which to use and which combinations have been validated. > Thanks, > > Alex -- Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK