On 4/25/2018 4:29 AM, Alex Williamson wrote: > On Wed, 25 Apr 2018 01:20:08 +0530 > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > >> On 4/24/2018 3:10 AM, Alex Williamson wrote: >>> On Wed, 18 Apr 2018 12:31:53 -0600 >>> Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: >>> >>>> On Mon, 9 Apr 2018 12:35:10 +0200 >>>> Gerd Hoffmann <kraxel@xxxxxxxxxx> wrote: >>>> >>>>> This little series adds three drivers, for demo-ing and testing vfio >>>>> display interface code. There is one mdev device for each interface >>>>> type (mdpy.ko for region and mbochs.ko for dmabuf). >>>> >>>> Erik Skultety brought up a good question today regarding how libvirt is >>>> meant to handle these different flavors of display interfaces and >>>> knowing whether a given mdev device has display support at all. It >>>> seems that we cannot simply use the default display=auto because >>>> libvirt needs to specifically configure gl support for a dmabuf type >>>> interface versus not having such a requirement for a region interface, >>>> perhaps even removing the emulated graphics in some cases (though I >>>> don't think we have boot graphics through either solution yet). >>>> Additionally, GVT-g seems to need the x-igd-opregion support >>>> enabled(?), which is a non-starter for libvirt as it's an experimental >>>> option! >>>> >>>> Currently the only way to determine display support is through the >>>> VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on >>>> their own they'd need to get to the point where they could open the >>>> vfio device and perform the ioctl. That means opening a vfio >>>> container, adding the group, setting the iommu type, and getting the >>>> device. I was initially a bit appalled at asking libvirt to do that, >>>> but the alternative is to put this information in sysfs, but doing that >>>> we risk that we need to describe every nuance of the mdev device >>>> through sysfs and it becomes a dumping ground for every possible >>>> feature an mdev device might have. >>>> >> >> One or two sysfs file for each feature shouldn't be that much of over >> head? In kernel, other subsystem modules expose capability through >> sysfs, like PCI subsystem adds 'boot_vga' file for VGA device which >> returns 0/1 depending on if its boot VGA device. Similarly >> 'd3cold_allowed', 'msi_bus'... > > Obviously we could add sysfs files, but unlike properties that the PCI > core exposes about struct pci_dev fields, the idea of a vfio_device is > much more abstract. Each bus driver creates its own device > representation, so we have a top level vfio_device referencing through > an opaque pointer a vfio_pci_device, vfio_platform_device, or > mdev_device, and each mdev vendor driver creates its own private data > structure below the mdev_device. So it's not quite a simple as one new > attribute "show" function to handle all devices of that bus_type. We > need a consistent implementation in each bus driver and vendor driver > or we need to figure out how to percolate the information up to the > vfio core. Your idea below seems to take the percolate approach. > >>>> So I was ready to return and suggest that maybe libvirt should probe >>>> the device to know about these ancillary configuration details, but >>>> then I remembered that both mdev vGPU vendors had external dependencies >>>> to even allow probing the device. KVMGT will fail to open the device >>>> if it's not associated with an instance of KVM and NVIDIA vGPU, I >>>> believe, will fail if the vGPU manager process cannot find the QEMU >>>> instance to extract the VM UUID. (Both of these were bad ideas) >>> >>> Here's another proposal that's really growing on me: >>> >>> * Fix the vendor drivers! Allow devices to be opened and probed >>> without these external dependencies. >>> * Libvirt uses the existing vfio API to open the device and probe the >>> necessary ioctls, if it can't probe the device, the feature is >>> unavailable, ie. display=off, no migration. >>> >> >> I'm trying to think simpler mechanism using sysfs that could work for >> any feature and knowing source-destination migration compatibility check >> by libvirt before initiating migration. >> >> I have another proposal: >> * Add a ioctl VFIO_DEVICE_PROBE_FEATURES >> struct vfio_device_features { >> __u32 argsz; >> __u32 features; >> } >> >> Define bit for each feature: >> #define VFIO_DEVICE_FEATURE_DISPLAY_REGION (1 << 0) >> #define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF (1 << 1) >> #define VFIO_DEVICE_FEATURE_MIGRATION (1 << 2) >> >> * Vendor driver returns bitmask of supported features during >> initialization phase. >> >> * In vfio core module, trap this ioctl for each device in >> vfio_device_fops_unl_ioctl(), > > Whoops, chicken and egg problem, VFIO_GROUP_GET_DEVICE_FD is our > blocking point with mdev drivers, we can't get a device fd, so we can't > call an ioctl on the device fd. > I'm sorry, I thought we could expose features when QEMU initialize, but libvirt needs to know supported features before QEMU initialize. >> check features bitmask returned by vendor >> driver and add a sysfs file if feature is supported that device. This >> sysfs file would return 0/1. > > I don't understand why we have an ioctl interface, if the user can get > to the device fd then we have existing interfaces to probe these > things, it seems like you're just wanting to pass a features bitmap > through to vfio_add_group_dev() that vfio-core would expose through > sysfs, but a list of feature bits doesn't convey enough info except for > the most basic uses. > Yes, vfio_add_group_dev() seems to be better way to convey features to vfio core. >> For migration this bit will only indicate if host driver supports >> migration feature. >> >> For source and destination compatibility check libvirt would need more >> data/variables to check like, >> * if same type of 'mdev_type' device create-able at destination, >> i.e. if ('mdev_type'->available_instances > 0) >> >> * if host_driver_version at source and destination are compatible. >> Host driver from same release branch should be mostly compatible, but if >> there are major changes in structures or APIs, host drivers from >> different branches might not be compatible, for example, if source and >> destination are from different branches and one of the structure had >> changed, then data collected at source might not be compatible with >> structures at destination and typecasting it to changed structures would >> mess up migrated data during restoration. > > Of course now you're asking that libvirt understand the release > versioning scheme of every vendor driver and that it remain > programatically consistent. We can't even do this with in-kernel > drivers. And in the end, still the best we can do is guess. > Libvirt doesn't need to understand the version, libvirt need to do strcmp version string from source and destination. If those are equal, then libvirt would understand that they are compatible. >> * if guest_driver_version is compatible with host driver at destination. >> For mdev devices, guest driver communicates with host driver in some >> form. If there are changes in structures/APIs of such communication, >> guest driver at source might not be compatible with host driver at >> destination. > > And another guess plus now the guest driver is involved which libvirt > has no visibility to. > Like above libvirt need to do strcmp. >> 'available_instances' sysfs already exist, later two should be added by >> vendor driver which libvirt can use for migration compatibility check. > > As noted previously, display and migration are not necessarily > mdev-only features, it's possible that vfio-pci or vfio-platform could > also implement these, so the sysfs interface cannot be restricted to > the mdev template and lifecycle interface. > I agree. Feature bitmask passed to vfio core is not mdev specific. But here 'available_instances' for migration compatibility check is mdev specific. If mdev device is not create-able at destination, there is no point in initiating migration by libvirt. > One more try... we have a vfio_group fd. This is created by the bus > drivers calling vfio_add_group_dev() and registers a struct device, a > struct vfio_device_ops, and private data. Typically we only wire the > device_ops to the resulting file descriptor we get from > VFIO_GROUP_GET_DEVICE_FD, but could we enable sort of a nested ioctl > through the group fd? The ioctl would need to take a string arg to > match to a device name, plus an ioctl cmd and arg for the device_ops > ioctl. The group ioctl would need to filter cmds to known, benign > queries. We'd also need to verify that the allowed ioctls have no > dependencies on setup done in device_ops.open(). So these ioctls would be called without devices open() call, doesn't this seem to be against file operations standard? Thanks, Kirti > *_INFO and > QUERY_GFX_PLANE ioctls would be the only candidates. Bus drivers could > of course keep an open count in their private data so they know how the > ioctl is being called (if necessary) and the group fd only allows a > single open, so there's no risk that another user could interact with > the group in bad ways once the device is opened (and of course we use > file level access control on the group device file anyway). This is > sort of a rethink of Paolo's suggestion of a read-only fd, but the fd > is the existing group fd and any references to the device would only be > held around the calling of the nested ioctl. Could it work? Thanks, >