Hi folks, At KVM Forum we had a BoF session primarily around the mediated device sysfs interface. I'd like to share what I think we agreed on and the "problem areas" that still need some work so we can get the thoughts and ideas from those who weren't able to attend. DanPB expressed some concern about the mdev_supported_types sysfs interface, which exposes a flat csv file with fields like "type", "number of instance", "vendor string", and then a bunch of type specific fields like "framebuffer size", "resolution", "frame rate limit", etc. This is not entirely machine parsing friendly and sort of abuses the sysfs concept of one value per file. Example output taken from Neo's libvirt RFC: cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer, max_resolution 11 ,"GRID M60-0B", 16, 2, 45, 512M, 2560x1600 12 ,"GRID M60-0Q", 16, 2, 60, 512M, 2560x1600 13 ,"GRID M60-1B", 8, 2, 45, 1024M, 2560x1600 14 ,"GRID M60-1Q", 8, 2, 60, 1024M, 2560x1600 15 ,"GRID M60-2B", 4, 2, 45, 2048M, 2560x1600 16 ,"GRID M60-2Q", 4, 4, 60, 2048M, 2560x1600 17 ,"GRID M60-4Q", 2, 4, 60, 4096M, 3840x2160 18 ,"GRID M60-8Q", 1, 4, 60, 8192M, 3840x2160 The create/destroy then looks like this: echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_create echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_destroy "vendor_specific_argument_list" is nebulous. So the idea to fix this is to explode this into a directory structure, something like: ├── mdev_destroy └── mdev_supported_types ├── 11 │ ├── create │ ├── description │ └── max_instances ├── 12 │ ├── create │ ├── description │ └── max_instances └── 13 ├── create ├── description └── max_instances Note that I'm only exposing the minimal attributes here for simplicity, the other attributes would be included in separate files and we would require vendors to create standard attributes for common device classes. For vGPUs like NVIDIA where we don't support multiple types concurrently, this directory structure would update as mdev devices are created, removing no longer available types. I carried forward max_instances here, but perhaps we really want to copy SR-IOV and report a max and current allocation. Creation and deletion is simplified as we can simply "echo $UUID > create" per type. I don't understand why destroy had a parameter list, so here I imagine we can simply do the same... in fact, I'd actually rather see a "remove" sysfs entry under each mdev device, so we remove it at the device rather than in some central location (any objections?). We discussed how this might look with Intel devices which do allow mixed vGPU types concurrently. We believe, but need confirmation, that the vendor driver could still make a finite set of supported types, perhaps with additional module options to the vendor driver to enable more "exotic" types. So for instance if IGD vGPUs are based on power-of-2 portions of the framebuffer size, then the vendor driver could list types with 32MB, 64MB, 128MB, etc in useful and popular sizes. As vGPUs are allocated, the larger sizes may become unavailable. We still don't have any way for the admin to learn in advance how the available supported types will change once mdev devices start to be created. I'm not sure how we can create a specification for this, so probing by creating devices may be the most flexible model. The other issue is the start/stop requirement, which was revealed to setup peer-to-peer resources between vGPUs which is a limited hardware resource. We'd really like to have these happen automatically on the first open of a vfio mdev device file and final release. So we brainstormed how the open/release callbacks could know the other mdev devices for a given user. This is where the instance number came into play previously. This is an area that needs work. There was a thought that perhaps on open() the vendor driver could look at the user pid and use that to associate with other devices, but the problem here is that we open and begin access to each device, so devices do this discovery serially rather than in parallel as desired. (we might not fault in mmio space yet though, so I wonder if open() could set the association of mdev to pid, then the first mmio fault would trigger the resource allocation? Then all the "magic" would live in the vendor driver. open() could fail if the pid already has running mdev devices and the vendor driver chooses not to support hotplug) One comment was that for a GPU that only supports homogeneous vGPUs, libvirt may choose to create all the vGPUs in advance and handle them as we do SR-IOV VFs. The UUID+instance model would preclude such a use case. We also considered whether iommu groups could be (ab)used for this use case, peer-to-peer would in fact be an iommu grouping constraint afterall. This would have the same UUID+instance constraint as above though and would require some sort of sysfs interface for the user to be able to create multiple mdevs within a group. Everyone was given homework to think about this on their flights home, so I expect plenty of ideas by now ;) Overall I think mediated devices were well received by the community, so let's keep up the development and discussion to bring it to fruition. Thanks, Alex -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list