On 08/31/2016 02:12 PM, Tian, Kevin wrote: >> From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx] >> Sent: Wednesday, August 31, 2016 12:17 AM >> >> Hi folks, >> >> At KVM Forum we had a BoF session primarily around the mediated device >> sysfs interface. I'd like to share what I think we agreed on and the >> "problem areas" that still need some work so we can get the thoughts >> and ideas from those who weren't able to attend. >> >> DanPB expressed some concern about the mdev_supported_types sysfs >> interface, which exposes a flat csv file with fields like "type", >> "number of instance", "vendor string", and then a bunch of type >> specific fields like "framebuffer size", "resolution", "frame rate >> limit", etc. This is not entirely machine parsing friendly and sort of >> abuses the sysfs concept of one value per file. Example output taken >> from Neo's libvirt RFC: >> >> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types >> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer, >> max_resolution >> 11 ,"GRID M60-0B", 16, 2, 45, 512M, 2560x1600 >> 12 ,"GRID M60-0Q", 16, 2, 60, 512M, 2560x1600 >> 13 ,"GRID M60-1B", 8, 2, 45, 1024M, 2560x1600 >> 14 ,"GRID M60-1Q", 8, 2, 60, 1024M, 2560x1600 >> 15 ,"GRID M60-2B", 4, 2, 45, 2048M, 2560x1600 >> 16 ,"GRID M60-2Q", 4, 4, 60, 2048M, 2560x1600 >> 17 ,"GRID M60-4Q", 2, 4, 60, 4096M, 3840x2160 >> 18 ,"GRID M60-8Q", 1, 4, 60, 8192M, 3840x2160 >> >> The create/destroy then looks like this: >> >> echo "$mdev_UUID:vendor_specific_argument_list" > >> /sys/bus/pci/devices/.../mdev_create >> >> echo "$mdev_UUID:vendor_specific_argument_list" > >> /sys/bus/pci/devices/.../mdev_destroy >> >> "vendor_specific_argument_list" is nebulous. >> >> So the idea to fix this is to explode this into a directory structure, >> something like: >> >> ├── mdev_destroy >> └── mdev_supported_types >> ├── 11 >> │ ├── create >> │ ├── description >> │ └── max_instances >> ├── 12 >> │ ├── create >> │ ├── description >> │ └── max_instances >> └── 13 >> ├── create >> ├── description >> └── max_instances >> >> Note that I'm only exposing the minimal attributes here for simplicity, >> the other attributes would be included in separate files and we would >> require vendors to create standard attributes for common device classes. > > I like this idea. All standard attributes are reflected into this hierarchy. > In the meantime, can we still allow optional vendor string in create > interface? libvirt doesn't need to know the meaning, but allows upper > layer to do some vendor specific tweak if necessary. > Not sure whether this can done within MDEV framework (attrs provided by vendor driver of course), or must be within the vendor driver. >> >> For vGPUs like NVIDIA where we don't support multiple types >> concurrently, this directory structure would update as mdev devices are >> created, removing no longer available types. I carried forward > > or keep the type with max_instances cleared to ZERO. > +1 :) >> max_instances here, but perhaps we really want to copy SR-IOV and >> report a max and current allocation. Creation and deletion is > > right, cur/max_instances look reasonable. > >> simplified as we can simply "echo $UUID > create" per type. I don't >> understand why destroy had a parameter list, so here I imagine we can >> simply do the same... in fact, I'd actually rather see a "remove" sysfs >> entry under each mdev device, so we remove it at the device rather than >> in some central location (any objections?). > > OK to me. IIUC, "destroy" has a parameter list is only because the previous $VM_UUID + instnace implementation. It should be safe to move the "destroy" file under mdev now. >> We discussed how this might look with Intel devices which do allow >> mixed vGPU types concurrently. We believe, but need confirmation, that >> the vendor driver could still make a finite set of supported types, >> perhaps with additional module options to the vendor driver to enable >> more "exotic" types. So for instance if IGD vGPUs are based on >> power-of-2 portions of the framebuffer size, then the vendor driver >> could list types with 32MB, 64MB, 128MB, etc in useful and popular >> sizes. As vGPUs are allocated, the larger sizes may become unavailable. > > Yes, Intel can do such type of definition. One thing I'm not sure is > about impact cross listed types, i.e. when creating a new instance > under a given type, max_instances under other types would be > dynamically decremented based on available resource. Would it be > a problem for libvirt or upper level stack, since a natural interpretation > of max_instances should be a static number? > > An alternative is to make max_instances configurable, so libvirt has > chance to define a pool of available instances with different types > before creating any instance. For example, initially IGD driver may > report max_instances only for a minimal sharing granularity: > 128MB: > max_instances (8) > 256MB: > max_instances (0) > 512MB: > max_instances (0) > > Then libvirt can configure more types as: > 128MB: > max_instances (2) > 256MB: > max_instances (1) > 512MB: > max_instances (1) > > Starting from this point, max_instances would be static and then > mdev instance can be created under each type. But I'm not > sure whether such additional configuration role is reasonable to libvirt... >> >> We still don't have any way for the admin to learn in advance how the >> available supported types will change once mdev devices start to be >> created. I'm not sure how we can create a specification for this, so >> probing by creating devices may be the most flexible model. >> >> The other issue is the start/stop requirement, which was revealed to >> setup peer-to-peer resources between vGPUs which is a limited hardware >> resource. We'd really like to have these happen automatically on the >> first open of a vfio mdev device file and final release. So we >> brainstormed how the open/release callbacks could know the other mdev >> devices for a given user. This is where the instance number came into >> play previously. This is an area that needs work. > > IGD doesn't have such peer-to-peer resource setup requirement. So > it's sufficient to create/destroy a mdev instance in a single action on > IGD. However I'd expect we still keep the "start/stop" interface ( > maybe not exposed as sysfs node, instead being a VFIO API), as > required to support future live migration usage. We've made prototype > working for KVMGT today. It's good for the framework to define start/stop interfaces, but as Alex said below, it should be MDEV oriented, not VM oriented. I don't know a lot about the peer-to-peer resource, but to me, although VM_UUID + instance is not applicable, userspace can always achieve the same purpose by, let us assume a mdev hierarchy, providing the VM UUID under every mdev: /sys/bus/pci/devices/<sbdf>/mdev/ |-- mdev01/ | `-- vm_uuid `-- mdev02/ `-- vm_uuid Did I miss something? >> >> There was a thought that perhaps on open() the vendor driver could look >> at the user pid and use that to associate with other devices, but the >> problem here is that we open and begin access to each device, so >> devices do this discovery serially rather than in parallel as desired. >> (we might not fault in mmio space yet though, so I wonder if open() >> could set the association of mdev to pid, then the first mmio fault >> would trigger the resource allocation? Then all the "magic" would live >> in the vendor driver. open() could fail if the pid already has running >> mdev devices and the vendor driver chooses not to support hotplug) >> >> One comment was that for a GPU that only supports homogeneous vGPUs, >> libvirt may choose to create all the vGPUs in advance and handle them >> as we do SR-IOV VFs. The UUID+instance model would preclude such a use >> case. >> >> We also considered whether iommu groups could be (ab)used for this use >> case, peer-to-peer would in fact be an iommu grouping constraint >> afterall. This would have the same UUID+instance constraint as above >> though and would require some sort of sysfs interface for the user to >> be able to create multiple mdevs within a group. >> >> Everyone was given homework to think about this on their flights home, >> so I expect plenty of ideas by now ;) >> >> Overall I think mediated devices were well received by the community, >> so let's keep up the development and discussion to bring it to >> fruition. Thanks, > > Thanks a lot Alex for your help on driving this discussion. Mediated device > technique has the potential to be used for other type of I/O virtualizations > in the future, not limited to GPU virtualization. So getting the core framework > ready earlier would be highly welcomed. :-) > -- Thanks, Jike -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html