On 18.08.2016 18:41, Neo Jia wrote: > Hi libvirt experts, Hi, welcome to the list. > > I am starting this email thread to discuss the potential solution / proposal of > integrating vGPU support into libvirt for QEMU. > > Some quick background, NVIDIA is implementing a VFIO based mediated device > framework to allow people to virtualize their devices without SR-IOV, for > example NVIDIA vGPU, and Intel KVMGT. Within this framework, we are reusing the > VFIO API to process the memory / interrupt as what QEMU does today with passthru > device. So as far as I understand, this is solely NVIDIA's API and other vendors (e.g. Intel) will use their own or is this a standard that others will comply to? > > The difference here is that we are introducing a set of new sysfs file for > virtual device discovery and life cycle management due to its virtual nature. > > Here is the summary of the sysfs, when they will be created and how they should > be used: > > 1. Discover mediated device > > As part of physical device initialization process, vendor driver will register > their physical devices, which will be used to create virtual device (mediated > device, aka mdev) to the mediated framework. > > Then, the sysfs file "mdev_supported_types" will be available under the physical > device sysfs, and it will indicate the supported mdev and configuration for this > particular physical device, and the content may change dynamically based on the > system's current configurations, so libvirt needs to query this file every time > before create a mdev. Ah, that was gonna be my question. Because in the example below, you used "echo '...vgpu_type_id=20...' > /sys/bus/.../mdev_create". And I was wondering where does the number 20 come from. Now what I am wondering about is how libvirt should expose these to users. Moreover, how it should let users to chose. We have a node device driver where I guess we could expose possible options and then require some explicit value in the domain XML (but what value would that be? I don't think taking vgpu_type_id-s as they are would be a great idea). > > Note: different vendors might have their own specific configuration sysfs as > well, if they don't have pre-defined types. > > For example, we have a NVIDIA Tesla M60 on 86:00.0 here registered, and here is > NVIDIA specific configuration on an idle system. > > For example, to query the "mdev_supported_types" on this Tesla M60: > > cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types > # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer, > max_resolution > 11 ,"GRID M60-0B", 16, 2, 45, 512M, 2560x1600 > 12 ,"GRID M60-0Q", 16, 2, 60, 512M, 2560x1600 > 13 ,"GRID M60-1B", 8, 2, 45, 1024M, 2560x1600 > 14 ,"GRID M60-1Q", 8, 2, 60, 1024M, 2560x1600 > 15 ,"GRID M60-2B", 4, 2, 45, 2048M, 2560x1600 > 16 ,"GRID M60-2Q", 4, 4, 60, 2048M, 2560x1600 > 17 ,"GRID M60-4Q", 2, 4, 60, 4096M, 3840x2160 > 18 ,"GRID M60-8Q", 1, 4, 60, 8192M, 3840x2160 > > 2. Create/destroy mediated device > > Two sysfs files are available under the physical device sysfs path : mdev_create > and mdev_destroy > > The syntax of creating a mdev is: > > echo "$mdev_UUID:vendor_specific_argument_list" > > /sys/bus/pci/devices/.../mdev_create > > The syntax of destroying a mdev is: > > echo "$mdev_UUID:vendor_specific_argument_list" > > /sys/bus/pci/devices/.../mdev_destroy > > The $mdev_UUID is a unique identifier for this mdev device to be created, and it > is unique per system. Ah, so a caller (the one doing the echo - e.g. libvirt) can generate their own UUID under which the mdev will be known? I'm asking because of migration - we might want to preserve UUIDs when a domain is migrated to the other side. Speaking of which, is there such limitation or will guest be able to migrate even if UUID's changed? > > For NVIDIA vGPU, we require a vGPU type identifier (shown as vgpu_type_id in > above Tesla M60 output), and a VM UUID to be passed as > "vendor_specific_argument_list". I understand the need for vgpu_type_id, but can you shed more light on the VM UUID? Why is that required? > > If there is no vendor specific arguments required, either "$mdev_UUID" or > "$mdev_UUID:" will be acceptable as input syntax for the above two commands. > > To create a M60-4Q device, libvirt needs to do: > > echo "$mdev_UUID:vgpu_type_id=20,vm_uuid=$VM_UUID" > > /sys/bus/pci/devices/0000\:86\:00.0/mdev_create > > Then, you will see a virtual device shows up at: > > /sys/bus/mdev/devices/$mdev_UUID/ > > For NVIDIA, to create multiple virtual devices per VM, it has to be created > upfront before bringing any of them online. > > Regarding error reporting and detection, on failure, write() to sysfs using fd > returns error code, and write to sysfs file through command prompt shows the > string corresponding to error code. > > 3. Start/stop mediated device > > Under the virtual device sysfs, you will see a new "online" sysfs file. > > you can do cat /sys/bus/mdev/devices/$mdev_UUID/online to get the current status > of this virtual device (0 or 1), and to start a virtual device or stop a virtual > device you can do: > > echo "1|0" > /sys/bus/mdev/devices/$mdev_UUID/online > > libvirt needs to query the current state before changing state. > > Note: if you have multiple devices, you need to write to the "online" file > individually. > > For NVIDIA, if there are multiple mdev per VM, libvirt needs to bring all of > them "online" before starting QEMU. This is a valid requirement, indeed. > > 4. Launch QEMU/VM > > Pass the mdev sysfs path to QEMU as vfio-pci device: > > -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$mdev_UUID,id=vgpu0 One question here. Libvirt allows users to run qemu under different user:group than root:root. If that's the case, libvirt sets security labels on all files qemu can/will touch. Are we going to need to do something in that respect here? > > 5. Shutdown sequence > > libvirt needs to shutdown the qemu, bring the virtual device offline, then destroy the > virtual device > > 6. VM Reset > > No change or requirement for libvirt as this will be handled via VFIO reset API > and QEMU process will keep running as before. > > 7. Hot-plug > > It optional for vendors to support hot-plug. > > And it is same syntax to create a virtual device for hot-plug. > > For hot-unplug, after executing QEMU monitor "device del" command, libvirt needs > to write to "destroy" sysfs to complete hot-unplug process. > > Since hot-plug is optional, then mdev_create or mdev_destroy operations may > return an error if it is not supported. Thank you for very detailed description! In general, I like the API as it looks usable from my POV (I'm no VFIO devel though). Michal -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list