On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote: > On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote: > > > From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx] > > Hi Alex, Kevin and Jike, > > (Seems I shouldn't use attachment, resend it again to the list, patches are > inline at the end) > > Thanks for adding me to this technical discussion, a great opportunity > for us to design together which can bring both Intel and NVIDIA vGPU solution to > KVM platform. > > Instead of directly jumping to the proposal that we have been working on > recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple > quick comments / thoughts regarding the existing discussions on this thread as > fundamentally I think we are solving the same problem, DMA, interrupt and MMIO. > > Then we can look at what we have, hopefully we can reach some consensus soon. > > > Yes, and since you're creating and destroying the vgpu here, this is > > where I'd expect a struct device to be created and added to an IOMMU > > group. The lifecycle management should really include links between > > the vGPU and physical GPU, which would be much, much easier to do with > > struct devices create here rather than at the point where we start > > doing vfio "stuff". > > Infact to keep vfio-vgpu to be more generic, vgpu device creation and management > can be centralized and done in vfio-vgpu. That also include adding to IOMMU > group and VFIO group. Is this really a good idea? The concept of a vgpu is not unique to vfio, we want vfio to be a driver for a vgpu, not an integral part of the lifecycle of a vgpu. That certainly doesn't exclude adding infrastructure to make lifecycle management of a vgpu more consistent between drivers, but it should be done independently of vfio. I'll go back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio does not create the VF, that's done in coordination with the PF making use of some PCI infrastructure for consistency between drivers. It seems like we need to take more advantage of the class and driver core support to perhaps setup a vgpu bus and class with vfio-vgpu just being a driver for those devices. > Graphics driver can register with vfio-vgpu to get management and emulation call > backs to graphics driver. > > We already have struct vgpu_device in our proposal that keeps pointer to > physical device. > > > - vfio_pci will inject an IRQ to guest only when physical IRQ > > generated; whereas vfio_vgpu may inject an IRQ for emulation > > purpose. Anyway they can share the same injection interface; > > eventfd to inject the interrupt is known to vfio-vgpu, that fd should be > available to graphics driver so that graphics driver can inject interrupts > directly when physical device triggers interrupt. > > Here is the proposal we have, please review. > > Please note the patches we have put out here is mainly for POC purpose to > verify our understanding also can serve the purpose to reduce confusions and speed up > our design, although we are very happy to refine that to something eventually > can be used for both parties and upstreamed. > > Linux vGPU kernel design > ================================================================================== > > Here we are proposing a generic Linux kernel module based on VFIO framework > which allows different GPU vendors to plugin and provide their GPU virtualization > solution on KVM, the benefits of having such generic kernel module are: > > 1) Reuse QEMU VFIO driver, supporting VFIO UAPI > > 2) GPU HW agnostic management API for upper layer software such as libvirt > > 3) No duplicated VFIO kernel logic reimplemented by different GPU driver vendor > > 0. High level overview > ================================================================================== > > > user space: > +-----------+ VFIO IOMMU IOCTLs > +---------| QEMU VFIO |-------------------------+ > VFIO IOCTLs | +-----------+ | > | | > ---------------------|-----------------------------------------------|--------- > | | > kernel space: | +--->----------->---+ (callback) V > | | v +------V-----+ > +----------+ +----V--^--+ +--+--+-----+ | VGPU | > | | | | +----| nvidia.ko +----->-----> TYPE1 IOMMU| > | VFIO Bus <===| VGPU.ko |<----| +-----------+ | +---++-------+ > | | | | | (register) ^ || > +----------+ +-------+--+ | +-----------+ | || > V +----| i915.ko +-----+ +---VV-------+ > | +-----^-----+ | TYPE1 | > | (callback) | | IOMMU | > +-->------------>---+ +------------+ > access flow: > > Guest MMIO / PCI config access > | > ------------------------------------------------- > | > +-----> KVM VM_EXITs (kernel) > | > ------------------------------------------------- > | > +-----> QEMU VFIO driver (user) > | > ------------------------------------------------- > | > +----> VGPU kernel driver (kernel) > | > | > +----> vendor driver callback > > > 1. VGPU management interface > ================================================================================== > > This is the interface allows upper layer software (mostly libvirt) to query and > configure virtual GPU device in a HW agnostic fashion. Also, this management > interface has provided flexibility to underlying GPU vendor to support virtual > device hotplug, multiple virtual devices per VM, multiple virtual devices from > different physical devices, etc. > > 1.1 Under per-physical device sysfs: > ---------------------------------------------------------------------------------- > > vgpu_supported_types - RO, list the current supported virtual GPU types and its > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of > "vgpu_supported_types". > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual > gpu device on a target physical GPU. idx: virtual device index inside a VM > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a > target physical GPU I've noted in previous discussions that we need to separate user policy from kernel policy here, the kernel policy should not require a "VM UUID". A UUID simply represents a set of one or more devices and an index picks the device within the set. Whether that UUID matches a VM or is independently used is up to the user policy when creating the device. Personally I'd also prefer to get rid of the concept of indexes within a UUID set of devices and instead have each device be independent. This seems to be an imposition on the nvidia implementation into the kernel interface design. > 1.3 Under vgpu class sysfs: > ---------------------------------------------------------------------------------- > > vgpu_start - WO, input syntax <VM_UUID>, this will trigger the registration > interface to notify the GPU vendor driver to commit virtual GPU resource for > this target VM. > > Also, the vgpu_start function is a synchronized call, the successful return of > this call will indicate all the requested vGPU resource has been fully > committed, the VMM should continue. > > vgpu_shutdown - WO, input syntax <VM_UUID>, this will trigger the registration > interface to notify the GPU vendor driver to release virtual GPU resource of > this target VM. > > 1.4 Virtual device Hotplug > ---------------------------------------------------------------------------------- > > To support virtual device hotplug, <vgpu_create> and <vgpu_destroy> can be > accessed during VM runtime, and the corresponding registration callback will be > invoked to allow GPU vendor support hotplug. > > To support hotplug, vendor driver would take necessary action to handle the > situation when a vgpu_create is done on a VM_UUID after vgpu_start, and that > implies both create and start for that vgpu device. > > Same, vgpu_destroy implies a vgpu_shudown on a running VM only if vendor driver > supports vgpu hotplug. > > If hotplug is not supported and VM is still running, vendor driver can return > error code to indicate not supported. > > Separate create from start gives flixibility to have: > > - multiple vgpu instances for single VM and > - hotplug feature. > > 2. GPU driver vendor registration interface > ================================================================================== > > 2.1 Registration interface definition (include/linux/vgpu.h) > ---------------------------------------------------------------------------------- > > extern int vgpu_register_device(struct pci_dev *dev, > const struct gpu_device_ops *ops); > > extern void vgpu_unregister_device(struct pci_dev *dev); > > /** > * struct gpu_device_ops - Structure to be registered for each physical GPU to > * register the device to vgpu module. > * > * @owner: The module owner. > * @vgpu_supported_config: Called to get information about supported vgpu > * types. > * @dev : pci device structure of physical GPU. > * @config: should return string listing supported > * config > * Returns integer: success (0) or error (< 0) > * @vgpu_create: Called to allocate basic resouces in graphics > * driver for a particular vgpu. > * @dev: physical pci device structure on which > * vgpu > * should be created > * @vm_uuid: VM's uuid for which VM it is intended > * to > * @instance: vgpu instance in that VM > * @vgpu_id: This represents the type of vgpu to be > * created > * Returns integer: success (0) or error (< 0) > * @vgpu_destroy: Called to free resources in graphics driver for > * a vgpu instance of that VM. > * @dev: physical pci device structure to which > * this vgpu points to. > * @vm_uuid: VM's uuid for which the vgpu belongs > * to. > * @instance: vgpu instance in that VM > * Returns integer: success (0) or error (< 0) > * If VM is running and vgpu_destroy is called that > * means the vGPU is being hotunpluged. Return > * error > * if VM is running and graphics driver doesn't > * support vgpu hotplug. > * @vgpu_start: Called to do initiate vGPU initialization > * process in graphics driver when VM boots before > * qemu starts. > * @vm_uuid: VM's UUID which is booting. > * Returns integer: success (0) or error (< 0) > * @vgpu_shutdown: Called to teardown vGPU related resources for > * the VM > * @vm_uuid: VM's UUID which is shutting down . > * Returns integer: success (0) or error (< 0) > * @read: Read emulation callback > * @vdev: vgpu device structure > * @buf: read buffer > * @count: number bytes to read > * @address_space: specifies for which address > * space > * the request is: pci_config_space, IO register > * space or MMIO space. > * Retuns number on bytes read on success or error. > * @write: Write emulation callback > * @vdev: vgpu device structure > * @buf: write buffer > * @count: number bytes to be written > * @address_space: specifies for which address > * space > * the request is: pci_config_space, IO register > * space or MMIO space. > * Retuns number on bytes written on success or > * error. > * @vgpu_set_irqs: Called to send about interrupts configuration > * information that qemu set. > * @vdev: vgpu device structure > * @flags, index, start, count and *data : same as > * that of struct vfio_irq_set of > * VFIO_DEVICE_SET_IRQS API. > * > * Physical GPU that support vGPU should be register with vgpu module with > * gpu_device_ops structure. > */ > > struct gpu_device_ops { > struct module *owner; > int (*vgpu_supported_config)(struct pci_dev *dev, char *config); > int (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid, > uint32_t instance, uint32_t vgpu_id); > int (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid, > uint32_t instance); > int (*vgpu_start)(uuid_le vm_uuid); > int (*vgpu_shutdown)(uuid_le vm_uuid); > ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count, > uint32_t address_space, loff_t pos); > ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count, > uint32_t address_space,loff_t pos); > int (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags, > unsigned index, unsigned start, unsigned count, > void *data); > > }; I wonder if it shouldn't be vfio-vgpu sub-drivers (ie, Intel and Nvidia) that register these ops with the main vfio-vgpu driver and they should also include a probe() function which allows us to associate a given vgpu device with a set of vendor ops. > > 2.2 Details for callbacks we haven't mentioned above. > --------------------------------------------------------------------------------- > > vgpu_supported_config: allows the vendor driver to specify the supported vGPU > type/configuration > > vgpu_create : create a virtual GPU device, can be used for device hotplug. > > vgpu_destroy : destroy a virtual GPU device, can be used for device hotplug. > > vgpu_start : callback function to notify vendor driver vgpu device > come to live for a given virtual machine. > > vgpu_shutdown : callback function to notify vendor driver > > read : callback to vendor driver to handle virtual device config > space or MMIO read access > > write : callback to vendor driver to handle virtual device config > space or MMIO write access > > vgpu_set_irqs : callback to vendor driver to pass along the interrupt > information for the target virtual device, then vendor > driver can inject interrupt into virtual machine for this > device. > > 2.3 Potential additional virtual device configuration registration interface: > --------------------------------------------------------------------------------- > > callback function to describe the MMAP behavior of the virtual GPU > > callback function to allow GPU vendor driver to provide PCI config space backing > memory. > > 3. VGPU TYPE1 IOMMU > ================================================================================== > > Here we are providing a TYPE1 IOMMU for vGPU which will basically keep track the > <iova, hva, size, flag> and save the QEMU mm for later reference. > > You can find the quick/ugly implementation in the attached patch file, which is > actually just a simple version Alex's type1 IOMMU without actual real > mapping when IOMMU_MAP_DMA / IOMMU_UNMAP_DMA is called. > > We have thought about providing another vendor driver registration interface so > such tracking information will be sent to vendor driver and he will use the QEMU > mm to do the get_user_pages / remap_pfn_range when it is required. After doing a > quick implementation within our driver, I noticed following issues: > > 1) OS/VFIO logic into vendor driver which will be a maintenance issue. > > 2) Every driver vendor has to implement their own RB tree, instead of reusing > the common existing VFIO code (vfio_find/link/unlink_dma) > > 3) IOMMU_UNMAP_DMA is expecting to get "unmapped bytes" back to the caller/QEMU, > better not have anything inside a vendor driver that the VFIO caller immediately > depends on. > > Based on the above consideration, we decide to implement the DMA tracking logic > within VGPU TYPE1 IOMMU code (ideally, this should be merged into current TYPE1 > IOMMU code) and expose two symbols to outside for MMIO mapping and page > translation and pinning. > > Also, with a mmap MMIO interface between virtual and physical, this allows > para-virtualized guest driver can access his virtual MMIO without taking a MMAP > fault hit, also we can support different MMIO size between virtual and physical > device. > > int vgpu_map_virtual_bar > ( > uint64_t virt_bar_addr, > uint64_t phys_bar_addr, > uint32_t len, > uint32_t flags > ) > > EXPORT_SYMBOL(vgpu_map_virtual_bar); Per the implementation provided, this needs to be implemented in the vfio device driver, not in the iommu interface. Finding the DMA mapping of the device and replacing it is wrong. It should be remapped at the vfio device file interface using vm_ops. > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count) > > EXPORT_SYMBOL(vgpu_dma_do_translate); > > Still a lot to be added and modified, such as supporting multiple VMs and > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU > kernel driver, error handling, roll-back and locked memory size per user, etc. Particularly, handling of mapping changes is completely missing. This cannot be a point in time translation, the user is free to remap addresses whenever they wish and device translations need to be updated accordingly. > 4. Modules > ================================================================================== > > Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko > > vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU > TYPE1 v1 and v2 interface. Depending on how intrusive it is, this can possibly by done within the existing type1 driver. Either that or we can split out common code for use by a separate module. > vgpu.ko - provide registration interface and virtual device > VFIO access. > > 5. QEMU note > ================================================================================== > > To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO > class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and > use it as a reference for our implementation. It is basically just a quick c & p > from vfio/pci.c to quickly meet our needs. > > Once this proposal is finalized, we will move to vfio/pci.c instead of a new > class, and probably the only thing required is to have a new way to discover the > device. > > 6. Examples > ================================================================================== > > On this server, we have two NVIDIA M60 GPUs. > > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2 > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1) > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1) > > After nvidia.ko gets initialized, we can query the supported vGPU type by > accessing the "vgpu_supported_types" like following: > > [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types > 11:GRID M60-0B > 12:GRID M60-0Q > 13:GRID M60-1B > 14:GRID M60-1Q > 15:GRID M60-2B > 16:GRID M60-2Q > 17:GRID M60-4Q > 18:GRID M60-8Q > > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would > like to create "GRID M60-4Q" VM on it. > > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create > > Note: the number 0 here is for vGPU device index. So far the change is not tested > for multiple vgpu devices yet, but we will support it. > > At this moment, if you query the "vgpu_supported_types" it will still show all > supported virtual GPU types as no virtual GPU resource is committed yet. > > Starting VM: > > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start > > then, the supported vGPU type query will return: > > [root@cjia-vgx-kvm /home/cjia]$ > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types > 17:GRID M60-4Q > > So vgpu_supported_config needs to be called whenever a new virtual device gets > created as the underlying HW might limit the supported types if there are > any existing VM runnings. > > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the > GPU driver vendor to clean up resource. > > Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under > device sysfs. I'd like to hear Intel's thoughts on this interface. Are there different vgpu capacities or priority classes that would necessitate different types of vcpus on Intel? I think there are some gaps in translating from named vgpu types to indexes here, along with my previous mention of the UUID/set oddity. Does Intel have a need for start and shutdown interfaces? Neo, wasn't there at some point information about how many of each type could be supported through these interfaces? How does a user know their capacity limits? Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html