Hi Alex, let's continue with a new thread :) Basically we agree with you: exposing vGPU via VFIO can make QEMU share as much code as possible with pcidev(PF or VF) assignment. And yes, different vGPU vendors can share quite a lot of the QEMU part, which will do good for upper layers such as libvirt. To achieve this, there are quite a lot to do, I'll summarize it below. I dived into VFIO for a while but still may have things misunderstood, so please correct me :) First, let me illustrate my understanding of current VFIO framework used to pass through a pcidev to guest: +----------------------------------+ | vfio qemu | +-----+------------------------+---+ |DMA ^ |CFG QEMU |map IRQ| | -----------------------|---------------------|--|----------- KERNEL +------------|---------------------|--|----------+ | VFIO | | | | | v | v | | +-------------------+ +-----+-----------+ | IOMMU | | vfio iommu driver | | vfio bus driver | | API <-------+ | | | | Layer | | e.g. type1 | | e.g. vfio_pci | | | +-------------------+ +-----------------+ | +------------------------------------------------+ Here when a particular pcidev is passed-through to a KVM guest, it is attached to vfio_pci driver in host, and guest memory is mapped into IOMMU via the type1 iommu driver. Then, the draft infrastructure of future VFIO-based vgpu: +-------------------------------------+ | vfio qemu | +----+-------------------------+------+ |DMA ^ |CFG QEMU |map IRQ| | ----------------------|----------------------|--|----------- KERNEL | | | +------------|----------------------|--|----------+ |VFIO | | | | | v | v | | +--------------------+ +-----+-----------+ | DMA | | vfio iommu driver | | vfio bus driver | | API <------+ | | | | Layer | | e.g. vfio_type2 | | e.g. vfio_vgpu | | | +--------------------+ +-----------------+ | | | ^ | ^ | +---------|--|----------------------|--|----------+ | | | | | | v | +---------|--|----------+ +---------------------+ | +-------v-----------+ | | | | | | | | | | | KVMGT | | | | | | | | | host gfx driver | | +-------------------+ | | | | | | | | KVM hypervisor | | | +-----------------------+ +---------------------+ NOTE vfio_type2 and vfio_vgpu are only *logically* parts of VFIO, they may be implemented in KVM hypervisor or host gfx driver. Here we need to implement a new vfio IOMMU driver instead of type1, let's call it vfio_type2 temporarily. The main difference from pcidev assignment is, vGPU doesn't have its own DMA requester id, so it has to share mappings with host and other vGPUs. - type1 iommu driver maps gpa to hpa for passing through; whereas type2 maps iova to hpa; - hardware iommu is always needed by type1, whereas for type2, hardware iommu is optional; - type1 will invoke low-level IOMMU API (iommu_map et al) to setup IOMMU page table directly, whereas type2 dosen't (only need to invoke higher level DMA API like dma_map_page); We also need to implement a new 'bus' driver instead of vfio_pci, let's call it vfio_vgpu temporarily: - vfio_pci is a real pci driver, it has a probe method called during dev attaching; whereas the vfio_vgpu is a pseudo driver, it won't attach any devivce - the GPU is always owned by host gfx driver. It has to do 'probing' elsewhere, but still in host gfx driver attached to the device; - pcidev(PF or VF) attached to vfio_pci has a natural path in sysfs; whereas vgpu is purely a software concept: vfio_vgpu needs to create create/destory vgpu instances, maintain their paths in sysfs (e.g. "/sys/class/vgpu/intel/vgpu0") etc. There should be something added in a higher layer to do this (VFIO or DRM). - vfio_pci in most case will allow QEMU to access pcidev hardware; whereas vfio_vgpu is to access virtual resource emulated by another device model; - vfio_pci will inject an IRQ to guest only when physical IRQ generated; whereas vfio_vgpu may inject an IRQ for emulation purpose. Anyway they can share the same injection interface; Questions: [1] For VFIO No-IOMMU mode (!iommu_present), I saw it was reverted in upstream ae5515d66362(Revert: "vfio: Include No-IOMMU mode"). In my opinion, vfio_type2 doesn't rely on it to support No-IOMMU case, instead it needs a new implementation which fits both w/ and w/o IOMMU. Is this correct? For things not mentioned above, we might have them discussed in other threads, or temporarily maintained in a TODO list (we might get back to them after the big picture get agreed): - How to expose guest framebuffer via VFIO for SPICE; - How to avoid double translation with two-stage: GTT + IOMMU, whether identity map is possible, and if yes, how to make it more effectively; - Application acceleration You mentioned that with VFIO, a vGPU may be used by applications to get GPU acceleration. It's a potential opportunity to use vGPU for container usage, worthy of further investigation. -- Thanks, Jike -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html