On 01/18/2016 12:47 PM, Alex Williamson wrote: > Hi Jike, > > On Mon, 2016-01-18 at 10:39 +0800, Jike Song wrote: >> Hi Alex, let's continue with a new thread :) >> >> Basically we agree with you: exposing vGPU via VFIO can make >> QEMU share as much code as possible with pcidev(PF or VF) assignment. >> And yes, different vGPU vendors can share quite a lot of the >> QEMU part, which will do good for upper layers such as libvirt. >> >> >> To achieve this, there are quite a lot to do, I'll summarize >> it below. I dived into VFIO for a while but still may have >> things misunderstood, so please correct me :) >> >> >> >> First, let me illustrate my understanding of current VFIO >> framework used to pass through a pcidev to guest: >> >> >> +----------------------------------+ >> | vfio qemu | >> +-----+------------------------+---+ >> |DMA ^ |CFG >> QEMU |map IRQ| | >> -----------------------|---------------------|--|----------- >> KERNEL +------------|---------------------|--|----------+ >> | VFIO | | | | >> | v | v | >> | +-------------------+ +-----+-----------+ | >> IOMMU | | vfio iommu driver | | vfio bus driver | | >> API <-------+ | | | | >> Layer | | e.g. type1 | | e.g. vfio_pci | | >> | +-------------------+ +-----------------+ | >> +------------------------------------------------+ >> >> >> Here when a particular pcidev is passed-through to a KVM guest, >> it is attached to vfio_pci driver in host, and guest memory >> is mapped into IOMMU via the type1 iommu driver. >> >> >> Then, the draft infrastructure of future VFIO-based vgpu: >> >> >> >> +-------------------------------------+ >> | vfio qemu | >> +----+-------------------------+------+ >> |DMA ^ |CFG >> QEMU |map IRQ| | >> ----------------------|----------------------|--|----------- >> KERNEL | | | >> +------------|----------------------|--|----------+ >> |VFIO | | | | >> | v | v | >> | +--------------------+ +-----+-----------+ | >> DMA | | vfio iommu driver | | vfio bus driver | | >> API <------+ | | | | >> Layer | | e.g. vfio_type2 | | e.g. vfio_vgpu | | >> | +--------------------+ +-----------------+ | >> | | ^ | ^ | >> +---------|--|----------------------|--|----------+ >> | | | | >> | | v | >> +---------|--|----------+ +---------------------+ >> | +-------v-----------+ | | | >> | | | | | | >> | | KVMGT | | | | >> | | | | | host gfx driver | >> | +-------------------+ | | | >> | | | | >> | KVM hypervisor | | | >> +-----------------------+ +---------------------+ >> >> NOTE vfio_type2 and vfio_vgpu are only *logically* parts >> of VFIO, they may be implemented in KVM hypervisor >> or host gfx driver. >> >> >> >> Here we need to implement a new vfio IOMMU driver instead of type1, >> let's call it vfio_type2 temporarily. The main difference from pcidev >> assignment is, vGPU doesn't have its own DMA requester id, so it has >> to share mappings with host and other vGPUs. >> >> - type1 iommu driver maps gpa to hpa for passing through; >> whereas type2 maps iova to hpa; >> >> - hardware iommu is always needed by type1, whereas for >> type2, hardware iommu is optional; >> >> - type1 will invoke low-level IOMMU API (iommu_map et al) to >> setup IOMMU page table directly, whereas type2 dosen't (only >> need to invoke higher level DMA API like dma_map_page); > > Yes, the current type1 implementation is not compatible with vgpu since > there are not separate requester IDs on the bus and you probably don't > want or need to pin all of guest memory like we do for direct > assignment. However, let's separate the type1 user API from the > current implementation. It's quite easy within the vfio code to > consider "type1" to be an API specification that may have multiple > implementations. A minor code change would allow us to continue > looking for compatible iommu backends if the group we're trying to > attach is rejected. Would you elaborate a bit about 'iommu backends' here? Previously I thought that entire type1 will be duplicated. If not, what is supposed to add, a new vfio_dma_do_map? > The benefit here is that QEMU could work > unmodified, using the type1 vfio-iommu API regardless of whether a > device is directly assigned or virtual. > > Let's look at the type1 interface; we have simple map and unmap > interfaces which map and unmap process virtual address space (vaddr) to > the device address space (iova). The host physical address is obtained > by pinning the vaddr. In the current implementation, a map operation > pins pages and populates the hardware iommu. A vgpu compatible > implementation might simply register the translation into a kernel- > based database to be called upon later. When the host graphics driver > needs to enable dma for the vgpu, it doesn't need to go to QEMU for the > translation, it already possesses the iova to vaddr mapping, which > becomes iova to hpa after a pinning operation. > > So, I would encourage you to look at creating a vgpu vfio iommu > backened that makes use of the type1 api since it will reduce the > changes necessary for userspace. > Yes, keeping type1 API sounds a great idea. >> We also need to implement a new 'bus' driver instead of vfio_pci, >> let's call it vfio_vgpu temporarily: >> >> - vfio_pci is a real pci driver, it has a probe method called >> during dev attaching; whereas the vfio_vgpu is a pseudo >> driver, it won't attach any devivce - the GPU is always owned by >> host gfx driver. It has to do 'probing' elsewhere, but >> still in host gfx driver attached to the device; >> >> - pcidev(PF or VF) attached to vfio_pci has a natural path >> in sysfs; whereas vgpu is purely a software concept: >> vfio_vgpu needs to create create/destory vgpu instances, >> maintain their paths in sysfs (e.g. "/sys/class/vgpu/intel/vgpu0") >> etc. There should be something added in a higher layer >> to do this (VFIO or DRM). >> >> - vfio_pci in most case will allow QEMU to access pcidev >> hardware; whereas vfio_vgpu is to access virtual resource >> emulated by another device model; >> >> - vfio_pci will inject an IRQ to guest only when physical IRQ >> generated; whereas vfio_vgpu may inject an IRQ for emulation >> purpose. Anyway they can share the same injection interface; > > Here too, I think you're making assumptions based on an implementation > path. Personally, I think each vgpu should be a struct device and that > an iommu group should be created for each. I think this is a valid > abstraction; dma isolation is provided through something other than a > system-level iommu, but it's still provided. Without this, the entire > vfio core would need to be aware of vgpu, since the core operates on > devices and groups. I believe creating a struct device also gives you > basic probe and release support for a driver. > Indeed. BTW, that should be done in the 'bus' driver, right? > There will be a need for some sort of lifecycle management of a vgpu. > How is it created? Destroyed? Can it be given more or less resources > than other vgpus, etc. This could be implemented in sysfs for each > physical gpu with vgpu support, sort of like how we support sr-iov now, > the PF exports controls for creating VFs. The more commonality we can > get for lifecycle and device access for userspace, the better. > Will have a look at the VF managements, thanks for the info. > As for virtual vs physical resources and interrupts, part of the > purpose of vfio is to abstract a device into basic components. It's up > to the bus driver how accesses to each space map to the physical > device. Take for instance PCI config space, the existing vfio-pci > driver emulates some portions of config space for the user. > >> Questions: >> >> [1] For VFIO No-IOMMU mode (!iommu_present), I saw it was reverted >> in upstream ae5515d66362(Revert: "vfio: Include No-IOMMU mode"). >> In my opinion, vfio_type2 doesn't rely on it to support No-IOMMU >> case, instead it needs a new implementation which fits both >> w/ and w/o IOMMU. Is this correct? >> > > vfio no-iommu has also been re-added for v4.5 (03a76b60f8ba), this was > simply a case that the kernel development outpaced the intended user > and I didn't want to commit to the user api changes until it had been > completely vetted. In any case, vgpu should have no dependency > whatsoever on no-iommu. As above, I think vgpu should create virtual > devices and add them to an iommu group, similar to how no-iommu does, > but without the kernel tainting because you are actually providing > isolation through other means than a system iommu. > Thanks for confirmation. >> For things not mentioned above, we might have them discussed in >> other threads, or temporarily maintained in a TODO list (we might get >> back to them after the big picture get agreed): >> >> >> - How to expose guest framebuffer via VFIO for SPICE; > > Potentially through a new, device specific region, which I think can be > done within the existing vfio API. The API can already expose an > arbitrary number of regions to the user, it's just a matter of how we > tell the user the purpose of a region index beyond the fixed set we map > to PCI resources. > >> - How to avoid double translation with two-stage: GTT + IOMMU, >> whether identity map is possible, and if yes, how to make it >> more effectively; >> >> - Application acceleration >> You mentioned that with VFIO, a vGPU may be used by >> applications to get GPU acceleration. It's a potential >> opportunity to use vGPU for container usage, worthy of >> further investigation. > > Yes, interesting topics. Thanks, > Looks that things get more clear overall, with small exceptions. Thanks for the advice:) > Alex > -- Thanks, Jike -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html