Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)

Jike Song <jike.song@xxxxxxxxx> · Mon, 18 Jan 2016 16:56:13 +0800

On 01/18/2016 12:47 PM, Alex Williamson wrote:
> Hi Jike,
> 
> On Mon, 2016-01-18 at 10:39 +0800, Jike Song wrote:
>> Hi Alex, let's continue with a new thread :)
>>
>> Basically we agree with you: exposing vGPU via VFIO can make
>> QEMU share as much code as possible with pcidev(PF or VF) assignment.
>> And yes, different vGPU vendors can share quite a lot of the
>> QEMU part, which will do good for upper layers such as libvirt.
>>
>>
>> To achieve this, there are quite a lot to do, I'll summarize
>> it below. I dived into VFIO for a while but still may have
>> things misunderstood, so please correct me :)
>>
>>
>>
>> First, let me illustrate my understanding of current VFIO
>> framework used to pass through a pcidev to guest:
>>
>>
>>                  +----------------------------------+
>>                  |            vfio qemu             |
>>                  +-----+------------------------+---+
>>                        |DMA                  ^  |CFG
>> QEMU                   |map               IRQ|  |
>> -----------------------|---------------------|--|-----------
>> KERNEL    +------------|---------------------|--|----------+
>>           | VFIO       |                     |  |          |
>>           |            v                     |  v          |
>>           |  +-------------------+     +-----+-----------+ |
>> IOMMU     |  | vfio iommu driver |     | vfio bus driver | |
>> API  <-------+                   |     |                 | |
>> Layer     |  | e.g. type1        |     | e.g. vfio_pci   | |
>>           |  +-------------------+     +-----------------+ |
>>           +------------------------------------------------+
>>
>>
>> Here when a particular pcidev is passed-through to a KVM guest,
>> it is attached to vfio_pci driver in host, and guest memory
>> is mapped into IOMMU via the type1 iommu driver.
>>
>>
>> Then, the draft infrastructure of future VFIO-based vgpu:
>>
>>
>>
>>                  +-------------------------------------+
>>                  |              vfio qemu              |
>>                  +----+-------------------------+------+
>>                       |DMA                   ^  |CFG
>> QEMU                  |map                IRQ|  |
>> ----------------------|----------------------|--|-----------
>> KERNEL                |                      |  |
>>          +------------|----------------------|--|----------+
>>          |VFIO        |                      |  |          |
>>          |            v                      |  v          |
>>          | +--------------------+      +-----+-----------+ |
>> DMA      | | vfio iommu driver  |      | vfio bus driver | |
>> API <------+                    |      |                 | |
>> Layer    | |  e.g. vfio_type2   |      |  e.g. vfio_vgpu | |
>>          | +--------------------+      +-----------------+ |
>>          |         |  ^                      |  ^          |
>>          +---------|--|----------------------|--|----------+
>>                    |  |                      |  |
>>                    |  |                      v  |
>>          +---------|--|----------+   +---------------------+
>>          | +-------v-----------+ |   |                     |
>>          | |                   | |   |                     |
>>          | |      KVMGT        | |   |                     |
>>          | |                   | |   |   host gfx driver   |
>>          | +-------------------+ |   |                     |
>>          |                       |   |                     |
>>          |    KVM hypervisor     |   |                     |
>>          +-----------------------+   +---------------------+
>>
>>         NOTE    vfio_type2 and vfio_vgpu are only *logically* parts
>>                 of VFIO, they may be implemented in KVM hypervisor
>>                 or host gfx driver.
>>
>>
>>
>> Here we need to implement a new vfio IOMMU driver instead of type1,
>> let's call it vfio_type2 temporarily. The main difference from pcidev
>> assignment is, vGPU doesn't have its own DMA requester id, so it has
>> to share mappings with host and other vGPUs.
>>
>>         - type1 iommu driver maps gpa to hpa for passing through;
>>           whereas type2 maps iova to hpa;
>>
>>         - hardware iommu is always needed by type1, whereas for
>>           type2, hardware iommu is optional;
>>
>>         - type1 will invoke low-level IOMMU API (iommu_map et al) to
>>           setup IOMMU page table directly, whereas type2 dosen't (only
>>           need to invoke higher level DMA API like dma_map_page);
> 
> Yes, the current type1 implementation is not compatible with vgpu since
> there are not separate requester IDs on the bus and you probably don't
> want or need to pin all of guest memory like we do for direct
> assignment.  However, let's separate the type1 user API from the
> current implementation.  It's quite easy within the vfio code to
> consider "type1" to be an API specification that may have multiple
> implementations.  A minor code change would allow us to continue
> looking for compatible iommu backends if the group we're trying to
> attach is rejected.

Would you elaborate a bit about 'iommu backends' here? Previously
I thought that entire type1 will be duplicated. If not, what is supposed
to add, a new vfio_dma_do_map?

> The benefit here is that QEMU could work
> unmodified, using the type1 vfio-iommu API regardless of whether a
> device is directly assigned or virtual.
> 
> Let's look at the type1 interface; we have simple map and unmap
> interfaces which map and unmap process virtual address space (vaddr) to
> the device address space (iova).  The host physical address is obtained
> by pinning the vaddr.  In the current implementation, a map operation
> pins pages and populates the hardware iommu.  A vgpu compatible
> implementation might simply register the translation into a kernel-
> based database to be called upon later.  When the host graphics driver
> needs to enable dma for the vgpu, it doesn't need to go to QEMU for the
> translation, it already possesses the iova to vaddr mapping, which
> becomes iova to hpa after a pinning operation.
> 
> So, I would encourage you to look at creating a vgpu vfio iommu
> backened that makes use of the type1 api since it will reduce the
> changes necessary for userspace.
> 

Yes, keeping type1 API sounds a great idea.

>> We also need to implement a new 'bus' driver instead of vfio_pci,
>> let's call it vfio_vgpu temporarily:
>>
>>         - vfio_pci is a real pci driver, it has a probe method called
>>           during dev attaching; whereas the vfio_vgpu is a pseudo
>>           driver, it won't attach any devivce - the GPU is always owned by
>>           host gfx driver. It has to do 'probing' elsewhere, but
>>           still in host gfx driver attached to the device;
>>
>>         - pcidev(PF or VF) attached to vfio_pci has a natural path
>>           in sysfs; whereas vgpu is purely a software concept:
>>           vfio_vgpu needs to create create/destory vgpu instances,
>>           maintain their paths in sysfs (e.g. "/sys/class/vgpu/intel/vgpu0")
>>           etc. There should be something added in a higher layer
>>           to do this (VFIO or DRM).
>>
>>         - vfio_pci in most case will allow QEMU to access pcidev
>>           hardware; whereas vfio_vgpu is to access virtual resource
>>           emulated by another device model;
>>
>>         - vfio_pci will inject an IRQ to guest only when physical IRQ
>>           generated; whereas vfio_vgpu may inject an IRQ for emulation
>>           purpose. Anyway they can share the same injection interface;
> 
> Here too, I think you're making assumptions based on an implementation
> path.  Personally, I think each vgpu should be a struct device and that
> an iommu group should be created for each.  I think this is a valid
> abstraction; dma isolation is provided through something other than a
> system-level iommu, but it's still provided.  Without this, the entire
> vfio core would need to be aware of vgpu, since the core operates on
> devices and groups.  I believe creating a struct device also gives you
> basic probe and release support for a driver.
> 

Indeed.
BTW, that should be done in the 'bus' driver, right?

> There will be a need for some sort of lifecycle management of a vgpu.
>  How is it created?  Destroyed?  Can it be given more or less resources
> than other vgpus, etc.  This could be implemented in sysfs for each
> physical gpu with vgpu support, sort of like how we support sr-iov now,
> the PF exports controls for creating VFs.  The more commonality we can
> get for lifecycle and device access for userspace, the better.
> 

Will have a look at the VF managements, thanks for the info.

> As for virtual vs physical resources and interrupts, part of the
> purpose of vfio is to abstract a device into basic components.  It's up
> to the bus driver how accesses to each space map to the physical
> device.  Take for instance PCI config space, the existing vfio-pci
> driver emulates some portions of config space for the user.
> 
>> Questions:
>>
>>         [1] For VFIO No-IOMMU mode (!iommu_present), I saw it was reverted
>>             in upstream ae5515d66362(Revert: "vfio: Include No-IOMMU mode").
>>             In my opinion, vfio_type2 doesn't rely on it to support No-IOMMU
>>             case, instead it needs a new implementation which fits both
>>             w/ and w/o IOMMU. Is this correct?
>>
> 
> vfio no-iommu has also been re-added for v4.5 (03a76b60f8ba), this was
> simply a case that the kernel development outpaced the intended user
> and I didn't want to commit to the user api changes until it had been
> completely vetted.  In any case, vgpu should have no dependency
> whatsoever on no-iommu.  As above, I think vgpu should create virtual
> devices and add them to an iommu group, similar to how no-iommu does,
> but without the kernel tainting because you are actually providing
> isolation through other means than a system iommu.
> 

Thanks for confirmation.

>> For things not mentioned above, we might have them discussed in
>> other threads, or temporarily maintained in a TODO list (we might get
>> back to them after the big picture get agreed):
>>
>>
>>         - How to expose guest framebuffer via VFIO for SPICE;
> 
> Potentially through a new, device specific region, which I think can be
> done within the existing vfio API.  The API can already expose an
> arbitrary number of regions to the user, it's just a matter of how we
> tell the user the purpose of a region index beyond the fixed set we map
> to PCI resources.
> 
>>         - How to avoid double translation with two-stage: GTT + IOMMU,
>>           whether identity map is possible, and if yes, how to make it
>>           more effectively;
>>
>>         - Application acceleration
>>           You mentioned that with VFIO, a vGPU may be used by
>>           applications to get GPU acceleration. It's a potential
>>           opportunity to use vGPU for container usage, worthy of
>>           further investigation.
> 
> Yes, interesting topics.  Thanks,
> 

Looks that things get more clear overall, with small exceptions.
Thanks for the advice:)

> Alex
> 

--
Thanks,
Jike
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html