VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)

Jike Song <jike.song@xxxxxxxxx> · Mon, 18 Jan 2016 10:39:45 +0800

Hi Alex, let's continue with a new thread :)

Basically we agree with you: exposing vGPU via VFIO can make
QEMU share as much code as possible with pcidev(PF or VF) assignment.
And yes, different vGPU vendors can share quite a lot of the
QEMU part, which will do good for upper layers such as libvirt.

To achieve this, there are quite a lot to do, I'll summarize
it below. I dived into VFIO for a while but still may have
things misunderstood, so please correct me :)

First, let me illustrate my understanding of current VFIO
framework used to pass through a pcidev to guest:

                 +----------------------------------+
                 |            vfio qemu             |
                 +-----+------------------------+---+
                       |DMA                  ^  |CFG
QEMU                   |map               IRQ|  |
-----------------------|---------------------|--|-----------
KERNEL    +------------|---------------------|--|----------+
          | VFIO       |                     |  |          |
          |            v                     |  v          |
          |  +-------------------+     +-----+-----------+ |
IOMMU     |  | vfio iommu driver |     | vfio bus driver | |
API  <-------+                   |     |                 | |
Layer     |  | e.g. type1        |     | e.g. vfio_pci   | |
          |  +-------------------+     +-----------------+ |
          +------------------------------------------------+

Here when a particular pcidev is passed-through to a KVM guest,
it is attached to vfio_pci driver in host, and guest memory
is mapped into IOMMU via the type1 iommu driver.

Then, the draft infrastructure of future VFIO-based vgpu:

                 +-------------------------------------+
                 |              vfio qemu              |
                 +----+-------------------------+------+
                      |DMA                   ^  |CFG
QEMU                  |map                IRQ|  |
----------------------|----------------------|--|-----------
KERNEL                |                      |  |
         +------------|----------------------|--|----------+
         |VFIO        |                      |  |          |
         |            v                      |  v          |
         | +--------------------+      +-----+-----------+ |
DMA      | | vfio iommu driver  |      | vfio bus driver | |
API <------+                    |      |                 | |
Layer    | |  e.g. vfio_type2   |      |  e.g. vfio_vgpu | |
         | +--------------------+      +-----------------+ |
         |         |  ^                      |  ^          |
         +---------|--|----------------------|--|----------+
                   |  |                      |  |
                   |  |                      v  |
         +---------|--|----------+   +---------------------+
         | +-------v-----------+ |   |                     |
         | |                   | |   |                     |
         | |      KVMGT        | |   |                     |
         | |                   | |   |   host gfx driver   |
         | +-------------------+ |   |                     |
         |                       |   |                     |
         |    KVM hypervisor     |   |                     |
         +-----------------------+   +---------------------+

        NOTE    vfio_type2 and vfio_vgpu are only *logically* parts
                of VFIO, they may be implemented in KVM hypervisor
                or host gfx driver.

Here we need to implement a new vfio IOMMU driver instead of type1,
let's call it vfio_type2 temporarily. The main difference from pcidev
assignment is, vGPU doesn't have its own DMA requester id, so it has
to share mappings with host and other vGPUs.

        - type1 iommu driver maps gpa to hpa for passing through;
          whereas type2 maps iova to hpa;

        - hardware iommu is always needed by type1, whereas for
          type2, hardware iommu is optional;

        - type1 will invoke low-level IOMMU API (iommu_map et al) to
          setup IOMMU page table directly, whereas type2 dosen't (only
          need to invoke higher level DMA API like dma_map_page);

We also need to implement a new 'bus' driver instead of vfio_pci,
let's call it vfio_vgpu temporarily:

        - vfio_pci is a real pci driver, it has a probe method called
          during dev attaching; whereas the vfio_vgpu is a pseudo
          driver, it won't attach any devivce - the GPU is always owned by
          host gfx driver. It has to do 'probing' elsewhere, but
          still in host gfx driver attached to the device;

        - pcidev(PF or VF) attached to vfio_pci has a natural path
          in sysfs; whereas vgpu is purely a software concept:
          vfio_vgpu needs to create create/destory vgpu instances,
          maintain their paths in sysfs (e.g. "/sys/class/vgpu/intel/vgpu0")
          etc. There should be something added in a higher layer
          to do this (VFIO or DRM).

        - vfio_pci in most case will allow QEMU to access pcidev
          hardware; whereas vfio_vgpu is to access virtual resource
          emulated by another device model;

        - vfio_pci will inject an IRQ to guest only when physical IRQ
          generated; whereas vfio_vgpu may inject an IRQ for emulation
          purpose. Anyway they can share the same injection interface;

Questions:

        [1] For VFIO No-IOMMU mode (!iommu_present), I saw it was reverted
            in upstream ae5515d66362(Revert: "vfio: Include No-IOMMU mode").
            In my opinion, vfio_type2 doesn't rely on it to support No-IOMMU
            case, instead it needs a new implementation which fits both
            w/ and w/o IOMMU. Is this correct?

For things not mentioned above, we might have them discussed in
other threads, or temporarily maintained in a TODO list (we might get
back to them after the big picture get agreed):

        - How to expose guest framebuffer via VFIO for SPICE;

        - How to avoid double translation with two-stage: GTT + IOMMU,
          whether identity map is possible, and if yes, how to make it
          more effectively;

        - Application acceleration
          You mentioned that with VFIO, a vGPU may be used by
          applications to get GPU acceleration. It's a potential
          opportunity to use vGPU for container usage, worthy of
          further investigation.

--
Thanks,
Jike
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html