Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)

Alex Williamson <alex.williamson@xxxxxxxxxx> · Sun, 17 Jan 2016 21:47:56 -0700

Hi Jike,

On Mon, 2016-01-18 at 10:39 +0800, Jike Song wrote:
> Hi Alex, let's continue with a new thread :)
> 
> Basically we agree with you: exposing vGPU via VFIO can make
> QEMU share as much code as possible with pcidev(PF or VF) assignment.
> And yes, different vGPU vendors can share quite a lot of the
> QEMU part, which will do good for upper layers such as libvirt.
> 
> 
> To achieve this, there are quite a lot to do, I'll summarize
> it below. I dived into VFIO for a while but still may have
> things misunderstood, so please correct me :)
> 
> 
> 
> First, let me illustrate my understanding of current VFIO
> framework used to pass through a pcidev to guest:
> 
> 
>                  +----------------------------------+
>                  |            vfio qemu             |
>                  +-----+------------------------+---+
>                        |DMA                  ^  |CFG
> QEMU                   |map               IRQ|  |
> -----------------------|---------------------|--|-----------
> KERNEL    +------------|---------------------|--|----------+
>           | VFIO       |                     |  |          |
>           |            v                     |  v          |
>           |  +-------------------+     +-----+-----------+ |
> IOMMU     |  | vfio iommu driver |     | vfio bus driver | |
> API  <-------+                   |     |                 | |
> Layer     |  | e.g. type1        |     | e.g. vfio_pci   | |
>           |  +-------------------+     +-----------------+ |
>           +------------------------------------------------+
> 
> 
> Here when a particular pcidev is passed-through to a KVM guest,
> it is attached to vfio_pci driver in host, and guest memory
> is mapped into IOMMU via the type1 iommu driver.
> 
> 
> Then, the draft infrastructure of future VFIO-based vgpu:
> 
> 
> 
>                  +-------------------------------------+
>                  |              vfio qemu              |
>                  +----+-------------------------+------+
>                       |DMA                   ^  |CFG
> QEMU                  |map                IRQ|  |
> ----------------------|----------------------|--|-----------
> KERNEL                |                      |  |
>          +------------|----------------------|--|----------+
>          |VFIO        |                      |  |          |
>          |            v                      |  v          |
>          | +--------------------+      +-----+-----------+ |
> DMA      | | vfio iommu driver  |      | vfio bus driver | |
> API <------+                    |      |                 | |
> Layer    | |  e.g. vfio_type2   |      |  e.g. vfio_vgpu | |
>          | +--------------------+      +-----------------+ |
>          |         |  ^                      |  ^          |
>          +---------|--|----------------------|--|----------+
>                    |  |                      |  |
>                    |  |                      v  |
>          +---------|--|----------+   +---------------------+
>          | +-------v-----------+ |   |                     |
>          | |                   | |   |                     |
>          | |      KVMGT        | |   |                     |
>          | |                   | |   |   host gfx driver   |
>          | +-------------------+ |   |                     |
>          |                       |   |                     |
>          |    KVM hypervisor     |   |                     |
>          +-----------------------+   +---------------------+
> 
>         NOTE    vfio_type2 and vfio_vgpu are only *logically* parts
>                 of VFIO, they may be implemented in KVM hypervisor
>                 or host gfx driver.
> 
> 
> 
> Here we need to implement a new vfio IOMMU driver instead of type1,
> let's call it vfio_type2 temporarily. The main difference from pcidev
> assignment is, vGPU doesn't have its own DMA requester id, so it has
> to share mappings with host and other vGPUs.
> 
>         - type1 iommu driver maps gpa to hpa for passing through;
>           whereas type2 maps iova to hpa;
> 
>         - hardware iommu is always needed by type1, whereas for
>           type2, hardware iommu is optional;
> 
>         - type1 will invoke low-level IOMMU API (iommu_map et al) to
>           setup IOMMU page table directly, whereas type2 dosen't (only
>           need to invoke higher level DMA API like dma_map_page);

Yes, the current type1 implementation is not compatible with vgpu since
there are not separate requester IDs on the bus and you probably don't
want or need to pin all of guest memory like we do for direct
assignment.  However, let's separate the type1 user API from the
current implementation.  It's quite easy within the vfio code to
consider "type1" to be an API specification that may have multiple
implementations.  A minor code change would allow us to continue
looking for compatible iommu backends if the group we're trying to
attach is rejected.  The benefit here is that QEMU could work
unmodified, using the type1 vfio-iommu API regardless of whether a
device is directly assigned or virtual.

Let's look at the type1 interface; we have simple map and unmap
interfaces which map and unmap process virtual address space (vaddr) to
the device address space (iova).  The host physical address is obtained
by pinning the vaddr.  In the current implementation, a map operation
pins pages and populates the hardware iommu.  A vgpu compatible
implementation might simply register the translation into a kernel-
based database to be called upon later.  When the host graphics driver
needs to enable dma for the vgpu, it doesn't need to go to QEMU for the
translation, it already possesses the iova to vaddr mapping, which
becomes iova to hpa after a pinning operation.

So, I would encourage you to look at creating a vgpu vfio iommu
backened that makes use of the type1 api since it will reduce the
changes necessary for userspace.

> We also need to implement a new 'bus' driver instead of vfio_pci,
> let's call it vfio_vgpu temporarily:
> 
>         - vfio_pci is a real pci driver, it has a probe method called
>           during dev attaching; whereas the vfio_vgpu is a pseudo
>           driver, it won't attach any devivce - the GPU is always owned by
>           host gfx driver. It has to do 'probing' elsewhere, but
>           still in host gfx driver attached to the device;
> 
>         - pcidev(PF or VF) attached to vfio_pci has a natural path
>           in sysfs; whereas vgpu is purely a software concept:
>           vfio_vgpu needs to create create/destory vgpu instances,
>           maintain their paths in sysfs (e.g. "/sys/class/vgpu/intel/vgpu0")
>           etc. There should be something added in a higher layer
>           to do this (VFIO or DRM).
> 
>         - vfio_pci in most case will allow QEMU to access pcidev
>           hardware; whereas vfio_vgpu is to access virtual resource
>           emulated by another device model;
> 
>         - vfio_pci will inject an IRQ to guest only when physical IRQ
>           generated; whereas vfio_vgpu may inject an IRQ for emulation
>           purpose. Anyway they can share the same injection interface;

Here too, I think you're making assumptions based on an implementation
path.  Personally, I think each vgpu should be a struct device and that
an iommu group should be created for each.  I think this is a valid
abstraction; dma isolation is provided through something other than a
system-level iommu, but it's still provided.  Without this, the entire
vfio core would need to be aware of vgpu, since the core operates on
devices and groups.  I believe creating a struct device also gives you
basic probe and release support for a driver.

There will be a need for some sort of lifecycle management of a vgpu.
 How is it created?  Destroyed?  Can it be given more or less resources
than other vgpus, etc.  This could be implemented in sysfs for each
physical gpu with vgpu support, sort of like how we support sr-iov now,
the PF exports controls for creating VFs.  The more commonality we can
get for lifecycle and device access for userspace, the better.

As for virtual vs physical resources and interrupts, part of the
purpose of vfio is to abstract a device into basic components.  It's up
to the bus driver how accesses to each space map to the physical
device.  Take for instance PCI config space, the existing vfio-pci
driver emulates some portions of config space for the user.

> Questions:
> 
>         [1] For VFIO No-IOMMU mode (!iommu_present), I saw it was reverted
>             in upstream ae5515d66362(Revert: "vfio: Include No-IOMMU mode").
>             In my opinion, vfio_type2 doesn't rely on it to support No-IOMMU
>             case, instead it needs a new implementation which fits both
>             w/ and w/o IOMMU. Is this correct?
> 

vfio no-iommu has also been re-added for v4.5 (03a76b60f8ba), this was
simply a case that the kernel development outpaced the intended user
and I didn't want to commit to the user api changes until it had been
completely vetted.  In any case, vgpu should have no dependency
whatsoever on no-iommu.  As above, I think vgpu should create virtual
devices and add them to an iommu group, similar to how no-iommu does,
but without the kernel tainting because you are actually providing
isolation through other means than a system iommu.

> For things not mentioned above, we might have them discussed in
> other threads, or temporarily maintained in a TODO list (we might get
> back to them after the big picture get agreed):
> 
> 
>         - How to expose guest framebuffer via VFIO for SPICE;

Potentially through a new, device specific region, which I think can be
done within the existing vfio API.  The API can already expose an
arbitrary number of regions to the user, it's just a matter of how we
tell the user the purpose of a region index beyond the fixed set we map
to PCI resources.

>         - How to avoid double translation with two-stage: GTT + IOMMU,
>           whether identity map is possible, and if yes, how to make it
>           more effectively;
> 
>         - Application acceleration
>           You mentioned that with VFIO, a vGPU may be used by
>           applications to get GPU acceleration. It's a potential
>           opportunity to use vGPU for container usage, worthy of
>           further investigation.

Yes, interesting topics.  Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html