RE: [RFC 0/7] drm/virtio: Import scanout buffers from other devices

"Kasireddy, Vivek" <vivek.kasireddy@xxxxxxxxx> · Fri, 24 May 2024 06:56:34 +0000

Hi Gurchetan,

Thank you for taking a look at this patch series!

On Thu, Mar 28, 2024 at 2:01 AM Vivek Kasireddy <vivek.kasireddy@xxxxxxxxx> wrote:

Having virtio-gpu import scanout buffers (via prime) from other

devices means that we'd be adding a head to headless GPUs assigned

to a Guest VM or additional heads to regular GPU devices that are

passthrough'd to the Guest. In these cases, the Guest compositor

can render into the scanout buffer using a primary GPU and has the

secondary GPU (virtio-gpu) import it for display purposes.

The main advantage with this is that the imported scanout buffer can

either be displayed locally on the Host (e.g, using Qemu + GTK UI)

or encoded and streamed to a remote client (e.g, Qemu + Spice UI).

Note that since Qemu uses udmabuf driver, there would be no copies

made of the scanout buffer as it is displayed. This should be

possible even when it might reside in device memory such has VRAM.

The specific use-case that can be supported with this series is when

running Weston or other guest compositors with "additional-devices"

feature (./weston --drm-device=card1 --additional-devices=card0).

More info about this feature can be found at:

https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/736

In the above scenario, card1 could be a dGPU or an iGPU and card0

would be virtio-gpu in KMS only mode. However, the case where this

patch series could be particularly useful is when card1 is a GPU VF

that needs to share its scanout buffer (in a zero-copy way) with the

GPU PF on the Host. Or, it can also be useful when the scanout buffer

needs to be shared between any two GPU devices (assuming one of them

is assigned to a Guest VM) as long as they are P2P DMA compatible.

Is passthrough iGPU-only or passthrough dGPU-only something you intend to use?
Our main use-case involves passthrough’g a headless dGPU VF device and sharing

the Guest compositor’s scanout buffer with dGPU PF device on the Host. Same goal for
headless iGPU VF to iGPU PF device as well.

However, using a combination of iGPU and dGPU where either of them can be passthrough’d
to the Guest is something I think can be supported with this patch series as well.

If it's a dGPU + iGPU setup, then the way other people seem to do it is a "virtualized" iGPU (via virgl/gfxstream/take your pick) and pass-through the dGPU.

For example, AMD seems to use virgl to allocate and import into the dGPU.

https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23896

https://lore.kernel.org/all/20231221100016.4022353-1-julia.zhang@xxxxxxx/

ChromeOS also uses that method (see 
crrev.com/c/3764931) [cc: dGPU architect +Dominik Behr]

So if iGPU + dGPU is the primary use case, you should be able to use these methods as well.  The model would "virtualized iGPU" + passthrough dGPU, not split SoCs. 
In our use-case, the goal is to have only one primary GPU (passthrough’d iGPU/dGPU)
do all the rendering (using native DRI drivers) for clients/compositor and all the outputs
and share the scanout buffers with the secondary GPU (virtio-gpu). Since this is mostly
how Mutter (and also Weston) work in a multi-GPU setup, I am not sure if virgl is needed.

As part of the import, the virtio-gpu driver shares the dma

addresses and lengths with Qemu which then determines whether the

memory region they belong to is owned by a PCI device or whether it

is part of the Guest's system ram. If it is the former, it identifies

the devid (or bdf) and bar and provides this info (along with offsets

and sizes) to the udmabuf driver. In the latter case, instead of the

the devid and bar it provides the memfd. The udmabuf driver then

creates a dmabuf using this info that Qemu shares with Spice for

encode via Gstreamer.

Note that the virtio-gpu driver registers a move_notify() callback

to track location changes associated with the scanout buffer and

sends attach/detach backing cmds to Qemu when appropriate. And,

synchronization (that is, ensuring that Guest and Host are not

using the scanout buffer at the same time) is ensured by pinning/

unpinning the dmabuf as part of plane update and using a fence

in resource_flush cmd.

I'm not sure how QEMU's display paths work, but with crosvm if you share the guest-created dmabuf with the display, and the guest moves the backing pages, the only recourse is the destroy the surface and show a black screen to the user:
 not the best thing experience wise.
Since Qemu GTK UI uses EGL, there is a blit done from the guest’s scanout buffer onto an EGL
backed buffer on the Host. So, this problem would not happen as of now.

Only amdgpu calls dma_buf_move_notfiy(..), and you're probably testing on Intel only, so you may not be hitting that code path anyways. 
I have tested with the Xe driver in the Guest which also calls dma_buf_move_notfiy(). However,
note that for dGPUs, both Xe and amdgpu migrate the scanout buffer from vram to system
memory as part of export, because virtio-gpu is not P2P compatible.

I forgot the exact reason, but apparently udmabuf may not work with amdgpu displays and it seems the virtualized iGPU + dGPU is the way to go for amdgpu anyways.
I am curious why udmabuf would not work with amdgpu?

So I recommend just pinning the buffer for the lifetime of the import for simplicity and correctness. 
Yeah, in this patch series, the dmabuf is indeed pinned but only for a short duration in the Guest –
just until the Host is done using it (blit or encode).

Thanks,
Vivek

This series is available at:

https://gitlab.freedesktop.org/Vivek/drm-tip/-/commits/virtgpu_import_rfc

along with additional patches for Qemu and Spice here:

https://gitlab.freedesktop.org/Vivek/qemu/-/commits/virtgpu_dmabuf_pcidev

https://gitlab.freedesktop.org/Vivek/spice/-/commits/encode_dmabuf_v4

Patchset overview:

Patch 1:   Implement VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd

Patch 2-3: Helpers to initalize, import, free imported object

Patch 4-5: Import and use buffers from other devices for scanout

Patch 6-7: Have udmabuf driver create dmabuf from PCI bars for P2P DMA

This series is tested using the following method:

- Run Qemu with the following relevant options:

  qemu-system-x86_64 -m 4096m ....

  -device vfio-pci,host=0000:03:00.0

  -device virtio-vga,max_outputs=1,blob=true,xres=1920,yres=1080

  -spice port=3001,gl=on,disable-ticketing=on,preferred-codec=gstreamer:h264

  -object memory-backend-memfd,id=mem1,size=4096M

  -machine memory-backend=mem1 ...

- Run upstream Weston with the following options in the Guest VM:

  ./weston --drm-device=card1 --additional-devices=card0

where card1 is a DG2 dGPU (passthrough'd and using xe driver in Guest VM),

card0 is virtio-gpu and the Host is using a RPL iGPU.

Cc: Gerd Hoffmann <kraxel@xxxxxxxxxx>

Cc: Dongwon Kim <dongwon.kim@xxxxxxxxx>

Cc: Daniel Vetter <daniel.vetter@xxxxxxxx>

Cc: Christian Koenig <christian.koenig@xxxxxxx>

Cc: Dmitry Osipenko <dmitry.osipenko@xxxxxxxxxxxxx>

Cc: Rob Clark <robdclark@xxxxxxxxxxxx>

Cc: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx>

Cc: Oded Gabbay <ogabbay@xxxxxxxxxx>

Cc: Michal Wajdeczko <michal.wajdeczko@xxxxxxxxx>

Cc: Michael Tretter <m.tretter@xxxxxxxxxxxxxx>

Vivek Kasireddy (7):

  drm/virtio: Implement VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd

  drm/virtio: Add a helper to map and note the dma addrs and lengths

  drm/virtio: Add helpers to initialize and free the imported object

  drm/virtio: Import prime buffers from other devices as guest blobs

  drm/virtio: Ensure that bo's backing store is valid while updating

    plane

  udmabuf/uapi: Add new ioctl to create a dmabuf from PCI bar regions

  udmabuf: Implement UDMABUF_CREATE_LIST_FOR_PCIDEV ioctl

 drivers/dma-buf/udmabuf.c              | 122 ++++++++++++++++--

 drivers/gpu/drm/virtio/virtgpu_drv.h   |   8 ++

 drivers/gpu/drm/virtio/virtgpu_plane.c |  56 ++++++++-

 drivers/gpu/drm/virtio/virtgpu_prime.c | 167 ++++++++++++++++++++++++-

 drivers/gpu/drm/virtio/virtgpu_vq.c    |  15 +++

 include/uapi/linux/udmabuf.h           |  11 +-

 6 files changed, 368 insertions(+), 11 deletions(-)

-- 

2.43.0