Re: Xen memory management primitives for GPU virtualization

Demi Marie Obenour <demi@xxxxxxxxxxxxxxxxxxxxxx> · Sun, 2 Feb 2025 21:20:22 -0500

On Sun, Feb 02, 2025 at 12:08:46AM -0500, Demi Marie Obenour wrote:
> Recently, AMD submitted patches to the dri-devel mailing list to support
> using application-provided buffers in virtio-GPU.  This feature is
> called Shared Virtual Memory (SVM) and it is implemented via an API
> called User Pointer (userptr).  This lead to some discussion on
> dri-devel@xxxxxxxxxxxxxxxxxxxxx and dri-devel IRC, from which I
> concluded that Xen is missing critical primitives for GPU-accelerated
> graphics and compute.  The missing primitives for graphics are the ones
> discussed at Xen Project Summit 2024, but it turns out that additional
> primitives are needed for compute workloads.
> 
> As discussed at Xen Project Summit 2024, GPU acceleration via virtio-GPU
> requires that an IOREQ server have access to the following primitives:
> 
> 1. Map: Map a backend-provided buffer into the frontend.  The buffer
>    might point to system memory or to a PCIe BAR.  The frontend is _not_
>    allowed to use these buffers in hypercalls or grant them to other
>    domains.  Accessing the pages using hypercalls directed at the
>    frontend fails as if the frontend did not have the pages.  The only
>    exception is that the frontend _may_ be allowed to use the buffer in
>    a Map operation, provided that Revoke (below) is transitive.

Further note: if the frontend has an assigned PCI device, I believe that
pages provided by Map _should not_ be included in the device's IOMMU
mappings.  This avoids needing a synchronous IOTLB flush when Revoke is
performed, which would be slow.  Delaying the flush would allow the
frontend to DMA into freed backend memory and so would be a security
vulnerability.  Furthermore, such entries would not be useable by the
frontend in any way, as the frontend cannot perform DMA to them without
racing against a concurrent call to Revoke made by the backend.

> 2. Revoke: Revoke access to a buffer provided by the backend.  Once
>    access is revoked, no operation on or in the frontend domain can
>    access or modify the pages, and the backend can safely reuse the
>    backing memory for other purposes.  Furthermore, revocation is not
>    allowed to fail unless the backend or hypervisor is buggy, and if it
>    does fail for any reason, the backend will panic.  Once access is
>    revoked, further accesses by the frontend will cause a fault that the
>    backend can intercept.

How should this interact with emulated I/O devices?  If the emulated I/O
device uses mmap() on the dmabuf to access the pages, things will work
fine, but if it uses foreign mapping operations, things will fail rather
miserably.  I think it is okay for this to fail: DMA to pages provided
by Map is always a guest bug, and that includes DMA by an emulated
device.  Guest userspace is prepared for such errors, because dmabufs on
bare silicon work the same way.

> 3. Steal: Convert frontend-owned pages to backend-owned pages and
>    provide the backend with a mapping of the page.  After a successful
>    Steal operation, the pages are in the same state as if they had been
>    provided via Map.  Steal fails if the pages are currently being used
>    in a hypercall, are MMIO (as opposed to system memory), were provided
>    by another domain via Map or grant tables, are currently foreign
>    mapped, are currently granted to another domain, or more generally
>    are accessible to any domain other than the target domain.  The
>    frontend's quota is decreased by the number of pages stolen, and the
>    backend's quota is increased by the same amount.  A successful Steal
>    operation means that Revoke and Map can be used to operate on the
>    pages.

How should this work if the frontend has an assigned PCI device?  Xen
can unmap the pages from the frontend's IOMMU mappings, but the frontend
might continue to try to perform DMA to these pages.  This would result
in runtime misbehavior or data corruption.

PV devices could be a significant problem too, because Steal is
incompatible with grant tables for security reasons.  Otherwise, a
malicious guest could grant a page to domain A (which expects it to be
ordinary RAM) and then ask domain B to steal it.  Domain B steals the
page, revokes it (for page migration purposes, say), and reuses the
backing storage.  Now domain A has a mapping of freed domain B memory.
If Steal instead revoked domain A's mapping too, domain B could block
domain A forever.  This means unless domain B is privileged over domain
A (and domain A has no assigned PCI devices), the mapping must fail.

Matthew: Do you know if Linux supports marking anonymous memory as "does
not support DMA/pin_user_pages()"?  If not, how hard would it be to
implement this?  Without this, all I/O by a guest using virtio-GPU
shared virtual memory would need to be bounce-buffered, which isn't
great for performance.  This includes I/O using paravirtualized devices,
unless the PV driver can handle the grant operation failing and fall
back to a bounce buffer.  It would be much better if only I/O from stolen
pages needed to be bounced.

What _might_ work is this sequence of operations:

1. The frontend asks backend userspace to give the pages back.
2. The backend converts the memory to pinned CPU memory, perhaps via
   mlock().
3. The backend returns the pages to the guest via Return (below).  This
   can be done without blocking GPU access because these pages are
   pinned to system RAM.
4. The frontend uses the pages for DMA/grant tables/etc as normal.

This requires a blocking cross-VM round-trip, though, so it won't be
fast.

As an aside, emulated devices will work fine if the device model is in
the same domain as the backend and uses the same mappings as are passed
to the GPU driver.  In that case, access to the stolen pages will be
handled correctly.  It's only when there are multiple IOREQ servers
or PV drivers involved that sadness occurs.  It seems that shared
virtual memory is very unfriendly to disaggregated setups.

> 4. Return: Convert a backend-owned page to a frontend-owned page.  After
>    a successful call to Return, the backend is no lonter able to use
>    Revoke or Map.  The returned page ceases to count against backend
>    quota and now counts against frontend quota.
> 
> Are these operations ones that Xen is interested in providing?  There
> may be other primitives that are sufficient to implement the above four,
> but I believe that any solution that allows virtio-GPU to work must
> allow the above four operations to be implemented.  Without the first
> two, virtio-GPU will not be able to support Vulkan or native contexts,
> and without the second two also being present, shared virtual memory
> and compute APIs that require it will not work.

In light of all of the above limitations, I think that there is an
important special case: if the GPU is an iGPU, then all SVM buffers
should be allocated form pinned CPU memory, which makes the problem go
away entirely.  If access to SVM data is not at all performance
critical, then it might still be possible to keep all the data in system
memory.

The underlying limitation that causes all of the above problems is that
both DMA and Xen grant table operations require pinned memory, and there
is no way to do that from userspace.  
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab> 
Attachment:
signature.asc

Description: PGP signature