Re: Xen memory management primitives for GPU virtualization

Roger Pau Monné <roger.pau@xxxxxxxxxx> · Thu, 6 Feb 2025 18:43:19 +0100

On Sun, Feb 02, 2025 at 12:08:46AM -0500, Demi Marie Obenour wrote:
> Cc: 
> Bcc: 
> Subject: Xen requirements for GPU virtualization via virtio-GPU
> Reply-To: 
> X-Mutt-Fcc: =INBOX,=xen-devel,=Sent
> X-Mutt-PGP: S
> 
> Recently, AMD submitted patches to the dri-devel mailing list to support
> using application-provided buffers in virtio-GPU.  This feature is
> called Shared Virtual Memory (SVM) and it is implemented via an API
> called User Pointer (userptr).  This lead to some discussion on
> dri-devel@xxxxxxxxxxxxxxxxxxxxx and dri-devel IRC, from which I
> concluded that Xen is missing critical primitives for GPU-accelerated
> graphics and compute.  The missing primitives for graphics are the ones
> discussed at Xen Project Summit 2024, but it turns out that additional
> primitives are needed for compute workloads.
> 
> As discussed at Xen Project Summit 2024, GPU acceleration via virtio-GPU
> requires that an IOREQ server have access to the following primitives:
> 
> 1. Map: Map a backend-provided buffer into the frontend.  The buffer
>    might point to system memory or to a PCIe BAR.  The frontend is _not_
>    allowed to use these buffers in hypercalls or grant them to other
>    domains.  Accessing the pages using hypercalls directed at the
>    frontend fails as if the frontend did not have the pages.

Do you really need to strictly enforce failure of access when used as
hypercall buffers?

Would it be fine to just get failures when the p2m entries are not
populated?  I assume the point is that accesses to those guest pages
from Xen should never go into the IOREQ?

>    The only
>    exception is that the frontend _may_ be allowed to use the buffer in
>    a Map operation, provided that Revoke (below) is transitive.

The fact that the mapped memory can either be RAM or MMIO makes it a
bit harder to handle any possible reference counting, as MMIO regions
don't have backing page_info structs, and hence no reference counting.
I think that might be hidden by the p2m handling, but needs to be
checked to be correct.

Also when mapping MMIO pages, will those maps respect the domain
d->iomem_caps permission ranges, and then require modifications for
the mappings to succeed, or just ignore d->iomem_caps?

> 
> 2. Revoke: Revoke access to a buffer provided by the backend.  Once
>    access is revoked, no operation on or in the frontend domain can
>    access or modify the pages, and the backend can safely reuse the
>    backing memory for other purposes.

It looks to me that revocation means removing the page from the p2m?

(and additionally adjusting d->iomem_caps if required to revoke domain
permission to map the page)

>    Furthermore, revocation is not
>    allowed to fail unless the backend or hypervisor is buggy, and if it
>    does fail for any reason, the backend will panic.  Once access is
>    revoked, further accesses by the frontend will cause a fault that the
>    backend can intercept.

Such faults would translate into a new IOREQ type, maybe IOREQ_TYPE_FAULT.

I think that just having a rangeset on the ioreq to signal the
accesses that should trigger a IOREQ_TYPE_FAULT instead of an
IOREQ_TYPE_COPY should be enough?

The p2m type could be set as p2m_mmio_dm for those ranges.

> 
> Map can be handled by userspace, but Revoke must be handled entirely
> in-kernel.  This is because Revoke happens from a Linux MMU notifier
> callback, and those are not allowed to block, fail, or involve userspace
> in any way.  Since MMU notifier callbacks are called before freeing
> memory, failure means that some other part of the system still has
> access to freed memory that might be reused for other purposes, which
> is a security vulnerability.

This "revoke" action would just be an hypercall, I think that would
satisfy your requirements?

> 
> It turns out that compute has additional requirements.  Graphics APIs
> use DMA buffers (dmabufs), which only support a subset of operations.
> In particular, direct I/O doesn't work.  Compute APIs allow users to
> make malloc'd memory accessible to the GPU.  This memory can be used
> in Linux kernel direct I/O and in other operations that do not work
> with dmabufs.  However, such memory starts out as frontend-owned pages,
> so it must be converted to backend pages before it can be used by the
> GPU.  Linux supports migration of userspace pages, but this is too
> unreliable to be used for this purpose.  Instead, it will need to be
> done by Xen and the backend.
> 
> This requires two additional primitives:
> 
> 3. Steal: Convert frontend-owned pages to backend-owned pages and
>    provide the backend with a mapping of the page.

What does "owned" exactly mean in this context?

What you describe above sound very much like a foreign map, but I'm
not sure I fully understand the constrains below.

Does this "steal" operation make the pages inaccessible by the domain
running the frontend (so the orignal owner of the memory).

>    After a successful
>    Steal operation, the pages are in the same state as if they had been
>    provided via Map.  Steal fails if the pages are currently being used
>    in a hypercall, are MMIO (as opposed to system memory), were provided
>    by another domain via Map or grant tables, are currently foreign
>    mapped, are currently granted to another domain, or more generally
>    are accessible to any domain other than the target domain.

I think the above means that "stealed" pages must have the
"p2m_ram_rw" type in the frontend domain p2m.   IOW: must be strictly
RAM and owned by the domain running the frontend.

>    The
>    frontend's quota is decreased by the number of pages stolen, and the
>    backend's quota is increased by the same amount.  A successful Steal
>    operation means that Revoke and Map can be used to operate on the
>    pages.

Hm, why do you need this quota adjustment?  Aren't the "stolen" pages
still owned by the domain running the frontend (have
page_info->v.inuse._domain == frontend domain)?

> 
> 4. Return: Convert a backend-owned page to a frontend-owned page.  After
>    a successful call to Return, the backend is no lonter able to use
>    Revoke or Map.  The returned page ceases to count against backend
>    quota and now counts against frontend quota.
> 
> Are these operations ones that Xen is interested in providing?  There
> may be other primitives that are sufficient to implement the above four,
> but I believe that any solution that allows virtio-GPU to work must
> allow the above four operations to be implemented.  Without the first
> two, virtio-GPU will not be able to support Vulkan or native contexts,
> and without the second two also being present, shared virtual memory
> and compute APIs that require it will not work.

I'm sure Xen can arrange for what's required, but the Xen primitives
should be as simple as possible, offloading all possible logic to the
backend.

Thanks, Roger.