Re: Xen memory management primitives for GPU virtualization

Demi Marie Obenour <demi@xxxxxxxxxxxxxxxxxxxxxx> · Thu, 6 Feb 2025 17:46:29 -0500

On Thu, Feb 06, 2025 at 06:43:19PM +0100, Roger Pau Monné wrote:
> On Sun, Feb 02, 2025 at 12:08:46AM -0500, Demi Marie Obenour wrote:
> > Recently, AMD submitted patches to the dri-devel mailing list to support
> > using application-provided buffers in virtio-GPU.  This feature is
> > called Shared Virtual Memory (SVM) and it is implemented via an API
> > called User Pointer (userptr).  This lead to some discussion on
> > dri-devel@xxxxxxxxxxxxxxxxxxxxx and dri-devel IRC, from which I
> > concluded that Xen is missing critical primitives for GPU-accelerated
> > graphics and compute.  The missing primitives for graphics are the ones
> > discussed at Xen Project Summit 2024, but it turns out that additional
> > primitives are needed for compute workloads.
> > 
> > As discussed at Xen Project Summit 2024, GPU acceleration via virtio-GPU
> > requires that an IOREQ server have access to the following primitives:
> > 
> > 1. Map: Map a backend-provided buffer into the frontend.  The buffer
> >    might point to system memory or to a PCIe BAR.  The frontend is _not_
> >    allowed to use these buffers in hypercalls or grant them to other
> >    domains.  Accessing the pages using hypercalls directed at the
> >    frontend fails as if the frontend did not have the pages.
> 
> Do you really need to strictly enforce failure of access when used as
> hypercall buffers?
> 
> Would it be fine to just get failures when the p2m entries are not
> populated?  I assume the point is that accesses to those guest pages
> from Xen should never go into the IOREQ?

These pages might point to PCIe BAR memory.  I'm not sure if that is
allowed to be used in hypercalls, and what the security consequences are
if it is allowed.  Also, allowing the guest to determine if the pages
are mapped in the p2m violates encapsulation and risks pointers to the
pages winding up in places they must not.

> >    The only
> >    exception is that the frontend _may_ be allowed to use the buffer in
> >    a Map operation, provided that Revoke (below) is transitive.
> 
> The fact that the mapped memory can either be RAM or MMIO makes it a
> bit harder to handle any possible reference counting, as MMIO regions
> don't have backing page_info structs, and hence no reference counting.
> I think that might be hidden by the p2m handling, but needs to be
> checked to be correct.

These pages must never have p2m type p2m_ram_rw, as that would allow
them to be foreign-mapped or used in grant table operations.  I also
believe this feature should be x86-only to begin with, as it is clear
that parts of the code (like get_paged_frame()) are not ready for other
architectures.  On x86, p2m_mmio_direct seems to be the appropriate type
when the pages that are accessible via CPU instructions, and p2m_mmio_dm
when they are not.  Will using p2m_mmio_direct for pages backed by
system RAM cause problems?

> Also when mapping MMIO pages, will those maps respect the domain
> d->iomem_caps permission ranges, and then require modifications for
> the mappings to succeed, or just ignore d->iomem_caps?

They should respect the backend's permissions and ignore those of the
frontend.  This might require a global lock (or cleverness) to avoid
lock order inversions.

> > 2. Revoke: Revoke access to a buffer provided by the backend.  Once
> >    access is revoked, no operation on or in the frontend domain can
> >    access or modify the pages, and the backend can safely reuse the
> >    backing memory for other purposes.
> 
> It looks to me that revocation means removing the page from the p2m?

This is one part.

> (and additionally adjusting d->iomem_caps if required to revoke domain
> permission to map the page)

The frontend domain should never have permission to map the pages.

> >    Furthermore, revocation is not
> >    allowed to fail unless the backend or hypervisor is buggy, and if it
> >    does fail for any reason, the backend will panic.  Once access is
> >    revoked, further accesses by the frontend will cause a fault that the
> >    backend can intercept.
> 
> Such faults would translate into a new IOREQ type, maybe IOREQ_TYPE_FAULT.
> 
> I think that just having a rangeset on the ioreq to signal the
> accesses that should trigger a IOREQ_TYPE_FAULT instead of an
> IOREQ_TYPE_COPY should be enough?

I think so.  Better documentation would be very helpful.

> The p2m type could be set as p2m_mmio_dm for those ranges.

That seems correct.

> > Map can be handled by userspace, but Revoke must be handled entirely
> > in-kernel.  This is because Revoke happens from a Linux MMU notifier
> > callback, and those are not allowed to block, fail, or involve userspace
> > in any way.  Since MMU notifier callbacks are called before freeing
> > memory, failure means that some other part of the system still has
> > access to freed memory that might be reused for other purposes, which
> > is a security vulnerability.
> 
> This "revoke" action would just be an hypercall, I think that would
> satisfy your requirements?

It would, provided that (unless misused) it always succeeds promptly and
is security-supported (including DoS) with partially trusted callers.  I
don't care about DoS but AMD's automotive customers definitely do.

> > It turns out that compute has additional requirements.  Graphics APIs
> > use DMA buffers (dmabufs), which only support a subset of operations.
> > In particular, direct I/O doesn't work.  Compute APIs allow users to
> > make malloc'd memory accessible to the GPU.  This memory can be used
> > in Linux kernel direct I/O and in other operations that do not work
> > with dmabufs.  However, such memory starts out as frontend-owned pages,
> > so it must be converted to backend pages before it can be used by the
> > GPU.  Linux supports migration of userspace pages, but this is too
> > unreliable to be used for this purpose.  Instead, it will need to be
> > done by Xen and the backend.
> > 
> > This requires two additional primitives:
> > 
> > 3. Steal: Convert frontend-owned pages to backend-owned pages and
> >    provide the backend with a mapping of the page.
> 
> What does "owned" exactly mean in this context?
> 
> What you describe above sound very much like a foreign map, but I'm
> not sure I fully understand the constrains below.
> 
> Does this "steal" operation make the pages inaccessible by the domain
> running the frontend (so the orignal owner of the memory).

It's a foreign map for which the backend can deny the frontend access to
its own pages if it so chooses.  It was needed for a feature (Shared
Virtual Memory) that Qubes OS doesn't need.

> >    After a successful
> >    Steal operation, the pages are in the same state as if they had been
> >    provided via Map.  Steal fails if the pages are currently being used
> >    in a hypercall, are MMIO (as opposed to system memory), were provided
> >    by another domain via Map or grant tables, are currently foreign
> >    mapped, are currently granted to another domain, or more generally
> >    are accessible to any domain other than the target domain.
> 
> I think the above means that "stealed" pages must have the
> "p2m_ram_rw" type in the frontend domain p2m.   IOW: must be strictly
> RAM and owned by the domain running the frontend.

I'm not sure what this looks like from the Xen PoV, but I'm no longer
interested in seeing page steal and return implemented.

> >    The
> >    frontend's quota is decreased by the number of pages stolen, and the
> >    backend's quota is increased by the same amount.  A successful Steal
> >    operation means that Revoke and Map can be used to operate on the
> >    pages.
> 
> Hm, why do you need this quota adjustment?  Aren't the "stolen" pages
> still owned by the domain running the frontend (have
> page_info->v.inuse._domain == frontend domain)?

Nope, they were meant to be owned by the backend, but I'm no longer
interested in this feature.

> > 4. Return: Convert a backend-owned page to a frontend-owned page.  After
> >    a successful call to Return, the backend is no lonter able to use
> >    Revoke or Map.  The returned page ceases to count against backend
> >    quota and now counts against frontend quota.
> > 
> > Are these operations ones that Xen is interested in providing?  There
> > may be other primitives that are sufficient to implement the above four,
> > but I believe that any solution that allows virtio-GPU to work must
> > allow the above four operations to be implemented.  Without the first
> > two, virtio-GPU will not be able to support Vulkan or native contexts,
> > and without the second two also being present, shared virtual memory
> > and compute APIs that require it will not work.
> 
> I'm sure Xen can arrange for what's required, but the Xen primitives
> should be as simple as possible, offloading all possible logic to the
> backend.

Makes sense, though I don't want to make the backend 10x more complex to
make the Xen code slightly simpler, as the backend in Qubes OS will
often be dom0.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
Attachment:
signature.asc

Description: PGP signature