On Sun, Feb 02, 2025 at 12:08:46AM -0500, Demi Marie Obenour wrote: > Recently, AMD submitted patches to the dri-devel mailing list to support > using application-provided buffers in virtio-GPU. This feature is > called Shared Virtual Memory (SVM) and it is implemented via an API > called User Pointer (userptr). This lead to some discussion on > dri-devel@xxxxxxxxxxxxxxxxxxxxx and dri-devel IRC, from which I > concluded that Xen is missing critical primitives for GPU-accelerated > graphics and compute. The missing primitives for graphics are the ones > discussed at Xen Project Summit 2024, but it turns out that additional > primitives are needed for compute workloads. > > As discussed at Xen Project Summit 2024, GPU acceleration via virtio-GPU > requires that an IOREQ server have access to the following primitives: > > 1. Map: Map a backend-provided buffer into the frontend. The buffer > might point to system memory or to a PCIe BAR. The frontend is _not_ > allowed to use these buffers in hypercalls or grant them to other > domains. Accessing the pages using hypercalls directed at the > frontend fails as if the frontend did not have the pages. The only > exception is that the frontend _may_ be allowed to use the buffer in > a Map operation, provided that Revoke (below) is transitive. Further note: if the frontend has an assigned PCI device, I believe that pages provided by Map _should not_ be included in the device's IOMMU mappings. This avoids needing a synchronous IOTLB flush when Revoke is performed, which would be slow. Delaying the flush would allow the frontend to DMA into freed backend memory and so would be a security vulnerability. Furthermore, such entries would not be useable by the frontend in any way, as the frontend cannot perform DMA to them without racing against a concurrent call to Revoke made by the backend. > 2. Revoke: Revoke access to a buffer provided by the backend. Once > access is revoked, no operation on or in the frontend domain can > access or modify the pages, and the backend can safely reuse the > backing memory for other purposes. Furthermore, revocation is not > allowed to fail unless the backend or hypervisor is buggy, and if it > does fail for any reason, the backend will panic. Once access is > revoked, further accesses by the frontend will cause a fault that the > backend can intercept. How should this interact with emulated I/O devices? If the emulated I/O device uses mmap() on the dmabuf to access the pages, things will work fine, but if it uses foreign mapping operations, things will fail rather miserably. I think it is okay for this to fail: DMA to pages provided by Map is always a guest bug, and that includes DMA by an emulated device. Guest userspace is prepared for such errors, because dmabufs on bare silicon work the same way. > 3. Steal: Convert frontend-owned pages to backend-owned pages and > provide the backend with a mapping of the page. After a successful > Steal operation, the pages are in the same state as if they had been > provided via Map. Steal fails if the pages are currently being used > in a hypercall, are MMIO (as opposed to system memory), were provided > by another domain via Map or grant tables, are currently foreign > mapped, are currently granted to another domain, or more generally > are accessible to any domain other than the target domain. The > frontend's quota is decreased by the number of pages stolen, and the > backend's quota is increased by the same amount. A successful Steal > operation means that Revoke and Map can be used to operate on the > pages. How should this work if the frontend has an assigned PCI device? Xen can unmap the pages from the frontend's IOMMU mappings, but the frontend might continue to try to perform DMA to these pages. This would result in runtime misbehavior or data corruption. PV devices could be a significant problem too, because Steal is incompatible with grant tables for security reasons. Otherwise, a malicious guest could grant a page to domain A (which expects it to be ordinary RAM) and then ask domain B to steal it. Domain B steals the page, revokes it (for page migration purposes, say), and reuses the backing storage. Now domain A has a mapping of freed domain B memory. If Steal instead revoked domain A's mapping too, domain B could block domain A forever. This means unless domain B is privileged over domain A (and domain A has no assigned PCI devices), the mapping must fail. Matthew: Do you know if Linux supports marking anonymous memory as "does not support DMA/pin_user_pages()"? If not, how hard would it be to implement this? Without this, all I/O by a guest using virtio-GPU shared virtual memory would need to be bounce-buffered, which isn't great for performance. This includes I/O using paravirtualized devices, unless the PV driver can handle the grant operation failing and fall back to a bounce buffer. It would be much better if only I/O from stolen pages needed to be bounced. What _might_ work is this sequence of operations: 1. The frontend asks backend userspace to give the pages back. 2. The backend converts the memory to pinned CPU memory, perhaps via mlock(). 3. The backend returns the pages to the guest via Return (below). This can be done without blocking GPU access because these pages are pinned to system RAM. 4. The frontend uses the pages for DMA/grant tables/etc as normal. This requires a blocking cross-VM round-trip, though, so it won't be fast. As an aside, emulated devices will work fine if the device model is in the same domain as the backend and uses the same mappings as are passed to the GPU driver. In that case, access to the stolen pages will be handled correctly. It's only when there are multiple IOREQ servers or PV drivers involved that sadness occurs. It seems that shared virtual memory is very unfriendly to disaggregated setups. > 4. Return: Convert a backend-owned page to a frontend-owned page. After > a successful call to Return, the backend is no lonter able to use > Revoke or Map. The returned page ceases to count against backend > quota and now counts against frontend quota. > > Are these operations ones that Xen is interested in providing? There > may be other primitives that are sufficient to implement the above four, > but I believe that any solution that allows virtio-GPU to work must > allow the above four operations to be implemented. Without the first > two, virtio-GPU will not be able to support Vulkan or native contexts, > and without the second two also being present, shared virtual memory > and compute APIs that require it will not work. In light of all of the above limitations, I think that there is an important special case: if the GPU is an iGPU, then all SVM buffers should be allocated form pinned CPU memory, which makes the problem go away entirely. If access to SVM data is not at all performance critical, then it might still be possible to keep all the data in system memory. The underlying limitation that causes all of the above problems is that both DMA and Xen grant table operations require pinned memory, and there is no way to do that from userspace. -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab>
Attachment:
signature.asc
Description: PGP signature