On Mon, May 13, 2024 at 1:09 PM Christian König <christian.koenig@xxxxxxx> wrote:Am 10.05.24 um 18:34 schrieb Zack Rusin:Hey, so this is a bit of a silly problem but I'd still like to solve it properly. The tldr is that virtualized drivers abuse drm_driver::gem_prime_import_sg_table (at least vmwgfx and xen do, virtgpu and xen punt on it) because there doesn't seem to be a universally supported way of converting the sg_table back to a list of pages without some form of gart to do it.Well the whole point is that you should never touch the pages in the sg_table in the first place. The long term plan is actually to completely remove the pages from that interface.First let me clarify that I'm not arguing for access to those pages. What I'd like to figure out are precise semantics for all of this prime import/map business on drivers that don't have some dedicated hardware to turn dma_addr_t array into something readable. If the general consensus is that those devices are broken, so be it.
Well that stuff is actually surprisingly well documented, see here: https://docs.kernel.org/driver-api/dma-buf.html#cpu-access-to-dma-buffer-objects
It's just that the documentation is written from the perspective of the exporter and userspace, so it's probably not that easy to understand what you should do as an importer.
Maybe we should add a sentence or two to clarify this.
drm_prime_sg_to_page_array is deprecated (for all the right reasons on actual hardware) but in our cooky virtualized world we don't have gart's so what are we supposed to do with the dma_addr_t from the imported sg_table? What makes it worse (and definitely breaks xen) is that with CONFIG_DMABUF_DEBUG the sg page_link is mangled via mangle_sg_table so drm_prime_sg_to_page_array won't even work.XEN and KVM were actually adjusted to not touch the struct pages any more. I'm not sure if that work is already upstream or not but I had to explain it over and over again why their approach doesn't work.I'd love to see those patches. Upstream xen definitely still uses it: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/xen/xen_drm_front_gem.c#n263 which looks pretty broken to me, especially with CONFIG_DMABUF_DEBUG because drm_gem_prime_import will call dma_buf_map_attachment_unlocked: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/drm_prime.c#n940 which will call __map_dma_buf https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dma-buf/dma-buf.c#n1131 which will mangle the sg's page_list before calling xen's gem_prime_import_sg_table. Which means the drm_prime_sg_to_page_array that's used in xen's gem_prime_import_sg_table is silently generating broken pages and the entire thing should just be a kernel oops (btw, it'd probably be a good idea to not have drm_prime_sg_to_page_array generate garbage with CONFIG_DMABUF_DEBUG and print some kind of a warning).
I honestly didn't followed the discussion to the end, but both Sima and me pointed that out to the XEN people and there were quite a bit of back and forth how to fix it.
Let me try to dig that up.
The reason why I'm saying it's a bit of a silly problem is that afaik currently it only affects IGT testing with vgem (because the rest of external gem objects will be from the virtualized gpu itself which is different). But do you have any ideas on what we'd like to do with this long term? i.e. we have a virtualized gpus without iommu, we have sg_table with some memory and we'd like to import it. Do we just assume that the sg_table on those configs will always reference cpu accessible memory (i.e. if it's external it only comes through drm_gem_shmem_object) and just do some horrific abomination like: for (i = 0; i < bo->ttm->num_pages; ++i) { phys_addr_t pa = dma_to_phys(vmw->drm.dev, bo->ttm->dma_address[i]); pages[i] = pfn_to_page(PHYS_PFN(pa)); } or add a "i know this is cpu accessible, please demangle" flag to drm_prime_sg_to_page_array or try to have some kind of more permanent solution?Well there is no solution for that. Accessing the underlying struct page through the sg_table is illegal in the first place. So the question is not how to access the struct page, but rather why do you want to do this?Rob mentioned one usecase, but to be honest, as I mentioned in the beginning I'd like to have a semantic clarity to the general problem of going from dma_addr_t to something readable on non iomem resources, e.g. get the IGT vgem<->vmwgfx tests running, i.e.: vgem_handle = dumb_buffer_create(vgem_fd, ....); dma_buf_fd = prime_handle_to_fd(vgem_fd, vgem_handle); vmw_handle = prime_fd_to_handle(vmw_fd, dma_buf_fd); void *ptr = vmw_map_bo(vmw_fd, vmw_handle, ...); <- crash trying to map that bo will crash because we'll endup in ttm_bo_vm_fault_reserved which will check whether bo->resource->bus.is_iomem, which it won't be because every vmwgfx buffer is just system memory and it will try to access the ttm pages which don't exist and crash: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/ttm/ttm_bo_vm.c#n249
Going through TTM for the VM fault is completely illegal to begin with. What you do instead is to re-route the mmap() request to the exporter, see how dma_buf_mmap() is being used by drm_gem_shmem_mmap() for an example.
It was even discussed to do this generally in the GEM layer, but IIRC it was rejected because driver stacks should avoid doing this and use the original exporter for the mmap() instead when possible.
Sima and Thomas probably knew this better than me, but I think the problem might be that VMWGFX didn't used that GEM layer until recently and so most likely never enforced some of the restrictions there.
On our hypervisors that are less than 8 years old all of vmwgfx buffers are always system memory, and when we get an external buffer from vgem, which is also system memory we have no currently supported way of populating the ttm::pages to be able to map/read it. It's fine if the general consensus is "that's crazy, we can't fault on external buffers pages" and "without some gart like thing in your device we can not make prime work" but I do want to have some clarity on how/whether this is supposed to work.
Short summary is that you redirect your mmap() call to the dma_buf_mmap() instead of handling it yourself.
Or to put it another way: imagine two different cpu vgem like drivers, how does one driver import the sg_table from another and convert it to something userspace readable?
Well you don't touch the sg_table at all in that case.
If you need an userspace mapping you use dma_buf_mmap(), if you need a kernel mapping you use the vmap interface.
If you are a hypervisor like XEN, KVM and VMW you basically don't touch any of that and just mirror parts of an address space between host and guest through an MMU notifier (include faults and invalidation).
See the recent (~2 month old) discussion between Christoph Hellwig, the KVM people and me on the some mailing lists.
Regards,
Christian.
z