Re: dma-buf sg mangling

Zack Rusin <zack.rusin@xxxxxxxxxxxx> · Tue, 14 May 2024 00:15:10 -0400

On Mon, May 13, 2024 at 1:09 PM Christian König
<christian.koenig@xxxxxxx> wrote:
>
> Am 10.05.24 um 18:34 schrieb Zack Rusin:
> > Hey,
> >
> > so this is a bit of a silly problem but I'd still like to solve it
> > properly. The tldr is that virtualized drivers abuse
> > drm_driver::gem_prime_import_sg_table (at least vmwgfx and xen do,
> > virtgpu and xen punt on it) because there doesn't seem to be a
> > universally supported way of converting the sg_table back to a list of
> > pages without some form of gart to do it.
>
> Well the whole point is that you should never touch the pages in the
> sg_table in the first place.
>
> The long term plan is actually to completely remove the pages from that
> interface.

First let me clarify that I'm not arguing for access to those pages.
What I'd like to figure out are precise semantics for all of this
prime import/map business on drivers that don't have some dedicated
hardware to turn dma_addr_t array into something readable. If the
general consensus is that those devices are broken, so be it.

> > drm_prime_sg_to_page_array is deprecated (for all the right reasons on
> > actual hardware) but in our cooky virtualized world we don't have
> > gart's so what are we supposed to do with the dma_addr_t from the
> > imported sg_table? What makes it worse (and definitely breaks xen) is
> > that with CONFIG_DMABUF_DEBUG the sg page_link is mangled via
> > mangle_sg_table so drm_prime_sg_to_page_array won't even work.
>
> XEN and KVM were actually adjusted to not touch the struct pages any more.
>
> I'm not sure if that work is already upstream or not but I had to
> explain it over and over again why their approach doesn't work.

I'd love to see those patches. Upstream xen definitely still uses it:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/xen/xen_drm_front_gem.c#n263
which looks pretty broken to me, especially with CONFIG_DMABUF_DEBUG
because drm_gem_prime_import
will call dma_buf_map_attachment_unlocked:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/drm_prime.c#n940
which will call __map_dma_buf
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dma-buf/dma-buf.c#n1131
which will mangle the sg's page_list before calling xen's
gem_prime_import_sg_table. Which means the drm_prime_sg_to_page_array
that's used in xen's gem_prime_import_sg_table is silently generating
broken pages and the entire thing should just be a kernel oops (btw,
it'd probably be a good idea to not have drm_prime_sg_to_page_array
generate garbage with CONFIG_DMABUF_DEBUG and print some kind of a
warning).

> > The reason why I'm saying it's a bit of a silly problem is that afaik
> > currently it only affects IGT testing with vgem (because the rest of
> > external gem objects will be from the virtualized gpu itself which is
> > different). But do you have any ideas on what we'd like to do with
> > this long term? i.e. we have a virtualized gpus without iommu, we have
> > sg_table with some memory and we'd like to import it. Do we just
> > assume that the sg_table on those configs will always reference cpu
> > accessible memory (i.e. if it's external it only comes through
> > drm_gem_shmem_object) and just do some horrific abomination like:
> > for (i = 0; i < bo->ttm->num_pages; ++i) {
> >      phys_addr_t pa = dma_to_phys(vmw->drm.dev, bo->ttm->dma_address[i]);
> >      pages[i] = pfn_to_page(PHYS_PFN(pa));
> > }
> > or add a "i know this is cpu accessible, please demangle" flag to
> > drm_prime_sg_to_page_array or try to have some kind of more permanent
> > solution?
>
> Well there is no solution for that. Accessing the underlying struct page
> through the sg_table is illegal in the first place.
>
> So the question is not how to access the struct page, but rather why do
> you want to do this?

Rob mentioned one usecase, but to be honest, as I mentioned in the
beginning I'd like to have a semantic clarity to the general problem
of going from dma_addr_t to something readable on non iomem resources,
e.g. get the IGT vgem<->vmwgfx tests running, i.e.:
vgem_handle = dumb_buffer_create(vgem_fd, ....);
dma_buf_fd = prime_handle_to_fd(vgem_fd, vgem_handle);
vmw_handle = prime_fd_to_handle(vmw_fd, dma_buf_fd);
void *ptr = vmw_map_bo(vmw_fd, vmw_handle, ...); <- crash

trying to map that bo will crash because we'll endup in
ttm_bo_vm_fault_reserved which will check whether
bo->resource->bus.is_iomem, which it won't be because every vmwgfx
buffer is just system memory and it will try to access the ttm pages
which don't exist and crash:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/ttm/ttm_bo_vm.c#n249

On our hypervisors that are less than 8 years old all of vmwgfx
buffers are always system memory, and when we get an external buffer
from vgem, which is also system memory we have no currently supported
way of populating the ttm::pages to be able to map/read it.

It's fine if the general consensus is "that's crazy, we can't fault on
external buffers pages" and "without some gart like thing in your
device we can not make prime work" but I do want to have some clarity
on how/whether this is supposed to work.

Or to put it another way: imagine two different cpu vgem like drivers,
how does one driver import the sg_table from another and convert it to
something userspace readable?

z