Re: Try to address the DMA-buf coherency problem

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Fri, 4 Nov 2022 10:03:14 +0100

Am 03.11.22 um 23:16 schrieb Nicolas Dufresne:
[SNIP]
We already had numerous projects where we reported this practice as bugs
to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.
Links ? Remember that I do read every single bugs and emails around GStreamer
project. I do maintain older and newer V4L2 support in there. I also did
contribute a lot to the mechanism GStreamer have in-place to reverse the
allocation. In fact, its implemented, the problem being that on generic Linux,
the receiver element, like the GL element and the display sink don't have any
API they can rely on to allocate memory. Thus, they don't implement what we call
the allocation offer in GStreamer term. Very often though, on other modern OS,
or APIs like VA, the memory offer is replaced by a context. So the allocation is
done from a "context" which is neither an importer or an exporter. This is
mostly found on MacOS and Windows.

Was there APIs suggested to actually make it manageable by userland to allocate
from the GPU? Yes, this what Linux Device Allocator idea is for. Is that API
ready, no.

Well, that stuff is absolutely ready: 
https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/heaps/system_heap.c#L175 
What do you think I'm talking about all the time?

DMA-buf has a lengthy section about CPU access to buffers and clearly 
documents how all of that is supposed to work: 
https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L1160 
This includes braketing of CPU access with dma_buf_begin_cpu_access() 
and dma_buf_end_cpu_access(), as well as transaction management between 
devices and the CPU and even implicit synchronization.

This specification is then implemented by the different drivers 
including V4L2: 
https://elixir.bootlin.com/linux/latest/source/drivers/media/common/videobuf2/videobuf2-dma-sg.c#L473

As well as the different DRM drivers: 
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c#L117 
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#L234

This design was then used by us with various media players on different 
customer projects, including QNAP https://www.qnap.com/en/product/ts-877 
as well as the newest Tesla 
https://www.amd.com/en/products/embedded-automotive-solutions

I won't go into the details here, but we are using exactly the approach 
I've outlined to let userspace control the DMA between the different 
device in question. I'm one of the main designers of that and our 
multimedia and mesa team has up-streamed quite a number of changes for 
this project.

I'm not that well into different ARM based solutions because we are just 
recently getting results that this starts to work with AMD GPUs, but I'm 
pretty sure that the design should be able to handle that as well.

So we have clearly prove that this design works, even with special 
requirements which are way more complex than what we are discussing 
here. We had cases where we used GStreamer to feed DMA-buf handles into 
multiple devices with different format requirements and that seems to 
work fine.

-----

But enough of this rant. As I wrote Lucas as well this doesn't help us 
any further in the technical discussion.

The only technical argument I have is that if some userspace 
applications fail to use the provided UAPI while others use it correctly 
then this is clearly not a good reason to change the UAPI, but rather an 
argument to change the applications.

If the application should be kept simple and device independent then 
allocating the buffer from the device independent DMA heaps would be 
enough as well. Cause that provider implements the necessary handling 
for dma_buf_begin_cpu_access() and dma_buf_end_cpu_access().

I'm a bit surprised that we are arguing about stuff like this because we 
spend a lot of effort trying to document this. Daniel gave me the job to 
fix  this documentation, but after reading through it multiple times now 
I can't seem to find where the design and the desired behavior is unclear.

What is clearly a bug in the kernel is that we don't reject things which 
won't work correctly and this is what this patch here addresses. What we 
could talk about is backward compatibility for this patch, cause it 
might look like it breaks things which previously used to work at least 
partially.

Regards,
Christian.