Re: Try to address the DMA-buf coherency problem

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Wed, 2 Nov 2022 13:21:46 +0100

Hi Lucas,

Am 02.11.22 um 12:39 schrieb Lucas Stach:
Hi Christian,

going to reply in more detail when I have some more time, so just some
quick thoughts for now.

Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:
Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
[SNIP]
As far as I can see it you guys just allocate a buffer from a V4L2
device, fill it with data and send it to Wayland for displaying.

To be honest I'm really surprised that the Wayland guys hasn't pushed
back on this practice already.

This only works because the Wayland as well as X display pipeline is
smart enough to insert an extra copy when it find that an imported
buffer can't be used as a framebuffer directly.

With bracketed access you could even make this case work, as the dGPU
would be able to slurp a copy of the dma-buf into LMEM for scanout.

Well, this copy is what we are trying to avoid here. The codec should 
pump the data into LMEM in the first place.

The only case the commonly fails is whenever we try to display UVC
created dmabuf,
Well, exactly that's not correct! The whole x86 use cases of direct
display for dGPUs are broken because media players think they can do the
simple thing and offload all the problematic cases to the display server.

This is absolutely *not* the common use case you describe here, but
rather something completely special to ARM.
It the normal case for a lot of ARM SoCs.

Yeah, but it's not the normal case for everybody.

We had numerous projects where customers wanted to pump video data 
directly from a decoder into an GPU frame or from a GPU frame into an 
encoder.

The fact that media frameworks doesn't support that out of the box is 
simply a bug.

That world is certainly not
any less big than the x86 dGPU world. A huge number of devices are ARM
based set-top boxes and other video players. Just because it is a
special case for you doesn't mean it's a global special case.

Ok, let's stop with that. This isn't helpful in the technical discussion.

That we haven't heard anybody screaming that x86 doesn't work is just
because we handle the case that a buffer isn't directly displayable in
X/Wayland anyway, but this is absolutely not the optimal solution.

The argument that you want to keep the allocation on the codec side is
completely false as far as I can see.

We already had numerous projects where we reported this practice as bugs
to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.

And on a lot of ARM SoCs it's exactly the right thing to do.

Yeah and that's fine, it just doesn't seem to work in all cases.

For both x86 as well as the case here that the CPU cache might be dirty 
the exporter needs to be the device with the requirements.

For x86 dGPUs that's the backing store is some local memory. For the 
non-coherent ARM devices it's that the CPU cache is not dirty.

For a device driver which solely works with cached system memory 
inserting cache flush operations is something it would never do for 
itself. It would just be doing this for the importer and exactly that 
would be bad design because we then have handling for the display driver 
outside of the driver.

This is just a software solution which works because of coincident and
not because of engineering.
By mandating a software fallback for the cases where you would need
bracketed access to the dma-buf, you simply shift the problem into
userspace. Userspace then creates the bracket by falling back to some
other import option that mostly do a copy and then the appropriate
cache maintenance.

While I understand your sentiment about the DMA-API design being
inconvenient when things are just coherent by system design, the DMA-
API design wasn't done this way due to bad engineering, but due to the
fact that performant DMA access on some systems just require this kind
of bracketing.

Well, this is exactly what I'm criticizing on the DMA-API. Instead of 
giving you a proper error code when something won't work in a specific 
way it just tries to hide the requirements inside the DMA layer.

For example when your device can only access 32bits the DMA-API 
transparently insert bounce buffers instead of giving you a proper error 
code that the memory in question can't be accessed.

This just tries to hide the underlying problem instead of pushing it 
into the upper layer where it can be handled much more gracefully.

Regards,
Christian.

Regards,
Lucas