Re: Try to address the DMA-buf coherency problem

Lucas Stach <l.stach@xxxxxxxxxxxxxx> · Fri, 28 Oct 2022 10:09:54 +0200

Hi Christian,

Am Donnerstag, dem 20.10.2022 um 14:13 +0200 schrieb Christian König:
> Hi guys,
> 
> after finding that we essentially have two separate worlds for coherent sharing
> of buffer through DMA-buf I thought I will tackle that problem a bit and at
> least allow the framework to reject attachments which won't work.
> 
> So those patches here add a new dma_coherent flag to each DMA-buf object
> telling the framework that dev_is_dma_coherent() needs to return true for an
> importing device to be able to attach. Since we should always have a fallback
> path this should give userspace the chance to still keep the use case working,
> either by doing a CPU copy instead or reversing the roles of exporter and
> importer.
> 
The fallback would likely be a CPU copy with the appropriate cache
flushes done by the device driver, which would be a massiv performance
issue.

> For DRM and most V4L2 devices I then fill in the dma_coherent flag based on the
> return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear
> the flag for their buffers if special handling like the USWC flag in amdgpu or
> the uncached allocations for radeon/nouveau are in use.
> 
I don't think the V4L2 part works for most ARM systems. The default
there is for devices to be noncoherent unless explicitly marked
otherwise. I don't think any of the "devices" writing the video buffers
in cached memory with the CPU do this. While we could probably mark
them as coherent, I don't think this is moving in the right direction.

> Additional to that importers can also check the flag if they have some
> non-snooping operations like the special scanout case for amdgpu for example.
> 
> The patches are only smoke tested and the solution isn't ideal, but as far as
> I can see should at least keep things working.
> 
I would like to see this solved properly. Where I think properly means
we make things work on systems with mixed coherent/noncoherent masters
in the same way the DMA API does on such systems: by inserting the
proper cache maintenance operations.

I also think that we should keep in mind that the world is moving into
a direction where DMA masters may not only snoop the CPU caches (what
is the definition of cache coherent on x86), but actually take part in
the system coherence and are able to have writeback caches for shared
data on their own. I can only speculate, as I haven't seen the amdgpu
side yet, but I think this proposal is moving in the other direction by
assuming a central system cache, where the importer has some magic way
to clean this central cache.

Since I have a vested interest in seeing V4L2 UVC and non-coherent GPU
dma-buf sharing work on ARM systems and seem to hold some strong
opinions on how this should work, I guess I need to make some time
available to type it up, so we can discuss over coder rather than
abstract ideas. If I come up with something that works for my use-cases
would you be up for taking a shot at a amdgpu implementation?

Regards,
Lucas