Re: DMA-buf and uncached system memory

Lucas Stach <l.stach@xxxxxxxxxxxxxx> · Thu, 23 Jun 2022 12:13:16 +0200

Am Donnerstag, dem 23.06.2022 um 11:46 +0200 schrieb Christian König:
> Am 23.06.22 um 11:33 schrieb Lucas Stach:
> > [SNIP]
> > > > > > In the DMA API keeping things mapped is also a valid use-case, but then
> > > > > > you need to do explicit domain transfers via the dma_sync_* family,
> > > > > > which DMA-buf has not inherited. Again those sync are no-ops on cache
> > > > > > coherent architectures, but do any necessary cache maintenance on non
> > > > > > coherent arches.
> > > > > Correct, yes. Coherency is mandatory for DMA-buf, you can't use
> > > > > dma_sync_* on it when you are the importer.
> > > > > 
> > > > > The exporter could of course make use of that because he is the owner of
> > > > > the buffer.
> > > > In the example given here with UVC video, you don't know that the
> > > > buffer will be exported and needs to be coherent without
> > > > synchronization points, due to the mapping cache at the DRM side. So
> > > > V4L2 naturally allocates the buffers from CPU cached memory. If the
> > > > expectation is that those buffers are device coherent without relying
> > > > on the map/unmap_attachment calls, then V4L2 needs to always
> > > > synchronize caches on DQBUF when the  buffer is allocated from CPU
> > > > cached memory and a single DMA-buf attachment exists. And while writing
> > > > this I realize that this is probably exactly what V4L2 should do...
> > > No, the expectation is that the importer can deal with whatever the
> > > exporter provides.
> > > 
> > > If the importer can't access the DMA-buf coherently it's his job to
> > > handle that gracefully.
> > How does the importer know that the memory behind the DMA-buf is in CPU
> > cached memory?
> > 
> > If you now tell me that an importer always needs to assume this and
> > reject the import if it can't do snooping, then any DMA-buf usage on
> > most ARM SoCs is currently invalid usage.
> 
> Yes, exactly that. I've pointed out a couple of times now that a lot of 
> ARM SoCs don't implement that the way we need it.
> 
> We already had tons of bug reports because somebody attached a random 
> PCI root complex to an ARM SoC and expected it to work with for example 
> an AMD GPU.
> 
> Non-cache coherent applications are currently not really supported by 
> the DMA-buf framework in any way.
> 
I'm not talking about bolting on a PCIe root complex, with its implicit
inherited "PCI is cache coherent" expectations to a ARM SoC, but just
the standard VPU/GPU/display engines are not snooping on most ARM SoCs.

> > On most of the multimedia
> > targeted ARM SoCs being unable to snoop the cache is the norm, not an
> > exception.
> > 
> > > See for example on AMD/Intel hardware most of the engines can perfectly
> > > deal with cache coherent memory accesses. Only the display engines can't.
> > > 
> > > So on import time we can't even say if the access can be coherent and
> > > snoop the CPU cache or not because we don't know how the imported
> > > DMA-buf will be used later on.
> > > 
> > So for those mixed use cases, wouldn't it help to have something
> > similar to the dma_sync in the DMA-buf API, so your scanout usage can
> > tell the exporter that it's going to do non-snoop access and any dirty
> > cache lines must be cleaned? Signaling this to the exporter would allow
> > to skip the cache maintenance if the buffer is in CPU uncached memory,
> > which again is a default case for the ARM SoC world.
> 
> Well for the AMD and Intel use cases we at least have the opportunity to 
> signal cache flushing, but I'm not sure if that counts for everybody.
> 
Sure, all the non-coherent arches have some way to do the cache
maintenance in some explicit way. Non coherent and no cache maintenance
instruction would be a recipe for desaster. ;)

> What we would rather do for those use cases is an indicator on the 
> DMA-buf if the underlying backing store is CPU cached or not. The 
> importer can then cleanly reject the use cases where it can't support 
> CPU cache snooping.
> 
> This then results in the normal fallback paths which we have anyway for 
> those use cases because DMA-buf sharing is not always possible.
> 
That's a very x86 centric world view you have there. 99% of DMA-buf
uses on those cheap ARM SoCs is non-snooping. We can not do any
fallbacks here, as the whole graphics world on those SoCs with their
different IP cores mixed together depends on DMA-buf sharing working
efficiently even when the SoC is mostly non coherent.

In fact DMA-buf sharing works fine on most of those SoCs because
everyone just assumes that all the accelerators don't snoop, so the
memory shared via DMA-buf is mostly CPU uncached. It only falls apart
for uses like the UVC cameras, where the shared buffer ends up being
CPU cached.

Non-coherent without explicit domain transfer points is just not going
to work. So why can't we solve the issue for DMA-buf in the same way as
the DMA API already solved it years ago: by adding the equivalent of
the dma_sync calls that do cache maintenance when necessary? On x86 (or
any system where things are mostly coherent) you could still no-op them
for the common case and only trigger cache cleaning if the importer
explicitly says that is going to do a non-snooping access.

Regards,
Lucas