Am Donnerstag, dem 23.06.2022 um 11:46 +0200 schrieb Christian König: > Am 23.06.22 um 11:33 schrieb Lucas Stach: > > [SNIP] > > > > > > In the DMA API keeping things mapped is also a valid use-case, but then > > > > > > you need to do explicit domain transfers via the dma_sync_* family, > > > > > > which DMA-buf has not inherited. Again those sync are no-ops on cache > > > > > > coherent architectures, but do any necessary cache maintenance on non > > > > > > coherent arches. > > > > > Correct, yes. Coherency is mandatory for DMA-buf, you can't use > > > > > dma_sync_* on it when you are the importer. > > > > > > > > > > The exporter could of course make use of that because he is the owner of > > > > > the buffer. > > > > In the example given here with UVC video, you don't know that the > > > > buffer will be exported and needs to be coherent without > > > > synchronization points, due to the mapping cache at the DRM side. So > > > > V4L2 naturally allocates the buffers from CPU cached memory. If the > > > > expectation is that those buffers are device coherent without relying > > > > on the map/unmap_attachment calls, then V4L2 needs to always > > > > synchronize caches on DQBUF when the buffer is allocated from CPU > > > > cached memory and a single DMA-buf attachment exists. And while writing > > > > this I realize that this is probably exactly what V4L2 should do... > > > No, the expectation is that the importer can deal with whatever the > > > exporter provides. > > > > > > If the importer can't access the DMA-buf coherently it's his job to > > > handle that gracefully. > > How does the importer know that the memory behind the DMA-buf is in CPU > > cached memory? > > > > If you now tell me that an importer always needs to assume this and > > reject the import if it can't do snooping, then any DMA-buf usage on > > most ARM SoCs is currently invalid usage. > > Yes, exactly that. I've pointed out a couple of times now that a lot of > ARM SoCs don't implement that the way we need it. > > We already had tons of bug reports because somebody attached a random > PCI root complex to an ARM SoC and expected it to work with for example > an AMD GPU. > > Non-cache coherent applications are currently not really supported by > the DMA-buf framework in any way. > I'm not talking about bolting on a PCIe root complex, with its implicit inherited "PCI is cache coherent" expectations to a ARM SoC, but just the standard VPU/GPU/display engines are not snooping on most ARM SoCs. > > On most of the multimedia > > targeted ARM SoCs being unable to snoop the cache is the norm, not an > > exception. > > > > > See for example on AMD/Intel hardware most of the engines can perfectly > > > deal with cache coherent memory accesses. Only the display engines can't. > > > > > > So on import time we can't even say if the access can be coherent and > > > snoop the CPU cache or not because we don't know how the imported > > > DMA-buf will be used later on. > > > > > So for those mixed use cases, wouldn't it help to have something > > similar to the dma_sync in the DMA-buf API, so your scanout usage can > > tell the exporter that it's going to do non-snoop access and any dirty > > cache lines must be cleaned? Signaling this to the exporter would allow > > to skip the cache maintenance if the buffer is in CPU uncached memory, > > which again is a default case for the ARM SoC world. > > Well for the AMD and Intel use cases we at least have the opportunity to > signal cache flushing, but I'm not sure if that counts for everybody. > Sure, all the non-coherent arches have some way to do the cache maintenance in some explicit way. Non coherent and no cache maintenance instruction would be a recipe for desaster. ;) > What we would rather do for those use cases is an indicator on the > DMA-buf if the underlying backing store is CPU cached or not. The > importer can then cleanly reject the use cases where it can't support > CPU cache snooping. > > This then results in the normal fallback paths which we have anyway for > those use cases because DMA-buf sharing is not always possible. > That's a very x86 centric world view you have there. 99% of DMA-buf uses on those cheap ARM SoCs is non-snooping. We can not do any fallbacks here, as the whole graphics world on those SoCs with their different IP cores mixed together depends on DMA-buf sharing working efficiently even when the SoC is mostly non coherent. In fact DMA-buf sharing works fine on most of those SoCs because everyone just assumes that all the accelerators don't snoop, so the memory shared via DMA-buf is mostly CPU uncached. It only falls apart for uses like the UVC cameras, where the shared buffer ends up being CPU cached. Non-coherent without explicit domain transfer points is just not going to work. So why can't we solve the issue for DMA-buf in the same way as the DMA API already solved it years ago: by adding the equivalent of the dma_sync calls that do cache maintenance when necessary? On x86 (or any system where things are mostly coherent) you could still no-op them for the common case and only trigger cache cleaning if the importer explicitly says that is going to do a non-snooping access. Regards, Lucas