Re: Try to address the DMA-buf coherency problem

Nicolas Dufresne <nicolas@xxxxxxxxxxxx> · Fri, 28 Oct 2022 11:46:20 -0400

Hi,

just dropping some real live use case, sorry I'm not really proposing solutions,
I believe you are much more knowledgeable in this regard.

Le vendredi 28 octobre 2022 à 16:26 +0200, Christian König a écrit :
> Am 28.10.22 um 13:42 schrieb Lucas Stach:
> > Am Freitag, dem 28.10.2022 um 10:40 +0200 schrieb Christian König:
> > > But essentially the right thing to do. The only alternative I can see is
> > > to reverse the role of exporter and importer.
> > > 
> > I don't think that would work generally either, as buffer exporter and
> > importer isn't always a 1:1 thing. As soon as any attached importer has
> > a different coherency behavior than the others, things fall apart.
> 
> I've just mentioned it because somebody noted that when you reverse the 
> roles of exporter and importer with the V4L driver and i915 then the use 
> case suddenly starts working.

Though, its not generically possible to reverse these roles. If you want to do
so, you endup having to do like Android (gralloc) and ChromeOS (minigbm),
because you will have to allocate DRM buffers that knows about importer specific
requirements. See link [1] for what it looks like for RK3399, with Motion Vector
size calculation copied from the kernel driver into a userspace lib (arguably
that was available from V4L2 sizeimage, but this is technically difficult to
communicate within the software layers). If you could let the decoder export
(with proper cache management) the non-generic code would not be needed.

Another case where reversing the role is difficult is for case where you need to
multiplex the streams (let's use a camera to illustrate) and share that with
multiple processes. In these uses case, the DRM importers are volatile, which
one do you abuse to do allocation from ? In multimedia server like PipeWire, you
are not really aware if the camera will be used by DRM or not, and if something
"special" is needed in term of role inversion. It is relatively easy to deal
with matching modifiers, but using downstream (display/gpu) as an exporter is
always difficult (and require some level of abuse and guessing).

[1]
https://android.googlesource.com/platform/external/minigbm/+/refs/heads/master/rockchip.c#140

> 
> > > > > For DRM and most V4L2 devices I then fill in the dma_coherent flag based on the
> > > > > return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear
> > > > > the flag for their buffers if special handling like the USWC flag in amdgpu or
> > > > > the uncached allocations for radeon/nouveau are in use.
> > > > > 
> > > > I don't think the V4L2 part works for most ARM systems. The default
> > > > there is for devices to be noncoherent unless explicitly marked
> > > > otherwise. I don't think any of the "devices" writing the video buffers
> > > > in cached memory with the CPU do this. While we could probably mark
> > > > them as coherent, I don't think this is moving in the right direction.
> > > Well why not? Those devices are coherent in the sense of the DMA API
> > > that they don't need an extra CPU copy on sync_to_cpu/sync_to_device.
> > > 
> > > We could come up with a better name for coherency, e.g. snooping for
> > > example. But that is just an documentation detail.
> > > 
> > I agree that those devices copying data into a CPU cacheable buffer
> > should be marked as coherent, just not sure right now if other things
> > like DMA mappings are done on that device, which would require the
> > cache maintenance.
> 
> Yeah, good point.
> 
> > > And this the exact wrong approach as far as I can see. As Daniel noted
> > > as well we absolutely need some kind of coherency between exporter and
> > > importer.
> > > 
> > I think it's important that we are very specific about the thing we are
> > talking about here: I guess when you say coherency you mean hardware
> > enforced coherency on cacheable memory, which is the default on
> > x86/PCI.
> 
> Well, no. What I mean with coherency is that the devices don't need 
> insert special operation to access each others data.
> 
> This can be archived by multiple approaches, e.g. by the PCI coherency 
> requirements, device internal connections (XGMI, NVLink, CXL etc...) as 
> well as using uncached system memory.
> 
> The key point is what we certainly don't want is special operations 
> which say: Ok, now device A can access the data, now device B..... 
> because this breaks tons of use cases.

I'm coming back again with the multiplexing case. We keep having mixed uses case
with multiple receiver. In some case, data may endup on CPU while being encoded
in HW. Current approach of disabling cache does work, but CPU algorithm truly
suffer in performance. Doing a full memcpy to a cached buffer helps, but remains
slower then if the cache had been snooped by the importer (encoder here) driver.

> 
> > The other way to enforce coherency is to either insert cache
> > maintenance operations, or make sure that the buffer is not cacheable
> > by any device taking part in the sharing, including the CPU.
> 
> Yes and no. When we want the devices to interact with each other without 
> the CPU then we need some negotiated coherency between the two.
> 
> > > This can either be provided by the PCI specification which makes it
> > > mandatory for device to snoop the caches or by platform devices agreeing
> > > that they only work on uncached memory.
> > What you disregard here is the fact that there are many systems out
> > there with mixed topologies, where some masters are part of the
> > coherency domain and some are not.
> > 
> > We have two options here: either mandate that coherency for dma-bufs
> > need to be established by hardware, which is the option that you
> > strongly prefer, which means forcing all buffers to be uncacheable in a
> > system with masters that are not coherent, or allowing some form of
> > bracketed DMA access with cache maintenance ops.
> 
> Well I don't prefer that option, it's just the only one I can see. One 
> of the main goals of DMA-buf is to allow device to share data without 
> the need for CPU interactions.
> 
> In other words we negotiate the high level properties and then the 
> device can talk to each other without explicitly noting who is accessing 
> what.
> 
> And this concept is completely incompatible with maintenance ops. We 
> made that mistake with SWIOTLB for the DMA API and I don't really want 
> to repeat that stunt.
> 
> > > Explicit cache flush operations are simple not part of the design of
> > > DMA-buf because they are not part of the design of the higher level APIs
> > > like Vulkan and OpenGL.
> > I'm aware that some graphics APIs have been living in a universe
> > blissfully unaware of systems without hardware enforced coherency. But
> > that isn't the only use for dma-bufs.
> 
> Yeah, but the other use cases are extremely limited. As far as I can see
> 
> > 
> > I totally agree that some higher level API primitives aren't possible
> > without coherency at the hardware level and for those uses we should
> > require either HW enforced coherency or uncachable memory. But I don't
> > think we should make things slow deliberately on systems that allow to
> > optimize things with the help of bracketed access.
> > 
> > If I understood things right your amdgpu use-case even falls into this
> > category: normally you would want to use cacheable memory for
> > everything, but just make sure to clean the caches before using the
> > buffer with the non-coherent display engine.
> 
> No, that won't work like this. The caching attributes must be coherent 
> for the display engine to work correctly.
> 
> > > Adding this to DMA-buf for the rather special use case would completely
> > > break that and make live much more complicated for everybody.
> > > 
> > > > I also think that we should keep in mind that the world is moving into
> > > > a direction where DMA masters may not only snoop the CPU caches (what
> > > > is the definition of cache coherent on x86), but actually take part in
> > > > the system coherence and are able to have writeback caches for shared
> > > > data on their own. I can only speculate, as I haven't seen the amdgpu
> > > > side yet, but I think this proposal is moving in the other direction by
> > > > assuming a central system cache, where the importer has some magic way
> > > > to clean this central cache.
> > > What you mean is CXL: https://en.wikipedia.org/wiki/Compute_Express_Link
> > Or ARM AMBA CHI.
> > > And yes we support that in a bunch of configurations and also have
> > > worked with that and amdgpu with DMA-buf based shared.
> > > 
> > > This should not be a problem with this approach.
> > It works as long as all masters sharing the buffer are accessing the
> > buffer through the HW cache coherence facilities provided by CXL. As
> > soon as a master wants to bypass it (like your nosnoop scanout) you
> > need some way to force all other masters sharing access to the buffer
> > to clean their caches.
> 
> That won't work like this. The problem is that this is an APU and so the 
> display is part of the CPU. When either the MTRR or PAT says that the 
> physical address is cacheable the engine might just hang on access.
> 
> > > > Since I have a vested interest in seeing V4L2 UVC and non-coherent GPU
> > > > dma-buf sharing work on ARM systems and seem to hold some strong
> > > > opinions on how this should work, I guess I need to make some time
> > > > available to type it up, so we can discuss over coder rather than
> > > > abstract ideas. If I come up with something that works for my use-cases
> > > > would you be up for taking a shot at a amdgpu implementation?
> > > Well, not really. As I said I see what you have in mind here as
> > > completely wrong approach we will certainly not support in any GPU driver.
> > > 
> > > What we can do is to request the use case which won't work and this is
> > > exactly what the patch here does.
> > > 
> > Did you mean to write "prevent the use case which won't work" here?
> 
> Oh, yes. To fast typing.
> 
> Regards,
> Christian.
> 
> > 
> > Regards,
> > Lucas
> > 
>