Re: Try to address the DMA-buf coherency problem

Christian König <christian.koenig@xxxxxxx> · Tue, 6 Dec 2022 19:37:46 +0100

Am 06.12.22 um 19:26 schrieb Nicolas Dufresne:
Le lundi 05 décembre 2022 à 09:28 +0100, Christian König a écrit :
Hi Tomasz,

Am 05.12.22 um 07:41 schrieb Tomasz Figa:
[SNIP]
In other words explicit ownership transfer is not something we would
want as requirement in the framework, cause otherwise we break tons of
use cases which require concurrent access to the underlying buffer.

When a device driver needs explicit ownership transfer it's perfectly
possible to implement this using the dma_fence objects mentioned above.
E.g. drivers can already look at who is accessing a buffer currently and
can even grab explicit ownership of it by adding their own dma_fence
objects.

The only exception is CPU based access, e.g. when something is written
with the CPU a cache flush might be necessary and when something is read
with the CPU a cache invalidation might be necessary.

Okay, that's much clearer now, thanks for clarifying this. So we
should be covered for the cache maintenance needs originating from CPU
accesses already, +/- the broken cases which don't call the begin/end
CPU access routines that I mentioned above.

Similarly, for any ownership transfer between different DMA engines,
we should be covered either by the userspace explicitly flushing the
hardware pipeline or attaching a DMA-buf fence to the buffer.

But then, what's left to be solved? :) (Besides the cases of missing
begin/end CPU access calls.)
Well there are multiple problems here:

1. A lot of userspace applications/frameworks assume that it can
allocate the buffer anywhere and it just works.
I know you have said that about 10 times, perhaps I'm about to believe it, but
why do you think userspace assumes this ? Did you actually read code that does
this (that isn't meant to run on controlled environment).

Yes, absolutely.

Kodi for example used to do this and it was actually me who pointed out 
that this whole approach is flawed in the first place.

And we had tons of customer projects which ran into trouble with that. 
It became better in the last few years, but only after pushing back on 
this many many times.

Regards,
Christian.

  And can you provide
some example of broken generic userspace ? The DMABuf flow is meant to be trial
and error. At least in GStreamer, yes, mostly only device allocation (when
genericly usable) is implemented, but the code that has been contribute will try
and fallback back like documented. Still fails sometimes, but that's exactly the
kind of kernel bugs your patchset is trying to address. I don't blame anyone
here, since why would folks on GStreamer/FFMPEG or any other "generic media
framework" spend so much time implement "per linux device code", when non-
embedded (constraint) linux is just handful of users (compare to Windows,
Android, iOS users).

To me, this shouldn't be #1 issue. Perhaps it should simply be replaced by
userspace not supporting DMABuf Heaps. Perhaps add that Linux distribution don't
always enable (or allow normal users to access) heaps (though you point 2. gets
in the way) ? Unlike virtual memory, I don't think there is very good accounting
and reclaiming mechanism for that memory, hence opening these means any
userspace could possibly impair the system functioning. If you can't e.g. limit
their usage within containers, this is pretty difficult for generic linux to
carry. This is a wider problem of course, which likely affect a lot of GPU usage
too, but perhaps it should be in the lower priority part of the todo.

This isn't true at all, we have tons of cases where device can only
access their special memory for certain use cases.
Just look at scanout for displaying on dGPU, neither AMD nor NVidia
supports system memory here. Similar cases exists for audio/video codecs
where intermediate memory is only accessible by certain devices because
of content protection.
nit: content protection is not CODEC specific, its a platform feature, its also
not really a thing upstream yet from what I'm aware of. This needs unified
design and documentation imho, but also enough standardisation so that a generic
application can use it. Right now, content protection people have been
complaining that V4L2 (and most generic userspace) don't work with their design,
rather then trying to figure-out a design that works with existing API.

2. We don't properly communicate allocation requirements to userspace.

E.g. even if you allocate from DMA-Heaps userspace can currently only
guess if normal, CMA or even device specific memory is needed.

3. We seem to lack some essential parts of those restrictions in the
documentation.
Agreed (can't always disagree).

regards,
Nicolas

So if a device driver uses cached system memory on an architecture which
devices which can't access it the right approach is clearly to reject
the access.
I'd like to accent the fact that "requires cache maintenance" != "can't access".
Well that depends. As said above the exporter exports the buffer as it
was allocated.

If that means the the exporter provides a piece of memory which requires
CPU cache snooping to access correctly then the best thing we can do is
to prevent an importer which can't do this from attaching.
Could you elaborate more about this case? Does it exist in practice?
Do I assume correctly that it's about sharing a buffer between one DMA
engine that is cache-coherent and another that is non-coherent, where
the first one ends up having its accesses always go through some kind
of a cache (CPU cache, L2/L3/... cache, etc.)?
Yes, exactly that. What happens in this particular use case is that one
device driver wrote to it's internal buffer with the CPU (so some cache
lines where dirty) and then a device which couldn't deal with that tried
to access it.

We could say that all device drivers must always look at the coherency
of the devices which want to access their buffers. But that would
horrible complicate things for maintaining the drivers because then
drivers would need to take into account requirements from other drivers
while allocating their internal buffers.

That's essentially the reason why we have DMA-buf heaps. Those heaps
expose system memory buffers with certain properties (size, CMA, DMA bit
restrictions etc...) and also implement the callback functions for CPU
cache maintenance.

We do have the system and CMA dma-buf heap for cases where a device
driver which wants to access the buffer with caches enabled. So this is
not a limitation in functionality, it's just a matter of correctly using it.

V4L2 also has the ability to allocate buffers and map them with caches enabled.
Yeah, when that's a requirement for the V4L2 device it also makes
perfect sense.

It's just that we shouldn't force any specific allocation behavior on a
device driver because of the requirements of a different device.

Giving an example a V4L2 device shouldn't be forced to use
videobuf2-dma-contig because some other device needs CMA. Instead we
should use the common DMA-buf heaps which implement this as neutral
supplier of system memory buffers.

The problem is that in this particular case the exporter provides the
buffer as is, e.g. with dirty CPU caches. And the importer can't deal
with that.
Why does the exporter leave the buffer with dirty CPU caches?
Because exporters always export the buffers as they would use it. And in
this case that means writing with the CPU to it.

Either reverting the roles of the importer or exporter or switching over
to the common DMA-heaps implementation for the buffer would solve that.

It's the job of the higher level to prepare the buffer a device work
with, not the one of the lower level.
What are higher and lower levels here?
The MM as higher level and the DMA mapping framework as lower level.

Hmm, I generally consider the DMA mapping framework a part of MM,
especially its allocation capabilities. It heavily relies on low level
MM primitives internally and exposes another set of low level
primitives that is more useful for device drivers. At least it should
be seen this way, but I agree that it currently includes things that
shouldn't necessarily be there, like the transparent bouncing.
Exactly that, yes! Good to hear that more people think that way.

Christoph made some comments which sounded like he agreed to some of the
points as well, but nobody ever said it so clearly.

As per the existing design of the DMA mapping framework, the framework
handles the system DMA architecture details and DMA master drivers
take care of invoking the right DMA mapping operations around the DMA
accesses. This makes sense to me, as DMA master drivers have no idea
about the specific SoCs or buses they're plugged into, while the DMA
mapping framework has no idea when the DMA accesses are taking place.
Yeah and exactly that's a bit problematic. In an ideal world device
drivers wouldn't see memory they can't access in the first place.

For example what we currently do is the following:
1. get_user_pages() grabs a reference to the pages we want to DMA to/from.
2. DMA mapping framework is called by the driver to create an sg-table.
3. The DMA mapping framework sees that this won't work and inserts
bounce buffers.
4. The driver does the DMA to the bounce buffers instead.
5. The DMA mapping framework copies the data to the real location.

This is highly problematic since it removes the control of what happens
here. E.g. drivers can't prevent using bounce buffers when they need
coherent access for DMA-buf for example.

What we should have instead is that bounce buffers are applied at a
higher level, for example by get_user_pages() or more general in the MM.

I tend to agree with you on this, but I see a lot of challenges that
would need to be solved if we want to make it otherwise. The whole
reason for transparent bouncing came from the fact that many existing
subsystems didn't really care about the needs of the underlying
hardware, e.g. block or network subsystems would just pass random
pages to the drivers. I think the idea of DMA mapping handling this
internally was to avoid implementing the bouncing here and there in
each block or network driver that needs it. (Arguably, an optional
helper could be provided instead for use in those subsystems...)
Yes, totally agree. The problem is really that we moved bunch of MM and
DMA functions in one API.

The bounce buffers are something we should really have in a separate
component.

Then the functionality of allocating system memory for a specific device
or devices should be something provided by the MM.

And finally making this memory or any other CPU address accessible to a
device (IOMMU programming etc..) should then be part of an DMA API.

In other words in a proper design the higher level would prepare the
memory in a way the device driver can work with it, not the other way
around.

When a device driver gets memory it can't work with the correct response
is to throw an error and bubble that up into a layer where it can be
handled gracefully.

For example instead of using bounce buffers under the hood the DMA layer
the MM should make sure that when you call read() with O_DIRECT that the
pages in question are accessible by the device.

I tend to agree with you if it's about a costly software "emulation"
like bounce buffers, but cache maintenance is a hardware feature
existing there by default and it's often much cheaper to operate on
cached memory and synchronize the caches rather than have everything
in uncached (or write-combined) memory.
Well that we should have proper cache maintenance is really not the
question. The question is where to do that?

E.g. in the DMA-mapping framework which as far as I can see should only
take care of translating addresses.
The DMA mapping framework is actually a DMA allocation and mapping
framework. Generic memory allocation primitives (like alloc_pages(),
kmalloc()) shouldn't be used for buffers that are expected to be
passed to DMA engines - there exist DMA-aware replacements, like
dma_alloc_pages(), dma_alloc_coherent(), dma_alloc_noncontiguous(),
etc.

Or the higher level (get_user_pages() is just one example of that) which
says ok device X wants to access memory Y I need to make sure that
caches are flushed/invalidated.
Okay, so here comes the userptr case, which I really feel like is just
doomed at the very start, because it does nothing to accommodate
hardware needs at allocation time and then just expects some magic to
happen to make it possible for the hardware to make use of the buffer.

That said, it should be still possible to handle that pretty well for
hardware that allows it, i.e. without much memory access constraints.
What would be still missing if we just use the existing gup() +
dma_map() + dma_sync() sequence?
Error or rather fallback handling and *not* transparently inserting
bounce buffers.

The whole implicit bounce buffers concept falls apart as soon as you
don't have a stream access any more.

It's a use-case that is working fine today with many devices (e.g. network
adapters) in the ARM world, exactly because the architecture specific
implementation of the DMA API inserts the cache maintenance operations
on buffer ownership transfer.
Yeah, I'm perfectly aware of that. The problem is that exactly that
design totally breaks GPUs on Xen DOM0 for example.

And Xen is just one example, I can certainly say from experience that
this design was a really really bad idea because it favors just one use
case while making other use cases practically impossible if not really
hard to implement.
Sorry, I haven't worked with Xen. Could you elaborate what's the
problem that this introduces for it?
That's a bit longer topic. The AMD XEN devs are already working on this
as far as I know. I can ping internally how far they got with sending
the patches out to avoid this problem.
Hmm, I see. It might be a good data point to understand in which
direction we should be going, but I guess we can wait until they send
some patches.
There was just recently a longer thread on the amd-gfx mailing list
about that. I think looking in there as well might be beneficial.

Regards,
Christian.

Best regards,
Tomasz