Re: Try to address the DMA-buf coherency problem

Christian König <christian.koenig@xxxxxxx> · Fri, 9 Dec 2022 11:27:44 +0100

Am 09.12.22 um 09:26 schrieb Tomasz Figa:
[SNIP]
Although I think the most common case on mainstream Linux today is
properly allocating for device X (e.g. V4L2 video decoder or DRM-based
GPU) and hoping that other devices would accept the buffers just fine,
which isn't a given on most platforms (although often it's just about
pixel format, width/height/stride alignment, tiling, etc. rather than
the memory itself). That's why ChromiumOS has minigbm and Android has
gralloc that act as the central point of knowledge on buffer
allocation.

Yeah, completely agree. The crux is really that we need to improve those 
user space allocators by providing more information from the kernel.

2. We don't properly communicate allocation requirements to userspace.

E.g. even if you allocate from DMA-Heaps userspace can currently only
guess if normal, CMA or even device specific memory is needed.
DMA-buf heaps actually make it even more difficult for the userspace,
because now it needs to pick the right heap. With allocation built
into the specific UAPI (like V4L2), it's at least possible to allocate
for one specific device without having any knowledge about allocation
constraints in the userspace.

Yes, same what Daniel said as well. We need to provide some more hints 
which allocator to use from the kernel.

So if a device driver uses cached system memory on an architecture which
devices which can't access it the right approach is clearly to reject
the access.
I'd like to accent the fact that "requires cache maintenance" != "can't access".
Well that depends. As said above the exporter exports the buffer as it
was allocated.

If that means the the exporter provides a piece of memory which requires
CPU cache snooping to access correctly then the best thing we can do is
to prevent an importer which can't do this from attaching.
Could you elaborate more about this case? Does it exist in practice?
Do I assume correctly that it's about sharing a buffer between one DMA
engine that is cache-coherent and another that is non-coherent, where
the first one ends up having its accesses always go through some kind
of a cache (CPU cache, L2/L3/... cache, etc.)?
Yes, exactly that. What happens in this particular use case is that one
device driver wrote to it's internal buffer with the CPU (so some cache
lines where dirty) and then a device which couldn't deal with that tried
to access it.
If so, shouldn't that driver surround its CPU accesses with
begin/end_cpu_access() in the first place?

The problem is that the roles are reversed. The callbacks let the 
exporter knows that it needs to flush the caches when the importer is 
done accessing the buffer with the CPU.

But here the exporter is the one accessing the buffer with the CPU and 
the importer then accesses stale data because it doesn't snoop the caches.

What we could do is to force all exporters to use begin/end_cpu_access() 
even on it's own buffers and look at all the importers when the access 
is completed. But that would increase the complexity of the handling in 
the exporter.

In other words we would then have code in the exporters which is only 
written for handling the constrains of the importers. This has a wide 
variety of consequences, especially that this functionality of the 
exporter can't be tested without a proper importer.

I was also thinking about reversing the role of exporter and importer in 
the kernel, but came to the conclusion that doing this under the hood 
without userspace knowing about it is probably not going to work either.

The case that I was suggesting was of a hardware block that actually
sits behind the CPU cache and thus dirties it on writes, not the
driver doing that. (I haven't personally encountered such a system,
though.)

Never heard of anything like that either, but who knows.

We could say that all device drivers must always look at the coherency
of the devices which want to access their buffers. But that would
horrible complicate things for maintaining the drivers because then
drivers would need to take into account requirements from other drivers
while allocating their internal buffers.
I think it's partially why we have the allocation part of the DMA
mapping API, but currently it's only handling requirements of one
device. And we don't have any information from the userspace what
other devices the buffer would be used with...

Exactly that, yes.

Actually, do we even have such information in the userspace today?
Let's say I do a video call in a web browser on a typical Linux
system. I have a V4L2 camera, VAAPI video encoder and X11 display. The
V4L2 camera fills in buffers with video frames and both encoder and
display consume them. Do we have a central place which would know that
a buffer needs to be allocated that works with the producer and all
consumers?

Both X11 and Wayland have protocols which can be used to display a 
certain DMA-buf handle, their feedback packages contain information how 
ideal your configuration is, e.g. if the DMA-buf handle could be used or 
if an extra copy was needed etc...

Similar exists between VAAPI and V4L2 as far as I know, but as you noted 
as well here it's indeed more about negotiating pixel format, stride, 
padding, alignment etc...

The best we can do is to reject combinations which won't work in the 
kernel and then userspace could react accordingly.

That's essentially the reason why we have DMA-buf heaps. Those heaps
expose system memory buffers with certain properties (size, CMA, DMA bit
restrictions etc...) and also implement the callback functions for CPU
cache maintenance.

How do DMA-buf heaps change anything here? We already have CPU cache
maintenance callbacks in the DMA-buf API - the begin/end_cpu_access()
for CPU accesses and dmabuf_map/unmap_attachment() for device accesses
(which arguably shouldn't actually do CPU cache maintenance, unless
that's implied by how either of the involved DMA engines work).

DMA-buf heaps are the neutral man in the middle.

The implementation keeps track of all the attached importers and should 
make sure that the allocated memory fits the need of everyone. 
Additional to that calls to the cache DMA-api cache management functions 
are inserted whenever CPU access begins/ends.

That's the best we can do for system memory sharing, only device 
specific memory can't be allocated like this.

We do have the system and CMA dma-buf heap for cases where a device
driver which wants to access the buffer with caches enabled. So this is
not a limitation in functionality, it's just a matter of correctly using it.

V4L2 also has the ability to allocate buffers and map them with caches enabled.
Yeah, when that's a requirement for the V4L2 device it also makes
perfect sense.

It's just that we shouldn't force any specific allocation behavior on a
device driver because of the requirements of a different device.

Giving an example a V4L2 device shouldn't be forced to use
videobuf2-dma-contig because some other device needs CMA. Instead we
should use the common DMA-buf heaps which implement this as neutral
supplier of system memory buffers.
Agreed, but that's only possible if we have a central entity that
understands what devices the requested buffer would be used with. My
understanding is that we're nowhere close to that with mainstream
Linux today.

// As a side note, videobuf2-dma-contig can actually allocate
discontiguous memory, if the device is behind an IOMMU. Actually with
the recent DMA API improvements, we could probably coalesce
vb2-dma-contig and vb2-dma-sg into one vb2-dma backend.

That would probably make live a little bit simpler, yes.

The problem is that in this particular case the exporter provides the
buffer as is, e.g. with dirty CPU caches. And the importer can't deal
with that.
Why does the exporter leave the buffer with dirty CPU caches?
Because exporters always export the buffers as they would use it. And in
this case that means writing with the CPU to it.

Sorry for the question not being very clear. I meant: How do the CPU
caches get dirty in that case?

The exporter wrote to it. As far as I understand the exporter just 
copies things from A to B with memcpy to construct the buffer content.

[SNIP]
Yes, totally agree. The problem is really that we moved bunch of MM and
DMA functions in one API.

The bounce buffers are something we should really have in a separate
component.

Then the functionality of allocating system memory for a specific device
or devices should be something provided by the MM.

And finally making this memory or any other CPU address accessible to a
device (IOMMU programming etc..) should then be part of an DMA API.

Remember that actually making the memory accessible to a device often
needs to be handled already as a part of the allocation (e.g. dma_mask
in the non-IOMMU case). So I tend to think that the current division
of responsibilities is mostly fine - the dma_alloc family should be
seen as a part of MM already, especially with all the recent
improvements from Christoph, like dma_alloc_pages().

Yes, that's indeed a very interesting development which as far as I can 
see goes into the right direction.

That said, it may indeed make sense to move things like IOMMU mapping
management out of the dma_alloc() and just reduce those functions to
simply returning a set of pages that satisfy the allocation
constraints. One would need to call dma_map() after the allocation,
but that's actually a fair expectation. Sometimes it might be
preferred to pre-allocate the memory, but only map it into the device
address space when it's really necessary.

What I'm still missing is the functionality to allocate pages for 
multiple devices and proper error codes when dma_map() can't make the 
page accessible to a device.

Regards,
Christian.

It's a use-case that is working fine today with many devices (e.g. network
adapters) in the ARM world, exactly because the architecture specific
implementation of the DMA API inserts the cache maintenance operations
on buffer ownership transfer.
Yeah, I'm perfectly aware of that. The problem is that exactly that
design totally breaks GPUs on Xen DOM0 for example.

And Xen is just one example, I can certainly say from experience that
this design was a really really bad idea because it favors just one use
case while making other use cases practically impossible if not really
hard to implement.
Sorry, I haven't worked with Xen. Could you elaborate what's the
problem that this introduces for it?
That's a bit longer topic. The AMD XEN devs are already working on this
as far as I know. I can ping internally how far they got with sending
the patches out to avoid this problem.
Hmm, I see. It might be a good data point to understand in which
direction we should be going, but I guess we can wait until they send
some patches.
There was just recently a longer thread on the amd-gfx mailing list
about that. I think looking in there as well might be beneficial.
Okay, let me check. Thanks,

Best regards,
Tomasz