Re: [Linaro-mm-sig] Re: DMA-buf and uncached system memory

Christian König <christian.koenig@xxxxxxx> · Wed, 10 Aug 2022 07:55:15 +0200

Am 09.08.22 um 16:46 schrieb Daniel Vetter:
On Mon, Jul 04, 2022 at 03:48:03PM +0200, Christian König wrote:
Hi Daniel,

Am 25.06.22 um 00:02 schrieb Daniel Vetter:
On Thu, Jun 23, 2022 at 01:32:18PM +0200, Christian König wrote:
Am 23.06.22 um 13:27 schrieb Daniel Stone:
[SNIP]
If it's really your belief that dmabuf requires universal snooping, I
recommend you send the patch to update the documentation, as well as
to remove DRIVER_PRIME from, realistically, most non-PCIE drivers.
Well, to be honest I think that would indeed be necessary.

What we have created are essentially two different worlds, one for PCI
devices and one for the rest.

This was indeed not the intention, but it's a fact that basically all
DMA-buf based PCI drivers assume coherent access.
dma-buf does not require universal snooping.

It does defacto require that all device access is coherent with all other
device access, and consistent with the exporters notion of how cpu
coherency is achieved. Not that coherent does not mean snooping, as long
as all devices do unsnooped access and the exporter either does wc/uc or
flushes caches that's perfectly fine, and how all the arm soc dma-buf
sharing works.
We should probably start documenting that better.
Agreed :-)

Are you volunteering to type up something that reflects the current sorry
state of affairs? I'm not sure I'm the best since I guess I've been too
badly involved in this ...

Yeah, already working on this. But you know, normal human being with two 
hands and one head.

With all the workload I'm pretty sure people would have cloned me by now 
if tech would be just a bit more advanced.

Christian.

We did originally have the wording in there that you have to map/unamp
around every device access, but that got dropped because no one was doing
that anyway.

Now where this totally breaks down is how we make this work, because the
idea was that dma_buf_attach validates this all. Where this means all the
hilarious reasons buffer sharing might not work:
- wrong coherency mode (cpu cached or not)
- not contiguous (we do check that, but only once we get the sg from
    dma_buf_attachment_map, which strictly speaking is a bit too late but
    most drivers do attach&map as one step so not that bad in practice)
- whether the dma api will throw in bounce buffers or not
- random shit like "oh this is in the wrong memory bank", which I think
    never landed in upstream

p2p connectivity is about the only one that gets this right, yay. And the
only reason we can even get it right is because all the information is
exposed to drivers fully.
Yeah, that's why I designed P2P that way :)

I also don't think it's that bad, at least for radeon, nouveau and amdgpu
all the migration restrictions are actually handled correctly.

In other words when a DMA-buf is about to be used by another device we use
TTM to move the buffer around so that it can actually be accessed by that
device.

What I haven't foreseen in here is that we need to deal with different
caching behaviors between exporter and importer.
Yeah we should have done caching explicitly and full opt-in like with p2p.
The trouble is that this would have been a multi-year fight with dma api
folks, who insist it must be all transparent. So the politically clever
thing was to just ignore the problem and land dma-buf, but it comes back
to bite us now :-/

The issue is that the device dma api refuses to share this information
because it would "leak". Which sucks, because we have defacto build every
single cross-device use-case of dma-buf on the assumption we can check
this (up to gl/vk specs), but oh well.

So in practice this gets sorted out by endless piles of hacks to make
individual use-cases work.

Oh and: This is definitely not limited to arm socs. x86 socs with intel
at least have exactly all the same issues, and they get solved by adding
various shitty hacks to the involved drivers (like i915+amdgpu). Luckily
the intel camera driver isn't in upstream yet, since that would break a
bunch of the hacks since suddently there will be now 2 cpu cache
incoherent devices in an x86 system.

Ideally someone fixes this, but I'm not hopeful.

I recommend pouring more drinks.

What is definitely not correct is claiming that dma-buf wasn't meant for
this. We discussed cache coherency issues endless in budapest 12 or so
years ago, I was there. It's just that the reality of the current
implementation is falling short, and every time someone tries to fix it we
get shouted down by dma api maintainers for looking behind their current.
Well that explains this, I've joined the party a year later and haven't
witnessed all of this.
Yay, cleared up another confusion!

tldr; You have to magically know to not use cpu cached allocators on these
machines.
Or reject the attachment. As far as I can see that is still the cleanest
option.
Yeah rejecting is always an ok thing if it just doesn't work.
-Daniel