Re: [Linaro-mm-sig] Re: DMA-buf and uncached system memory

Daniel Vetter <daniel@xxxxxxxx> · Tue, 9 Aug 2022 16:46:11 +0200

On Mon, Jul 04, 2022 at 03:48:03PM +0200, Christian König wrote:
> Hi Daniel,
> 
> Am 25.06.22 um 00:02 schrieb Daniel Vetter:
> > On Thu, Jun 23, 2022 at 01:32:18PM +0200, Christian König wrote:
> > > Am 23.06.22 um 13:27 schrieb Daniel Stone:
> > > > [SNIP]
> > > > If it's really your belief that dmabuf requires universal snooping, I
> > > > recommend you send the patch to update the documentation, as well as
> > > > to remove DRIVER_PRIME from, realistically, most non-PCIE drivers.
> > > Well, to be honest I think that would indeed be necessary.
> > > 
> > > What we have created are essentially two different worlds, one for PCI
> > > devices and one for the rest.
> > > 
> > > This was indeed not the intention, but it's a fact that basically all
> > > DMA-buf based PCI drivers assume coherent access.
> > dma-buf does not require universal snooping.
> > 
> > It does defacto require that all device access is coherent with all other
> > device access, and consistent with the exporters notion of how cpu
> > coherency is achieved. Not that coherent does not mean snooping, as long
> > as all devices do unsnooped access and the exporter either does wc/uc or
> > flushes caches that's perfectly fine, and how all the arm soc dma-buf
> > sharing works.
> 
> We should probably start documenting that better.

Agreed :-)

Are you volunteering to type up something that reflects the current sorry
state of affairs? I'm not sure I'm the best since I guess I've been too
badly involved in this ...

> > We did originally have the wording in there that you have to map/unamp
> > around every device access, but that got dropped because no one was doing
> > that anyway.
> > 
> > Now where this totally breaks down is how we make this work, because the
> > idea was that dma_buf_attach validates this all. Where this means all the
> > hilarious reasons buffer sharing might not work:
> > - wrong coherency mode (cpu cached or not)
> > - not contiguous (we do check that, but only once we get the sg from
> >    dma_buf_attachment_map, which strictly speaking is a bit too late but
> >    most drivers do attach&map as one step so not that bad in practice)
> > - whether the dma api will throw in bounce buffers or not
> > - random shit like "oh this is in the wrong memory bank", which I think
> >    never landed in upstream
> > 
> > p2p connectivity is about the only one that gets this right, yay. And the
> > only reason we can even get it right is because all the information is
> > exposed to drivers fully.
> 
> Yeah, that's why I designed P2P that way :)
> 
> I also don't think it's that bad, at least for radeon, nouveau and amdgpu
> all the migration restrictions are actually handled correctly.
> 
> In other words when a DMA-buf is about to be used by another device we use
> TTM to move the buffer around so that it can actually be accessed by that
> device.
> 
> What I haven't foreseen in here is that we need to deal with different
> caching behaviors between exporter and importer.

Yeah we should have done caching explicitly and full opt-in like with p2p.
The trouble is that this would have been a multi-year fight with dma api
folks, who insist it must be all transparent. So the politically clever
thing was to just ignore the problem and land dma-buf, but it comes back
to bite us now :-/

> > The issue is that the device dma api refuses to share this information
> > because it would "leak". Which sucks, because we have defacto build every
> > single cross-device use-case of dma-buf on the assumption we can check
> > this (up to gl/vk specs), but oh well.
> > 
> > So in practice this gets sorted out by endless piles of hacks to make
> > individual use-cases work.
> > 
> > Oh and: This is definitely not limited to arm socs. x86 socs with intel
> > at least have exactly all the same issues, and they get solved by adding
> > various shitty hacks to the involved drivers (like i915+amdgpu). Luckily
> > the intel camera driver isn't in upstream yet, since that would break a
> > bunch of the hacks since suddently there will be now 2 cpu cache
> > incoherent devices in an x86 system.
> > 
> > Ideally someone fixes this, but I'm not hopeful.
> > 
> > I recommend pouring more drinks.
> > 
> > What is definitely not correct is claiming that dma-buf wasn't meant for
> > this. We discussed cache coherency issues endless in budapest 12 or so
> > years ago, I was there. It's just that the reality of the current
> > implementation is falling short, and every time someone tries to fix it we
> > get shouted down by dma api maintainers for looking behind their current.
> 
> Well that explains this, I've joined the party a year later and haven't
> witnessed all of this.

Yay, cleared up another confusion!

> > tldr; You have to magically know to not use cpu cached allocators on these
> > machines.
> 
> Or reject the attachment. As far as I can see that is still the cleanest
> option.

Yeah rejecting is always an ok thing if it just doesn't work.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch