On Mon, Jul 04, 2022 at 03:48:03PM +0200, Christian König wrote: > Hi Daniel, > > Am 25.06.22 um 00:02 schrieb Daniel Vetter: > > On Thu, Jun 23, 2022 at 01:32:18PM +0200, Christian König wrote: > > > Am 23.06.22 um 13:27 schrieb Daniel Stone: > > > > [SNIP] > > > > If it's really your belief that dmabuf requires universal snooping, I > > > > recommend you send the patch to update the documentation, as well as > > > > to remove DRIVER_PRIME from, realistically, most non-PCIE drivers. > > > Well, to be honest I think that would indeed be necessary. > > > > > > What we have created are essentially two different worlds, one for PCI > > > devices and one for the rest. > > > > > > This was indeed not the intention, but it's a fact that basically all > > > DMA-buf based PCI drivers assume coherent access. > > dma-buf does not require universal snooping. > > > > It does defacto require that all device access is coherent with all other > > device access, and consistent with the exporters notion of how cpu > > coherency is achieved. Not that coherent does not mean snooping, as long > > as all devices do unsnooped access and the exporter either does wc/uc or > > flushes caches that's perfectly fine, and how all the arm soc dma-buf > > sharing works. > > We should probably start documenting that better. Agreed :-) Are you volunteering to type up something that reflects the current sorry state of affairs? I'm not sure I'm the best since I guess I've been too badly involved in this ... > > We did originally have the wording in there that you have to map/unamp > > around every device access, but that got dropped because no one was doing > > that anyway. > > > > Now where this totally breaks down is how we make this work, because the > > idea was that dma_buf_attach validates this all. Where this means all the > > hilarious reasons buffer sharing might not work: > > - wrong coherency mode (cpu cached or not) > > - not contiguous (we do check that, but only once we get the sg from > > dma_buf_attachment_map, which strictly speaking is a bit too late but > > most drivers do attach&map as one step so not that bad in practice) > > - whether the dma api will throw in bounce buffers or not > > - random shit like "oh this is in the wrong memory bank", which I think > > never landed in upstream > > > > p2p connectivity is about the only one that gets this right, yay. And the > > only reason we can even get it right is because all the information is > > exposed to drivers fully. > > Yeah, that's why I designed P2P that way :) > > I also don't think it's that bad, at least for radeon, nouveau and amdgpu > all the migration restrictions are actually handled correctly. > > In other words when a DMA-buf is about to be used by another device we use > TTM to move the buffer around so that it can actually be accessed by that > device. > > What I haven't foreseen in here is that we need to deal with different > caching behaviors between exporter and importer. Yeah we should have done caching explicitly and full opt-in like with p2p. The trouble is that this would have been a multi-year fight with dma api folks, who insist it must be all transparent. So the politically clever thing was to just ignore the problem and land dma-buf, but it comes back to bite us now :-/ > > The issue is that the device dma api refuses to share this information > > because it would "leak". Which sucks, because we have defacto build every > > single cross-device use-case of dma-buf on the assumption we can check > > this (up to gl/vk specs), but oh well. > > > > So in practice this gets sorted out by endless piles of hacks to make > > individual use-cases work. > > > > Oh and: This is definitely not limited to arm socs. x86 socs with intel > > at least have exactly all the same issues, and they get solved by adding > > various shitty hacks to the involved drivers (like i915+amdgpu). Luckily > > the intel camera driver isn't in upstream yet, since that would break a > > bunch of the hacks since suddently there will be now 2 cpu cache > > incoherent devices in an x86 system. > > > > Ideally someone fixes this, but I'm not hopeful. > > > > I recommend pouring more drinks. > > > > What is definitely not correct is claiming that dma-buf wasn't meant for > > this. We discussed cache coherency issues endless in budapest 12 or so > > years ago, I was there. It's just that the reality of the current > > implementation is falling short, and every time someone tries to fix it we > > get shouted down by dma api maintainers for looking behind their current. > > Well that explains this, I've joined the party a year later and haven't > witnessed all of this. Yay, cleared up another confusion! > > tldr; You have to magically know to not use cpu cached allocators on these > > machines. > > Or reject the attachment. As far as I can see that is still the cleanest > option. Yeah rejecting is always an ok thing if it just doesn't work. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch