Re: Try to address the DMA-buf coherency problem

Daniel Vetter <daniel@xxxxxxxx> · Tue, 22 Nov 2022 19:26:47 +0100

On Tue, 22 Nov 2022 at 18:34, Christian König <christian.koenig@xxxxxxx> wrote:
> Am 22.11.22 um 15:36 schrieb Daniel Vetter:
> > On Fri, Nov 18, 2022 at 11:32:19AM -0800, Rob Clark wrote:
> >> On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne <nicolas@xxxxxxxxxxxx> wrote:
> >>> Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :
> >>>>>> DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> >>>>>> be able to directly decided where a certain operation should go. E.g. we
> >>>>>> have cases where for example a P2P write doesn't even go to memory, but
> >>>>>> rather a doorbell BAR to trigger another operation. Throwing in CPU
> >>>>>> round trips for explicit ownership transfer completely breaks that
> >>>>>> concept.
> >>>>> It sounds like we should have a dma_dev_is_coherent_with_dev() which
> >>>>> accepts two (or an array?) of devices and tells the caller whether the
> >>>>> devices need explicit ownership transfer.
> >>>> No, exactly that's the concept I'm pushing back on very hard here.
> >>>>
> >>>> In other words explicit ownership transfer is not something we would
> >>>> want as requirement in the framework, cause otherwise we break tons of
> >>>> use cases which require concurrent access to the underlying buffer.
> >>> I'm not pushing for this solution, but really felt the need to correct you here.
> >>> I have quite some experience with ownership transfer mechanism, as this is how
> >>> GStreamer framework works since 2000. Concurrent access is a really common use
> >>> cases and it is quite well defined in that context. The bracketing system (in
> >>> this case called map() unmap(), with flag stating the usage intention like reads
> >>> and write) is combined the the refcount. The basic rules are simple:
> >> This is all CPU oriented, I think Christian is talking about the case
> >> where ownership transfer happens without CPU involvement, such as via
> >> GPU waiting on a fence
> > Yeah for pure device2device handover the rule pretty much has to be that
> > any coherency management that needs to be done must be done from the
> > device side (flushing device side caches and stuff like that) only. But
> > under the assumption that _all_ cpu side management has been done already
> > before the first device access started.
> >
> > And then the map/unmap respectively begin/end_cpu_access can be used what
> > it was meant for with cpu side invalidation/flushing and stuff like that,
> > while having pretty clear handover/ownership rules and hopefully not doing
> > no unecessary flushes. And all that while allowing device acces to be
> > pipelined. Worst case the exporter has to insert some pipelined cache
> > flushes as a dma_fence pipelined work of its own between the device access
> > when moving from one device to the other. That last part sucks a bit right
> > now, because we don't have any dma_buf_attachment method which does this
> > syncing without recreating the mapping, but in reality this is solved by
> > caching mappings in the exporter (well dma-buf layer) nowadays.
> >
> > True concurrent access like vk/compute expects is still a model that
> > dma-buf needs to support on top, but that's a special case and pretty much
> > needs hw that supports such concurrent access without explicit handover
> > and fencing.
> >
> > Aside from some historical accidents and still a few warts, I do think
> > dma-buf does support both of these models.
>
> We should have come up with dma-heaps earlier and make it clear that
> exporting a DMA-buf from a device gives you something device specific
> which might or might not work with others.

Yeah, but engineering practicalities were pretty clear that no one
would rewrite the entire Xorg stack and all the drivers just to make
that happen for prime.

> Apart from that I agree, DMA-buf should be capable of handling this.
> Question left is what documentation is missing to make it clear how
> things are supposed to work?

Given the historical baggage of existing use-case, I think the only
way out is that we look at concrete examples from real world users
that break, and figure out how to fix them. Without breaking any of
the existing mess.

One idea might be that we have a per-platform
dma_buf_legacy_coherency_mode(), which tells you what mode (cpu cache
snooping or uncached memory) you need to use to make sure that all
devices agree. On x86 the rule might be that it's cpu cache snooping
by default, but if you have an integrated gpu then everyone needs to
use uncached. That sucks, but at least we could keep the existing mess
going and clean it up. Everyone else would be uncached, except maybe
arm64 servers with pcie connectors. Essentially least common
denominator to make this work. Note that uncached actually means
device access doesn't snoop, the cpu side you can handle with either
uc/wc mappings or explicit flushing.

Then once we have that we could implement the coherency negotiation
protocol on top as an explicit opt-in, so that you can still use
coherent buffers across two pcie gpus even if you also have an
integrated gpu.

Doing only the new protocol without some means to keep the existing
pile of carefully hacked up assumptions would break things, and we
can't do that. Also I have no idea whether that global legacy device
coherency mode would work. Also we might more than just
snooped/unsnoop, since depending upon architecture you might want to
only snoop one transaction (reads vs writes) instead of both of them:
If writes snoop then cpu reads would never need to invalidate caches
beforehand, but writes would still need to flush (and would give you
faster reads on the device side since those can still bypass
snooping). Some igpu platforms work like that, but I'm not sure
whether there's any other device that would care enough about these
for this to matter. Yes it's a hw mis-design (well I don't like it
personally), they fixed it :-)

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch