Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

Thomas Hellström (Intel) <thomas_os@xxxxxxxxxxxx> · Wed, 22 Jul 2020 12:31:27 +0200

On 2020-07-22 11:45, Daniel Vetter wrote:
On Wed, Jul 22, 2020 at 10:05 AM Thomas Hellström (Intel)
<thomas_os@xxxxxxxxxxxx> wrote:

On 2020-07-22 09:11, Daniel Vetter wrote:
On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
<thomas_os@xxxxxxxxxxxx> wrote:
On 2020-07-22 00:45, Dave Airlie wrote:
On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
<thomas_os@xxxxxxxxxxxx> wrote:
On 7/21/20 9:45 AM, Christian König wrote:
Am 21.07.20 um 09:41 schrieb Daniel Vetter:
On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
wrote:
Hi,

On 7/9/20 2:33 PM, Daniel Vetter wrote:
Comes up every few years, gets somewhat tedious to discuss, let's
write this down once and for all.

What I'm not sure about is whether the text should be more explicit in
flat out mandating the amdkfd eviction fences for long running compute
workloads or workloads where userspace fencing is allowed.
Although (in my humble opinion) it might be possible to completely
untangle
kernel-introduced fences for resource management and dma-fences used
for
completion- and dependency tracking and lift a lot of restrictions
for the
dma-fences, including prohibiting infinite ones, I think this makes
sense
describing the current state.
Yeah I think a future patch needs to type up how we want to make that
happen (for some cross driver consistency) and what needs to be
considered. Some of the necessary parts are already there (with like the
preemption fences amdkfd has as an example), but I think some clear docs
on what's required from both hw, drivers and userspace would be really
good.
I'm currently writing that up, but probably still need a few days for
this.
Great! I put down some (very) initial thoughts a couple of weeks ago
building on eviction fences for various hardware complexity levels here:

https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
We are seeing HW that has recoverable GPU page faults but only for
compute tasks, and scheduler without semaphores hw for graphics.

So a single driver may have to expose both models to userspace and
also introduces the problem of how to interoperate between the two
models on one card.

Dave.
Hmm, yes to begin with it's important to note that this is not a
replacement for new programming models or APIs, This is something that
takes place internally in drivers to mitigate many of the restrictions
that are currently imposed on dma-fence and documented in this and
previous series. It's basically the driver-private narrow completions
Jason suggested in the lockdep patches discussions implemented the same
way as eviction-fences.

The memory fence API would be local to helpers and middle-layers like
TTM, and the corresponding drivers.  The only cross-driver-like
visibility would be that the dma-buf move_notify() callback would not be
allowed to wait on dma-fences or something that depends on a dma-fence.
Because we can't preempt (on some engines at least) we already have
the requirement that cross driver buffer management can get stuck on a
dma-fence. Not even taking into account the horrors we do with
userptr, which are cross driver no matter what. Limiting move_notify
to memory fences only doesn't work, since the pte clearing might need
to wait for a dma_fence first. Hence this becomes a full end-of-batch
fence, not just a limited kernel-internal memory fence.
For non-preemptible hardware the memory fence typically *is* the
end-of-batch fence. (Unless, as documented, there is a scheduler
consuming sync-file dependencies in which case the memory fence wait
needs to be able to break out of that). The key thing is not that we can
break out of execution, but that we can break out of dependencies, since
when we're executing all dependecies (modulo semaphores) are already
fulfilled. That's what's eliminating the deadlocks.

That's kinda why I think only reasonable option is to toss in the
towel and declare dma-fence to be the memory fence (and suck up all
the consequences of that decision as uapi, which is kinda where we
are), and construct something new&entirely free-wheeling for userspace
fencing. But only for engines that allow enough preempt/gpu page
faulting to make that possible. Free wheeling userspace fences/gpu
semaphores or whatever you want to call them (on windows I think it's
monitored fence) only work if you can preempt to decouple the memory
fences from your gpu command execution.

There's the in-between step of just decoupling the batchbuffer
submission prep for hw without any preempt (but a scheduler), but that
seems kinda pointless. Modern execbuf should be O(1) fastpath, with
all the allocation/mapping work pulled out ahead. vk exposes that
model directly to clients, GL drivers could use it internally too, so
I see zero value in spending lots of time engineering very tricky
kernel code just for old userspace. Much more reasonable to do that in
userspace, where we have real debuggers and no panics about security
bugs (or well, a lot less, webgl is still a thing, but at least
browsers realized you need to container that completely).
Sure, it's definitely a big chunk of work. I think the big win would be
allowing memory allocation in dma-fence critical sections. But I
completely buy the above argument. I just wanted to point out that many
of the dma-fence restrictions are IMHO fixable, should we need to do
that for whatever reason.
I'm still not sure that's possible, without preemption at least. We
have 4 edges:
- Kernel has internal depencies among memory fences. We want that to
allow (mild) amounts of overcommit, since that simplifies live so
much.
- Memory fences can block gpu ctx execution (by nature of the memory
simply not being there yet due to our overcommit)
- gpu ctx have (if we allow this) userspace controlled semaphore
dependencies. Of course userspace is expected to not create deadlocks,
but that's only assuming the kernel doesn't inject additional
dependencies. Compute folks really want that.
- gpu ctx can hold up memory allocations if all we have is
end-of-batch fences. And end-of-batch fences are all we have without
preempt, plus if we want backwards compat with the entire current
winsys/compositor ecosystem we need them, which allows us to inject
stuff dependent upon them pretty much anywhere.

Fundamentally that's not fixable without throwing one of the edges
(and the corresponding feature that enables) out, since no entity has
full visibility into what's going on. E.g. forcing userspace to tell
the kernel about all semaphores just brings up back to the
drm_timeline_syncobj design we have merged right now. And that's imo
no better.

Indeed, HW waiting for semaphores without being able to preempt that 
wait is a no-go. The doc (perhaps naively) assumes nobody is doing that.

That's kinda why I'm not seeing much benefits in a half-way state:
Tons of work, and still not what userspace wants. And for the full
deal that userspace wants we might as well not change anything with
dma-fences. For that we need a) ctx preempt and b) new entirely
decoupled fences that never feed back into a memory fences and c) are
controlled entirely by userspace. And c) is the really important thing
people want us to provide.

And once we're ok with dma_fence == memory fences, then enforcing the
strict and painful memory allocation limitations is actually what we
want.

Let's hope you're right. My fear is that that might be pretty painful as 
well.

Cheers, Daniel

/Thomas