Am 21.07.20 um 12:47 schrieb Thomas Hellström (Intel):
On 7/21/20 11:50 AM, Daniel Vetter wrote:
On Tue, Jul 21, 2020 at 11:38 AM Thomas Hellström (Intel)
<thomas_os@xxxxxxxxxxxx> wrote:
On 7/21/20 10:55 AM, Christian König wrote:
Am 21.07.20 um 10:47 schrieb Thomas Hellström (Intel):
On 7/21/20 9:45 AM, Christian König wrote:
Am 21.07.20 um 09:41 schrieb Daniel Vetter:
On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
wrote:
Hi,
On 7/9/20 2:33 PM, Daniel Vetter wrote:
Comes up every few years, gets somewhat tedious to discuss, let's
write this down once and for all.
What I'm not sure about is whether the text should be more
explicit in
flat out mandating the amdkfd eviction fences for long running
compute
workloads or workloads where userspace fencing is allowed.
Although (in my humble opinion) it might be possible to completely
untangle
kernel-introduced fences for resource management and dma-fences
used for
completion- and dependency tracking and lift a lot of restrictions
for the
dma-fences, including prohibiting infinite ones, I think this
makes sense
describing the current state.
Yeah I think a future patch needs to type up how we want to make
that
happen (for some cross driver consistency) and what needs to be
considered. Some of the necessary parts are already there (with
like the
preemption fences amdkfd has as an example), but I think some clear
docs
on what's required from both hw, drivers and userspace would be
really
good.
I'm currently writing that up, but probably still need a few days
for this.
Great! I put down some (very) initial thoughts a couple of weeks ago
building on eviction fences for various hardware complexity levels
here:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fthomash%2Fdocs%2F-%2Fblob%2Fmaster%2FUntangling%2520dma-fence%2520and%2520memory%2520allocation.odt&data=02%7C01%7Cchristian.koenig%40amd.com%7C0af39422c4e744a9303b08d82d637d62%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637309252665326201&sdata=Zk3LVX7bbMpfAMsq%2Fs2jyA0puRQNcjzliJS%2BC7uDLMo%3D&reserved=0
I don't think that this will ever be possible.
See that Daniel describes in his text is that indefinite fences are a
bad idea for memory management, and I think that this is a fixed fact.
In other words the whole concept of submitting work to the kernel
which depends on some user space interaction doesn't work and never
will.
Well the idea here is that memory management will *never* depend on
indefinite fences: As soon as someone waits on a memory manager fence
(be it eviction, shrinker or mmu notifier) it breaks out of any
dma-fence dependencies and /or user-space interaction. The text
tries to
describe what's required to be able to do that (save for
non-preemptible
gpus where someone submits a forever-running shader).
Yeah I think that part of your text is good to describe how to
untangle memory fences from synchronization fences given how much the
hw can do.
So while I think this is possible (until someone comes up with a case
where it wouldn't work of course), I guess Daniel has a point in
that it
won't happen because of inertia and there might be better options.
Yeah it's just I don't see much chance for splitting dma-fence itself.
Well that's the whole idea with the timeline semaphores and waiting for
a signal number to appear.
E.g. instead of doing the wait with the dma_fence we are separating that
out into the timeline semaphore object.
This not only avoids the indefinite fence problem for the wait before
signal case in Vulkan, but also prevents userspace to submit stuff which
can't be processed immediately.
That's also why I'm not positive on the "no hw preemption, only
scheduler" case: You still have a dma_fence for the batch itself,
which means still no userspace controlled synchronization or other
form of indefinite batches allowed. So not getting us any closer to
enabling the compute use cases people want.
What compute use case are you talking about? I'm only aware about the
wait before signal case from Vulkan, the page fault case and the KFD
preemption fence case.
Yes, we can't do magic. As soon as an indefinite batch makes it to
such hardware we've lost. But since we can break out while the batch
is stuck in the scheduler waiting, what I believe we *can* do with
this approach is to avoid deadlocks due to locally unknown
dependencies, which has some bearing on this documentation patch, and
also to allow memory allocation in dma-fence (not memory-fence)
critical sections, like gpu fault- and error handlers without
resorting to using memory pools.
Avoiding deadlocks is only the tip of the iceberg here.
When you allow the kernel to depend on user space to proceed with some
operation there are a lot more things which need consideration.
E.g. what happens when an userspace process which has submitted stuff to
the kernel is killed? Are the prepared commands send to the hardware or
aborted as well? What do we do with other processes waiting for that stuff?
How to we do resource accounting? When processes need to block when
submitting to the hardware stuff which is not ready we have a process we
can punish for blocking resources. But how is kernel memory used for a
submission accounted? How do we avoid deny of service attacks here were
somebody eats up all memory by doing submissions which can't finish?
But again. I'm not saying we should actually implement this. Better to
consider it and reject it than not consider it at all.
Agreed.
Same thing as it turned out with the Wait before Signal for Vulkan,
initially it looked simpler to do it in the kernel. But as far as I know
the solution in userspace now works so well that we don't really want
the pain for a kernel implementation any more.
Christian.
/Thomas