Re: [PATCH 7/7] drm/i915/gem: Acquire all vma/objects under reservation_ww_class

Christian König <christian.koenig@xxxxxxx> · Fri, 26 Jun 2020 13:35:30 +0200

Am 26.06.20 um 13:10 schrieb Chris Wilson:
Quoting Christian König (2020-06-26 09:54:19)
[SNIP]
In other words "fence -> userspace -> page fault -> fence" or "fence ->
userspace -> system call -> fence" can easily cause the same problem and
that is not avoidable.

An example

Thread A                              Thread B

       submit(VkCmdWaitEvents)
       recvfrom(ThreadB)       ...     sendto(ThreadB)
                                       \- alloc_page
                                        \- direct reclaim
                                         \- dma_fence_wait(A)
       VkSetEvent()

Regardless of that actual deadlock, waiting on an arbitrary fence incurs
an unbounded latency which is unacceptable for direct reclaim.

Online debugging can indefinitely suspend fence signaling, and the only
guarantee we make of forward progress, in some cases, is process
termination.
And exactly that is what doesn't work. You don't have any forward
progress any more because you ran into a software deadlock.
Only one side is halted. Everything on that side comes to a grinding
halt.

What about checkpoint/restore, suspend/resume? Where we need to suspend
all execution, move all the resources to one side, then put everything
back, without cancelling the fences. Same halting problem, no?

What are you talking about? Of course we either wait for all fences to 
complete or cancel them on suspend.

We also do similar for resets. Suspend the hanging context, move it and
all dependent execution off to one side; record what we can, clean up
what we have to, then move what remains of the execution back to finish
signaling.

Yes, but this is not possible in this situation. In the bad case you 
have a kernel deadlock and that can't be cleaned up in any way.

The only solution left in that situation is to reset the system or at 
least reload the kernel and that is not acceptable.

In other words the signaling of a fence depends on the welfare of
userspace. You can try to kill userspace, but this can wait for the
fence you try to signal in the first place.
The only scenario that fits what you are describing here [userspace
ignoring a signal] is if you used an uninterruptible wait. Under what
circumstances during normal execution would you do that? If it's
someone else's wait, a bug outside of our control.

Uninterruptible waits are a necessity.

Just take a look at the dma_fence_wait() interface. Why to you think we 
have ability to wait uninterruptible there?

We need this when there is no other way of recovering. For example when 
operations are already partially flushed to the hardware and can't be 
aborted any more.

But if you have chosen to cancel the fences, there is nothing to stop
the signaling.

And just to repeat myself: You can't cancel the fence!

For example assume that canceling the proxy fence would mean that you 
send a SIGKILL to the process which issued it. But then you need to wait 
for the SIGKILL to be processed.

Now what can happen is that the process is uninterruptible waiting for 
something which then needs the SIGKILL to be delivered -> kernel deadlock.

See the difference to a deadlock on the GPU is that you can can always
kill a running job or process even if it is stuck with something else.
But if the kernel is deadlocked with itself you can't kill the process
any more, the only option left to get cleanly out of this is to reboot
the kernel.
However, I say that is under our control. We know what fences are in an
execution context, just as easily as we know that we are inside an
execution context. And yes, the easiest, the most restrictive way to
control it is to say don't bother.

No, that is absolutely not under our control.

dma_fences need to be waited on under a lot of different context, 
including the reclaim path as well as the MMU notifiers, memory pressure 
callbacks, OOM killer....

Just see Daniels patches on the lockdep fence signaling annotation and 
what this work bubbled up on problems.

The only way to avoid this would be to never ever wait for the fence in
the kernel and then your whole construct is not useful any more.
I advocate for moving as much as is feasible, for some waits are required
by userspace as a necessary evil, into the parallelised pipeline.

I'm running out of ideas how to explain what the problem is here....
Oh we agree on the problem, we appear to disagree that the implicit waits
themselves are a serious existent problem. That they are worth effort to
avoid or, at least, mitigate.

No, as far as I can see you don't seem to either understand the problem 
or the implications of it.

The only way to solve this would be to audit the whole Linux kernel and 
remove all uninterruptible waits and that is not feasible.

As long as you don't provide me with a working solution to the problem 
I've outlined here the whole approach is a clear NAK since it will allow 
to create really bad kernel deadlocks.

Sorry to say that, but this whole thing doesn't look like it is thought 
through to the end. You should probably take a step back and talk to 
Daniel here.

Regards,
Christian.

-Chris

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx