Re: [PATCH 7/7] drm/i915/gem: Acquire all vma/objects under reservation_ww_class

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Fri, 26 Jun 2020 14:08:40 +0100

Quoting Christian König (2020-06-26 12:35:30)
> Am 26.06.20 um 13:10 schrieb Chris Wilson:
> > Quoting Christian König (2020-06-26 09:54:19)
> > [SNIP]
> >> In other words "fence -> userspace -> page fault -> fence" or "fence ->
> >> userspace -> system call -> fence" can easily cause the same problem and
> >> that is not avoidable.
> >>
> >>> An example
> >>>
> >>> Thread A                              Thread B
> >>>
> >>>        submit(VkCmdWaitEvents)
> >>>        recvfrom(ThreadB)       ...     sendto(ThreadB)
> >>>                                        \- alloc_page
> >>>                                         \- direct reclaim
> >>>                                          \- dma_fence_wait(A)
> >>>        VkSetEvent()
> >>>
> >>> Regardless of that actual deadlock, waiting on an arbitrary fence incurs
> >>> an unbounded latency which is unacceptable for direct reclaim.
> >>>
> >>> Online debugging can indefinitely suspend fence signaling, and the only
> >>> guarantee we make of forward progress, in some cases, is process
> >>> termination.
> >> And exactly that is what doesn't work. You don't have any forward
> >> progress any more because you ran into a software deadlock.
> > Only one side is halted. Everything on that side comes to a grinding
> > halt.
> >
> > What about checkpoint/restore, suspend/resume? Where we need to suspend
> > all execution, move all the resources to one side, then put everything
> > back, without cancelling the fences. Same halting problem, no?
> 
> What are you talking about? Of course we either wait for all fences to 
> complete or cancel them on suspend.

I do not want to have to cancel incomplete fences as we do today.
I want to restore the suspended execution back to waiting on its
VkEvent.

> > We also do similar for resets. Suspend the hanging context, move it and
> > all dependent execution off to one side; record what we can, clean up
> > what we have to, then move what remains of the execution back to finish
> > signaling.
> 
> Yes, but this is not possible in this situation. In the bad case you 
> have a kernel deadlock and that can't be cleaned up in any way.

Fences are not disturbed in this process.
> 
> The only solution left in that situation is to reset the system or at 
> least reload the kernel and that is not acceptable.
> 
> >> In other words the signaling of a fence depends on the welfare of
> >> userspace. You can try to kill userspace, but this can wait for the
> >> fence you try to signal in the first place.
> > The only scenario that fits what you are describing here [userspace
> > ignoring a signal] is if you used an uninterruptible wait. Under what
> > circumstances during normal execution would you do that? If it's
> > someone else's wait, a bug outside of our control.
> 
> Uninterruptible waits are a necessity.
> 
> Just take a look at the dma_fence_wait() interface. Why to you think we 
> have ability to wait uninterruptible there?
>
> We need this when there is no other way of recovering. For example when 
> operations are already partially flushed to the hardware and can't be 
> aborted any more.

So why wait in the middle of submission, rather than defer the submission
to the fence callback if the HW wasn't ready? You then have your
uninterruptible continuation.

> > But if you have chosen to cancel the fences, there is nothing to stop
> > the signaling.
> 
> And just to repeat myself: You can't cancel the fence!
> 
> For example assume that canceling the proxy fence would mean that you 
> send a SIGKILL to the process which issued it. But then you need to wait 
> for the SIGKILL to be processed.

What? Where does SIGKILL come from for fence handling?

The proxy fence is force signaled in an error state (e.g. -ETIMEDOUT),
every waiter then inherits the error state and all of their waiters down
the chain. Those waiters are now presumably ready to finish their own
signaling.

The proxy fence is constructed to always complete if it does not get
resolved; after resolution, the onus is on the real fence to complete.

The same as handling any other error or context cancellation during
fence submission.

> Now what can happen is that the process is uninterruptible waiting for 
> something which then needs the SIGKILL to be delivered -> kernel deadlock.
> 
> >> See the difference to a deadlock on the GPU is that you can can always
> >> kill a running job or process even if it is stuck with something else.
> >> But if the kernel is deadlocked with itself you can't kill the process
> >> any more, the only option left to get cleanly out of this is to reboot
> >> the kernel.
> > However, I say that is under our control. We know what fences are in an
> > execution context, just as easily as we know that we are inside an
> > execution context. And yes, the easiest, the most restrictive way to
> > control it is to say don't bother.
> 
> No, that is absolutely not under our control.
> 
> dma_fences need to be waited on under a lot of different context, 
> including the reclaim path as well as the MMU notifiers, memory pressure 
> callbacks, OOM killer....

Oh yes, they are under our control. That list boils down to reclaim,
since mmu notifiers outside of reclaim are outside of a nested context.

That in particular is the same old question as whether GFP_IO should be
a gfp_t or in the task_struct. If we are inside an execution context, we
can track that and the fences on the task_struct if we wanted to,
avoiding reclaim of fences being used by the outer context and their
descendants...

But as we have stated multiple times now, and that I thought you had
agreed with for the VkEvents example, one cannot wait inside direct
reclaim. Least of all because the latency in doing so impacts other
users, sometimes severely.

Which pushes the burden of work onto kswapd to make objects reclaimable,
and the driver in general to not hold onto objects beyond their use.

> Just see Daniels patches on the lockdep fence signaling annotation and 
> what this work bubbled up on problems.
> 
> >> The only way to avoid this would be to never ever wait for the fence in
> >> the kernel and then your whole construct is not useful any more.
> > I advocate for moving as much as is feasible, for some waits are required
> > by userspace as a necessary evil, into the parallelised pipeline.
> >
> >> I'm running out of ideas how to explain what the problem is here....
> > Oh we agree on the problem, we appear to disagree that the implicit waits
> > themselves are a serious existent problem. That they are worth effort to
> > avoid or, at least, mitigate.
> 
> No, as far as I can see you don't seem to either understand the problem 
> or the implications of it.
> 
> The only way to solve this would be to audit the whole Linux kernel and 
> remove all uninterruptible waits and that is not feasible.
> 
> As long as you don't provide me with a working solution to the problem 
> I've outlined here the whole approach is a clear NAK since it will allow 
> to create really bad kernel deadlocks.

You are confusing multiple things here. The VkEvents example is real.
How do you avoid that deadlock? We avoid it by not waiting in direct
reclaim.

It has also shown up any waits in our submit ioctl [prior to fence
publication, I might add] for their potential deadlock with userspace.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx