Re: [PATCH 7/7] drm/i915/gem: Acquire all vma/objects under reservation_ww_class

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Thu, 25 Jun 2020 14:23:22 +0100



Quoting Christian König (2020-06-25 13:59:16)
> Am 25.06.20 um 14:48 schrieb Chris Wilson:
> > Quoting Christian König (2020-06-25 09:11:35)
> >> Am 24.06.20 um 22:18 schrieb Chris Wilson:
> >>> Quoting Dave Airlie (2020-06-24 20:04:02)
> >>>> On Wed, 24 Jun 2020 at 07:19, Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> wrote:
> >>>>> Quoting Dave Airlie (2020-06-23 22:01:24)
> >>>>>> On Tue, 23 Jun 2020 at 20:03, Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> wrote:
> >>>>>>> Quoting Thomas Hellström (Intel) (2020-06-23 10:33:20)
> >>>>>>>> Hi, Chris!
> >>>>>>>>
> >>>>>>>> On 6/22/20 11:59 AM, Chris Wilson wrote:
> >>>>>>>>> In order to actually handle eviction and what not, we need to process
> >>>>>>>>> all the objects together under a common lock, reservation_ww_class. As
> >>>>>>>>> such, do a memory reservation pass after looking up the object/vma,
> >>>>>>>>> which then feeds into the rest of execbuf [relocation, cmdparsing,
> >>>>>>>>> flushing and ofc execution].
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> >>>>>>>>> ---
> >>>>>>>>>     .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 91 ++++++++++++++-----
> >>>>>>>>>     1 file changed, 70 insertions(+), 21 deletions(-)
> >>>>>>>>>
> >>>>>>>> Which tree is this against? The series doesn't apply cleanly against
> >>>>>>>> drm-tip?
> >>>>>>> It's continuing on from the scheduler patches, the bug fixes and the
> >>>>>>> iris-deferred-fence work. I thought throwing all of those old patches
> >>>>>>> into the pile would have been distracting.
> >>>>>>>
> >>>>>>>> ...
> >>>>>>>>
> >>>>>>>>> +static int eb_reserve_mm(struct i915_execbuffer *eb)
> >>>>>>>>> +{
> >>>>>>>>> +     const u64 idx = eb->context->timeline->fence_context;
> >>>>>>>>> +     struct ww_acquire_ctx acquire;
> >>>>>>>>> +     struct eb_vma *ev;
> >>>>>>>>> +     int err;
> >>>>>>>>> +
> >>>>>>>>> +     eb->mm_fence = __dma_fence_create_proxy(0, 0);
> >>>>>>>>> +     if (!eb->mm_fence)
> >>>>>>>>> +             return -ENOMEM;
> >>>>>>>> Where are the proxy fence functions defined?
> >>>>>>> In dma-fence-proxy.c ;)
> >>>>>> The dma-fence-proxy that Christian NAKed before?
> >>>>> I do not have an email from Christian about dma-fence-proxy in the last
> >>>>> 3 years it has been on the list.
> >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2Faeb0373d-0583-d922-3b73-93668c27d177%40amd.com%2F&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7Ccb060e358d844784815708d819061868%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637286861292346372&amp;sdata=DlHistmqPi%2BtwdcT%2FycrtRpoLGZ6xcBD%2FkPvVZcQ2YQ%3D&amp;reserved=0
> >>> Darn, I skimmed the thread title and thought it was just about the
> >>> timelines.
> >>>
> >>>> I'm assuming this was about patch 8 there which to me looks like proxy
> >>>> fences but maybe by threading is off reading that.
> >>> The deadlocks are easy to resolve. The fence is either signaled normally
> >>> by userspace, they create a deadlock that is rejected by checking the dag
> >>> and the fence signaled with an error (and work cancelled, error
> >>> propagated back to userspace if they kept the output fence around), or
> >>> userspace forgets entirely about the fence they were waiting on in which
> >>> case it is signaled by closing the syncobjs [sadly not in error though,
> >>> I hoping to report EPIPE] on process termination.
> >> And exactly that concept is still a big NAK.
> >>
> >> The kernel memory management depends on dma_fences to be signaling as
> >> soon as they are existing.
> >>
> >> Just imagine what Daniel's dependency patches would splat out when you
> >> do something like this and correctly annotate the signaling code path.
> > Nothing at all. Forward progress of the waiter does not solely depend on
> > the signaler, just as in bc9c80fe01a2570a2fd78abbc492b377b5fda068.
> >   
> >> Proxy fences, especially when they depend on userspace for signaling are
> >> an absolutely NO-GO.
> > We are in full control of the signaling and are able to cancel the pending
> > userspace operation, move it off to one side and shutdown the HW,
> > whatever. We can and do do dependency analysis of the fence contexts to
> > avoid deadlocks, just as easily as detecting recursion.
> >
> > To claim that userspace is not already able to control signaling, is a
> > false dichotomy. Userspace is fully able to lock the HW resources
> > indefinitely (even if you cap every job, one can always build a chain of
> > jobs to circumvent any imposed timeout, a couple of seconds timeout
> > becomes several months of jobs before the GPU runs out of memory and is
> > unable to accept any more jobs). Any ioctl that blocks while holding a HW
> > resource renders itself liable to a user controllable livelock, you know
> > this, because it is blocking the signaling of those earlier jobs.
> > Worrying about things that are entirely within our control and hence
> > avoidable, misses the point.
> 
> You are completely missing the problem here.
> 
> As you correctly pointed out that an userspace thread blocks on 
> something is perfectly acceptable. And that's how 
> bc9c80fe01a2570a2fd78abbc492b377b5fda068 works as well.
> 
> And bc9c80fe01a2570a2fd78abbc492b377b5fda068 only implements waiting so 
> that during CS or WAIT IOCTL we can block for the fence to appear.
> 
> 
> What happens in your approach is that the kernel starts to wait for 
> userspace in its memory reclaim path. That is exactly the kind of 
> problem Daniels patches now point out immediately.

No we don't.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx