Re: [PATCH 2/3] drm/i915: Return immediately if trylock fails for direct-reclaim

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Tue, 13 Nov 2018 17:10:11 +0000

Quoting Tvrtko Ursulin (2018-11-13 10:24:43)
> 
> On 09/11/2018 11:44, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-11-09 07:30:34)
> >>
> >> On 08/11/2018 16:48, Chris Wilson wrote:
> >>> Quoting Tvrtko Ursulin (2018-11-08 16:23:08)
> >>>>
> >>>> On 08/11/2018 08:17, Chris Wilson wrote:
> >>>>> Ignore trying to shrink from i915 if we fail to acquire the struct_mutex
> >>>>> in the shrinker while performing direct-reclaim. The trade-off being
> >>>>> (much) lower latency for non-i915 clients at an increased risk of being
> >>>>> unable to obtain a page from direct-reclaim without hitting the
> >>>>> oom-notifier. The proviso being that we still keep trying to hard
> >>>>> obtain the lock for oom so that we can reap under heavy memory pressure.
> >>>>>
> >>>>> Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> >>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
> >>>>> ---
> >>>>>     drivers/gpu/drm/i915/i915_gem_shrinker.c | 24 +++++++++++-------------
> >>>>>     1 file changed, 11 insertions(+), 13 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/i915/i915_gem_shrinker.c b/drivers/gpu/drm/i915/i915_gem_shrinker.c
> >>>>> index ea90d3a0d511..d461f458f4af 100644
> >>>>> --- a/drivers/gpu/drm/i915/i915_gem_shrinker.c
> >>>>> +++ b/drivers/gpu/drm/i915/i915_gem_shrinker.c
> >>>>> @@ -36,7 +36,9 @@
> >>>>>     #include "i915_drv.h"
> >>>>>     #include "i915_trace.h"
> >>>>>     
> >>>>> -static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
> >>>>> +static bool shrinker_lock(struct drm_i915_private *i915,
> >>>>> +                       unsigned int flags,
> >>>>> +                       bool *unlock)
> >>>>>     {
> >>>>>         switch (mutex_trylock_recursive(&i915->drm.struct_mutex)) {
> >>>>>         case MUTEX_TRYLOCK_RECURSIVE:
> >>>>> @@ -45,15 +47,11 @@ static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
> >>>>>     
> >>>>>         case MUTEX_TRYLOCK_FAILED:
> >>>>>                 *unlock = false;
> >>>>> -             preempt_disable();
> >>>>> -             do {
> >>>>> -                     cpu_relax();
> >>>>> -                     if (mutex_trylock(&i915->drm.struct_mutex)) {
> >>>>> -                             *unlock = true;
> >>>>> -                             break;
> >>>>> -                     }
> >>>>> -             } while (!need_resched());
> >>>>> -             preempt_enable();
> >>>>> +             if (flags & I915_SHRINK_ACTIVE) {
> >>>>
> >>>> So until I915_SHRINK_ACTIVE, which is the last ditch attempt to shrink
> >>>> in the normal case (direct reclaim?) or oom, we bail out on the first
> >>>> sign of struct mutex contention. Doesn't this make our shrinker much
> >>>> less effective at runtime and why is that OK?
> >>>
> >>> As I said, it's a tradeoff between blocking others for _several_
> >>> _seconds_ and making no progress and returning immediately and making no
> >>> progress. My argument is along the lines of if direct-reclaim is running
> >>> in another process and something else is engaged in the driver hopefully
> >>> the driver will be cleaning up as it goes along or else what remains is
> >>> active and won't be reaped anyway. If direct reclaim is failing, the
> >>> delay before trying the oom path is insignificant.
> >>
> >> What was the rationale behind busy looping there btw?
> > 
> > Emulating the optimistic spin for mutex (my patches to expose it from
> > kernel/locking were kept hidden for public decency). My thinking was the
> > exact opposite to this patch, that direct reclaim was of paramount
> > importance and spending the time to try and ensure we grabbed the
> > struct_mutex to search for some pages to free was preferable.
> > 
> > It's just on the basis of looking at the actual syslatency and realising
> > the cause is this spinner, I want to swing the axe in other direction.
> > 
> > (There's probably a compromise, but honestly I'd prefer to sell the
> > struct_mutex free version of the shrinker first :)
> > 
> >> Compared to
> >> perhaps an alternative of micro-sleeps and trying a few times? I know it
> >> would be opposite from what this patch is trying to achieve, I Just
> >> don't had a good judgment on what makes most sense for the shrinker. Is
> >> it better to perhaps try a little bit harder instead of giving up
> >> immediately, but try a little bit harder in a softer way? Or that ends
> >> up blocking the callers and has the same effect of making no progress?
> > 
> > Exactly. We can definitely measure the impact of the spinner on
> > unrelated processes, but detecting the premature allocation failure is
> > harder (we wait for more dmesg-warns). The compromise that I've tried to
> > reach here is that if direct-reclaim isn't enough, then we should still
> > try hard to grab the struct_mutex. (That leaves __GFP_RETRY_MAYFAIL
> > vulnerable to not shrinking i915, but a worthwhile compromise as it's
> > allowed to fail?)
> > 
> >>>> Or in other words, for what use cases, tests or benchmark was the
> >>>> existing approach of busy looping a problem?
> >>>
> >>> Do something like 'find / -exec cat' while running i915 and see how long
> >>> you have to wait for a khungtaskd :|
> >>
> >> I couldn't reproduce anything strange with this. Assuming you meant
> >> something like -exec cat { } \; >dev/null.
> >>
> >> Either case I think explanations like this should go into the commit
> >> message.
> > 
> > Weird, I spent so long over the last few weeks talking about the impact
> > on gem_syslatency, I thought it was mentioned here.
> > 
> >>> gem_syslatency reports max latencies of over 600s, and I'm sure it's
> >>> pretty much unbounded. It's bad enough that I expect we are the cause of
> >>> significant hitching (mainly in other tasks) on the desktop running at
> >>> memory capacity.
> >>
> >> Running gem_syslatency in parallel to the find, or gem_syslatency -b in
> >> parallel or not did not do anything.
> > 
> > gem_syslatency -b -m
> > 
> > +------------------------------------------------------------------------+
> > |      x                                                                 |
> > |      x                                                                 |
> > .
> > .
> > .
> > |      x                                                                 |
> > |      x xxx      xx x           x  x      x                            x|
> > ||_____M_A_______|                                                       |
> > +------------------------------------------------------------------------+
> >      N           Min           Max        Median           Avg        Stddev
> > x 120       2284928  6.752085e+08       3385097      20825362      80352645
> > 
> > Unpatched, and similar have been observed on ivb/byt/bdw/bsw/skl/kbl and
> > while we were fighting ksoftirqd. 675s maximum latency for a RT thread to wake up.
> > 
> > Patched, max is around 261ms and order of magnitude better than the best
> > previously!
> 
> Yes I have reproduced the ~20x better max latency myself. Mean latency 
> stays roughly the same but has less variance. From that point of view 
> this is very good.
> 
> > 
> >> Then I tries gem_shrink but that doesn't seem to be making any progress
> >> with or without the patch. But it's not even hitting the i915 hard, but
> >> perhaps I need to turn off lockdep.. Which would be a bummer since I
> >> wanted to have a lockdep to check the next bit..
> > 
> > gem_shrink never has completed... It's not quite the right test for this
> > as we need to hit the shrinker & oom on both i915 and non-i915 paths to
> > try and hit all the different lock chains.
> 
> Ok. :)
> 
> > 
> >>>>> +                     mutex_lock_nested(&i915->drm.struct_mutex,
> >>>>> +                                       I915_MM_SHRINKER);
> >>>>
> >>>> _nested is needed since abandoning trylock would otherwise cause lockdep
> >>>> issues? But is it really safe? I don't know.. how can it be? It is
> >>>> guaranteed to be a different process here otherwise the result wouldn't
> >>>> be MUTEX_TRYLOCK_FAILED.
> >>>
> >>> The easy option was to only force the mutex_lock for kswapd. However,
> >>> noting that we do need to forcibly shrink before oom, I opted for the
> >>> more inclusive attempt to take it in both. For the oom approach, my only
> >>> handwaving is we shouldn't be nested under oom serialisation and so
> >>> avoid an ABBA deadlock. It's survived some oom abuse, but the machine
> >>> still becomes unresponsive (but pingable) eventually (as it does before
> >>> the patch).
> >>
> >> ...which I completely did not understand! :(
> >>
> >> When you say kswapd - that means mutex->owner != current in the shrinker
> >> and otherwise it is direct reclaim?
> >>
> >> If true, how is the nested annotation correct for the kswapd case?
> > 
> > We have 3 paths,
> > direct-reclaim / oom from under i915 + struct_mutex
> > direct-reclaim / oom from elsewher > kswapd
> 
> Is the kernel guaranteeing only one reclaim path at a time can happen? 
> In other words if it is doing direct reclaim, kswapd won't enter the 
> same shrinking notifier and vice versa?

direct-reclaim is parallelised; the shrink_slab is only guarded by a
read lock (to prevent the shrinker lists being modified). Multiple
callers into direct-reclaim is allowed, and the caller is not always
guaranteed of obtaining the pages they themselves freed. Afaik, there's
no lock there to cause contention and inversion for us.

> > The nested annotation doesn't mean that the lock is nested per-se, it
> > just means that here it is acting as a different subclass of the lock.
> > Since the kswapd path can't have struct_mutex, it is safe from being
> > confused. The complication is if there is serialisation on
> > direct-reclaim (which I think can be run concurrently, i.e. no
> > mutex/semaphore nesting) or oom. For oom, there is a lock but afaict the
> > guard is just a trylock so shouldn't cause an ABBA deadlock if we oom
> > under struct_mutex while fighting an oom in another process.
> 
> When you say direct reclaim can run concurrently, in that case we would 
> have one thread potentially holding the struct_mutex and entering 
> shrinker. Second thread would grab the mutex in the shrinker with the 
> nested annotation. But it is not nested but overlapping. And you say 
> nested is effectively a hack to enable different lock classes.
> 
> Actually in this case it is the same situation for any two simultaneous 
> threads in direct reclaim, even if none holds struct mutex on entry. But 
> in this case the lockdep annotation is also not needed.
> 
> Hm.. maybe to turn the question on it's head.. if you had a normal 
> mutex_lock in there, what deadlock would lockdep notice?

The direct-reclaim paths are explicitly marked up so as to avoid taking
a mutex inside the shrinker that may be held while allocating. That's
the contention we avoid by special casing the recursive mutex; but it
also means we can't use a plain mutex_lock() for the cases where we are
happy that we won't deadlock. Hence why previously it was a trylock to
avoid the same lockdep warning.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx