Re: [PATCH 2/3] drm/i915: Return immediately if trylock fails for direct-reclaim

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Tue, 13 Nov 2018 10:24:43 +0000

On 09/11/2018 11:44, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2018-11-09 07:30:34)

On 08/11/2018 16:48, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2018-11-08 16:23:08)

On 08/11/2018 08:17, Chris Wilson wrote:
Ignore trying to shrink from i915 if we fail to acquire the struct_mutex
in the shrinker while performing direct-reclaim. The trade-off being
(much) lower latency for non-i915 clients at an increased risk of being
unable to obtain a page from direct-reclaim without hitting the
oom-notifier. The proviso being that we still keep trying to hard
obtain the lock for oom so that we can reap under heavy memory pressure.

Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
---
    drivers/gpu/drm/i915/i915_gem_shrinker.c | 24 +++++++++++-------------
    1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_shrinker.c b/drivers/gpu/drm/i915/i915_gem_shrinker.c
index ea90d3a0d511..d461f458f4af 100644
--- a/drivers/gpu/drm/i915/i915_gem_shrinker.c
+++ b/drivers/gpu/drm/i915/i915_gem_shrinker.c
@@ -36,7 +36,9 @@
    #include "i915_drv.h"
    #include "i915_trace.h"
    
-static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
+static bool shrinker_lock(struct drm_i915_private *i915,
+                       unsigned int flags,
+                       bool *unlock)
    {
        switch (mutex_trylock_recursive(&i915->drm.struct_mutex)) {
        case MUTEX_TRYLOCK_RECURSIVE:
@@ -45,15 +47,11 @@ static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
    
        case MUTEX_TRYLOCK_FAILED:
                *unlock = false;
-             preempt_disable();
-             do {
-                     cpu_relax();
-                     if (mutex_trylock(&i915->drm.struct_mutex)) {
-                             *unlock = true;
-                             break;
-                     }
-             } while (!need_resched());
-             preempt_enable();
+             if (flags & I915_SHRINK_ACTIVE) {

So until I915_SHRINK_ACTIVE, which is the last ditch attempt to shrink
in the normal case (direct reclaim?) or oom, we bail out on the first
sign of struct mutex contention. Doesn't this make our shrinker much
less effective at runtime and why is that OK?

As I said, it's a tradeoff between blocking others for _several_
_seconds_ and making no progress and returning immediately and making no
progress. My argument is along the lines of if direct-reclaim is running
in another process and something else is engaged in the driver hopefully
the driver will be cleaning up as it goes along or else what remains is
active and won't be reaped anyway. If direct reclaim is failing, the
delay before trying the oom path is insignificant.

What was the rationale behind busy looping there btw?

Emulating the optimistic spin for mutex (my patches to expose it from
kernel/locking were kept hidden for public decency). My thinking was the
exact opposite to this patch, that direct reclaim was of paramount
importance and spending the time to try and ensure we grabbed the
struct_mutex to search for some pages to free was preferable.

It's just on the basis of looking at the actual syslatency and realising
the cause is this spinner, I want to swing the axe in other direction.

(There's probably a compromise, but honestly I'd prefer to sell the
struct_mutex free version of the shrinker first :)

Compared to
perhaps an alternative of micro-sleeps and trying a few times? I know it
would be opposite from what this patch is trying to achieve, I Just
don't had a good judgment on what makes most sense for the shrinker. Is
it better to perhaps try a little bit harder instead of giving up
immediately, but try a little bit harder in a softer way? Or that ends
up blocking the callers and has the same effect of making no progress?

Exactly. We can definitely measure the impact of the spinner on
unrelated processes, but detecting the premature allocation failure is
harder (we wait for more dmesg-warns). The compromise that I've tried to
reach here is that if direct-reclaim isn't enough, then we should still
try hard to grab the struct_mutex. (That leaves __GFP_RETRY_MAYFAIL
vulnerable to not shrinking i915, but a worthwhile compromise as it's
allowed to fail?)

Or in other words, for what use cases, tests or benchmark was the
existing approach of busy looping a problem?

Do something like 'find / -exec cat' while running i915 and see how long
you have to wait for a khungtaskd :|

I couldn't reproduce anything strange with this. Assuming you meant
something like -exec cat { } \; >dev/null.

Either case I think explanations like this should go into the commit
message.

Weird, I spent so long over the last few weeks talking about the impact
on gem_syslatency, I thought it was mentioned here.

gem_syslatency reports max latencies of over 600s, and I'm sure it's
pretty much unbounded. It's bad enough that I expect we are the cause of
significant hitching (mainly in other tasks) on the desktop running at
memory capacity.

Running gem_syslatency in parallel to the find, or gem_syslatency -b in
parallel or not did not do anything.

gem_syslatency -b -m

+------------------------------------------------------------------------+
|      x                                                                 |
|      x                                                                 |
.
.
.
|      x                                                                 |
|      x xxx      xx x           x  x      x                            x|
||_____M_A_______|                                                       |
+------------------------------------------------------------------------+
     N           Min           Max        Median           Avg        Stddev
x 120       2284928  6.752085e+08       3385097      20825362      80352645

Unpatched, and similar have been observed on ivb/byt/bdw/bsw/skl/kbl and
while we were fighting ksoftirqd. 675s maximum latency for a RT thread to wake up.

Patched, max is around 261ms and order of magnitude better than the best
previously!

Yes I have reproduced the ~20x better max latency myself. Mean latency 
stays roughly the same but has less variance. From that point of view 
this is very good.


Then I tries gem_shrink but that doesn't seem to be making any progress
with or without the patch. But it's not even hitting the i915 hard, but
perhaps I need to turn off lockdep.. Which would be a bummer since I
wanted to have a lockdep to check the next bit..

gem_shrink never has completed... It's not quite the right test for this
as we need to hit the shrinker & oom on both i915 and non-i915 paths to
try and hit all the different lock chains.

Ok. :)


+                     mutex_lock_nested(&i915->drm.struct_mutex,
+                                       I915_MM_SHRINKER);

_nested is needed since abandoning trylock would otherwise cause lockdep
issues? But is it really safe? I don't know.. how can it be? It is
guaranteed to be a different process here otherwise the result wouldn't
be MUTEX_TRYLOCK_FAILED.

The easy option was to only force the mutex_lock for kswapd. However,
noting that we do need to forcibly shrink before oom, I opted for the
more inclusive attempt to take it in both. For the oom approach, my only
handwaving is we shouldn't be nested under oom serialisation and so
avoid an ABBA deadlock. It's survived some oom abuse, but the machine
still becomes unresponsive (but pingable) eventually (as it does before
the patch).

...which I completely did not understand! :(

When you say kswapd - that means mutex->owner != current in the shrinker
and otherwise it is direct reclaim?

If true, how is the nested annotation correct for the kswapd case?

We have 3 paths,
direct-reclaim / oom from under i915 + struct_mutex
direct-reclaim / oom from elsewher > kswapd

Is the kernel guaranteeing only one reclaim path at a time can happen? 
In other words if it is doing direct reclaim, kswapd won't enter the 
same shrinking notifier and vice versa?


The nested annotation doesn't mean that the lock is nested per-se, it
just means that here it is acting as a different subclass of the lock.
Since the kswapd path can't have struct_mutex, it is safe from being
confused. The complication is if there is serialisation on
direct-reclaim (which I think can be run concurrently, i.e. no
mutex/semaphore nesting) or oom. For oom, there is a lock but afaict the
guard is just a trylock so shouldn't cause an ABBA deadlock if we oom
under struct_mutex while fighting an oom in another process.

When you say direct reclaim can run concurrently, in that case we would 
have one thread potentially holding the struct_mutex and entering 
shrinker. Second thread would grab the mutex in the shrinker with the 
nested annotation. But it is not nested but overlapping. And you say 
nested is effectively a hack to enable different lock classes.

Actually in this case it is the same situation for any two simultaneous 
threads in direct reclaim, even if none holds struct mutex on entry. But 
in this case the lockdep annotation is also not needed.

Hm.. maybe to turn the question on it's head.. if you had a normal 
mutex_lock in there, what deadlock would lockdep notice?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx