Re: [PATCH 28/66] drm/i915/gem: Replace i915_gem_object.mm.mutex with reservation_ww_class

Thomas Hellström (Intel) <thomas_os@xxxxxxxxxxxx> · Wed, 29 Jul 2020 09:56:46 +0200

On 7/28/20 1:17 PM, Thomas Hellström (Intel) wrote:

On 7/16/20 5:53 PM, Tvrtko Ursulin wrote:

On 15/07/2020 16:43, Maarten Lankhorst wrote:
Op 15-07-2020 om 13:51 schreef Chris Wilson:
Our goal is to pull all memory reservations (next iteration
obj->ops->get_pages()) under a ww_mutex, and to align those 
reservations
with other drivers, i.e. control all such allocations with the
reservation_ww_class. Currently, this is under the purview of the
obj->mm.mutex, and while obj->mm remains an embedded struct we can
"simply" switch to using the reservation_ww_class obj->base.resv->lock

The major consequence is the impact on the shrinker paths as the
reservation_ww_class is used to wrap allocations, and a ww_mutex does
not support subclassing so we cannot do our usual trick of knowing 
that
we never recurse inside the shrinker and instead have to finish the
reclaim with a trylock. This may result in us failing to release the
pages after having released the vma. This will have to do until a 
better
idea comes along.

However, this step only converts the mutex over and continues to treat
everything as a single allocation and pinning the pages. With the
ww_mutex in place we can remove the temporary pinning, as we can then
reserve all storage en masse.

One last thing to do: kill the implict page pinning for active vma.
This will require us to invalidate the vma->pages when the backing 
store
is removed (and we expect that while the vma is active, we mark the
backing store as active so that it cannot be removed while the HW is
busy.)

Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>

[snip]

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c 
b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
index dc8f052a0ffe..4e928103a38f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
@@ -47,10 +47,7 @@ static bool unsafe_drop_pages(struct 
drm_i915_gem_object *obj,
      if (!(shrink & I915_SHRINK_BOUND))
          flags = I915_GEM_OBJECT_UNBIND_TEST;
  -    if (i915_gem_object_unbind(obj, flags) == 0)
-        __i915_gem_object_put_pages(obj);
-
-    return !i915_gem_object_has_pages(obj);
+    return i915_gem_object_unbind(obj, flags) == 0;
  }
    static void try_to_writeback(struct drm_i915_gem_object *obj,
@@ -199,14 +196,14 @@ i915_gem_shrink(struct drm_i915_private *i915,
spin_unlock_irqrestore(&i915->mm.obj_lock, flags);
  -            if (unsafe_drop_pages(obj, shrink)) {
-                /* May arrive from get_pages on another bo */
-                mutex_lock(&obj->mm.lock);
+            if (unsafe_drop_pages(obj, shrink) &&
+                i915_gem_object_trylock(obj)) {

Why trylock? Because of the nesting? In that case, still use ww ctx 
if provided please

By "if provided" you mean for code paths where we are calling the 
shrinker ourselves, as opposed to reclaim, like shmem_get_pages?

That indeed sounds like the right thing to do, since all the 
get_pages from execbuf are in the reservation phase, collecting a 
list of GEM objects to lock, the ones to shrink sound like should be 
on that list.

+ __i915_gem_object_put_pages(obj);
                  if (!i915_gem_object_has_pages(obj)) {
                      try_to_writeback(obj, shrink);
                      count += obj->base.size >> PAGE_SHIFT;
                  }
-                mutex_unlock(&obj->mm.lock);
+                i915_gem_object_unlock(obj);
              }
                scanned += obj->base.size >> PAGE_SHIFT;
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c 
b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
index ff72ee2fd9cd..ac12e1c20e66 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_tiling.c
@@ -265,7 +265,6 @@ i915_gem_object_set_tiling(struct 
drm_i915_gem_object *obj,
       * pages to prevent them being swapped out and causing 
corruption
       * due to the change in swizzling.
       */
-    mutex_lock(&obj->mm.lock);
      if (i915_gem_object_has_pages(obj) &&
          obj->mm.madv == I915_MADV_WILLNEED &&
          i915->quirks & QUIRK_PIN_SWIZZLED_PAGES) {
@@ -280,7 +279,6 @@ i915_gem_object_set_tiling(struct 
drm_i915_gem_object *obj,
              obj->mm.quirked = true;
          }
      }
-    mutex_unlock(&obj->mm.lock);
        spin_lock(&obj->vma.lock);
      for_each_ggtt_vma(vma, obj) {
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
index e946032b13e4..80907c00c6fd 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -129,8 +129,15 @@ userptr_mn_invalidate_range_start(struct 
mmu_notifier *_mn,
          ret = i915_gem_object_unbind(obj,
                           I915_GEM_OBJECT_UNBIND_ACTIVE |
                           I915_GEM_OBJECT_UNBIND_BARRIER);

Question: What happens above if someone is preparing cs with the above 
object, is holding its reservation object and is just about to submit? 
Doesn't the ppgtt binding go away from under it?


-        if (ret == 0)
-            ret = __i915_gem_object_put_pages(obj);
+        if (ret == 0) {
+            /* ww_mutex and mmu_notifier is fs_reclaim tainted */
+            if (i915_gem_object_trylock(obj)) {
+                ret = __i915_gem_object_put_pages(obj);
+                i915_gem_object_unlock(obj);
+            } else {
+                ret = -EAGAIN;
+            }
+        }

I'm not sure upstream will agree with this kind of API:

1. It will deadlock when RT tasks are used.

It will or it can? Which part? It will break out of the loop if 
trylock fails.


2. You start throwing -EAGAIN because you don't have the correct 
ordering of locking, this needs fixing first.

Is it about correct ordering of locks or something else? If memory 
allocation is allowed under dma_resv.lock, then the opposite order 
cannot be taken in any case.

I've had a brief look at the amdgpu solution and maybe I 
misunderstood something, but it looks like a BKL approach with the 
device level notifier_lock. Their userptr notifier blocks on that 
one, not on dma_resv lock, but that also means their command 
submission (amdgpu_cs_submit) blocks on the same lock while obtaining 
backing store.

If I read Christian right, it blocks on that lock only just before 
command submission to validate that sequence number. If there is a 
mismatch, it needs to rerun CS. I'm not sure how common userptr 
buffers are, but if a device-wide mutex hurts too much, There are 
perhaps more fine-grained solutions. (Like an rw semaphore, and unlock 
before the fence wait in the notifier: CS which are unaffected 
shouldn't need to wait...).


So it looks like a big hammer approach not directly related to the 
story of dma_resv locking. Maybe we could do the same big hammer 
approach, although I am not sure how it is deadlock free.

What happens for instance if someone submits an userptr batch which 
gets unmapped while amdgpu_cs_submit is holding the notifier_lock?

My understanding is the unmapping operation blocks on the 
notifier_lock in the mmu notifier?

/Thomas



If you understand amdgpu better please share some insights. I 
certainly only looked at it briefly today so may be wrong.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx