Re: [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

Christian König <christian.koenig@xxxxxxx> · Tue, 18 Feb 2020 11:42:58 +0100

Am 17.02.20 um 20:38 schrieb Daniel Vetter:
On Mon, Feb 17, 2020 at 7:58 PM Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
Am 17.02.20 um 18:55 schrieb Daniel Vetter:
On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:
[SNIP]
And is also the sole reason why I started looking into the ww_mutex
cursor handling a while back (e.g. the initial version with the horrible
macro hack).

But this is really really hard to get right. So my thinking for now is
to push this series upstream to at least unblock my ongoing P2P work.
Hm, but at least the move_notify stuff and the locking nightmare
around that feels rushed if we just push that. Otoh it's indeed
getting painful, and we'll probably have another few rounds of
headaches to sort this all out. What about a

config EXPERIMENTAL_DYNAMIC_DMA_BUF
     depends on BROKEN

Wrapped around the new ->move_notify hook, plus all relevant code?

Oh, that is a really good idea and trivial to to do.

My initial thinking was to make all of this part of the core ww_mutex
implementation, but then I quickly found that this won't work.

This is all supremely nasty (also ttm_bo_validate would need to be
improved to handle these sublocks and random new objects that could force
a ww_mutex_lock_slow).
The next idea was to have it based on dma_resv objects, but as you also
figured out you then need to drop the reference to the contended lock
somehow...

So my current working plan was to use GEM object to avoid the callback...
I've heard noise that someone is looking into adding dynamic dma-buf
support to stuff like rdma drivers. Because interconnects and big
machines. Plus feels a bit awkward to mandate a gem library if you
want to use dynamic dma-buf support in your driver. Hence why I think
something around dma_resv (but with enough flexibility that it doesn't
insist that the contending lock must be a dma_resv itself).

Ok, good to know. So to hell with the idea of using a GEM object.

But this also means that we can't do this with a single drop_ref() 
callback in the context because the context might contains different 
objects of all kind.....

[SNIP]

Oh cool, I was waiting for the upload. Will watch asap. btw slides somewhere?

Attached.

[SNIP]
For the rather specific amdgpu case I could work around that by
utilizing the HMM work to invalidate page tables on the fly, but that
doesn't really help with memory management in general.
Yeah, so move_notify is maybe solveable with better hw and hmm,

You don't even need better hardware and HMM.

All you need to do is clever locking because in this case you will never 
export page tables to other devices.

  but
there's other scenarios where I think the cross-driver ww_mutex
locking will be needed, for fundamental reasons. Scenario:
- a bunch of gpus in pcie slots, all in the same machine
- because pcie is slot a nice interconnect (iirc you guys call yours
xgmi or something like that)
- working sets that are bigger than vram of a single gpu
- lots of buffer sharing ofc

Yeah, completely agree. The issue with the page tables is actually a 
rather specific use case.

1. So driver has an imported dma-buf, currently not mapped anywhere
because ti got thrown out (or first use).
2. Importer calls dma_buf_map_attachement
3. Exporter realizes there's a nice xgmi link and p2p would be much
better if that object is in vram.
4. Exporter does ttm_bo_validate or equivalent to get the bo into
vram, including eviction and lots of locking
5. In turn this might bite back to the importer through some
move_notify of objects still mapped, but at the end of the lru.

So ->move_notify might not be the worst, eventually I think we'll need
the full locking dance across drivers (or at least across drm_device
instances, there might be internal upcasting going on so you get your
buffers placed in the right vram and all that directly).

YES, EXACTLY! That's the reason why I'm working on this stuff and not 
try to get P2P/XGMI/etc.. upstream directly without it.

I mean using P2P without all this is certainly possible, but sooner or 
later your memory management will just fall apart.

So YES, I totally agree that we need some sort of GEM execution context
or something like this to lock buffers on the fly as we try to make room
for others.
So what's the plan? Merge current series (with the bikesheds address)
under this CONFIG_EXPERIMENTAL_DYN_DMABUF and then see where we land
from there? Trying to get all the pieces lined up out of tree feels
like it's going to be too much :-/

At least I hoped for something like that.

Developing this out of tree and especially since I have this only as a 
background task turned out to be delaying things over and over again.

This way I can get it upstream (even when still under experimental flag) 
and start to convince our internal team/customers that we should work on 
investing time into this.

Thanks,
Christian.

-Daniel

Regards,
Christian.

Cheers, Daniel

+                    /* When we get an error here it means that somebody
+                     * else is holding the VM lock and updating page tables
+                     * So we can just continue here.
+                     */
+                    r = dma_resv_lock(resv, ticket);
+                    if (r)
+                            continue;
+
+            } else {
+                    /* TODO: This is more problematic and we actually need
+                     * to allow page tables updates without holding the
+                     * lock.
+                     */
+                    if (!dma_resv_trylock(resv))
+                            continue;
+            }
+
+            r = amdgpu_vm_clear_freed(adev, vm, NULL);
+            if (!r)
+                    r = amdgpu_vm_handle_moved(adev, vm);
+
+            if (r && r != -EBUSY)
+                    DRM_ERROR("Failed to invalidate VM page tables (%d))\n",
+                              r);
+
+            dma_resv_unlock(resv);
+    }
+}
+
   static const struct dma_buf_attach_ops amdgpu_dma_buf_attach_ops = {
+    .move_notify = amdgpu_dma_buf_move_notify
   };

   /**
@@ -489,7 +553,7 @@ struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev,
              return obj;

      attach = dma_buf_dynamic_attach(dma_buf, dev->dev,
-                                    &amdgpu_dma_buf_attach_ops, NULL);
+                                    &amdgpu_dma_buf_attach_ops, obj);
      if (IS_ERR(attach)) {
              drm_gem_object_put(obj);
              return ERR_CAST(attach);

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 8ae260822908..8c480c898b0d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -926,6 +926,9 @@ int amdgpu_bo_pin_restricted(struct amdgpu_bo *bo, u32 domain,
              return 0;
      }

+    if (bo->tbo.base.import_attach)
+            dma_buf_pin(bo->tbo.base.import_attach);
+
      bo->flags |= AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS;
      /* force to pin into visible video ram */
      if (!(bo->flags & AMDGPU_GEM_CREATE_NO_CPU_ACCESS))
@@ -1009,6 +1012,9 @@ int amdgpu_bo_unpin(struct amdgpu_bo *bo)

      amdgpu_bo_subtract_pin_size(bo);

+    if (bo->tbo.base.import_attach)
+            dma_buf_unpin(bo->tbo.base.import_attach);
+
      for (i = 0; i < bo->placement.num_placement; i++) {
              bo->placements[i].lpfn = 0;
              bo->placements[i].flags &= ~TTM_PL_FLAG_NO_EVICT;
--
2.17.1



Attachment:
TTM FOSDEM 2020.pdf

Description: Adobe PDF document
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel