Re: [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2/20/20 9:08 PM, Daniel Vetter wrote:
On Thu, Feb 20, 2020 at 08:46:27PM +0100, Thomas Hellström (VMware) wrote:
On 2/20/20 7:04 PM, Daniel Vetter wrote:
On Thu, Feb 20, 2020 at 10:39:06AM +0100, Thomas Hellström (VMware) wrote:
On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote:
On 2/18/20 10:01 PM, Daniel Vetter wrote:
On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware)
<thomas_os@xxxxxxxxxxxx> wrote:
On 2/17/20 6:55 PM, Daniel Vetter wrote:
On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:
Implement the importer side of unpinned DMA-buf handling.

v2: update page tables immediately

Signed-off-by: Christian König <christian.koenig@xxxxxxx>
---
     drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66
++++++++++++++++++++-
     drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
     2 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index 770baba621b3..48de7624d49c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct
drm_device *dev, struct dma_buf *dma_buf)
        return ERR_PTR(ret);
     }

+/**
+ * amdgpu_dma_buf_move_notify - &attach.move_notify implementation
+ *
+ * @attach: the DMA-buf attachment
+ *
+ * Invalidate the DMA-buf attachment, making sure that
the we re-create the
+ * mapping before the next use.
+ */
+static void
+amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
+{
+    struct drm_gem_object *obj = attach->importer_priv;
+    struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
+    struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
+    struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+    struct ttm_operation_ctx ctx = { false, false };
+    struct ttm_placement placement = {};
+    struct amdgpu_vm_bo_base *bo_base;
+    int r;
+
+    if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
+            return;
+
+    r = ttm_bo_validate(&bo->tbo, &placement, &ctx);
+    if (r) {
+            DRM_ERROR("Failed to invalidate DMA-buf
import (%d))\n", r);
+            return;
+    }
+
+    for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
+            struct amdgpu_vm *vm = bo_base->vm;
+            struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
+
+            if (ticket) {
Yeah so this is kinda why I've been a total pain about the
exact semantics
of the move_notify hook. I think we should flat-out require
that importers
_always_ have a ticket attach when they call this, and that
they can cope
with additional locks being taken (i.e. full EDEADLCK) handling.

Simplest way to force that contract is to add a dummy 2nd
ww_mutex lock to
the dma_resv object, which we then can take #ifdef
CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).

Now the real disaster is how we handle deadlocks. Two issues:

- Ideally we'd keep any lock we've taken locked until the
end, it helps
      needless backoffs. I've played around a bit with that
but not even poc
      level, just an idea:

https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27be37331255ab7582


      Idea is essentially to track a list of objects we had to
lock as part of
      the ttm_bo_validate of the main object.

- Second one is if we get a EDEADLCK on one of these
sublocks (like the
      one here). We need to pass that up the entire callchain,
including a
      temporary reference (we have to drop locks to do the
ww_mutex_lock_slow
      call), and need a custom callback to drop that temporary reference
      (since that's all driver specific, might even be
internal ww_mutex and
      not anything remotely looking like a normal dma_buf).
This probably
      needs the exec util helpers from ttm, but at the
dma_resv level, so that
      we can do something like this:

struct dma_resv_ticket {
         struct ww_acquire_ctx base;

         /* can be set by anyone (including other drivers)
that got hold of
          * this ticket and had to acquire some new lock. This
lock might
          * protect anything, including driver-internal stuff, and isn't
          * required to be a dma_buf or even just a dma_resv. */
         struct ww_mutex *contended_lock;

         /* callback which the driver (which might be a dma-buf exporter
          * and not matching the driver that started this
locking ticket)
          * sets together with @contended_lock, for the main
driver to drop
          * when it calls dma_resv_unlock on the contended_lock. */
         void (drop_ref*)(struct ww_mutex *contended_lock);
};

This is all supremely nasty (also ttm_bo_validate would need to be
improved to handle these sublocks and random new objects
that could force
a ww_mutex_lock_slow).

Just a short comment on this:

Neither the currently used wait-die or the wound-wait algorithm
*strictly* requires a slow lock on the contended lock. For
wait-die it's
just very convenient since it makes us sleep instead of spinning with
-EDEADLK on the contended lock. For wound-wait IIRC one could just
immediately restart the whole locking transaction after an
-EDEADLK, and
the transaction would automatically end up waiting on the contended
lock, provided the mutex lock stealing is not allowed. There is however
a possibility that the transaction will be wounded again on another
lock, taken before the contended lock, but I think there are ways to
improve the wound-wait algorithm to reduce that probability.

So in short, choosing the wound-wait algorithm instead of wait-die and
perhaps modifying the ww mutex code somewhat would probably help
passing
an -EDEADLK up the call chain without requiring passing the contended
lock, as long as each locker releases its own locks when receiving an
-EDEADLK.
Hm this is kinda tempting, since rolling out the full backoff tricker
across driver boundaries is going to be real painful.

What I'm kinda worried about is the debug/validation checks we're
losing with this. The required backoff has this nice property that
ww_mutex debug code can check that we've fully unwound everything when
we should, that we've blocked on the right lock, and that we're
restarting everything without keeling over. Without that I think we
could end up with situations where a driver in the middle feels like
handling the EDEADLCK, which might go well most of the times (the
deadlock will probably be mostly within a given driver, not across).
Right up to the point where someone creates a deadlock across drivers,
and the lack of full rollback will be felt.

So not sure whether we can still keep all these debug/validation
checks, or whether this is a step too far towards clever tricks.
I think we could definitely find a way to keep debugging to make sure
everything is unwound before attempting to restart the locking
transaction. But the debug check that we're restarting on the contended
lock only really makes sense for wait-die, (and we could easily keep it
for wait-die). The lock returning -EDEADLK for wound-wait may actually
not be the contending lock but an arbitrary lock that the wounded
transaction attempts to take after it is wounded.

So in the end IMO this is a tradeoff between added (possibly severe)
locking complexity into dma-buf and not being able to switch back to
wait-die efficiently if we need / want to do that.

/Thomas
And as a consequence an interface *could* be:

*) We introduce functions

void ww_acquire_relax(struct ww_acquire_ctx *ctx);
int ww_acquire_relax_interruptible(struct ww_acquire_ctx *ctx);

that can be used instead of ww_mutex_lock_slow() in the absence of a
contending lock to avoid spinning on -EDEADLK. While trying to take the
contending lock is probably the best choice there are various second best
approaches that can be explored, for example waiting on the contending
acquire to finish or in the wound-wait case, perhaps do nothing. These
functions will also help us keep the debugging.
Hm ... I guess this could work. Trouble is, it only gets rid of the
slowpath locking book-keeping headaches, we still have quite a few others.

*) A function returning -EDEADLK to a caller *must* have already released
its own locks.
So this ties to another question, as in should these callbacks have to
drops the locks thei acquire (much simpler code) or not (less thrashing,
if we drop locks we might end up in a situation where threads thrash
around instead of realizing quicker that they're actually deadlocking and
one of them should stop and back off).
Hmm.. Could you describe such a thrashing case with an example?
Ignoring cross device fun and all that, just a simplified example of why
holding onto locks you've acquired for eviction is useful, at least in a
slow path.

- one thread trying to do an execbuf with a huge bo

vs.

- an entire pile of thread that try to do execbuf with just a few small bo

First thread is in the eviction loop, selects a bo, wins against all the
other thread since it's been doing this forever already, gets the bo moved
out, unlocks.

Since it's competing against lots of other threads with small bo, it'll
have to do that a lot of times. Often enough to create a contiguous hole.
If you have a smarter allocator that tries to create that hole more
actively, just assume that the single huge bo is a substantial part of
total vram.

The other threads will be quicker in cramming new stuff in, even if they
occasionally lose the ww dance against the single thread. So the big
thread livelocks.

If otoh the big thread would keep onto all the locks, eventually it have
the entire vram locked, and every other thread is guaranteed to lose
against it in the ww dance and queue up behind. And it could finally but
its huge bo into vram and execute.

Hmm, yes this indeed explains why it's beneficial in some cases to keep a number of  locks held across certain operations, but I still fail to see why we would like *all* locks held across the entire transaction? In the above case I'd envision us ending up with something like:

int validate(ctx, bo)
{

    for_each_suitable_bo_to_evict(ebo) {
        r = lock(ctx, ebo);
        if (r == EDEADLK)
            goto out_unlock;

        r = move_notify(ctx, ebo);// locks and unlocks GPU VM bo.
        if (r == EDEADLK)
            goto out_unlock;
        evict();
    }

    place_bo(bo);
    //Repeat until success.


out_unlock:
    for_each_locked_bo(ebo)
        unlock(ctx, ebo);
}


void command_submission()
{
    acquire_init(ctx);

restart:
    for_each_bo_in_cs(bo) {
        r = lock(ctx, bo);
        if (r == -EDEADLK)
            goto out_unreserve;
    }

    for_each_bo_in_cs(bo) {
        r = validate(ctx, bo);
        if (r == -EDEADLK)
            goto out_unreserve;
    };

    cs();

    for_each_bo_in_cs(bo)
        unlock(ctx, bo);

    acquire_fini(ctx);
    return 0;

out_unreserve:
    for_each_locked_bo()
        unlock(ctx, bo);

    acquire_relax();
    goto restart;
}



Vary example for multi-gpu and more realism, but that's roughly it.

Aside, a lot of the stuff Christian has been doing in ttm is to improve
the chances that the competing threads will hit one of the locked objects
of the big thread, and at least back off a bit. That's at least my
understanding of what's been happening.
-Daniel

OK unserstood. For vmwgfx the crude simplistic idea to avoid that situation has been to have an rwsem around command submission: When the thread with the big bo has run a full LRU worth of eviction without succeeding it would get annoyed and take the rwsem in write mode, blocking competing threads. But that would of course never work in a dma-buf setting, and IIRC the implementation is not complete either....

/Thomas


_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx




[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux