Re: [PATCH 2/7] mutex: add support for reservation style locks

Rob Clark <robdclark@xxxxxxxxx> · Thu, 31 Jan 2013 07:38:13 -0600

On Wed, Jan 30, 2013 at 5:52 AM, Rob Clark <robdclark@xxxxxxxxx> wrote:
> On Wed, Jan 30, 2013 at 5:08 AM, Daniel Vetter <daniel@xxxxxxxx> wrote:
>> On Wed, Jan 30, 2013 at 2:07 AM, Rob Clark <robdclark@xxxxxxxxx> wrote:
>>> ==========================
>>> Basic problem statement:
>>> ----- ------- ---------
>>> GPU's do operations that commonly involve many buffers.  Those buffers
>>> can be shared across contexts/processes, exist in different memory
>>> domains (for example VRAM vs system memory), and so on.  And with
>>> PRIME / dmabuf, they can even be shared across devices.  So there are
>>> a handful of situations where the driver needs to wait for buffers to
>>> become ready.  If you think about this in terms of waiting on a buffer
>>> mutex for it to become available, this presents a problem because
>>> there is no way to guarantee that buffers appear in a execbuf/batch in
>>> the same order in all contexts.  That is directly under control of
>>> userspace, and a result of the sequence of GL calls that an
>>> application makes.  Which results in the potential for deadlock.  The
>>> problem gets more complex when you consider that the kernel may need
>>> to migrate the buffer(s) into VRAM before the GPU operates on the
>>> buffer(s), which main in turn require evicting some other buffers (and
>>> you don't want to evict other buffers which are already queued up to
>>> the GPU), but for a simplified understanding of the problem you can
>>> ignore this.
>>>
>>> The algorithm that TTM came up with for dealing with this problem is
>>> quite simple.  For each group of buffers (execbuf) that need to be
>>> locked, the caller would be assigned a unique reservation_id, from a
>>> global counter.  In case of deadlock in the process of locking all the
>>> buffers associated with a execbuf, the one with the lowest
>>> reservation_id wins, and the one with the higher reservation_id
>>> unlocks all of the buffers that it has already locked, and then tries
>>> again.
>>>
>>> Originally TTM implemented this algorithm on top of an event-queue and
>>> atomic-ops, but Maarten Lankhorst realized that by merging this with
>>> the mutex code we could take advantage of the existing mutex fast-path
>>> code and result in a simpler solution, and so ticket_mutex was born.
>>> (Well, there where also some additional complexities with the original
>>> implementation when you start adding in cross-device buffer sharing
>>> for PRIME.. Maarten could probably better explain.)
>>
>> I think the motivational writeup above is really nice, but the example
>> code below is a bit wrong
>>
>>> How it is used:
>>> --- -- -- -----
>>>
>>> A very simplified version:
>>>
>>>   int submit_execbuf(execbuf)
>>>   {
>>>       /* acquiring locks, before queuing up to GPU: */
>>>       seqno = assign_global_seqno();
>>>   retry:
>>>       for (buf in execbuf->buffers) {
>>>           ret = mutex_reserve_lock(&buf->lock, seqno);
>>>           switch (ret) {
>>>           case 0:
>>>               /* we got the lock */
>>>               break;
>>>           case -EAGAIN:
>>>               /* someone with a lower seqno, so unreserve and try again: */
>>>               for (buf2 in reverse order starting before buf in
>>> execbuf->buffers)
>>>                   mutex_unreserve_unlock(&buf2->lock);
>>>               goto retry;
>>>           default:
>>>               goto err;
>>>           }
>>>       }
>>>
>>>       /* now everything is good to go, submit job to GPU: */
>>>       ...
>>>   }
>>>
>>>   int finish_execbuf(execbuf)
>>>   {
>>>       /* when GPU is finished: */
>>>       for (buf in execbuf->buffers)
>>>           mutex_unreserve_unlock(&buf->lock);
>>>   }
>>> ==========================
>>
>> Since gpu command submission is all asnyc (hopefully at least) we
>> don't unlock once it completes, but right away after the commands are
>> submitted. Otherwise you wouldn't be able to submit new execbufs using
>> the same buffer objects (and besides, holding locks while going back
>> out to userspace is evil).
>
> right.. but I was trying to simplify the explanation for non-gpu
> folk.. maybe that was an over-simplification ;-)
>

Ok, a bit expanded version.. I meant to send this yesterday, but I forgot..

============================
Basic problem statement:
----- ------- ---------

GPU's do operations that commonly involve many buffers.  Those buffers
can be shared across contexts/processes, exist in different memory
domains (for example VRAM vs system memory), and so on.  And with
PRIME / dmabuf, they can even be shared across devices.  So there are
a handful of situations where the driver needs to wait for buffers to
become ready.  If you think about this in terms of waiting on a buffer
mutex for it to become available, this presents a problem because
there is no way to guarantee that buffers appear in a execbuf/batch in
the same order in all contexts.  That is directly under control of
userspace, and a result of the sequence of GL calls that an application
makes.  Which results in the potential for deadlock.  The problem gets
more complex when you consider that the kernel may need to migrate the
buffer(s) into VRAM before the GPU operates on the buffer(s), which
may in turn require evicting some other buffers (and you don't want to
evict other buffers which are already queued up to the GPU), but for a
simplified understanding of the problem you can ignore this.

The algorithm that TTM came up with for dealing with this problem is
quite simple.  For each group of buffers (execbuf) that need to be
locked, the caller would be assigned a unique reservation_id, from a
global counter.  In case of deadlock in the process of locking all the
buffers associated with a execbuf, the one with the lowest
reservation_id wins, and the one with the higher reservation_id
unlocks all of the buffers that it has already locked, and then tries
again.

Originally TTM implemented this algorithm on top of an event-queue and
atomic-ops, but Maarten Lankhorst realized that by merging this with
the mutex code we could take advantage of the existing mutex fast-path
code and result in a simpler solution, and so ticket_mutex was born.
(Well, there where also some additional complexities with the original
implementation when you start adding in cross-device buffer sharing
for PRIME.. Maarten could probably better explain.)

How it is used:
--- -- -- -----

A very simplified version:

    int lock_execbuf(execbuf)
    {
        struct buf *res_buf = NULL;

        /* acquiring locks, before queuing up to GPU: */
        seqno = assign_global_seqno();

    retry:
        for (buf in execbuf->buffers) {
            if (buf == res_buf) {
                res_buf = NULL;
                continue;
            }
            ret = mutex_reserve_lock(&buf->lock, seqno);
            if (ret < 0)
                goto err;
        }

        /* now everything is good to go, submit job to GPU: */
        ...

        return 0;

    err:
        for (all buf2 before buf in execbuf->buffers)
            mutex_unreserve_unlock(&buf2->lock);
        if (res_buf)
            mutex_unreserve_unlock(&res_buf->lock);

        if (ret == -EAGAIN) {
            /* we lost out in a seqno race, lock and retry.. */
            mutex_reserve_lock_slow(&buf->lock, seqno);
            res_buf = buf;
            goto retry;
        }

        return ret;
    }

    int unlock_execbuf(execbuf)
    {
        /* when GPU is finished; */
        for (buf in execbuf->buffers)
            mutex_unreserve_unlock(&buf->lock);
    }

What Really Happens:
---- ------ -------

(TODO maybe this should be Documentation/dma-fence-reservation.txt and
this file should just refer to it??  Well, we can shuffle things around
later..)

In real life, you want to keep the GPU operating asynchronously to the
CPU as much as possible, and not have to wait to queue up more work for
the GPU until the previous work is finished.  So in practice, you are
unlocking (unreserving) all the buffers once the execbuf is queued up
to the GPU.  The dma-buf fence objects, and the reservation code which
manages the fence objects (and is the primary user of ticket_mutex)
takes care of the synchronization of different buffer users from this
point.

If you really left the buffers locked until you got some irq back from
the GPU to let you know that the GPU was finished, then you would be
unable to queue up more rendering involving the same buffer(s), which
would be quite horrible for performance.

To just understand ticket_mutex, you can probably ignore this section.
If you want to make your driver share buffers with a GPU properly, then
you really need to be using reservation/fence, so you should read on.

NOTE: the reservation object and fence are split out from dma-buf so
      that a driver can use them both for it's own internal buffers
      and for imported dma-bufs, without having to create a dma-buf
      for every internal buffer.

For each rendering command queued up to the GPU, a fence object is
created. You can think of the fence as a sort of waitqueue, except that
(if it is supported by other devices waiting on the same buffer), it
can be used for hw->hw signaling, so that CPU involvement is not
required.  A fence object is a transient, one-use, object, with two
states.  Initially it is created un-signaled.  And later after the hw
is done with the operation, it becomes signaled.

(TODO probably should refer to a different .txt with more details
about fences, hw->hw vs hw->sw vs sw->sw signaling, etc)

The same fence is attached to the reservation_object of all buffers
involved in a rendering command.  In the fence_excl slot, if the buffer
is being written to, otherwise in one of the fence_shared slots.  (A
buffer can safely have many readers at once.)

If when preparing to submit the rendering command, a buffer has an un-
signaled exclusive fence attached, then there must be some way to wait
for that fence to become signaled before the hw uses that buffer.  In
the simple case, if that fence isn't one that the driver understands
how to instruct the hw to wait for, then it must call fence_wait() to
block until other devices have finished writing to the buffer.  But if
the driver has a way to instruct the hw to wait until the fence is
signaled, it can just emit commands to instruct the GPU to wait in
order to avoid blocking.

NOTE: in actuality, if the fence is created on the same ring, and you
      therefore know that it will be signaled by the earlier render
      command before the hw sees the current render command, then
      inserting fence cmds to the hw can be skipped.

============================

It made me realize we also need some docs about fence/reservation..
not sure if I'll get a chance before fosdem, but I can take a stab at
that too

BR,
-R
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html