Re: [PATCH 2/7] mutex: add support for reservation style locks

Rob Clark <robdclark@xxxxxxxxx> · Wed, 30 Jan 2013 05:52:21 -0600

On Wed, Jan 30, 2013 at 5:08 AM, Daniel Vetter <daniel@xxxxxxxx> wrote:
> On Wed, Jan 30, 2013 at 2:07 AM, Rob Clark <robdclark@xxxxxxxxx> wrote:
>> ==========================
>> Basic problem statement:
>> ----- ------- ---------
>> GPU's do operations that commonly involve many buffers.  Those buffers
>> can be shared across contexts/processes, exist in different memory
>> domains (for example VRAM vs system memory), and so on.  And with
>> PRIME / dmabuf, they can even be shared across devices.  So there are
>> a handful of situations where the driver needs to wait for buffers to
>> become ready.  If you think about this in terms of waiting on a buffer
>> mutex for it to become available, this presents a problem because
>> there is no way to guarantee that buffers appear in a execbuf/batch in
>> the same order in all contexts.  That is directly under control of
>> userspace, and a result of the sequence of GL calls that an
>> application makes.  Which results in the potential for deadlock.  The
>> problem gets more complex when you consider that the kernel may need
>> to migrate the buffer(s) into VRAM before the GPU operates on the
>> buffer(s), which main in turn require evicting some other buffers (and
>> you don't want to evict other buffers which are already queued up to
>> the GPU), but for a simplified understanding of the problem you can
>> ignore this.
>>
>> The algorithm that TTM came up with for dealing with this problem is
>> quite simple.  For each group of buffers (execbuf) that need to be
>> locked, the caller would be assigned a unique reservation_id, from a
>> global counter.  In case of deadlock in the process of locking all the
>> buffers associated with a execbuf, the one with the lowest
>> reservation_id wins, and the one with the higher reservation_id
>> unlocks all of the buffers that it has already locked, and then tries
>> again.
>>
>> Originally TTM implemented this algorithm on top of an event-queue and
>> atomic-ops, but Maarten Lankhorst realized that by merging this with
>> the mutex code we could take advantage of the existing mutex fast-path
>> code and result in a simpler solution, and so ticket_mutex was born.
>> (Well, there where also some additional complexities with the original
>> implementation when you start adding in cross-device buffer sharing
>> for PRIME.. Maarten could probably better explain.)
>
> I think the motivational writeup above is really nice, but the example
> code below is a bit wrong
>
>> How it is used:
>> --- -- -- -----
>>
>> A very simplified version:
>>
>>   int submit_execbuf(execbuf)
>>   {
>>       /* acquiring locks, before queuing up to GPU: */
>>       seqno = assign_global_seqno();
>>   retry:
>>       for (buf in execbuf->buffers) {
>>           ret = mutex_reserve_lock(&buf->lock, seqno);
>>           switch (ret) {
>>           case 0:
>>               /* we got the lock */
>>               break;
>>           case -EAGAIN:
>>               /* someone with a lower seqno, so unreserve and try again: */
>>               for (buf2 in reverse order starting before buf in
>> execbuf->buffers)
>>                   mutex_unreserve_unlock(&buf2->lock);
>>               goto retry;
>>           default:
>>               goto err;
>>           }
>>       }
>>
>>       /* now everything is good to go, submit job to GPU: */
>>       ...
>>   }
>>
>>   int finish_execbuf(execbuf)
>>   {
>>       /* when GPU is finished: */
>>       for (buf in execbuf->buffers)
>>           mutex_unreserve_unlock(&buf->lock);
>>   }
>> ==========================
>
> Since gpu command submission is all asnyc (hopefully at least) we
> don't unlock once it completes, but right away after the commands are
> submitted. Otherwise you wouldn't be able to submit new execbufs using
> the same buffer objects (and besides, holding locks while going back
> out to userspace is evil).

right.. but I was trying to simplify the explanation for non-gpu
folk.. maybe that was an over-simplification ;-)

BR,
-R

> The trick is to add a fence object for async operation (essentially a
> waitqueue on steriods to support gpu->gpu direct signalling). And
> updating fences for a given execbuf needs to happen atomically for all
> buffers, for otherwise userspace could trick the kernel into creating
> a circular fence chain. This wouldn't deadlock the kernel, since
> everything is async, but it'll nicely deadlock the gpus involved.
> Hence why we need ticketing locks to get dma_buf fences off the
> ground.
>
> Maybe wait for Maarten's feedback, then update your motivational blurb a bit?
>
> Cheers, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel