Re: Batched ww-mutexes, wound-wait vs wait-die etc.

Thomas Hellstrom <thellstrom@xxxxxxxxxx> · Mon, 16 Apr 2018 10:23:36 +0200

On 04/14/2018 10:33 AM, Daniel Vetter wrote:
Hi Thomas,

On Fri, Apr 13, 2018 at 10:23 PM, Thomas Hellstrom
<thellstrom@xxxxxxxxxx> wrote:
On 04/13/2018 07:13 PM, Daniel Vetter wrote:
On Wed, Apr 11, 2018 at 10:27:06AM +0200, Thomas Hellstrom wrote:
2) Should we add a *real* wound-wait choice to our wound-wait mutexes.
Otherwise perhaps rename them or document that they're actually doing
wait-die.
I think a doc patch would be good at least. Including all the data you
assembled here.

Actually, a further investigation appears to indicate that manipulating the
lock state under a local spinlock is about fast as using atomic operations
even for the completely uncontended cases.

This means that we could have a solution where you decide on a per-mutex or
per-reservation object basis whether you want to manipulate lock-state under
a "batch group" spinlock, meaning certain performance characteristics or
traditional local locking, meaning other performance characteristics.

Like, vmwgfx could choose batching locks, radeon traditional locks, but the
same API would work for both and locks could be shared between drivers..
Don't we need to make this decision at least on a per-class level?

No, I was thinking more in the line of the ww_mutex having a pointer to 
the spinlock. It could either be the local mutex "wait_lock", or a
per-batch-group lock. The mutex code wouldn't care. We do need a special 
API for batched locking, though, but not for ordinary locking.
Both APIs should be able to handle local or grouped spinlocks.

Note that this would of course require that there would be no 
performance loss for users that don't use batch groups.

I guess the most efficient use for GPU command submission would be to 
use per-process batch-groups. Then when the batch encounters a ww_mutex 
with a different batch group (for example the display server shared 
surface, it'll just switch batch lock), and this way the contention for
the batch spinlock will be mostly eliminated.

  Or
how will the spinlock/batch-lock approach interact with the normal
ww_mutex_lock path (which does require the atomics/ordered stores
we're trying to avoid)?

We can use the same code with some extra
if (use_ww_ctx) in the common locking and unlocking path.
Note that the "use_ww_ctx" parameter is defined at compile-time so the 
ordinary mutex path (binary) shouldn't change at all after optimization 
but I need to verify that, of course.

What you can't do with such a change is to lock / unlock a ww_mutex 
using the standard mutex API, like mutex_lock(&ww_mutex->base), but I 
guess that would be OK?

If we can't mix them I'm kinda leaning towards a
ww_batch_mutex/ww_batch_acquire_ctx, but exactly matching api
otherwise. We probably do need the new batch_start/end api, since
ww_acquire_done isn't quite the right place ...

I'll see if I get time to put together an RFC.
Yeah I think there's definitely some use for batched ww locks, where
parallelism is generally low, or at least the ratio between "time
spent acquiring locks" and "time spent doing stuff while holding
locks" small enough to not make the reduced parallelism while
acquiring an issue.

Yes. At least it's worth bringing up for discussion. The reduced 
parallelism shouldn't be an issue if per-process batch-groups are used, 
or, like for vmwgfx the command submission itself is serialized, due to 
a single FIFO.

/Thomas

-Daniel

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel