Re: Batched ww-mutexes, wound-wait vs wait-die etc.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 04/14/2018 10:33 AM, Daniel Vetter wrote:
Hi Thomas,

On Fri, Apr 13, 2018 at 10:23 PM, Thomas Hellstrom
<thellstrom@xxxxxxxxxx> wrote:
On 04/13/2018 07:13 PM, Daniel Vetter wrote:
On Wed, Apr 11, 2018 at 10:27:06AM +0200, Thomas Hellstrom wrote:
2) Should we add a *real* wound-wait choice to our wound-wait mutexes.
Otherwise perhaps rename them or document that they're actually doing
wait-die.
I think a doc patch would be good at least. Including all the data you
assembled here.

Actually, a further investigation appears to indicate that manipulating the
lock state under a local spinlock is about fast as using atomic operations
even for the completely uncontended cases.

This means that we could have a solution where you decide on a per-mutex or
per-reservation object basis whether you want to manipulate lock-state under
a "batch group" spinlock, meaning certain performance characteristics or
traditional local locking, meaning other performance characteristics.

Like, vmwgfx could choose batching locks, radeon traditional locks, but the
same API would work for both and locks could be shared between drivers..
Don't we need to make this decision at least on a per-class level?

No, I was thinking more in the line of the ww_mutex having a pointer to the spinlock. It could either be the local mutex "wait_lock", or a per-batch-group lock. The mutex code wouldn't care. We do need a special API for batched locking, though, but not for ordinary locking.
Both APIs should be able to handle local or grouped spinlocks.

Note that this would of course require that there would be no performance loss for users that don't use batch groups.

I guess the most efficient use for GPU command submission would be to use per-process batch-groups. Then when the batch encounters a ww_mutex with a different batch group (for example the display server shared surface, it'll just switch batch lock), and this way the contention for
the batch spinlock will be mostly eliminated.

  Or
how will the spinlock/batch-lock approach interact with the normal
ww_mutex_lock path (which does require the atomics/ordered stores
we're trying to avoid)?

We can use the same code with some extra
if (use_ww_ctx) in the common locking and unlocking path.
Note that the "use_ww_ctx" parameter is defined at compile-time so the ordinary mutex path (binary) shouldn't change at all after optimization but I need to verify that, of course.

What you can't do with such a change is to lock / unlock a ww_mutex using the standard mutex API, like mutex_lock(&ww_mutex->base), but I guess that would be OK?



If we can't mix them I'm kinda leaning towards a
ww_batch_mutex/ww_batch_acquire_ctx, but exactly matching api
otherwise. We probably do need the new batch_start/end api, since
ww_acquire_done isn't quite the right place ...

I'll see if I get time to put together an RFC.
Yeah I think there's definitely some use for batched ww locks, where
parallelism is generally low, or at least the ratio between "time
spent acquiring locks" and "time spent doing stuff while holding
locks" small enough to not make the reduced parallelism while
acquiring an issue.

Yes. At least it's worth bringing up for discussion. The reduced parallelism shouldn't be an issue if per-process batch-groups are used, or, like for vmwgfx the command submission itself is serialized, due to a single FIFO.

/Thomas


-Daniel


_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux