On Mon, Apr 16, 2018 at 10:23 AM, Thomas Hellstrom <thellstrom@xxxxxxxxxx> wrote: > On 04/14/2018 10:33 AM, Daniel Vetter wrote: >> >> Hi Thomas, >> >> On Fri, Apr 13, 2018 at 10:23 PM, Thomas Hellstrom >> <thellstrom@xxxxxxxxxx> wrote: >>> >>> On 04/13/2018 07:13 PM, Daniel Vetter wrote: >>>> >>>> On Wed, Apr 11, 2018 at 10:27:06AM +0200, Thomas Hellstrom wrote: >>>>> >>>>> 2) Should we add a *real* wound-wait choice to our wound-wait mutexes. >>>>> Otherwise perhaps rename them or document that they're actually doing >>>>> wait-die. >>>> >>>> I think a doc patch would be good at least. Including all the data you >>>> assembled here. >>> >>> >>> Actually, a further investigation appears to indicate that manipulating >>> the >>> lock state under a local spinlock is about fast as using atomic >>> operations >>> even for the completely uncontended cases. >>> >>> This means that we could have a solution where you decide on a per-mutex >>> or >>> per-reservation object basis whether you want to manipulate lock-state >>> under >>> a "batch group" spinlock, meaning certain performance characteristics or >>> traditional local locking, meaning other performance characteristics. >>> >>> Like, vmwgfx could choose batching locks, radeon traditional locks, but >>> the >>> same API would work for both and locks could be shared between drivers.. >> >> Don't we need to make this decision at least on a per-class level? > > > No, I was thinking more in the line of the ww_mutex having a pointer to the > spinlock. It could either be the local mutex "wait_lock", or a > per-batch-group lock. The mutex code wouldn't care. We do need a special API > for batched locking, though, but not for ordinary locking. > Both APIs should be able to handle local or grouped spinlocks. > > Note that this would of course require that there would be no performance > loss for users that don't use batch groups. > > I guess the most efficient use for GPU command submission would be to use > per-process batch-groups. Then when the batch encounters a ww_mutex with a > different batch group (for example the display server shared surface, it'll > just switch batch lock), and this way the contention for > the batch spinlock will be mostly eliminated. But won't this force us through the spinlock case for all ww_mutex? The core mutex code goes to extreme lengths to avoid that for the uncontended fast path. That's why I meant the batched and non-batch ww_mutex look fundamentally incompatible. Or maybe I missed something somewhere. >> Or >> how will the spinlock/batch-lock approach interact with the normal >> ww_mutex_lock path (which does require the atomics/ordered stores >> we're trying to avoid)? > > > We can use the same code with some extra > if (use_ww_ctx) in the common locking and unlocking path. > Note that the "use_ww_ctx" parameter is defined at compile-time so the > ordinary mutex path (binary) shouldn't change at all after optimization but > I need to verify that, of course. Agreed, no issue there. > What you can't do with such a change is to lock / unlock a ww_mutex using > the standard mutex API, like mutex_lock(&ww_mutex->base), but I guess that > would be OK? Yeah right now that works, but I don't care about that. The point of merging ww_mutex into the core mutex was that we can fully reuse the __mutex_trylock_fast. Same for the unlock path. Once we don't share that code anymore, there's imo not that much point in having ww_mutex interleaved with core mutex code. >> If we can't mix them I'm kinda leaning towards a >> ww_batch_mutex/ww_batch_acquire_ctx, but exactly matching api >> otherwise. We probably do need the new batch_start/end api, since >> ww_acquire_done isn't quite the right place ... >> >>> I'll see if I get time to put together an RFC. >> >> Yeah I think there's definitely some use for batched ww locks, where >> parallelism is generally low, or at least the ratio between "time >> spent acquiring locks" and "time spent doing stuff while holding >> locks" small enough to not make the reduced parallelism while >> acquiring an issue. > > > Yes. At least it's worth bringing up for discussion. The reduced parallelism > shouldn't be an issue if per-process batch-groups are used, or, like for > vmwgfx the command submission itself is serialized, due to a single FIFO. Random aside: We do seem to already implement wound semantics (or I'm terribly confused). See __ww_mutex_add_waiter plus related wakeup code (in __ww_mutex_lock_check_stamp). -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel