The current Wound-Wait mutex algorithm is actually not Wound-Wait but Wait-Die. Implement also Wound-Wait as a per-ww-class choice. Wound-Wait is, contrary to Wait-Die a preemptive algorithm and is known to generate fewer backoffs. Testing reveals that this is true if the number of simultaneous contending transactions is small. As the number of simultaneous contending threads increases, Wait-Wound becomes inferior to Wait-Die in terms of elapsed time. Possibly due to the larger number of held locks of sleeping transactions. Update documentation and callers. Timings using git://people.freedesktop.org/~thomash/ww_mutex_test tag patch-18-06-04 Each thread runs 100000 batches of lock / unlock 800 ww mutexes randomly chosen out of 100000. Four core Intel x86_64: Algorithm #threads Rollbacks time Wound-Wait 4 ~100 ~17s. Wait-Die 4 ~150000 ~19s. Wound-Wait 16 ~360000 ~109s. Wait-Die 16 ~450000 ~82s. Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> Cc: Ingo Molnar <mingo@xxxxxxxxxx> Cc: Jonathan Corbet <corbet@xxxxxxx> Cc: Gustavo Padovan <gustavo@xxxxxxxxxxx> Cc: Maarten Lankhorst <maarten.lankhorst@xxxxxxxxxxxxxxx> Cc: Sean Paul <seanpaul@xxxxxxxxxxxx> Cc: David Airlie <airlied@xxxxxxxx> Cc: Davidlohr Bueso <dave@xxxxxxxxxxxx> Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx> Cc: Josh Triplett <josh@xxxxxxxxxxxxxxxx> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> Cc: Kate Stewart <kstewart@xxxxxxxxxxxxxxxxxxx> Cc: Philippe Ombredanne <pombredanne@xxxxxxxx> Cc: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> Cc: linux-doc@xxxxxxxxxxxxxxx Cc: linux-media@xxxxxxxxxxxxxxx Cc: linaro-mm-sig@xxxxxxxxxxxxxxxx Signed-off-by: Thomas Hellstrom <thellstrom@xxxxxxxxxx> --- Documentation/locking/ww-mutex-design.txt | 57 ++++++++++++++---- drivers/dma-buf/reservation.c | 2 +- drivers/gpu/drm/drm_modeset_lock.c | 2 +- include/linux/ww_mutex.h | 19 ++++-- kernel/locking/locktorture.c | 2 +- kernel/locking/mutex.c | 98 ++++++++++++++++++++++++++++--- kernel/locking/test-ww_mutex.c | 2 +- lib/locking-selftest.c | 2 +- 8 files changed, 152 insertions(+), 32 deletions(-) diff --git a/Documentation/locking/ww-mutex-design.txt b/Documentation/locking/ww-mutex-design.txt index 34c3a1b50b9a..29c85623b551 100644 --- a/Documentation/locking/ww-mutex-design.txt +++ b/Documentation/locking/ww-mutex-design.txt @@ -1,4 +1,4 @@ -Wait/Wound Deadlock-Proof Mutex Design +Wound/Wait Deadlock-Proof Mutex Design ====================================== Please read mutex-design.txt first, as it applies to wait/wound mutexes too. @@ -32,10 +32,23 @@ the oldest task) wins, and the one with the higher reservation id (i.e. the younger task) unlocks all of the buffers that it has already locked, and then tries again. -In the RDBMS literature this deadlock handling approach is called wait/wound: -The older tasks waits until it can acquire the contended lock. The younger tasks -needs to back off and drop all the locks it is currently holding, i.e. the -younger task is wounded. +In the RDBMS literature, a reservation ticket is associated with a transaction. +and the deadlock handling approach is called Wait-Die. The name is based on +the actions of a locking thread when it encounters an already locked mutex. +If the transaction holding the lock is younger, the locking transaction waits. +If the transaction holding the lock is older, the locking transaction backs off +and dies. Hence Wait-Die. +There is also another algorithm called Wound-Wait: +If the transaction holding the lock is younger, the locking transaction +preempts the transaction holding the lock, requiring it to back off. It +Wounds the other transaction. +If the transaction holding the lock is older, it waits for the other +transaction. Hence Wound-Wait. +The two algorithms are both fair in that a transaction will eventually succeed. +However, the Wound-Wait algorithm is typically stated to generate fewer backoffs +compared to Wait-Die, but is, on the other hand, associated with more work than +Wait-Die when recovering from a backoff. Wound-Wait is also a preemptive +algorithm which requires a reliable way to preempt another transaction. Concepts -------- @@ -47,10 +60,12 @@ Acquire context: To ensure eventual forward progress it is important the a task trying to acquire locks doesn't grab a new reservation id, but keeps the one it acquired when starting the lock acquisition. This ticket is stored in the acquire context. Furthermore the acquire context keeps track of debugging state -to catch w/w mutex interface abuse. +to catch w/w mutex interface abuse. An acquire context is representing a +transaction. W/w class: In contrast to normal mutexes the lock class needs to be explicit for -w/w mutexes, since it is required to initialize the acquire context. +w/w mutexes, since it is required to initialize the acquire context. The lock +class also specifies what algorithm to use, Wound-Wait or Wait-Die. Furthermore there are three different class of w/w lock acquire functions: @@ -90,10 +105,15 @@ provided. Usage ----- +The algorithm (Wait-Die vs Wound-Wait) is chosen using the _is_wait_die +argument to DEFINE_WW_CLASS(). As a rough rule of thumb, use Wound-Wait iff you +typically expect the number of simultaneous competing transactions to be small, +and the rollback cost can be substantial. + Three different ways to acquire locks within the same w/w class. Common definitions for methods #1 and #2: -static DEFINE_WW_CLASS(ww_class); +static DEFINE_WW_CLASS(ww_class, false); struct obj { struct ww_mutex lock; @@ -243,7 +263,7 @@ struct obj { struct list_head locked_list; }; -static DEFINE_WW_CLASS(ww_class); +static DEFINE_WW_CLASS(ww_class, false); void __unlock_objs(struct list_head *list) { @@ -312,12 +332,23 @@ Design: We maintain the following invariants for the wait list: (1) Waiters with an acquire context are sorted by stamp order; waiters without an acquire context are interspersed in FIFO order. - (2) Among waiters with contexts, only the first one can have other locks - acquired already (ctx->acquired > 0). Note that this waiter may come - after other waiters without contexts in the list. + (2) For Wait-Die, among waiters with contexts, only the first one can have + other locks acquired already (ctx->acquired > 0). Note that this waiter + may come after other waiters without contexts in the list. + + The Wound-Wait preemption is implemented with a lazy-preemption scheme: + The wounded status of the transaction is checked only when there is + contention for a new lock and hence a true chance of deadlock. In that + situation, if the transaction is wounded, it backs off, clears the + wounded status and retries. A great benefit of implementing preemption in + this way is that the wounded transaction can identify a contending lock to + wait for before restarting the transaction. Just blindly restarting the + transaction would likely make the transaction end up in a situation where + it would have to back off again. In general, not much contention is expected. The locks are typically used to - serialize access to resources for devices. + serialize access to resources for devices, and optimization focus should + therefore be directed towards the uncontended cases. Lockdep: Special care has been taken to warn for as many cases of api abuse diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c index 314eb1071cce..039571b9fea1 100644 --- a/drivers/dma-buf/reservation.c +++ b/drivers/dma-buf/reservation.c @@ -46,7 +46,7 @@ * write-side updates. */ -DEFINE_WW_CLASS(reservation_ww_class); +DEFINE_WW_CLASS(reservation_ww_class, true); EXPORT_SYMBOL(reservation_ww_class); struct lock_class_key reservation_seqcount_class; diff --git a/drivers/gpu/drm/drm_modeset_lock.c b/drivers/gpu/drm/drm_modeset_lock.c index 8a5100685875..f22a7ef41de1 100644 --- a/drivers/gpu/drm/drm_modeset_lock.c +++ b/drivers/gpu/drm/drm_modeset_lock.c @@ -70,7 +70,7 @@ * lists and lookup data structures. */ -static DEFINE_WW_CLASS(crtc_ww_class); +static DEFINE_WW_CLASS(crtc_ww_class, true); /** * drm_modeset_lock_all - take all modeset locks diff --git a/include/linux/ww_mutex.h b/include/linux/ww_mutex.h index 39fda195bf78..6278077f288b 100644 --- a/include/linux/ww_mutex.h +++ b/include/linux/ww_mutex.h @@ -8,6 +8,8 @@ * * Wound/wait implementation: * Copyright (C) 2013 Canonical Ltd. + * Choice of algorithm: + * Copyright (C) 2018 WMWare Inc. * * This file contains the main data structure and API definitions. */ @@ -23,15 +25,17 @@ struct ww_class { struct lock_class_key mutex_key; const char *acquire_name; const char *mutex_name; + bool is_wait_die; }; struct ww_acquire_ctx { struct task_struct *task; unsigned long stamp; unsigned acquired; + bool wounded; + struct ww_class *ww_class; #ifdef CONFIG_DEBUG_MUTEXES unsigned done_acquire; - struct ww_class *ww_class; struct ww_mutex *contending_lock; #endif #ifdef CONFIG_DEBUG_LOCK_ALLOC @@ -58,17 +62,19 @@ struct ww_mutex { # define __WW_CLASS_MUTEX_INITIALIZER(lockname, class) #endif -#define __WW_CLASS_INITIALIZER(ww_class) \ +#define __WW_CLASS_INITIALIZER(ww_class, _is_wait_die) \ { .stamp = ATOMIC_LONG_INIT(0) \ , .acquire_name = #ww_class "_acquire" \ - , .mutex_name = #ww_class "_mutex" } + , .mutex_name = #ww_class "_mutex" \ + , .is_wait_die = _is_wait_die } #define __WW_MUTEX_INITIALIZER(lockname, class) \ { .base = __MUTEX_INITIALIZER(lockname.base) \ __WW_CLASS_MUTEX_INITIALIZER(lockname, class) } -#define DEFINE_WW_CLASS(classname) \ - struct ww_class classname = __WW_CLASS_INITIALIZER(classname) +#define DEFINE_WW_CLASS(classname, _is_wait_die) \ + struct ww_class classname = __WW_CLASS_INITIALIZER(classname, \ + _is_wait_die) #define DEFINE_WW_MUTEX(mutexname, ww_class) \ struct ww_mutex mutexname = __WW_MUTEX_INITIALIZER(mutexname, ww_class) @@ -123,8 +129,9 @@ static inline void ww_acquire_init(struct ww_acquire_ctx *ctx, ctx->task = current; ctx->stamp = atomic_long_inc_return_relaxed(&ww_class->stamp); ctx->acquired = 0; -#ifdef CONFIG_DEBUG_MUTEXES ctx->ww_class = ww_class; + ctx->wounded = false; +#ifdef CONFIG_DEBUG_MUTEXES ctx->done_acquire = 0; ctx->contending_lock = NULL; #endif diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c index 6850ffd69125..778ed026382f 100644 --- a/kernel/locking/locktorture.c +++ b/kernel/locking/locktorture.c @@ -365,7 +365,7 @@ static struct lock_torture_ops mutex_lock_ops = { }; #include <linux/ww_mutex.h> -static DEFINE_WW_CLASS(torture_ww_class); +static DEFINE_WW_CLASS(torture_ww_class, true); static DEFINE_WW_MUTEX(torture_ww_mutex_0, &torture_ww_class); static DEFINE_WW_MUTEX(torture_ww_mutex_1, &torture_ww_class); static DEFINE_WW_MUTEX(torture_ww_mutex_2, &torture_ww_class); diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index 2048359f33d2..b449a012c6f9 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -290,12 +290,47 @@ __ww_ctx_stamp_after(struct ww_acquire_ctx *a, struct ww_acquire_ctx *b) (a->stamp != b->stamp || a > b); } +/* + * Wound the lock holder transaction if it's younger than the contending + * transaction, and there is a possibility of a deadlock. + * Also if the lock holder transaction isn't the current transaction, + * Make sure it's woken up in case it's sleeping on another ww mutex. + */ +static bool __ww_mutex_wound(struct mutex *lock, + struct ww_acquire_ctx *ww_ctx, + struct ww_acquire_ctx *hold_ctx) +{ + struct task_struct *owner = + __owner_task(atomic_long_read(&lock->owner)); + + lockdep_assert_held(&lock->wait_lock); + + if (owner && hold_ctx && __ww_ctx_stamp_after(hold_ctx, ww_ctx) && + ww_ctx->acquired > 0) { + WRITE_ONCE(hold_ctx->wounded, true); + if (owner != current) { + /* + * wake_up_process() inserts a write memory barrier to + * make sure owner sees it is wounded before + * TASK_RUNNING in case it's sleeping on another + * ww_mutex. Note that owner points to a valid + * task_struct as long as we hold the wait_lock. + */ + wake_up_process(owner); + } + return true; + } + + return false; +} + /* * Wake up any waiters that may have to back off when the lock is held by the * given context. * * Due to the invariants on the wait list, this can only affect the first - * waiter with a context. + * waiter with a context, unless the Wound-Wait algorithm is used where + * also subsequent waiters with a context main wound the lock holder. * * The current task must not be on the wait list. */ @@ -303,6 +338,7 @@ static void __sched __ww_mutex_wakeup_for_backoff(struct mutex *lock, struct ww_acquire_ctx *ww_ctx) { struct mutex_waiter *cur; + bool is_wait_die = ww_ctx->ww_class->is_wait_die; lockdep_assert_held(&lock->wait_lock); @@ -310,13 +346,14 @@ __ww_mutex_wakeup_for_backoff(struct mutex *lock, struct ww_acquire_ctx *ww_ctx) if (!cur->ww_ctx) continue; - if (cur->ww_ctx->acquired > 0 && + if (is_wait_die && cur->ww_ctx->acquired > 0 && __ww_ctx_stamp_after(cur->ww_ctx, ww_ctx)) { debug_mutex_wake_waiter(lock, cur); wake_up_process(cur->task); } - break; + if (is_wait_die || __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx)) + break; } } @@ -338,12 +375,17 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx) * and keep spinning, or it will acquire wait_lock, add itself * to waiter list and sleep. */ - smp_mb(); /* ^^^ */ + smp_mb(); /* See comments above and below. */ /* - * Check if lock is contended, if not there is nobody to wake up + * Check if lock is contended, if not there is nobody to wake up. + * Checking MUTEX_FLAG_WAITERS is not enough here, since we need to + * order against the lock->ctx check in __ww_mutex_wound called from + * __ww_mutex_add_waiter. We can use list_empty without taking the + * wait_lock, given the memory barrier above and the list_empty + * documentation. */ - if (likely(!(atomic_long_read(&lock->base.owner) & MUTEX_FLAG_WAITERS))) + if (likely(list_empty(&lock->base.wait_list))) return; /* @@ -653,6 +695,17 @@ __ww_mutex_lock_check_stamp(struct mutex *lock, struct mutex_waiter *waiter, struct ww_acquire_ctx *hold_ctx = READ_ONCE(ww->ctx); struct mutex_waiter *cur; + /* + * If we miss a wounded == true here, we will have a pending + * TASK_RUNNING and pick it up on the next schedule fall-through. + */ + if (!ctx->ww_class->is_wait_die) { + if (READ_ONCE(ctx->wounded)) + goto deadlock; + else + return 0; + } + if (hold_ctx && __ww_ctx_stamp_after(ctx, hold_ctx)) goto deadlock; @@ -683,12 +736,15 @@ __ww_mutex_add_waiter(struct mutex_waiter *waiter, { struct mutex_waiter *cur; struct list_head *pos; + bool is_wait_die; if (!ww_ctx) { list_add_tail(&waiter->list, &lock->wait_list); return 0; } + is_wait_die = ww_ctx->ww_class->is_wait_die; + /* * Add the waiter before the first waiter with a higher stamp. * Waiters without a context are skipped to avoid starving @@ -701,7 +757,7 @@ __ww_mutex_add_waiter(struct mutex_waiter *waiter, if (__ww_ctx_stamp_after(ww_ctx, cur->ww_ctx)) { /* Back off immediately if necessary. */ - if (ww_ctx->acquired > 0) { + if (is_wait_die && ww_ctx->acquired > 0) { #ifdef CONFIG_DEBUG_MUTEXES struct ww_mutex *ww; @@ -721,13 +777,26 @@ __ww_mutex_add_waiter(struct mutex_waiter *waiter, * Wake up the waiter so that it gets a chance to back * off. */ - if (cur->ww_ctx->acquired > 0) { + if (is_wait_die && cur->ww_ctx->acquired > 0) { debug_mutex_wake_waiter(lock, cur); wake_up_process(cur->task); } } list_add_tail(&waiter->list, pos); + if (!is_wait_die) { + struct ww_mutex *ww = container_of(lock, struct ww_mutex, base); + + /* + * Make sure a racing lock taker sees a non-empty waiting list + * before we read ww->ctx, so that if we miss ww->ctx, the + * racing lock taker will call __ww_mutex_wake_up_for_backoff() + * and wound itself. + */ + smp_mb(); + __ww_mutex_wound(lock, ww_ctx, ww->ctx); + } + return 0; } @@ -750,6 +819,14 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, if (use_ww_ctx && ww_ctx) { if (unlikely(ww_ctx == READ_ONCE(ww->ctx))) return -EALREADY; + + /* + * Reset the wounded flag after a backoff. + * No other process can race and wound us here since they + * can't have a valid owner pointer at this time + */ + if (ww_ctx->acquired == 0) + ww_ctx->wounded = false; } preempt_disable(); @@ -858,6 +935,11 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, acquired: __set_current_state(TASK_RUNNING); + /* We stole the lock. Need to check wounded status. */ + if (use_ww_ctx && ww_ctx && !ww_ctx->ww_class->is_wait_die && + !__mutex_waiter_is_first(lock, &waiter)) + __ww_mutex_wakeup_for_backoff(lock, ww_ctx); + mutex_remove_waiter(lock, &waiter, current); if (likely(list_empty(&lock->wait_list))) __mutex_clear_flag(lock, MUTEX_FLAGS); diff --git a/kernel/locking/test-ww_mutex.c b/kernel/locking/test-ww_mutex.c index 0e4cd64ad2c0..c7fc112d691d 100644 --- a/kernel/locking/test-ww_mutex.c +++ b/kernel/locking/test-ww_mutex.c @@ -26,7 +26,7 @@ #include <linux/slab.h> #include <linux/ww_mutex.h> -static DEFINE_WW_CLASS(ww_class); +static DEFINE_WW_CLASS(ww_class, true); struct workqueue_struct *wq; struct test_mutex { diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c index b5c1293ce147..e52065f2acbf 100644 --- a/lib/locking-selftest.c +++ b/lib/locking-selftest.c @@ -29,7 +29,7 @@ */ static unsigned int debug_locks_verbose; -static DEFINE_WW_CLASS(ww_lockdep); +static DEFINE_WW_CLASS(ww_lockdep, true); static int __init setup_debug_locks_verbose(char *str) { -- 2.14.3 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html