On Wed, 8 Jan 2025 at 21:36, Waiman Long <llong@xxxxxxxxxx> wrote: > > > On 1/7/25 8:59 AM, Kumar Kartikeya Dwivedi wrote: > > While the timeout logic provides guarantees for the waiter's forward > > progress, the time until a stalling waiter unblocks can still be long. > > The default timeout of 1/2 sec can be excessively long for some use > > cases. Additionally, custom timeouts may exacerbate recovery time. > > > > Introduce logic to detect common cases of deadlocks and perform quicker > > recovery. This is done by dividing the time from entry into the locking > > slow path until the timeout into intervals of 1 ms. Then, after each > > interval elapses, deadlock detection is performed, while also polling > > the lock word to ensure we can quickly break out of the detection logic > > and proceed with lock acquisition. > > > > A 'held_locks' table is maintained per-CPU where the entry at the bottom > > denotes a lock being waited for or already taken. Entries coming before > > it denote locks that are already held. The current CPU's table can thus > > be looked at to detect AA deadlocks. The tables from other CPUs can be > > looked at to discover ABBA situations. Finally, when a matching entry > > for the lock being taken on the current CPU is found on some other CPU, > > a deadlock situation is detected. This function can take a long time, > > therefore the lock word is constantly polled in each loop iteration to > > ensure we can preempt detection and proceed with lock acquisition, using > > the is_lock_released check. > > > > We set 'spin' member of rqspinlock_timeout struct to 0 to trigger > > deadlock checks immediately to perform faster recovery. > > > > Note: Extending lock word size by 4 bytes to record owner CPU can allow > > faster detection for ABBA. It is typically the owner which participates > > in a ABBA situation. However, to keep compatibility with existing lock > > words in the kernel (struct qspinlock), and given deadlocks are a rare > > event triggered by bugs, we choose to favor compatibility over faster > > detection. > > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> > > --- > > include/asm-generic/rqspinlock.h | 56 +++++++++- > > kernel/locking/rqspinlock.c | 178 ++++++++++++++++++++++++++++--- > > 2 files changed, 220 insertions(+), 14 deletions(-) > > > > diff --git a/include/asm-generic/rqspinlock.h b/include/asm-generic/rqspinlock.h > > index 5c996a82e75f..c7e33ccc57a6 100644 > > --- a/include/asm-generic/rqspinlock.h > > +++ b/include/asm-generic/rqspinlock.h > > @@ -11,14 +11,68 @@ > > > > #include <linux/types.h> > > #include <vdso/time64.h> > > +#include <linux/percpu.h> > > > > struct qspinlock; > > > > +extern int resilient_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val, u64 timeout); > > + > > /* > > * Default timeout for waiting loops is 0.5 seconds > > */ > > #define RES_DEF_TIMEOUT (NSEC_PER_SEC / 2) > > > > -extern int resilient_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val, u64 timeout); > > +#define RES_NR_HELD 32 > > + > > +struct rqspinlock_held { > > + int cnt; > > + void *locks[RES_NR_HELD]; > > +}; > > + > > +DECLARE_PER_CPU_ALIGNED(struct rqspinlock_held, rqspinlock_held_locks); > > + > > +static __always_inline void grab_held_lock_entry(void *lock) > > +{ > > + int cnt = this_cpu_inc_return(rqspinlock_held_locks.cnt); > > + > > + if (unlikely(cnt > RES_NR_HELD)) { > > + /* Still keep the inc so we decrement later. */ > > + return; > > + } > > + > > + /* > > + * Implied compiler barrier in per-CPU operations; otherwise we can have > > + * the compiler reorder inc with write to table, allowing interrupts to > > + * overwrite and erase our write to the table (as on interrupt exit it > > + * will be reset to NULL). > > + */ > > + this_cpu_write(rqspinlock_held_locks.locks[cnt - 1], lock); > > +} > > + > > +/* > > + * It is possible to run into misdetection scenarios of AA deadlocks on the same > > + * CPU, and missed ABBA deadlocks on remote CPUs when this function pops entries > > + * out of order (due to lock A, lock B, unlock A, unlock B) pattern. The correct > > + * logic to preserve right entries in the table would be to walk the array of > > + * held locks and swap and clear out-of-order entries, but that's too > > + * complicated and we don't have a compelling use case for out of order unlocking. > Maybe we can pass in the lock and print a warning if out-of-order unlock > is being done. I think alternatively, I will constrain the verifier in v2 to require lock release to be in-order, which would obviate the need to warn at runtime and reject programs potentially doing out-of-order unlocks. This doesn't cover in-kernel users though, but we're not doing out-of-order unlocks with this lock there, and it would be yet another branch in the unlock function with little benefit. > > + * > > + * Therefore, we simply don't support such cases and keep the logic simple here. > > + */ > > +static __always_inline void release_held_lock_entry(void) > > +{ > > + struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks); > > + > > + if (unlikely(rqh->cnt > RES_NR_HELD)) > > + goto dec; > > + smp_store_release(&rqh->locks[rqh->cnt - 1], NULL); > > + /* > > + * Overwrite of NULL should appear before our decrement of the count to > > + * other CPUs, otherwise we have the issue of a stale non-NULL entry being > > + * visible in the array, leading to misdetection during deadlock detection. > > + */ > > +dec: > > + this_cpu_dec(rqspinlock_held_locks.cnt); > AFAIU, smp_store_release() only guarantees memory ordering before it, > not after. That shouldn't be a problem if the decrement is observed > before clearing the entry as that non-NULL entry won't be checked anyway. Ack, I will improve the comment, it's a bit misleading right now. > > +} > > > > #endif /* __ASM_GENERIC_RQSPINLOCK_H */ > > diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c > > index b63f92bd43b1..b7c86127d288 100644 > > --- a/kernel/locking/rqspinlock.c > > +++ b/kernel/locking/rqspinlock.c > > @@ -30,6 +30,7 @@ > > * Include queued spinlock definitions and statistics code > > */ > > #include "qspinlock.h" > > +#include "rqspinlock.h" > > #include "qspinlock_stat.h" > > > > /* > > @@ -74,16 +75,141 @@ > > struct rqspinlock_timeout { > > u64 timeout_end; > > u64 duration; > > + u64 cur; > > u16 spin; > > }; > > > > #define RES_TIMEOUT_VAL 2 > > > > -static noinline int check_timeout(struct rqspinlock_timeout *ts) > > +DEFINE_PER_CPU_ALIGNED(struct rqspinlock_held, rqspinlock_held_locks); > > + > > +static bool is_lock_released(struct qspinlock *lock, u32 mask, struct rqspinlock_timeout *ts) > > +{ > > + if (!(atomic_read_acquire(&lock->val) & (mask))) > > + return true; > > + return false; > > +} > > + > > +static noinline int check_deadlock_AA(struct qspinlock *lock, u32 mask, > > + struct rqspinlock_timeout *ts) > > +{ > > + struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks); > > + int cnt = min(RES_NR_HELD, rqh->cnt); > > + > > + /* > > + * Return an error if we hold the lock we are attempting to acquire. > > + * We'll iterate over max 32 locks; no need to do is_lock_released. > > + */ > > + for (int i = 0; i < cnt - 1; i++) { > > + if (rqh->locks[i] == lock) > > + return -EDEADLK; > > + } > > + return 0; > > +} > > + > > +static noinline int check_deadlock_ABBA(struct qspinlock *lock, u32 mask, > > + struct rqspinlock_timeout *ts) > > +{ > > I think you should note that the ABBA check here is not exhaustive. It > is just the most common case and there are corner cases that will be missed. Ack, will add a comment. > > Cheers, > Longman >