On Thu, 14 Feb 2019 16:09:48 +0800 Yimin Deng <yimin11.deng@xxxxxxxxx> wrote: > I encountered deadlock in glibc's pthread_mutex_lock as below: > 'pthread_mutex_lock.c:314: __pthread_mutex_lock_full: Assertion `(e) > != 45 || (kind != PTHREAD_MUTEX_ERRORCHECK_NP && kind != > PTHREAD_MUTEX_RECURSIVE_NP)' failed.' > > glibc: 2.16 > linux: 3.10.87-rt80-Cavium-Octeon > arch: MIPS All three of the above are considered obsolete ;-) > > ThreadA called __pthread_mutex_lock_full(mutex). The type of mutex is > PTHREAD_MUTEX_PI_RECURSIVE_NP or PTHREAD_MUTEX_PI_ERRORCHECK_NP. > > ThreadA found the value of mutex->__data.__lock is another task > ThreadB's tid. So it entered the linux kernel via system call. (the > auto variable 'oldval' in __pthread_mutex_lock_full was stored in the > stack) > > The linux kernel find the value mutex->__data.__lock is ThreadA itself > in 'if ((unlikely((uval & FUTEX_TID_MASK) == vpid)))' in > futex_lock_pi_atomic(), So return -EDEADLK. That sounds like the kernel found that ThreadB is blocked on something owned by ThreadA which would be a deadlock. > > __pthread_mutex_lock_full() judge the return value and asserted. > > coredump file generated, and the value mutex->__data.__lock in the > coredump file is 0. And the ThreadB is in the start of the entry > function, for example waiting another message to be processed (i.e. > has released the lock). > $5 = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 33, > __nusers = 0, {__spins = 0, __list = {__next = 0x0}}}, > __size = '\000' <repeats 15 times>, "!\000\000\000\000\000\000\000", > __align = 0} Are you saying that ThreadB isn't blocked on anything? Or could it be possible that the crash of ThreadA released whatever ThreadB was blocked on before ThreadB was taken out as well? > > ThreadA and ThreadB belong to the same process, but run on different cpus (SMP). > > > To debug this issue, i add printing in the kernel, and it indicates > the ThreadA deadlocked itself. The displayed uaddr is > &(mutex->__data.__lock). > @@ -997,8 +1093,13 @@ static int futex_lock_pi_atomic(u32 __us > /* > * Detect deadlocks. > */ > - if ((unlikely((uval & FUTEX_TID_MASK) == vpid))) > + if ((unlikely((uval & FUTEX_TID_MASK) == vpid))) { > + printk(KERN_ERR "uaddr:%p, uval:%u, vpid:%u, > task:%s(%d),prio:%d,normal:%d, current:%s(%d),prio:%d,normal:%d\n", > uaddr, (unsigned)uval, (unsigned)vpid, task->comm, task_pid_nr(task), > task->prio, task->normal_prio, current->comm, task_pid_nr(current), > current->prio, current->normal_prio); > + show_stack(task, NULL); > + if (current != task) > + show_stack(current, NULL); > return -EDEADLK; > + } > > Fragment in __pthread_mutex_lock_full(): > int newval = id; > #ifdef NO_INCR > newval |= FUTEX_WAITERS; > #endif > oldval = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock, > newval, 0); > > if (oldval != 0) > { > /* The mutex is locked. The kernel will now take care of > everything. */ > int private = (robust > ? PTHREAD_ROBUST_MUTEX_PSHARED (mutex) > : PTHREAD_MUTEX_PSHARED (mutex)); > INTERNAL_SYSCALL_DECL (__err); > int e = INTERNAL_SYSCALL (futex, __err, 4, &mutex->__data.__lock, > __lll_private_flag (FUTEX_LOCK_PI, > private), 1, 0); > > > 0x77f4ed98 <+616>: sw zero,0(sp) > 0x77f4ed9c <+620>: ll v1,0(s0) //v1: mutex->__data.__lock (Load > linked (LL) and store conditional (SC)) > 0x77f4eda0 <+624>: bnez v1,0x77f4edb4 <__pthread_mutex_lock_full+644> > 0x77f4eda4 <+628>: move at,s1 > 0x77f4eda8 <+632>: sc at,0(s0) //mutex->__data.__lock = current task's tid > 0x77f4edac <+636>: beqz at,0x77f4ed9c <__pthread_mutex_lock_full+620> > 0x77f4edb0 <+640>: nop > 0x77f4edb4 <+644>: beqz v1,0x77f4eea0 <__pthread_mutex_lock_full+880> > 0x77f4edb8 <+648>: sw v1,0(sp) > 0x77f4edbc <+652>: bnez a4,0x77f4edcc <__pthread_mutex_lock_full+668> > 0x77f4edc0 <+656>: li v0,128 > 0x77f4edc4 <+660>: lw v0,12(s0) > > > I could not image a scenario that lead to 3 different values on the > same variable mutex->__data.__lock seen in 3 positions. > It's very difficult to reproduce this issue (About 1 ~ several months > for 1 reproducing). And we failed to reproduce it using small > application. > > Any help is welcome! Have you been able to try a newer kernel at all? I don't look at anything less that 3.18, and even for 3.18, I try to avoid. -- Steve