Re: Help me: deadlock reported in pthread_mutex_lock

Yimin Deng <yimin11.deng@xxxxxxxxx> · Sat, 16 Feb 2019 18:28:46 +0800

Hi CVS,

Sincerely appreciate for your effort on this issue!

> > I could not image a scenario that lead to 3 different values on the
> > same variable mutex->__data.__lock seen in 3 positions.
> Is there a common pattern to these 3 different values?
> - Do they look like memory bit-flips (if yes, then misbehaviour /
> misconfigured HW?)

For example, when ThreadA's tid is 0x1100, then the 'oldval' in the
stack is 0x11c1 (ThreadB's tid on another cpu). When ThreadA's tid is
0x1000, then the 'oldval' is 0x1057. So it seems not bit-flip between
these 2 different values. Which value do you mean is bit-flipped?

> - Do they always contain the same pattern (if yes, then
> buffer-overflow? Add "guard-bytes" around the struct and attempt to
> reproduce the behaviour)

At least on one occurrance, the variable at the address
(&(mutex->__data.__lock) - 8) (a global variable save the value of
sysconf(_SC_NPROCESSORS_ONLN)) is correct, i.e. not overwritten.
(&(mutex->__data.__lock) - 4) is a *fill* (i.e. actually not used) and
the value in coredump is 0. And according to the coredump, the
mutex->__data.__kind (its address is behind the
&(mutex->__data.__lock)) is also correct. I could not add guard-bytes
in the struct pthread_mutex_t, beause the glibc is included in the
toolchain provided by third party.

> Slightly tangential.
> Have you checked for any unpatched HW errata in CPU, cache controller,
> memory controller
> that can cause writes posted by the CPU to be lost in some rare scenarios?

I'm asking for the latest version of the errata.

I'll update you if there's any progress.

B.R.
Yimin