On Sat, 2 Feb 2019, Heiko Carstens wrote: > On Sat, Feb 02, 2019 at 11:14:27AM +0100, Thomas Gleixner wrote: > > On Sat, 2 Feb 2019, Heiko Carstens wrote: > > So after the unlock @timestamp 337.215675 the kernel does not deal with > > that futex at all until the failed lock attempt where it rightfully rejects > > the attempt due to the alleged owner being gone. > > > > So this looks more like user space doing something stupid... > > > > As we talked about the missing barriers before, I just looked at > > pthread_mutex_trylock() and that does still: > > > > if (robust) > > { > > ENQUEUE_MUTEX_PI (mutex); > > THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL); > > } > > > > So it's missing the barriers which pthread_mutex_lock() has. Grasping for > > straws obviously.... Looks more like a solid tree than a straw now. :) > Excellent! Taking a look into the disassembly of nptl/pthread_mutex_trylock.o > reveals this part: > > 140: a5 1b 00 01 oill %r1,1 > 144: e5 48 a0 f0 00 00 mvghi 240(%r10),0 <--- THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL); > 14a: e3 10 a0 e0 00 24 stg %r1,224(%r10) <--- last THREAD_SETMEM of ENQUEUE_MUTEX_PI Awesome. > I added a barrier between those two and now the code looks like this: > > 140: a5 1b 00 01 oill %r1,1 > 144: e3 10 a0 e0 00 24 stg %r1,224(%r10) > 14a: e5 48 a0 f0 00 00 mvghi 240(%r10),0 > > Looks like this was a one instruction race... Fun. JFYI, I said that I reversed the stores in glibc and on my x86 test VM it took more than _3_ days to trigger. But the good news is, that the trace looks exactly like the ones you provided. So it looks we are on the right track. > I'll try to reproduce with the patch below (sprinkling compiler > barriers just like the other files have). Looks about right. Thanks, tglx