On Wed, Oct 31, 2018 at 09:14:58AM +0100, Daniel Wagner wrote: > From: Daniel Wagner <daniel.wagner@xxxxxxxxxxx> > > Sebastian writes: > > """ > We reproducibly observe cache line starvation on a Core2Duo E6850 (2 > cores), a i5-6400 SKL (4 cores) and on a NXP LS2044A ARM Cortex-A72 (4 > cores). > > The problem can be triggered with a v4.9-RT kernel by starting > > cyclictest -S -p98 -m -i2000 -b 200 > > and as "load" > > stress-ng --ptrace 4 > > The reported maximal latency is usually less than 60us. If the problem > triggers then values around 400us, 800us or even more are reported. The > upperlimit is the -i parameter. > > Reproduction with 4.9-RT is almost immediate on Core2Duo, ARM64 and SKL, > but it took 7.5 hours to trigger on v4.14-RT on the Core2Duo. > > Instrumentation show always the picture: > > CPU0 CPU1 > => do_syscall_64 => do_syscall_64 > => SyS_ptrace => syscall_slow_exit_work > => ptrace_check_attach => ptrace_do_notify / rt_read_unlock > => wait_task_inactive rt_spin_lock_slowunlock() > -> while task_running() __rt_mutex_unlock_common() > / check_task_state() mark_wakeup_next_waiter() > | raw_spin_lock_irq(&p->pi_lock); raw_spin_lock(¤t->pi_lock); > | . . > | raw_spin_unlock_irq(&p->pi_lock); . > \ cpu_relax() . > - . > *IRQ* <lock acquired> > > In the error case we observe that the while() loop is repeated more than > 5000 times which indicates that the pi_lock can be acquired. CPU1 on the > other side does not make progress waiting for the same lock with interrupts > disabled. > > This continues until an IRQ hits CPU0. Once CPU0 starts processing the IRQ > the other CPU is able to acquire pi_lock and the situation relaxes. > """ > > This matches with the observeration for v4.4-rt on a Core2Duo E6850: > > CPU 0: > > - no progress for a very long time in rt_mutex_dequeue_pi): > > stress-n-1931 0d..11 5060.891219: function: __try_to_take_rt_mutex > stress-n-1931 0d..11 5060.891219: function: rt_mutex_dequeue > stress-n-1931 0d..21 5060.891220: function: rt_mutex_enqueue_pi > stress-n-1931 0....2 5060.891220: signal_generate: sig=17 errno=0 code=262148 comm=stress-ng-ptrac pid=1928 grp=1 res=1 > stress-n-1931 0d..21 5060.894114: function: rt_mutex_dequeue_pi > stress-n-1931 0d.h11 5060.894115: local_timer_entry: vector=239 > > CPU 1: > > - IRQ at 5060.894114 on CPU 1 followed by the IRQ on CPU 0 > > stress-n-1928 1....0 5060.891215: sys_enter: NR 101 (18, 78b, 0, 0, 17, 788) > stress-n-1928 1d..11 5060.891216: function: __try_to_take_rt_mutex > stress-n-1928 1d..21 5060.891216: function: rt_mutex_enqueue_pi > stress-n-1928 1d..21 5060.891217: function: rt_mutex_dequeue_pi > stress-n-1928 1....1 5060.891217: function: rt_mutex_adjust_prio > stress-n-1928 1d..11 5060.891218: function: __rt_mutex_adjust_prio > stress-n-1928 1d.h10 5060.894114: local_timer_entry: vector=239 > > Thomas writes: > > """ > This has nothing to do with RT. RT is merily exposing the > problem in an observable way. The same issue happens with upstream, it's > harder to trigger and it's harder to observe for obvious reasons. > > If you read through the discussions [see the links below] then you > really see that there is an upstream issue with the x86 qrlock > implementation and Peter has posted fixes which resolve it, both at > the practical and the theoretical level. > """ > > Backporting all qspinlock related patches is very likely to introduce > regressions on v4.4. Therefore, the recommended solution by Peter and > Thomas is to drop back to ticket spinlocks for v4.4. > > Link :https://lkml.kernel.org/r/20180921120226.6xjgr4oiho22ex75@xxxxxxxxxxxxx > Link: https://lkml.kernel.org/r/20180926110117.405325143@xxxxxxxxxxxxx > Cc: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> > Signed-off-by: Daniel Wagner <daniel.wagner@xxxxxxxxxxx> Acked-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>