On Mon, Oct 18, 2021 at 09:18:14AM -0700, Paul E. McKenney wrote: > On Mon, Oct 18, 2021 at 01:32:59PM +0200, Frederic Weisbecker wrote: > > When an rdp is in the process of (de-)offloading, rcu_core() and the > > nocb kthreads can process callbacks at the same time. This leaves many > > possible scenarios leading to an rcu barrier to execute before > > the preceding callbacks. Here is one such example: > > > > CPU 0 CPU 1 > > -------------- --------------- > > call_rcu(callbacks1) > > call_rcu(callbacks2) > > // move callbacks1 and callbacks2 on the done list > > rcu_advance_callbacks() > > call_rcu(callbacks3) > > rcu_barrier_func() > > rcu_segcblist_entrain(...) > > nocb_cb_wait() > > rcu_do_batch() > > callbacks1() > > cond_resched_tasks_rcu_qs() > > // move callbacks3 and rcu_barrier_callback() > > // on the done list > > rcu_advance_callbacks() > > rcu_core() > > rcu_do_batch() > > callbacks3() > > rcu_barrier_callback() > > //MISORDERING > > callbacks2() > > > > Fix this with preventing two concurrent rcu_do_batch() on a same rdp > > as long as an rcu barrier callback is pending somewhere. > > > > Reported-by: Paul E. McKenney <paulmck@xxxxxxxxxx> > > Signed-off-by: Frederic Weisbecker <frederic@xxxxxxxxxx> > > Cc: Josh Triplett <josh@xxxxxxxxxxxxxxxx> > > Cc: Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> > > Cc: Boqun Feng <boqun.feng@xxxxxxxxx> > > Cc: Neeraj Upadhyay <neeraju@xxxxxxxxxxxxxx> > > Cc: Uladzislau Rezki <urezki@xxxxxxxxx> > > Yow! > > But how does the (de-)offloading procedure's acquisition of > rcu_state.barrier_mutex play into this? In theory, that mutex was > supposed to prevent these sorts of scenarios. In practice, it sounds > like the shortcomings in this theory should be fully explained so that > we don't get similar bugs in the future. ;-) I think you're right. The real issue is something I wanted to fix next: RCU_SEGCBLIST_RCU_CORE isn't cleared when nocb is enabled on boot so rcu_core() always run concurrently with nocb kthreads in TREE04, without holding rcu_barrier mutex of course (I mean with the latest patchset). Ok forget this patch, I'm testing again with simply clearing RCU_SEGCBLIST_RCU_CORE on boot. Thanks.